Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
scRNA-seq-simulation
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
zavolan_group
pipelines
scRNA-seq-simulation
Merge requests
!7
feat: add sampling from transcript file
Code
Review changes
Check out branch
Download
Patches
Plain diff
Merged
feat: add sampling from transcript file
feature
into
main
Overview
0
Commits
1
Pipelines
1
Changes
5
Merged
Michele Garioni
requested to merge
feature
into
main
3 years ago
Overview
0
Commits
1
Pipelines
1
Changes
5
Expand
#1
create csv file after sampling from input transcripts
0
0
Merge request reports
Compare
main
main (base)
and
latest version
latest version
e8f8acc9
1 commit,
3 years ago
5 files
+
107
−
1
Inline
Compare changes
Side-by-side
Inline
Show whitespace changes
Show one file at a time
Files
5
Search (e.g. *.vue) (Ctrl+P)
src/sampleinput.py
0 → 100644
+
73
−
0
Options
"""
Samples transcripts from input.
Samples a defined number of transcript following
the relative RNA abundance per gene of a given input.
"""
import
logging
from
pathlib
import
Path
from
random
import
choices
LOG
=
logging
.
getLogger
(
__name__
)
def
sample_from_input
(
input_file
:
Path
,
output_file
:
Path
=
Path
.
cwd
()
/
'
sampled_cell.csv
'
,
n
:
int
=
10000
,
sep
:
str
=
'
,
'
,
)
->
None
:
"""
Samples transcripts from input.
Samples a defined number of transcript per gene following
the relative RNA abundance per gene of a given input and
writes the simulated results in a csv file.
Args:
input_file (string): name of the input gene expression file.
output_file (string): name of the sampled gene expression file.
n (int): number of total transcripts to be sampled.
sep (str): separator of the input file.
"""
myfile
=
open
(
input_file
,
'
r
'
)
# initialize empty dictionary
input_dc
=
{}
# read line, split key-value and assign key and value to the
# dictionary after stripping \n character.
LOG
.
info
(
'
reading file...
'
)
for
myline
in
myfile
:
gene
=
myline
.
split
(
sep
)
input_dc
[
gene
[
0
].
strip
()]
=
int
(
gene
[
1
].
strip
())
myfile
.
close
()
LOG
.
debug
(
input_dc
)
LOG
.
info
(
'
file read.
'
)
# extract count numbers and calculate relative abundance
counts
=
list
(
input_dc
.
values
())
tot_counts
=
sum
(
counts
)
relative_value
=
[
x
/
tot_counts
for
x
in
counts
]
# sampling
LOG
.
info
(
'
sampling reads...
'
)
sampled_genes
=
choices
(
list
(
input_dc
.
keys
()),
weights
=
relative_value
,
k
=
n
)
# initialize empty dictionary
sampled_dc
=
dict
()
# count the genes occurence from the sampled list
for
i
in
sampled_genes
:
if
i
not
in
sampled_dc
:
sampled_dc
[
i
]
=
1
else
:
sampled_dc
[
i
]
+=
1
LOG
.
info
(
'
reads sampled.
'
)
# write sample dictionary to a csv file, joining the
# key value pairs with a comma
myfile
=
open
(
output_file
,
'
w
'
)
LOG
.
info
(
'
writing output...
'
)
for
(
k
,
v
)
in
sampled_dc
.
items
():
line
=
'
,
'
.
join
([
str
(
k
),
str
(
v
)])
myfile
.
write
(
line
+
'
\n
'
)
myfile
.
close
()
LOG
.
info
(
'
output written.
'
)
Loading