Workflow for generating synthetic data
Encapsulate all the functionality of scRNA-seq data generation into a workflow (written in nextflow https://github.com/nextflow-io/nextflow).
Inputs (I#):
- Path to genome sequence file (fasta)
- Path to genome annotation file (gtf)
- Path to gene expression values (csv: geneID,count)
- Total number of transcripts to samples
- Probability of intron inclusion
- Script containing function for constructing poly(A) tails
- Length of poly(A) tails
- Dictionary with nucleotide frequencies in poly(A) tails
- Primer sequence
- Threshold for the energy of primer-mRNA interaction needed for priming
- Mean and standard deviation of fragment length
- Read length (number of sequencing cycles)
- Number of cells to simulate
- Directory for storing output files
- Software for predicting energy of primer-target interaction
- Pattern specifying the reads file name for an individual cell
Output of this issue: Nextflow code for executing the workflow
Outputs (O#) of the entire workflow:
- Path to sampled transcript structures (gtf)
- Path to transcript counts (csv: transcriptID,count)
- Path with sampled transcript sequences (fasta)
- Path to annotated internal priming sites (gtf)
- Path to unique cDNA sequences
- Path to cDNA count table
- Path to sequences of terminal fragments (fasta)
- Path to read sequences
The workflow will include the following steps:
- Repeat the simulation for the required number of cells (I3)
- Generate transcript structures (#2)
- Inputs I2,I3,I5
- Outputs O1,O2
- Extract transcript sequences (#3)
- Inputs I1,I6,I7,I8,O1
- Outputs O3
- Predict priming sites (#4)
- Inputs I9,I10,I15,O3
- Outputs O4
- Generate cDNAs (#5)
- Inputs O2,O3,O4
- Outputs O5,O6
- Terminal fragment selection (#6)
- Inputs O5,O6,I11
- Outputs O7
- Read sequencing (#7)
- Inputs O7,I12
- Outputs O8
- Generate transcript structures (#2)