@@ -7,7 +7,7 @@ The second approach to assessing the accuracy of computational analysis methods
In this project we will implement a procedure for sampling reads from mRNA sequences, incorporating a few sources of “noise”. These include the presence of multiple transcript isoforms from a given gene, some that are incompletely spliced, stochastic binding of primers to RNA fragments and stochastic sampling of DNA fragments for sequencing. We will then use standard methods to estimate gene expression from the simulated data. We will repeat the process multiple times, each time corresponding to a single cell. We will then compare the estimates obtained from the simulated cells with the gene expression values assumed in the simulation. We will also try to explore which steps in the sample preparation have the largest impact on the accuracy of gene expression estimates.
Inputs to the simuation:
Inputs to the simulation:
1. Csv-formatted table “GeneID,Counts” specifying the number of transcripts expressed, on average, for each gene in a given cell type. These can come for example from a bulk RNA-seq experiment of sorted cells of a given type.
2. File with the genome sequence
3. gff/gtf-formatted file with the transcript annotation of the genome