@@ -6,3 +6,17 @@ For all of these reasons, testing the accuracy of computational analysis methods
The second approach to assessing the accuracy of computational analysis methods is to use *synthetic data*. That is, to generate data sets by simulating the experimental steps and determine whether the computational analysis can recover properties of the data that was assumed in the simulation. For example, if the goal of the computational analysis is to infer gene expression levels from scRNA-seq data, then one simulates such data assuming specific transcript abundances, which should be recovered by the computational method. In general, it is very difficult to accurately model each step of the experimental procedure, and therefore, simulations still leave out some (possibly a lot) of the complexities of the experiment. Thus, the fact that a computational method performs well on simulated data provides more of a sanity check on the method than the confidence that the method will give accurate results on *real* data. Nevertheless, such sanity checks should be done. Furthermore, simulations can help build intuitions as to which steps of the experiment have the largest consequences for the outcome, where specific behaviors may come from etc.
In this project we will implement a procedure for sampling reads from mRNA sequences, incorporating a few sources of “noise”. These include the presence of multiple transcript isoforms from a given gene, some that are incompletely spliced, stochastic binding of primers to RNA fragments and stochastic sampling of DNA fragments for sequencing. We will then use standard methods to estimate gene expression from the simulated data. We will repeat the process multiple times, each time corresponding to a single cell. We will then compare the estimates obtained from the simulated cells with the gene expression values assumed in the simulation. We will also try to explore which steps in the sample preparation have the largest impact on the accuracy of gene expression estimates.
Inputs to the simuation:
1. Csv-formatted table “GeneID,Counts” specifying the number of transcripts expressed, on average, for each gene in a given cell type. These can come for example from a bulk RNA-seq experiment of sorted cells of a given type.
2. File with the genome sequence
3. gff/gtf-formatted file with the transcript annotation of the genome
4. Number of reads to sequence
5. Number of cells to simulate
6. Mean and standard deviation of RNA fragment length
7. Read length
8. Probability of intron inclusion - considered constant per intron to start with, can be extended to intron-specific. In the latter case, estimates could be obtained from bulk RNA-seq data by dividing the average per-position coverage in a given intron by the average per-position coverage of the gene, or of flanking exons.
9. Option to add poly(A) tails to transcripts and an associated function for generating these tails (with specific length distribution and non-A nucleotide frequency).
10. Parameters for evaluating internal priming: primer sequence, function implementing the constraints on priming sites (accessibility, energy of interaction, perfect matching at last primer position etc.).