Analysis of internal priming sites
To be able to design better methods for identifying internal priming sites we need more comprehensive and cleaner data sets of such sites as well as of bona fide poly(A) sites. Such data could be used to train machine learning models to classify putative sites obtained from scRNA-seq. We will construct a synthetic data set using our simulation data. We will also visualize the sequence sets with sequence logos (can use weblogo for making the logos https://github.com/WebLogo/weblogo).
Input:
- Bam file with read alignments to genome
- gtf file with genome annotations
- Parameters for defining the neighborhood around sites
- Type of "background" (e.g. Markov of a particular order)
Output:
- Sequences around internal priming sites
- Sequences around bona fide transcript ends
- Figure with the sequence logos for the two sequence sets.