Analysis of internal priming sites

To be able to design better methods for identifying internal priming sites we need more comprehensive and cleaner data sets of such sites as well as of bona fide poly(A) sites. Such data could be used to train machine learning models to classify putative sites obtained from scRNA-seq. We will construct a synthetic data set using our simulation data. We will also visualize the sequence sets with sequence logos (can use weblogo for making the logos https://github.com/WebLogo/weblogo).

Input:

Bam file with read alignments to genome
gtf file with genome annotations
Parameters for defining the neighborhood around sites
Type of "background" (e.g. Markov of a particular order)

Output:

Sequences around internal priming sites
Sequences around bona fide transcript ends
Figure with the sequence logos for the two sequence sets.