Extract transcript sequences
Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.
Input:
- Gtf file with exon/intron structures of transcripts
- File with genome sequence
- Length of the poly(A) tail
- Dictionary of expected nucleotide frequencies in poly(A) tail
Output: fasta-formatted file of transcript sequences
For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).
Note, the poly(A) tail can be added by a padding function, which could be used on other contexts as well.