Read sequencing
Simulate the sequencing of reads on the template of terminal fragments. Reads are copies of fixed length starting from the 5' end of fragments. If the desired read length is larger than the fragment length, sequencing would in principle proceed into the 3' adaptor and then would perhaps yield random bases. For simplicity, here we assume that random nucleotides are introduced in this case.
Input:
- Fasta-formatted file of sequences of terminal fragments from transcripts
- Number of reads to sample
- Read length (number of sequencing cycles)
- Dictionary of nucleotide frequencies used to pad the read if the input fragment is too short.
Output: Fasta-formatted file of reads of identical length, representing 5’ ends of the terminal fragments.
To generate each read, a terminal fragment is chosen from input 1, with replacement. Then a segment of the specified read length (input 3) is extracted from the terminal fragment. If the terminal fragment is shorter than the read length, then random nucleotides are added to the 3' end according to the probabilities given in input 4, until the read length is reached. A unique name should be created for each read, and the name and read should be written to the output file in fasta format. The process is repeated for the specified number of reads (input 2).