Sample transcript counts given average expression levels

Given a total number transcripts, their relative abundance in a sample and the genome annotation, sample representative transcripts per gene, in proportion to their relative abundance levels.

Input:

Csv-formatted file ("ID,Level") with expression levels per gene (or per transcript).
Total number of transcripts to sample.
gtf-formatted file with the intron/exon coordinates of the transcripts represented in the expression file.

Output:

Gtf-formatted file of the sampled transcripts.
Csv-formatted file ("ID,Count") with the transcript copies for each representative transcript.

First, we pick a representative transcript for each gene in the annotation file. This transcript has the highest level of experimental support (lowest transcript support level value). If there are multiple such transcripts for a gene, the one that covers the largest genomic region is chosen (based on the coordinates of the exons).

Then, we sample transcript counts up to a specified total, in proportion to the gene expression levels given in the input 1. The expression levels can be provided either per transcript ID or per gene ID. If transcript expression levels are given, these transcripts are not guaranteed to be the representative ones, but the expression should be extracted per representative transcript. If the expression level is provided per gene, it needs to be assigned to the representative transcript as well. So, a dictionary of representative transcript ID : gene ID has to be build first. Then the expression of all transcripts associated with the gene should be cumulated on a per gene basis (if the expression values are not already provided per gene) and then the gene expression level should be transferred to the representative transcript and written out.

Edited Aug 18, 2022 by MihaelaZavolan