Add rule to create transcriptome from genome and gene annotations
Currently, three genome resource files have to be provided as input to the workflow:
- a genome FASTA file
- gene annotations in GTF format
- a transcriptome FASTA file
Having more required inputs than necessary for a software means undue burden for the user and provides a source of error. The latter especially if there is no single source of truth for any given information (here: transcript sequences). It may well be that a transcriptome explicitly provided by the user does not match that which is indirectly specified by the corresponding entries in the gene annotation file in combination with the genome sequences they refer to.
The workflow should thus include a rule that, based on the gene annotations and genome file, automatically creates the transcriptome.
One way to do that is via the gffread
utility that comes with Cufflinks. The tool will automatically compile only the transcriptome from genome and gene annotations with the following call (tested on human annotations/genome with gffread
packaged with cufflinks 2.2.1
):
gffread -g GENOME_FASTA -w TRANSCRIPTOME_FASTA GFF/GTF
Note that the automatic parsing of transcript names/identifiers may or may not work as intended for a given gene annotation file, as the GFF/GTF format does not contain a standardized field for feature identifiers; instead identifiers are typically described in the attributes
field by convention (e.g., Ensembl uses transcript_id
in at least human and mouse), but this assumption may not hold in every case.
As an alternative, BEDTools' provides the getfasta
functionality:
bedtools getfasta -s -fi GENOME_FASTA -bed GFF/GTF/BED/VCF > OUTPUT_FASTA
However, as this will produce sequences for all of the features in the GTF file, will not merge the exons of a given transcript and does not automatically set the transcript names as required, this solution would require considerable pre-processing (filtering of exon entries, conversion to BED12 format, parsing of the transcript identifier).