Skip to content
Snippets Groups Projects
Commit d8606281 authored by Niels Schlusser's avatar Niels Schlusser
Browse files

Clarification readme

parent d0323422
No related branches found
No related tags found
No related merge requests found
#TranslateLSTM
written by Niels Schlusser, 15.11.2023
This repository contains different python scripts to predict various measures of translation output from transcript sequences using TranslateLSTM, an artificial neural network architecture as presented in "Current limitations in predicting mRNA translation with deep learning models".
There are deep learning scripts for essentially three different usecases:
[1] training a model on synthetic MPRA data in the directory translateLSTM_MPRA/
[2] training a model on endogenous TE data in the directory translateLSTM_endogenous/
[3] do transfer learning from (1) to (2) in the directory tl_TranslateLSTM_endogenous/
1. training a model on synthetic MPRA data in the directory translateLSTM_MPRA/
2. training a model on endogenous TE data in the directory translateLSTM_endogenous/
3. do transfer learning from (1) to (2) in the directory tl_TranslateLSTM_endogenous/
##Example training data
- MPRA from Sample et.al. (2019)
- endogenous (riboseq/RNAseq) data based on Alexaki et.al. (2020)
- clinvar variations based on Landrum et. al. (2020)
are provided in the directory HEK293_training_data/
##Scripts
1. turn the output of RNAseq and ribosome profiling data into translation efficiency estimates
2. append non-sequential features to a given data set
3. construct a data set based on a vcf file
can be found in the directory training_data_preprocessing/.
##Preprocessing
###MPRA data
The preprocessing procedure for MPRA data calculates and appends the non-sequential features
- UTR_length
- number_outframe_uAUGs
- number_inframe_uAUGs
- normalized_5p_folding_energy
- GC_content
to the input file using.
###Endogenous data
The preprocessing procedure for endogenous data takes mapping files from:
- riboseq data analysis
- RNAseq data analysis
as *input* and turns it into a file with:
- translation efficiencies
- 5'UTR sequences
- CDS sequences
- all available non-sequential features (UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content, number_exons, log_ORF_length)
Example training data (MPRA from Sample et.al. (2019), endogenous data based on Alexaki et.al. (2020), and clinvar variations based on Landrum et. al. (2020)) are provided in the directory HEK293_training_data/.
Scripts to turn the output of RNAseq and ribosome profiling data into an endogenous data set, appending non-sequential features to a given data set, and constructing a data set based on a vcf file can be found in the directory training_data_preprocessing/.
####In- and outputs
As an input from the riboseq side, you need:
- a transcriptome fasta file (generate with gffread)
- a file with the CDS coordinates of the longest protein coding transcript per gene
- bam and bai file of the mapping done in riboseq
- an alignment json file that contains the p-site offsets for different RPF lengths
- a tsv file that links gene id and transcript id
From the RNA seq side, you need
- transcripts_numreads.tsv (output from kallisto)
- a file with the TIN scores (potentially per replicate)
- a tsv file that contains the read counts mapped to the CDS of a given gene (create with Rsubread)
- a file that contains the number of exons per transcript (generate from gtf file)
The preprocessing procedure for MPRA data only adds the non-sequential features UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content to the input file.
The preprocessing procedure for endogenous data takes output from riboseq data analysis and RNAseq data analysis and turns it into a file with translation efficiencies, 5'UTRs, CDSs, and all available non-sequential features (UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content, number_exons, log_ORF_length). As an input from the riboseq side, you need a transcriptome fasta file (generate with gffread), a file with the CDS coordinates of the longest protein coding transcript per gene, bam and bai file of the mapping done in riboseq, an alignment json file that contains the p-site offsets for different RPF lengths, and a tsv file that links gene id and transcript id. From the RNA seq side, you need transcripts_numreads.tsv (output from kallisto), a file with the TIN scores (potentially per replicate), and a tsv file that contains the read counts mapped to the CDS of a given gene (create with Rsubread). Moreover, a file that contains the number of exons per transcript (generate from gtf file) is required. 5'UTRs are queried from ENSEMBL via pybiomart, the most covered transcript according to kallisto is selected, the TE is calculated, and non-sequential features are appended to the file.
The preprocessing procedure for the clinvar variants downloads the clinvar vcf file from a specified URL, sticks to the 5'UTR variants using bedtools intersect, converts the mutations to relative transcript coordinates, constructs the mutated 5'UTRs, queries the CDS sequences, and finally augments the file with non-sequential features.
####(Endogenous) preprocessing procedure
The script for endogenous data prerpocessing
1. queries 5'UTRs from ENSEMBL via pybiomart
2. selects the most covered transcript according to kallisto
3. calculates the TE is calculated
4. computes and appends non-sequential features to the file.
The learning scripts for all three usecases work the same way: There are a few parameters to specify in the middle of the script: maximum 5'UTR length in nt, the length of the region of the CDS to be considered in nt, the name of the 5'UTR input column in the input file, the name of the CDS column in the input file, the names of the non-sequential features columns in the input file, the number of non-sequential features the pretrained model was trained on (usually 5, for transfer learning only), the name of the output_column, the path to the data set, the path to the directory where to save the scalers, the path for saving the trained model, and finally the path for the pretrained model (for transfer learning, only). All these scripts can be run in normal mode (training and testing), with the suffix 'predict' after 'python3 <scriptname>' for prediction and scatterplot creation, only, and with the suffic 'train' for training only.
###Preprocessing procedure for the clinvar variants
1. downloads the clinvar vcf file from a specified URL
2. filters for the 5'UTR variants using bedtools intersect
3. converts the mutations to relative transcript coordinates
4. constructs the mutated 5'UTRs
5. queries the CDS sequences
6. augments the file with non-sequential features
##Learning scripts
The learning scripts for all three usecases work the same way.
###In- and outputs
There are a few parameters to specify in the middle of the script:
- maximum 5'UTR length in nt
- the length of the region of the CDS to be considered in nt
- the name of the 5'UTR input column in the input file
- the name of the CDS column in the input file
- the names of the non-sequential features columns in the input file
- the number of non-sequential features the pretrained model was trained on (usually 5, for transfer learning only)
- the name of the output_column, the path to the data set
- the path to the directory where to save the scalers
- the path for saving the trained model
- the path for the pretrained model (for transfer learning, only)
All these scripts can be run in
1. normal mode (training and testing)
2. with the suffix 'predict' after '''python3 <scriptname>''' for prediction and scatterplot creation, only
3. with the suffic 'train' for training, only
###Supplementary scripts
The transfer-learning directory also contains a script for making end-to-end predictions and a script for predicting the log TE of the entire clinvar data set.
For the end-to-end prediction, just the input sequences (UTR and CDS), and, if necessary, the number of exons per transcript need to be provided in a tsv file, all other non-sequential features are computed by the script.
##Installation
Clone the github repository and move to the new folder with
'''
git clone https://git.scicore.unibas.ch/zavolan_group/data_analysis/predicting-translation-initiation-efficiency.git
cd predicting-translation-initiation-efficiency
'''
Setting up a conda environment for tensorflow is highly non-trivial and hardware-specific, especially if you want GPU acceleration. For installation help, refer to [this site](https://www.geeksforgeeks.org/how-to-install-tensorflow-in-anaconda/) and others.
On top of tensorflow, the conda environment for preprocessing requires some additional packages specified in the requirements.txt file.
#License
This code is published under the MIT license.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment