Skip to content
Snippets Groups Projects

TranslateLSTM

written by Niels Schlusser, 15.11.2023

This repository contains different python scripts to predict various measures of translation output from transcript sequences using TranslateLSTM, an artificial neural network architecture as presented in "Current limitations in predicting mRNA translation with deep learning models".

There are deep learning scripts for essentially three different usecases:

  1. training a model on synthetic MPRA data in the directory translateLSTM_MPRA/
  2. training a model on endogenous TE data in the directory translateLSTM_endogenous/
  3. do transfer learning from (1) to (2) in the directory tl_TranslateLSTM_endogenous/

Example training data

  • MPRA from Sample et.al. (2019)
  • endogenous (riboseq/RNAseq) data based on Alexaki et.al. (2020)
  • clinvar variations based on Landrum et. al. (2020)

are provided in the directory HEK293_training_data/

Scripts

  1. turn the output of RNAseq and ribosome profiling data into translation efficiency estimates
  2. append non-sequential features to a given data set
  3. construct a data set based on a vcf file

can be found in the directory training_data_preprocessing/.

Preprocessing

MPRA data

The preprocessing procedure for MPRA data calculates and appends the non-sequential features

  • UTR_length
  • number_outframe_uAUGs
  • number_inframe_uAUGs
  • normalized_5p_folding_energy
  • GC_content

to the input file using.

Endogenous data

The preprocessing procedure for endogenous data takes mapping files from:

  • riboseq data analysis
  • RNAseq data analysis

as input and turns it into a file with:

  • translation efficiencies
  • 5'UTR sequences
  • CDS sequences
  • all available non-sequential features (UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content, number_exons, log_ORF_length)

In- and outputs

As an input from the riboseq side, you need:

  • a transcriptome fasta file (generate with gffread)
  • a file with the CDS coordinates of the longest protein coding transcript per gene
  • bam and bai file of the mapping done in riboseq
  • an alignment json file that contains the p-site offsets for different RPF lengths
  • a tsv file that links gene id and transcript id

From the RNA seq side, you need

  • transcripts_numreads.tsv (output from kallisto)
  • a file with the TIN scores (potentially per replicate)
  • a tsv file that contains the read counts mapped to the CDS of a given gene (create with Rsubread)
  • a file that contains the number of exons per transcript (generate from gtf file)

(Endogenous) preprocessing procedure

The script for endogenous data prerpocessing

  1. queries 5'UTRs from ENSEMBL via pybiomart
  2. selects the most covered transcript according to kallisto
  3. calculates the TE is calculated
  4. computes and appends non-sequential features to the file.

Preprocessing procedure for the clinvar variants

  1. downloads the clinvar vcf file from a specified URL
  2. filters for the 5'UTR variants using bedtools intersect
  3. converts the mutations to relative transcript coordinates
  4. constructs the mutated 5'UTRs
  5. queries the CDS sequences
  6. augments the file with non-sequential features

Learning scripts

The learning scripts for all three usecases work the same way.

In- and outputs

There are a few parameters to specify in the middle of the script:

  • maximum 5'UTR length in nt
  • the length of the region of the CDS to be considered in nt
  • the name of the 5'UTR input column in the input file
  • the name of the CDS column in the input file
  • the names of the non-sequential features columns in the input file
  • the number of non-sequential features the pretrained model was trained on (usually 5, for transfer learning only)
  • the name of the output_column, the path to the data set
  • the path to the directory where to save the scalers
  • the path for saving the trained model
  • the path for the pretrained model (for transfer learning, only)

All these scripts can be run in

  1. normal mode (training and testing)
  2. with the suffix 'predict' after python3 <scriptname> for prediction and scatterplot creation, only
  3. with the suffic 'train' for training, only

Supplementary scripts

The transfer-learning directory also contains a script for making end-to-end predictions and a script for predicting the log TE of the entire clinvar data set. For the end-to-end prediction, just the input sequences (UTR and CDS), and, if necessary, the number of exons per transcript need to be provided in a tsv file, all other non-sequential features are computed by the script.

Installation

Clone the github repository and move to the new folder with

git clone https://git.scicore.unibas.ch/zavolan_group/data_analysis/predicting-translation-initiation-efficiency.git
cd predicting-translation-initiation-efficiency

Setting up a conda environment for tensorflow is highly non-trivial and hardware-specific, especially if you want GPU acceleration. For installation help, refer to this site and others. On top of tensorflow, the conda environment for preprocessing requires some additional packages specified in the requirements.txt file.

License

This code is published under the MIT license.