Skip to content
Snippets Groups Projects
Commit 93064d1d authored by Niels Schlusser's avatar Niels Schlusser
Browse files

Minor typesetting corrections headline README

parent d8606281
No related branches found
No related tags found
No related merge requests found
#TranslateLSTM
# TranslateLSTM
written by Niels Schlusser, 15.11.2023
......@@ -9,21 +9,21 @@ There are deep learning scripts for essentially three different usecases:
2. training a model on endogenous TE data in the directory translateLSTM_endogenous/
3. do transfer learning from (1) to (2) in the directory tl_TranslateLSTM_endogenous/
##Example training data
## Example training data
- MPRA from Sample et.al. (2019)
- endogenous (riboseq/RNAseq) data based on Alexaki et.al. (2020)
- clinvar variations based on Landrum et. al. (2020)
are provided in the directory HEK293_training_data/
##Scripts
## Scripts
1. turn the output of RNAseq and ribosome profiling data into translation efficiency estimates
2. append non-sequential features to a given data set
3. construct a data set based on a vcf file
can be found in the directory training_data_preprocessing/.
##Preprocessing
###MPRA data
## Preprocessing
### MPRA data
The preprocessing procedure for MPRA data calculates and appends the non-sequential features
- UTR_length
- number_outframe_uAUGs
......@@ -32,7 +32,7 @@ The preprocessing procedure for MPRA data calculates and appends the non-sequent
- GC_content
to the input file using.
###Endogenous data
### Endogenous data
The preprocessing procedure for endogenous data takes mapping files from:
- riboseq data analysis
- RNAseq data analysis
......@@ -42,7 +42,7 @@ as *input* and turns it into a file with:
- CDS sequences
- all available non-sequential features (UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content, number_exons, log_ORF_length)
####In- and outputs
#### In- and outputs
As an input from the riboseq side, you need:
- a transcriptome fasta file (generate with gffread)
- a file with the CDS coordinates of the longest protein coding transcript per gene
......@@ -55,14 +55,14 @@ From the RNA seq side, you need
- a tsv file that contains the read counts mapped to the CDS of a given gene (create with Rsubread)
- a file that contains the number of exons per transcript (generate from gtf file)
####(Endogenous) preprocessing procedure
#### (Endogenous) preprocessing procedure
The script for endogenous data prerpocessing
1. queries 5'UTRs from ENSEMBL via pybiomart
2. selects the most covered transcript according to kallisto
3. calculates the TE is calculated
4. computes and appends non-sequential features to the file.
###Preprocessing procedure for the clinvar variants
### Preprocessing procedure for the clinvar variants
1. downloads the clinvar vcf file from a specified URL
2. filters for the 5'UTR variants using bedtools intersect
3. converts the mutations to relative transcript coordinates
......@@ -71,10 +71,10 @@ The script for endogenous data prerpocessing
6. augments the file with non-sequential features
##Learning scripts
## Learning scripts
The learning scripts for all three usecases work the same way.
###In- and outputs
### In- and outputs
There are a few parameters to specify in the middle of the script:
- maximum 5'UTR length in nt
- the length of the region of the CDS to be considered in nt
......@@ -91,12 +91,12 @@ All these scripts can be run in
2. with the suffix 'predict' after '''python3 <scriptname>''' for prediction and scatterplot creation, only
3. with the suffic 'train' for training, only
###Supplementary scripts
### Supplementary scripts
The transfer-learning directory also contains a script for making end-to-end predictions and a script for predicting the log TE of the entire clinvar data set.
For the end-to-end prediction, just the input sequences (UTR and CDS), and, if necessary, the number of exons per transcript need to be provided in a tsv file, all other non-sequential features are computed by the script.
##Installation
## Installation
Clone the github repository and move to the new folder with
'''
git clone https://git.scicore.unibas.ch/zavolan_group/data_analysis/predicting-translation-initiation-efficiency.git
......@@ -106,5 +106,5 @@ Setting up a conda environment for tensorflow is highly non-trivial and hardware
On top of tensorflow, the conda environment for preprocessing requires some additional packages specified in the requirements.txt file.
#License
# License
This code is published under the MIT license.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment