Add README.md

eb08ca9f · Niels Schlusser · 659b2ae5 · eb08ca9f
Commit eb08ca9f authored 1 year ago by Niels Schlusser
--- a/README.md
+++ b/README.md
+This repository contains different python scripts to predict translation initiation efficiency from transcript sequences using TranslateLLM, an artificial neural network architecture as presented in "Predicting the translation output from the mRNA sequence - an assessment of the accuracy and parameter-efficiency of deep learning models".
+There are deep learning scripts for essentially three different usecases:
+(1) training a model on synthetic MPRA data in the directory translateLLM_MPRA/
+(2) training a model on endogenous TE data in the directory translateLLM_endogenous/
+(3) do transfer learning from (1) to (2) in the directory tl_TranslateLLM_endogenous/
+Example training data (MPRA from Sample et.al. (2019), endogenous data based on Alexaki et.al. (2020), and clinvar variations based on Landrum et. al. (2020)) are provided in the directory HEK293_training_data/.
+Scripts to turn the output of RNAseq and ribosome profiling data into an endogenous data set, appending non-sequential features to a given data set, and constructing a data set based on a vcf file can be found in the directory training_data_preprocessing/.
+The preprocessing procedure for MPRA data only adds the non-sequential features UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content to the input file.
+The preprocessing procedure for endogenous data takes output from riboseq data analysis and RNAseq data analysis and turns it into a file with translation efficiencies, 5'UTRs, CDSs, and all available non-sequential features (UTR_length, number_outframe_uAUGs, number_inframe_uAUGs, normalized_5p_folding_energy, GC_content, number_exons, log_ORF_length). As an input from the riboseq side, you need a transcriptome fasta file (generate with gffread), a file with the CDS coordinates of the longest protein coding transcript per gene, bam and bai file of the mapping done in riboseq, an alignment json file that contains the p-site offsets for different RPF lengths, and a tsv file that links gene id and transcript id. From the RNA seq side, you need transcripts_numreads.tsv (output from kallisto), a file with the TIN scores (potentially per replicate), and a tsv file that contains the read counts mapped to the CDS of a given gene (create with Rsubread). Moreover, a file that contains the number of exons per transcript (generate from gtf file) is required. 5'UTRs are queried from ENSEMBL via pybiomart, the most covered transcript according to kallisto is selected, the TE is calculated, and non-sequential features are appended to the file.
+The preprocessing procedure for the clinvar variants downloads the clinvar vcf file from a specified URL, sticks to the 5'UTR variants using bedtools intersect, converts the mutations to relative transcript coordinates, constructs the mutated 5'UTRs, queries the CDS sequences, and finally augments the file wiht metadata.