diff --git a/README.md b/README.md index c117c98fa4da291e3f962148f5fc5aab51d96592..4412baecdf1ce12809fcce57525acce0e45b98b2 100644 --- a/README.md +++ b/README.md @@ -13,12 +13,14 @@ There are deep learning scripts for essentially three different usecases: - MPRA from Sample et.al. (2019) - endogenous (riboseq/RNAseq) data based on Alexaki et.al. (2020) - clinvar variations based on Landrum et. al. (2020) + are provided in the directory HEK293_training_data/ ## Scripts 1. turn the output of RNAseq and ribosome profiling data into translation efficiency estimates 2. append non-sequential features to a given data set 3. construct a data set based on a vcf file + can be found in the directory training_data_preprocessing/. @@ -30,12 +32,14 @@ The preprocessing procedure for MPRA data calculates and appends the non-sequent - number_inframe_uAUGs - normalized_5p_folding_energy - GC_content + to the input file using. ### Endogenous data The preprocessing procedure for endogenous data takes mapping files from: - riboseq data analysis - RNAseq data analysis + as *input* and turns it into a file with: - translation efficiencies - 5'UTR sequences @@ -49,6 +53,7 @@ As an input from the riboseq side, you need: - bam and bai file of the mapping done in riboseq - an alignment json file that contains the p-site offsets for different RPF lengths - a tsv file that links gene id and transcript id + From the RNA seq side, you need - transcripts_numreads.tsv (output from kallisto) - a file with the TIN scores (potentially per replicate) @@ -86,6 +91,7 @@ There are a few parameters to specify in the middle of the script: - the path to the directory where to save the scalers - the path for saving the trained model - the path for the pretrained model (for transfer learning, only) + All these scripts can be run in 1. normal mode (training and testing) 2. with the suffix 'predict' after '''python3 <scriptname>''' for prediction and scatterplot creation, only