Skip to content
Snippets Groups Projects
Commit fe99cc84 authored by Niels Schlusser's avatar Niels Schlusser
Browse files

changed names to translateLSTM

parent 99bc8547
No related branches found
No related tags found
No related merge requests found
written by Niels Schlusser, 15.11.2023 written by Niels Schlusser, 15.11.2023
This repository contains different python scripts to predict translation initiation efficiency from transcript sequences using TranslateLLM, an artificial neural network architecture as presented in "Predicting the translation output from the mRNA sequence - an assessment of the accuracy and parameter-efficiency of deep learning models". This repository contains different python scripts to predict translation initiation efficiency from transcript sequences using TranslateLSTM, an artificial neural network architecture as presented in "Predicting the translation output from the mRNA sequence - an assessment of the accuracy and parameter-efficiency of deep learning models".
There are deep learning scripts for essentially three different usecases: There are deep learning scripts for essentially three different usecases:
(1) training a model on synthetic MPRA data in the directory translateLLM_MPRA/ (1) training a model on synthetic MPRA data in the directory translateLSTM_MPRA/
(2) training a model on endogenous TE data in the directory translateLLM_endogenous/ (2) training a model on endogenous TE data in the directory translateLSTM_endogenous/
(3) do transfer learning from (1) to (2) in the directory tl_TranslateLLM_endogenous/ (3) do transfer learning from (1) to (2) in the directory tl_TranslateLSTM_endogenous/
Example training data (MPRA from Sample et.al. (2019), endogenous data based on Alexaki et.al. (2020), and clinvar variations based on Landrum et. al. (2020)) are provided in the directory HEK293_training_data/. Example training data (MPRA from Sample et.al. (2019), endogenous data based on Alexaki et.al. (2020), and clinvar variations based on Landrum et. al. (2020)) are provided in the directory HEK293_training_data/.
Scripts to turn the output of RNAseq and ribosome profiling data into an endogenous data set, appending non-sequential features to a given data set, and constructing a data set based on a vcf file can be found in the directory training_data_preprocessing/. Scripts to turn the output of RNAseq and ribosome profiling data into an endogenous data set, appending non-sequential features to a given data set, and constructing a data set based on a vcf file can be found in the directory training_data_preprocessing/.
...@@ -19,4 +19,4 @@ The transfer-learning directory also contains a script for making end-to-end pre ...@@ -19,4 +19,4 @@ The transfer-learning directory also contains a script for making end-to-end pre
For the end-to-end prediction, just the input sequences (UTR and CDS), and, if necessary, the number of exons per transcript need to be provided in a tsv file, all other non-sequential features are computed by the script. For the end-to-end prediction, just the input sequences (UTR and CDS), and, if necessary, the number of exons per transcript need to be provided in a tsv file, all other non-sequential features are computed by the script.
This code is published under the MIT license. This code is published under the MIT license.
\ No newline at end of file
...@@ -130,8 +130,8 @@ output_col='TE' ...@@ -130,8 +130,8 @@ output_col='TE'
data_path = '../HEK293_training_data/init_effs_HEK293_endogenous.tsv' data_path = '../HEK293_training_data/init_effs_HEK293_endogenous.tsv'
scaler_dir = '../HEK293_training_data/scalers/' scaler_dir = '../HEK293_training_data/scalers/'
pt_model_path = '../translateLLM_MPRA/TranslateLLM_opt100.h5' pt_model_path = '../translateLSTM_MPRA/TranslateLSTM_opt100.h5'
tl_model_path = 'tl_TranslateLLM_HEK293.h5' tl_model_path = 'tl_TranslateLSTM_HEK293.h5'
#nucleotide dictionary #nucleotide dictionary
...@@ -218,8 +218,8 @@ if len(sys.argv) < 2 or sys.argv[1] == 'predict': ...@@ -218,8 +218,8 @@ if len(sys.argv) < 2 or sys.argv[1] == 'predict':
xmin, xmax, ymin, ymax = plt.axis() xmin, xmax, ymin, ymax = plt.axis()
plt.text(xmin+1.0, ymax-0.3, '$R_{Pearson}$=%.3f, $R_{Spearman}$=%.3f' % (rho_p,rho_s),fontsize = 12,color='black') plt.text(xmin+1.0, ymax-0.3, '$R_{Pearson}$=%.3f, $R_{Spearman}$=%.3f' % (rho_p,rho_s),fontsize = 12,color='black')
plt.savefig('scatterplot_tl_TranslateLLM_HEK293.pdf') plt.savefig('scatterplot_tl_TranslateLSTM_HEK293.pdf')
plt.close() plt.close()
raw_test['predicted_'+output_col] = pred raw_test['predicted_'+output_col] = pred
raw_test.to_csv("predictions_test_tl_TranslateLLM_HEK293_"+output_col+".tsv",sep="\t",index=False) raw_test.to_csv("predictions_test_tl_TranslateLSTM_HEK293_"+output_col+".tsv",sep="\t",index=False)
...@@ -112,7 +112,7 @@ output_col='rl' ...@@ -112,7 +112,7 @@ output_col='rl'
data_path = '../HEK293_training_data/opt100_nonseq_feat.tsv' data_path = '../HEK293_training_data/opt100_nonseq_feat.tsv'
scaler_dir = '../HEK293_training_data/scalers/' scaler_dir = '../HEK293_training_data/scalers/'
integrated_model_path = 'TranslateLLM_opt100.h5' integrated_model_path = 'TranslateLSTM_opt100.h5'
#nucleotide dictionary #nucleotide dictionary
...@@ -175,8 +175,8 @@ if len(sys.argv) < 2 or sys.argv[1] == 'predict': ...@@ -175,8 +175,8 @@ if len(sys.argv) < 2 or sys.argv[1] == 'predict':
xmin, xmax, ymin, ymax = plt.axis() xmin, xmax, ymin, ymax = plt.axis()
plt.text(xmin+1.0, ymax-0.3, '$R_{Pearson}$=%.3f, $R_{Spearman}$=%.3f' % (rho_p,rho_s),fontsize = 12,color='black') plt.text(xmin+1.0, ymax-0.3, '$R_{Pearson}$=%.3f, $R_{Spearman}$=%.3f' % (rho_p,rho_s),fontsize = 12,color='black')
plt.savefig('scatterplot_TranslateLLM_opt100.pdf') plt.savefig('scatterplot_TranslateLSTM_opt100.pdf')
plt.close() plt.close()
raw_test['predicted_'+output_col] = pred raw_test['predicted_'+output_col] = pred
raw_test.to_csv("predictions_test_TranslateLLM_opt100_"+output_col+".tsv",sep="\t",index=False) raw_test.to_csv("predictions_test_TranslateLSTM_opt100_"+output_col+".tsv",sep="\t",index=False)
...@@ -128,7 +128,7 @@ output_col='TE' ...@@ -128,7 +128,7 @@ output_col='TE'
data_path = '../HEK293_training_data/init_effs_HEK293_endogenous.tsv' data_path = '../HEK293_training_data/init_effs_HEK293_endogenous.tsv'
scaler_dir = '../HEK293_training_data/scalers/' scaler_dir = '../HEK293_training_data/scalers/'
integrated_model_path = 'TranslateLLM_end_HEK293.h5' integrated_model_path = 'TranslateLSTM_end_HEK293.h5'
#nucleotide dictionary #nucleotide dictionary
...@@ -197,8 +197,8 @@ if len(sys.argv) < 2 or sys.argv[1] == 'predict': ...@@ -197,8 +197,8 @@ if len(sys.argv) < 2 or sys.argv[1] == 'predict':
xmin, xmax, ymin, ymax = plt.axis() xmin, xmax, ymin, ymax = plt.axis()
plt.text(xmin+1.0, ymax-0.3, '$R_{Pearson}$=%.3f, $R_{Spearman}$=%.3f' % (rho_p,rho_s),fontsize = 12,color='black') plt.text(xmin+1.0, ymax-0.3, '$R_{Pearson}$=%.3f, $R_{Spearman}$=%.3f' % (rho_p,rho_s),fontsize = 12,color='black')
plt.savefig('scatterplot_TranslateLLM_end_HEK293.pdf') plt.savefig('scatterplot_TranslateLSTM_end_HEK293.pdf')
plt.close() plt.close()
raw_test['predicted_'+output_col] = pred raw_test['predicted_'+output_col] = pred
raw_test.to_csv("predictions_test_TranslateLLM_end_HEK293_"+output_col+".tsv",sep="\t",index=False) raw_test.to_csv("predictions_test_TranslateLSTM_end_HEK293_"+output_col+".tsv",sep="\t",index=False)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment