Collapse directionality params in sample table
Description
closes #160 (closed)
Introduce a dictionary to infer "directionality parameters" for different tools from one provided Salmon library type.
directionality_dict = {
"SF":
{"kallisto":"--fr-stranded",
"alfa": "fr-secondstrand",
"alfa_plus": "str1",
"alfa_minus": "str2"},
"SR":
{"kallisto":"--rf-stranded",
"alfa": "fr-firststrand",
"alfa_plus": "str2",
"alfa_minus": "str1"},
This is based on the following tool documentations and checked by Dominik and Mihaela:
Salmon,
Kallisto,
ALFA,
Zarp issue created by Dominik when first incorporating ALFA,
Blog about strandedness in RNAseq
For single end sequencing, the two codes 'SF' and 'SR' are used; for paired end, one of 'I' (inward), 'O' (outward), 'M' (matching) is prepended to 'SF' or 'SR'. However, those latter three types are treated the same when translated for other tools, as only salmon has a specific code for those types.
Note: With the introduced changes, the
libtype
insamples.tsv
has to be specified; the previous value 'A' (automatic inference) is not valid anymore
Changes made:
- Snakefile(s):
- "directionality dict"
-
get_directionality(libtype, tool)
to find correct parameter for each tool, given salmon libtype. Calls to the different tools' directionality params from samples.tsv replaced byget_directionality
function
-
prepare_inputs.py
:- list of allowed libtypes in argparse help
- function
get_libtype
, which infers the salmon libtype from labkey parameters 'SENSE or ANTISENSE' and 'pe or se' - deleted functions that infer the libtypes for kallisto and ALFA separately
- updated md5sum for samples.tsv test
test_scripts_prepare_inputs_table
- samples tables for all other zarp tests:
- removed
kallisto_directionality
,alfa_directionality
,alfa_plus
,alfa_minus
- replaced libtype 'A' with correct libtype.
- removed
Tested:
Apart from the zarp integration tests, I also ran the pipeline on
-
test_alfa
files created by Dominik when incorporating ALFA originally. Was able to reproduce the plots he got when specifying correct or reverse parameters. - real mouse data from Meric. As input files I used the first 250000 lines from the three CR samples.
# Old samples.tsv
libtype fq1_polya_3p fq1_polya_5p kallisto_directionality alfa_directionality alfa_plus alfa_minus fq2
A AAAAAAAAAAAAAAAAA XXXXXXXXXXXXX --rf fr-firststrand str2 str1 XXXXXXXXXXXXX
# New samples.tsv
libtype fq1_polya_3p fq1_polya_5p fq2
SF AAAAAAAAAAAAAAAAA XXXXXXXXXXXXX XXXXXXXXXXXXX
Obtained the same results with the old and new pipeline version. When specifying 'SR' instead of 'SF', ALFA classifies most reads as 'opposite strand', and kallisto and salmon cannot align most reads, as expected.