Skip to content

Collapse directionality params in sample table

CJHerrmann requested to merge simplify_directionality into dev

Description

closes #160 (closed)
Introduce a dictionary to infer "directionality parameters" for different tools from one provided Salmon library type.

directionality_dict = {
    "SF": 
        {"kallisto":"--fr-stranded", 
        "alfa": "fr-secondstrand", 
        "alfa_plus": "str1", 
        "alfa_minus": "str2"},
    "SR":
        {"kallisto":"--rf-stranded", 
        "alfa": "fr-firststrand", 
        "alfa_plus": "str2", 
        "alfa_minus": "str1"},

This is based on the following tool documentations and checked by Dominik and Mihaela:
Salmon, Kallisto, ALFA, Zarp issue created by Dominik when first incorporating ALFA, Blog about strandedness in RNAseq

For single end sequencing, the two codes 'SF' and 'SR' are used; for paired end, one of 'I' (inward), 'O' (outward), 'M' (matching) is prepended to 'SF' or 'SR'. However, those latter three types are treated the same when translated for other tools, as only salmon has a specific code for those types.

Note: With the introduced changes, the libtype in samples.tsv has to be specified; the previous value 'A' (automatic inference) is not valid anymore

Changes made:

  • Snakefile(s):
    • "directionality dict"
    • get_directionality(libtype, tool) to find correct parameter for each tool, given salmon libtype. Calls to the different tools' directionality params from samples.tsv replaced by get_directionality function
  • prepare_inputs.py:
    • list of allowed libtypes in argparse help
    • function get_libtype, which infers the salmon libtype from labkey parameters 'SENSE or ANTISENSE' and 'pe or se'
    • deleted functions that infer the libtypes for kallisto and ALFA separately
    • updated md5sum for samples.tsv test test_scripts_prepare_inputs_table
  • samples tables for all other zarp tests:
    • removed kallisto_directionality, alfa_directionality, alfa_plus, alfa_minus
    • replaced libtype 'A' with correct libtype.

Tested:

Apart from the zarp integration tests, I also ran the pipeline on

  • test_alfa files created by Dominik when incorporating ALFA originally. Was able to reproduce the plots he got when specifying correct or reverse parameters.
  • real mouse data from Meric. As input files I used the first 250000 lines from the three CR samples.
# Old samples.tsv
  libtype	fq1_polya_3p	fq1_polya_5p	kallisto_directionality	alfa_directionality	alfa_plus	alfa_minus	fq2
  A	AAAAAAAAAAAAAAAAA	XXXXXXXXXXXXX	--rf	fr-firststrand	str2	str1	XXXXXXXXXXXXX

# New samples.tsv
  libtype	fq1_polya_3p	fq1_polya_5p	fq2
  SF	AAAAAAAAAAAAAAAAA	XXXXXXXXXXXXX	XXXXXXXXXXXXX

Obtained the same results with the old and new pipeline version. When specifying 'SR' instead of 'SF', ALFA classifies most reads as 'opposite strand', and kallisto and salmon cannot align most reads, as expected.

Edited by CJHerrmann

Merge request reports