Commits · ef2c6624df80ed62388863de0044078ba902f15a · zavolan_group / pipelines / ZARP

Jun 12, 2020
- fix(prepare_inputs): support relative paths · 402da16e
  Alex Kanitz authored 4 years ago
  
  402da16e
Apr 27, 2020

Refactor LabKey to Snakemake script · 556f1e12

Alex Kanitz authored 4 years ago

- clean up command line interface
  - improve descriptions
  - add consistent structure
  - remove or merge superfluous CLI arguments
  - set defaults
  - update test calls
  - update docs
  - when importing data from LabKey, table is saved to 'samples.tsv.labkey' in same directory as Snakemake sample table
- allow user to specify environment variables and relative paths in input table and on CLI
  - relative paths in the input table are interpreted with respect to the directory containing the input table
  - relative paths will are interpreted with respect to the current working directory; this is to achieve portability with respect to tests but is discouraged in production because its behavior is not very predictable from the user's perspective; consequently a warning is thrown
- set STAR index size to read length - 1
- remove `gtf_filtered` and `tr_fasta_filtered` and update Snakefiles and test sample tables accordingly
- rename some MultiQC report-related parameters and update Snakefiles and test config files accordingly
- add logging
- add docstrings to module and all functions
- add typing definitions to all functions
- restructure and comment code to improve readability
- linters `flake8` and `mypy` pass

556f1e12

Major refactoring · 6cf28511

BIOPZ-Katsantoni Maria authored 4 years ago and

Alex Kanitz committed 4 years ago

* Sequencing mode-related changes:
  * allowed sequencing modes in Snakemake input table changed from `paired_end` and `single_end` to `pe` and `se`, respectively
  * remove sequencing mode from output paths for each rule
  * corresponding wild cards removed entirely from all rules that do not depend on sequencing mode (currently all rules that are defined in the main `Snakefile` in the project root directory)
  * where absolutely necessary, sequencing mode is added as part of output file or directory instead
  * remove dependency of sequencing mode for rule for `FastQC`; now runs separately for each strand
* Changes related to MultiQC and output file/directory structure
  * moving and renaming outputs for MultiQC is no longer required
  * code to create MultiQC custom config externalized into script `scripts/rhea_multiqc_config.py`
  * add MultiQC output files with deterministic output to md5 sum checks performed during execution of `tests/test_integration_workflow/test.{local,slurm}.sh`
  * output filenames for each rule now follow this general structure: `samples/{sample_name}/{rule}/{output_file}`
  * change log directory structure matches results directory structure
* Miscellaneous changes
  * consistent, PEP8-compliant formatting in most parts, including Snakemake files, where allowed
  * remove rule `extract_decoys_salmon`; equivalent file `chrName.txt` produced by `star_index` is used instead
  * add rule `start` which copies sample data to the results directory and enforces uniform naming
  * refactoring of ALFA rules and modification of the CI/CD test to ensure compatibility

6cf28511

Mar 20, 2020

Fix Poly(A)-trimming rule · 392b04d2

BIOPZ-Katsantoni Maria authored 4 years ago and

Alex Kanitz committed 4 years ago

In labkey_to_snakemake.py fixed the parameters so that there is 3p as well 5p polya
feature for every mate, which can be matched to the -a -g -A and -G options of cutadapt
depending on which is the sense or antisense mate the appropriate variable is populated
and the rest of variables are filled with 'XXXXXXXXXXXX' which leads to no trimming by
cutadapt. The poly-A trimming rules are fixed to contain all -a -g -A -G options.

392b04d2

Mar 12, 2020
- added alfa_indexes to config.yaml · a7dddfe6
  Dominik Burri authored 5 years ago
  
  a7dddfe6
Feb 21, 2020

handle polyA processing in input preparation script · c4e20a21

BIOPZ-Katsantoni Maria authored 5 years ago and

Alex Kanitz committed 5 years ago

- fixes some functions in `labkey_to_snakemake.py`
- add optional argument for trimming polyA tails; they are trimmed as follows:
  - if mate is sense, oligo-A is added to sample table for `cutadapt` rule to trim
  - if mate is antisense, oligo-T is added to sample table for `cutadapt` rule to trim
  - if option is set to `--trim_polya`, oligo-X stretch is added to sample table and `cutadapt` will not trim

c4e20a21

Feb 20, 2020

create log directories in Snakefile\ · 5e1ec85e

Alex Kanitz authored 5 years ago

- log and, if workflow is executed on cluster, cluster log directories are explicitly created in `Snakefile`
- location of main log directory can be configured in `config.yaml` (field `log_dir`, previously: `local_log`; requires change in script `labkey_to_snakemake.py` as well as subworkflows as field name is hard-coded there)
- location of cluster log directory can be configured in `cluster.json` (in field `__default__` -> `out`)
- `config.yaml` and `cluster.json` in `tests/input_files` are set such that a directory `logs/` is created in the directory where Snakemake is run (i.e., the directory of each test); cluster logs are stored in a subdirectory `logs/cluster`
- removes instructions to explicitly create log directories from docs and all test scripts
- cleans up main `Snakefile` (apart from Snakemake-specific syntax, now passes `flake8` linter test)

5e1ec85e

Feb 15, 2020

get Snakemake input from LabKey API · eea0206f

BIOPZ-Katsantoni Maria authored 5 years ago and

Alex Kanitz committed 5 years ago

- add script that prepares Snakemake input files 'samples.tsv' and 'config.yaml' from LabKey table
- script either connects to API directly (with '--remote' and related options) or processes a tab-separated LabKey dump file
- add tests for both use cases
- common input files for tests now in 'tests/input_files'
- update all other tests to account for new file locations
- update documentation

eea0206f

Feb 14, 2020
- LabKey-like input to Snakmake input · 979e6cdd
  BIOPZ-Katsantoni Maria authored 5 years ago and Alex Kanitz committed 5 years ago
  
  - separate organism genome architecture (different input folder) - change MD5 checksums to match the new output
  979e6cdd
Feb 03, 2020

generate Snakemake inputs from LabKey data table · cd541afe

BIOPZ-Katsantoni Maria authored 5 years ago and

Alex Kanitz committed 5 years ago

Adds script `scripts/labkey_to_snakemake.py` which
- maps LabKey table fields to Snakemake parameters
- assembles required parameters from the table data
- infers required parameters from the input data
- produces files `config.yaml` and `samples.tsv` required by the Snakemake pipeline

A self-contained integration test for the script is located at `tests/test_scripts_labkey_to_snakemake` (execute script `test.sh`) and was added to the CI/CD pipeline.

Note that intermittent changes to the `master` branch were merged into this branch to forego conflicts during merging.

Closes #39

cd541afe