Commits · e265ed7539312e1c6b99889876f565192603a5d5 · zavolan_group / pipelines / ZARP

May 07, 2021

Use biocontainers images for star, gffread, salmon, kallisto, cutadapt,... · be8eb02d

Use biocontainers images for star, gffread, salmon, kallisto, cutadapt, samtools, fastqc, alfa, bedtools, bedgraphtobigwig. Change container from bash to ubuntu. Fixes #149

be8eb02d

Apr 15, 2021
- feat: enable user to configure CLI params per rule · d35d5500
  CJHerrmann authored 3 years ago and Alex Kanitz committed 3 years ago
  
  d35d5500
Feb 26, 2021
- Remove unnecessary files in the results directory · 6804ea67
  Dominik Burri authored 4 years ago and BIOPZ-Gypas Foivos committed 4 years ago
  
  6804ea67
Feb 11, 2021
- MultiQC plugins for TIN scores and ALFA Fixes #138 · fab75506
  BIOPZ-Bak Maciej authored 4 years ago and BIOPZ-Gypas Foivos committed 4 years ago
  
  fab75506
Jun 23, 2020
- Merge kallisto rules: kallisto_merge_genes and kallisto_merge_transcript · 1f007e19
  BIOPZ-Iborra de Toledo Paula authored 4 years ago and BIOPZ-Gypas Foivos committed 4 years ago
  
  The rules rely on https://github.com/zavolanlab/merge_kallisto Update info in pipeline_documentation.md
  1f007e19
Jun 18, 2020
- Increase memory for salmon index · 595f881e
  BIOPZ-Börsch Anastasiya authored 4 years ago
  
  595f881e
Jun 17, 2020
- Increase memory for some rules · b72a811a
  BIOPZ-Börsch Anastasiya authored 4 years ago
  
  b72a811a
Jun 15, 2020
- fix: Renamed samples_concat.tsv to samples.multiple_lanes.tsv. Renamed rows... · 05c167fd
  BIOPZ-Katsantoni Maria authored 4 years ago and Alex Kanitz committed 4 years ago
  
  fix: Renamed samples_concat.tsv to samples.multiple_lanes.tsv. Renamed rows with split with the same name as the other test samples, so that I do not change the tests (md5 and sunch). Removed the one lane samples. Created config that uses this tsv file
  05c167fd
- refactor: rename LabKey/input table column · 82c4dcea
  BIOPZ-Börsch Anastasiya authored 4 years ago and Alex Kanitz committed 4 years ago
  
  82c4dcea
Jun 12, 2020
- Add Snakemake report · f7cda546
  BIOPZ-Bak Maciej authored 4 years ago and Alex Kanitz committed 4 years ago
  
  f7cda546
- fix(prepare_inputs): support relative paths · 402da16e
  Alex Kanitz authored 4 years ago
  
  402da16e
Apr 27, 2020

Refactor LabKey to Snakemake script · 556f1e12

Alex Kanitz authored 4 years ago

- clean up command line interface
  - improve descriptions
  - add consistent structure
  - remove or merge superfluous CLI arguments
  - set defaults
  - update test calls
  - update docs
  - when importing data from LabKey, table is saved to 'samples.tsv.labkey' in same directory as Snakemake sample table
- allow user to specify environment variables and relative paths in input table and on CLI
  - relative paths in the input table are interpreted with respect to the directory containing the input table
  - relative paths will are interpreted with respect to the current working directory; this is to achieve portability with respect to tests but is discouraged in production because its behavior is not very predictable from the user's perspective; consequently a warning is thrown
- set STAR index size to read length - 1
- remove `gtf_filtered` and `tr_fasta_filtered` and update Snakefiles and test sample tables accordingly
- rename some MultiQC report-related parameters and update Snakefiles and test config files accordingly
- add logging
- add docstrings to module and all functions
- add typing definitions to all functions
- restructure and comment code to improve readability
- linters `flake8` and `mypy` pass

556f1e12

Major refactoring · 6cf28511

BIOPZ-Katsantoni Maria authored 4 years ago and

Alex Kanitz committed 4 years ago

* Sequencing mode-related changes:
  * allowed sequencing modes in Snakemake input table changed from `paired_end` and `single_end` to `pe` and `se`, respectively
  * remove sequencing mode from output paths for each rule
  * corresponding wild cards removed entirely from all rules that do not depend on sequencing mode (currently all rules that are defined in the main `Snakefile` in the project root directory)
  * where absolutely necessary, sequencing mode is added as part of output file or directory instead
  * remove dependency of sequencing mode for rule for `FastQC`; now runs separately for each strand
* Changes related to MultiQC and output file/directory structure
  * moving and renaming outputs for MultiQC is no longer required
  * code to create MultiQC custom config externalized into script `scripts/rhea_multiqc_config.py`
  * add MultiQC output files with deterministic output to md5 sum checks performed during execution of `tests/test_integration_workflow/test.{local,slurm}.sh`
  * output filenames for each rule now follow this general structure: `samples/{sample_name}/{rule}/{output_file}`
  * change log directory structure matches results directory structure
* Miscellaneous changes
  * consistent, PEP8-compliant formatting in most parts, including Snakemake files, where allowed
  * remove rule `extract_decoys_salmon`; equivalent file `chrName.txt` produced by `star_index` is used instead
  * add rule `start` which copies sample data to the results directory and enforces uniform naming
  * refactoring of ALFA rules and modification of the CI/CD test to ensure compatibility

6cf28511

Add rules for bigWig creation · 907082c3
CJHerrmann authored 4 years ago and Alex Kanitz committed 4 years ago

907082c3

Mar 25, 2020
- Update salmon transcriptome index generation · 4e3cac05
  BIOPZ-Bak Maciej authored 4 years ago and Alex Kanitz committed 4 years ago
  
  4e3cac05
Mar 20, 2020

extend ALFA functionality · f5e2f6ac
Dominik Burri authored 5 years ago and Alex Kanitz committed 5 years ago
```
- generate nucleotide distribution for unique reads only
- new rule to generate PNG image for MultiQC
```
f5e2f6ac

Fix Poly(A)-trimming rule · 392b04d2

BIOPZ-Katsantoni Maria authored 5 years ago and

Alex Kanitz committed 5 years ago

In labkey_to_snakemake.py fixed the parameters so that there is 3p as well 5p polya
feature for every mate, which can be matched to the -a -g -A and -G options of cutadapt
depending on which is the sense or antisense mate the appropriate variable is populated
and the rest of variables are filled with 'XXXXXXXXXXXX' which leads to no trimming by
cutadapt. The poly-A trimming rules are fixed to contain all -a -g -A -G options.

392b04d2

Mar 19, 2020
- MultiQC · fd1e3123
  BIOPZ-Bak Maciej authored 5 years ago and Alex Kanitz committed 5 years ago
  
  fd1e3123
Mar 17, 2020
- Update resource requirements for Slurm cluster · aae0ffde
  BIOPZ-Börsch Anastasiya authored 5 years ago and Alex Kanitz committed 5 years ago
  
  aae0ffde
- Fix cutadapt overtrimming · bab8f25a
  BIOPZ-Katsantoni Maria authored 5 years ago and Alex Kanitz committed 5 years ago
  
  bab8f25a
Mar 12, 2020

add TIN score merge and plot steps · a0babc83
BIOPZ-Bak Maciej authored 5 years ago

a0babc83

replaced synthetic test by new one. · 46e6e00b

Dominik Burri authored 5 years ago

moved input_files into top-layer test directory for consistency.

corrected removal of test files

46e6e00b

included tests for ALFA qc · ad3a8e52
Dominik Burri authored 5 years ago
```
corrected md5sum for config.yaml

remove unnecessary file
```
ad3a8e52

added rule for · 37fb0fd0

Dominik Burri authored 5 years ago

- renaming bedgraph
- creating ALFA qc plots

removed conda dependence, moved import statement.

included ALFA in finish rule, corrected annotation.gtf and config.yaml, created new .svg

37fb0fd0

Mar 06, 2020
- Replace cufflinks image with gffread image · c3d15275
  BIOPZ-Gypas Foivos authored 5 years ago
  
  c3d15275
- Generate bedgraph file of normalised coverage. Fixes #45 · a54ff3e8
  Dominik Burri authored 5 years ago and BIOPZ-Gypas Foivos committed 5 years ago
  
  a54ff3e8
- Extract transcript sequences from genome (fasta file) and gene annotations (gtf file). Fixes #62 · 0def7b72
  BIOPZ-Iborra de Toledo Paula authored 5 years ago and BIOPZ-Gypas Foivos committed 5 years ago
  
  0def7b72
Feb 21, 2020

Use minified docker images · dc2afcf9
Alex Kanitz authored 5 years ago

dc2afcf9

Add rule that combines TPM values from Salmon · bb1f9b8f

BIOPZ-Iborra de Toledo Paula authored 5 years ago and

Alex Kanitz committed 5 years ago

-  Remove files with non-deterministic output from `tests/test_integration_workflow/expected_output.files`
-  Update MD5 sums in `tests/test_integration_workflow/expected_output.md5`
-  Update new workflow DAG and rule graph images

bb1f9b8f

handle polyA processing in input preparation script · c4e20a21

BIOPZ-Katsantoni Maria authored 5 years ago and

Alex Kanitz committed 5 years ago

- fixes some functions in `labkey_to_snakemake.py`
- add optional argument for trimming polyA tails; they are trimmed as follows:
  - if mate is sense, oligo-A is added to sample table for `cutadapt` rule to trim
  - if mate is antisense, oligo-T is added to sample table for `cutadapt` rule to trim
  - if option is set to `--trim_polya`, oligo-X stretch is added to sample table and `cutadapt` will not trim

c4e20a21

Feb 20, 2020

create log directories in Snakefile\ · 5e1ec85e

Alex Kanitz authored 5 years ago

- log and, if workflow is executed on cluster, cluster log directories are explicitly created in `Snakefile`
- location of main log directory can be configured in `config.yaml` (field `log_dir`, previously: `local_log`; requires change in script `labkey_to_snakemake.py` as well as subworkflows as field name is hard-coded there)
- location of cluster log directory can be configured in `cluster.json` (in field `__default__` -> `out`)
- `config.yaml` and `cluster.json` in `tests/input_files` are set such that a directory `logs/` is created in the directory where Snakemake is run (i.e., the directory of each test); cluster logs are stored in a subdirectory `logs/cluster`
- removes instructions to explicitly create log directories from docs and all test scripts
- cleans up main `Snakefile` (apart from Snakemake-specific syntax, now passes `flake8` linter test)

5e1ec85e

Feb 18, 2020

run tests in verbose mode · 0d95577e

Alex Kanitz authored 5 years ago

- trap call functionalized through cleanup() function
- function added to all test scripts
- function prints out exit status of last command before trap
- flag `--verbose` added to Snakemake calls in all test scripts
- script tests rename to follow naming convention 'test_script_<script_name>_<script_run_mode>

0d95577e

Feb 17, 2020

add TIN score calculation · c538fe8b

BIOPZ-Bak Maciej authored 5 years ago and

Alex Kanitz committed 5 years ago

- add rule for input preparation (GTF to BED12)
- add rule for TIN score calculation
- update rule graph and DAG image
- update Slurm cluster config

c538fe8b

Feb 15, 2020

get Snakemake input from LabKey API · eea0206f

BIOPZ-Katsantoni Maria authored 5 years ago and

Alex Kanitz committed 5 years ago

- add script that prepares Snakemake input files 'samples.tsv' and 'config.yaml' from LabKey table
- script either connects to API directly (with '--remote' and related options) or processes a tab-separated LabKey dump file
- add tests for both use cases
- common input files for tests now in 'tests/input_files'
- update all other tests to account for new file locations
- update documentation

eea0206f

Feb 14, 2020

repo follows recommended structure · 1e52fa56
Alex Kanitz authored 5 years ago

1e52fa56
hot fix integration test script · 3fe8f0d1
Alex Kanitz authored 5 years ago

3fe8f0d1
LabKey-like input to Snakmake input · 979e6cdd
BIOPZ-Katsantoni Maria authored 5 years ago and Alex Kanitz committed 5 years ago
```
- separate organism genome architecture (different input folder)
- change MD5 checksums to match the new output
```
979e6cdd

display rule graph instead of DAG · ff08b9c3

CJHerrmann authored 5 years ago and

Alex Kanitz committed 5 years ago

- add script `tests/test_rule_graph/test.sh` to generate a rule graph in `images/rule_graph.svg`
- display rule graph created in `README.md` instead of specific workflow DAG
- add test script to GitLab CI config
- renamed test to create workflow DAG from `test_create_dag_chart` to `test_create_dag_image` (also output file is renamed from `images/workflow_dag.svg` to `images/dag_test_workflow.svg`

ff08b9c3

verify test workflow output against ground truth · c6341db2

Alex Kanitz authored 5 years ago

add tests to workflow integration test `tests/test_workflow_integration` that
- verify that STAR alignments match expected alignments (based on "ground truth" files)
- verify that Salmon gene quantification assign the correct number of reads to each gene (based on "ground truth files)

resolves #49

c6341db2

Feb 09, 2020

replace test files with small synthetic ones · 48e012a0

Alex Kanitz authored 5 years ago

- replaces existing larger libraries and annotations in test cases `test_create_dag_chart` and `test_integration_workflow`
- adds the following new test files:
  - `chr1-10000-20000.fa`: artificial chromosome of length 10'000 (based on human chromosome 1)
  - `chr1-10000-20000.gtf`: matching gene annotation file with two gene and three multi-exon transcripts entries
  - `chr1-10000-20000.transcripts.fa`: sequences of the transcripts listed in the gene annotation file
  - `synthetic.mate_?.fastq.gz`: 10 read pairs randomly sampled from the genic regions of the artificial chromosome
  - `synthetic.*.bed`: BED files with expected alignments for each read; names of overlapping genes are specified in a 7th column
- updates file paths in the relevant sample tables
- extends and updates checksum checking of result files in CI/CD pipeline

48e012a0