diff --git a/README.md b/README.md index 3125a563e4f3ee01a13484a082acc0a3005f0794..84b8c845eb218f63ef4c4ea64743c859067aa07c 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,41 @@ -# ZARP +<div align="left"> + <img width="20%" align="left" src=images/zarp_logo.svg> +</div> -[Snakemake][snakemake] workflow that covers common steps of short read RNA-Seq -library analysis developed by the [Zavolan lab][zavolan-lab]. +**ZARP** ([Zavolan-Lab][zavolan-lab] Automated RNA-Seq Pipeline) is a generic RNA-Seq analysis workflow that allows +users to process and analyze Illumina short-read sequencing libraries with minimum effort. The workflow relies on +publicly available bioinformatics tools and currently handles single or paired-end stranded bulk RNA-seq data. +The workflow is developed in [Snakemake][snakemake], a widely used workflow management system in the bioinformatics +community. -Reads are analyzed (pre-processed, aligned, quantified) with state-of-the-art -tools to give meaningful initial insights into the quality and composition -of an RNA-Seq library, reducing hands-on time for bioinformaticians and giving -experimentalists the possibility to rapidly assess their data. +According to the current ZARP implementation, reads are analyzed (pre-processed, aligned, quantified) with state-of-the-art +tools to give meaningful initial insights into the quality and composition of an RNA-Seq library, reducing hands-on time for bioinformaticians and giving experimentalists the possibility to rapidly assess their data. Additional reports summarise the results of the individual steps and provide useful visualisations. -Below is a schematic representation of the individual steps of the workflow -("pe" refers to "paired-end"): +<div align="center"> + <img width="60%" src=images/zarp_schema.png> +</div> -> ![rule_graph][rule-graph] -For a more detailed description of each step, please refer to the [workflow -documentation][pipeline-documentation]. +> **Note:** For a more detailed description of each step, please refer to the [workflow +> documentation][pipeline-documentation]. -## Requirements -Currently the workflow is only available for Linux distributions. It was tested -on the following distributions: +# Requirements +The workflow has been tested on: - CentOS 7.5 - Debian 10 - Ubuntu 16.04, 18.04 -## Installation +> **NOTE:** +> Currently, we only support **Linux** execution. -### Cloning the repository -Traverse to the desired directory/folder on your file system, then clone/get the +# Installation + +## 1. Clone the repository + +Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with: ```bash @@ -37,42 +43,45 @@ git clone ssh://git@git.scicore.unibas.ch:2222/zavolan_group/pipelines/zarp.git cd zarp ``` -### Installing Conda +## 2. Conda installation Workflow dependencies can be conveniently installed with the [Conda][conda] -package manager. We recommend that you install -[Miniconda][miniconda-installation] for your system (Linux). Be sure to select -Python 3 option. The workflow was built and tested with `miniconda 4.7.12`. +package manager. We recommend that you install [Miniconda][miniconda-installation] +for your system (Linux). Be sure to select Python 3 option. +The workflow was built and tested with `miniconda 4.7.12`. Other versions are not guaranteed to work as expected. -### Installing dependencies +## 3. Dependencies installation For improved reproducibility and reusability of the workflow, each individual step of the workflow runs either in its own [Singularity][singularity] -container or in its own [Conda][conda] virtual environemnt. As a consequence, running this workflow has very few individual dependencies. However, for the **container execution** it requires Singularity to be installed on the system where the workflow is executed. As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity for Linux architectures, the installation instructions are -slightly different depending on your system/setup: +container or in its own [Conda][conda] virtual environemnt. +As a consequence, running this workflow has very few individual dependencies. +The **container execution** requires Singularity to be installed on the system where the workflow is executed. +As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity +for Linux architectures, the installation instructions are slightly different depending on your system/setup: -#### For most users +### For most users -If you do *not* have root privileges on the machine you want to run the -workflow on *or* if you do not have a Linux machine, please [install +If you do *not* have root privileges on the machine you want +to run the workflow on *or* if you do not have a Linux machine, please [install Singularity][singularity-install] separately and in privileged mode, depending on your system. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want -to run the workflow on a high-performance computing (HPC) cluster. We have -successfully tested the workflow with the following Singularity versions: +to run the workflow on a high-performance computing (HPC) cluster. -- `v2.4.5` -- `v2.6.2` -- `v3.5.2` +> **NOTE:** +> The workflow has been tested with the following Singularity versions: +> * `v2.6.2` +> * `v3.5.2` After installing Singularity, install the remaining dependencies with: - ```bash conda env create -f install/environment.yml ``` -#### As root user on Linux + +### As root user on Linux If you have a Linux machine, as well as root privileges, (e.g., if you plan to run the workflow on your own computer), you can execute the following command @@ -82,7 +91,7 @@ to include Singularity in the Conda environment: conda env create -f install/environment.root.yml ``` -### Activate environment +## 4. Activate environment Activate the Conda environment with: @@ -90,7 +99,9 @@ Activate the Conda environment with: conda activate zarp ``` -### Installing non-essential dependencies +# Extra installation steps (optional) + +## 5. Non-essential dependencies installation Most tests have additional dependencies. If you are planning to run tests, you will need to install these by executing the following command _in your active @@ -100,38 +111,33 @@ Conda environment_: conda env update -f install/environment.dev.yml ``` -## Testing the installation +## 6. Successful installation tests We have prepared several tests to check the integrity of the workflow and its components. These can be found in subdirectories of the `tests/` directory. -The most critical of these tests enable you execute the entire workflow on a +The most critical of these tests enable you to execute the entire workflow on a set of small example input files. Note that for this and other tests to complete successfully, [additional dependencies](#installing-non-essential-dependencies) -need to be installed. - -### Test workflow on local machine - -Execute the following command to run the test workflow on your local machine (with singularity): - +need to be installed. +Execute one of the following commands to run the test workflow +on your local machine: +* Test workflow on local machine with **Singularity**: ```bash bash tests/test_integration_workflow/test.local.sh ``` - -Alternatively execute the following command to run the test workflow on your local machine (with conda): +* Test workflow on local machine with **Conda**: ```bash bash tests/test_integration_workflow_with_conda/test.local.sh ``` +Execute one of the following commands to run the test workflow +on a [Slurm][slurm]-managed high-performance computing (HPC) cluster: -### Test workflow via Slurm - -Execute the following command to run the test workflow on a -[Slurm][slurm]-managed high-performance computing (HPC) cluster: +* Test workflow with **Singularity**: ```bash bash tests/test_integration_workflow/test.slurm.sh ``` - -or +* Test workflow with **Conda**: ```bash bash tests/test_integration_workflow_with_conda/test.slurm.sh @@ -144,10 +150,10 @@ bash tests/test_integration_workflow_with_conda/test.slurm.sh > Consult the manual of your workload manager as well as the section of the > Snakemake manual dealing with [profiles]. -## Running the workflow on your own samples +# Running the workflow on your own samples 1. Assuming that your current directory is the repository's root directory, -create a directory for your workflow run and traverse inside it with: +create a directory for your workflow run and move into it with: ```bash mkdir config/my_run @@ -176,7 +182,7 @@ or cluster execution. Before execution of the respective command, you need to remember to update the argument of the `--singularity-args` option of a respective profile (file: `profiles/{profile}/config.yaml`) so that it contains a comma-separated list of _all_ directories -containing input data files (samples and any annoation files etc) required for +containing input data files (samples and any annotation files etc) required for your run. Runner script for _local execution_: @@ -223,6 +229,8 @@ your run. [profiles]: <https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles> [miniconda-installation]: <https://docs.conda.io/en/latest/miniconda.html> [rule-graph]: images/rule_graph.svg +[zarp-logo]: images/zarp_logo.svg +[zarp-schema]: images/zarp_schema.svg [snakemake]: <https://snakemake.readthedocs.io/en/stable/> [singularity]: <https://sylabs.io/singularity/> [singularity-install]: <https://sylabs.io/guides/3.5/admin-guide/installation.html> diff --git a/images/zarp_logo.svg b/images/zarp_logo.svg new file mode 100644 index 0000000000000000000000000000000000000000..5fb2d8487ff33fdbe679ce88c71a5dd2f1d5ee33 --- /dev/null +++ b/images/zarp_logo.svg @@ -0,0 +1,91 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!-- Created with Inkscape (http://www.inkscape.org/) --> + +<svg + xmlns:dc="http://purl.org/dc/elements/1.1/" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns="http://www.w3.org/2000/svg" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + width="56.26725mm" + height="48.036488mm" + viewBox="0 0 199.37215 170.20803" + id="svg4261" + version="1.1" + inkscape:version="0.91 r13725" + sodipodi:docname="zarp_1.svg"> + <defs + id="defs4263" /> + <sodipodi:namedview + id="base" + pagecolor="#ffffff" + bordercolor="#666666" + borderopacity="1.0" + inkscape:pageopacity="0" + inkscape:pageshadow="2" + inkscape:zoom="1.979899" + inkscape:cx="-20.897608" + inkscape:cy="8.0923225" + inkscape:document-units="px" + inkscape:current-layer="layer1" + showgrid="false" + fit-margin-top="2" + fit-margin-left="2" + fit-margin-right="2" + fit-margin-bottom="2" + inkscape:window-width="1920" + inkscape:window-height="1056" + inkscape:window-x="0" + inkscape:window-y="24" + inkscape:window-maximized="1" /> + <metadata + id="metadata4266"> + <rdf:RDF> + <cc:Work + rdf:about=""> + <dc:format>image/svg+xml</dc:format> + <dc:type + rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> + <dc:title></dc:title> + </cc:Work> + </rdf:RDF> + </metadata> + <g + inkscape:label="Layer 1" + inkscape:groupmode="layer" + id="layer1" + transform="translate(-143.17107,-38.686762)"> + <path + style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#e5ff80;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.99999988px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m 243.57805,88.696175 4.26256,-42.817231 9.23941,42.774121 27.72509,-33.056461 -13.42567,56.419606 63.97528,13.31515 -55.4256,5.63049 13.41883,14.07622 -35.29737,-10.00975 2.62545,46.29513 -21.58683,-30.34208 -9.33482,50.67438 -14.42343,-45.05975 -63.46414,14.09207 41.42335,-40.35184 -32.50423,-14.98737 41.31563,-10.16393 -6.52137,-34.918195 24.81619,26.68928 8.48284,-37.93928 z" + id="path4227" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccccccccccccccccccccc" /> + <path + d="m 233.52739,151.64498 -36.82867,20.19032 -4.34714,-6.07217 8.88857,-44.66529 -18.75483,15.37297 -8.11745,-7.10713 34.18337,-24.0349 2.6618,6.34119 -6.60331,47.12207 21.66058,-15.94774 z" + id="path4219" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccccccccccc" + style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165365;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" /> + <path + d="m 251.57333,139.84191 -6.23388,7.90103 -6.52448,-19.20392 -15.51878,2.39211 -0.56383,20.5798 -8.03645,-5.98464 6.16552,-42.44714 9.51639,-1.60852 z m -15.38208,-15.39566 -9.00772,-16.1482 -2.67194,17.94854 z" + style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165377;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + id="path4221" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccccccccccccc" /> + <path + d="m 253.71902,93.936195 c 4.28794,-0.17115 7.84842,0.555 10.68149,2.17841 2.86857,1.58862 4.31255,4.035545 4.33187,7.340785 0.024,4.10654 -2.06683,7.16206 -6.27258,9.16659 2.36696,1.37488 4.04026,3.31175 5.0199,5.8106 0.97962,2.49886 1.75442,8.63598 2.56944,13.54579 l -7.57779,-3.91442 c -1.52577,-10.32469 -2.28878,-13.69806 -7.50577,-13.48984 l -3.69875,-0.10794 -0.87052,14.84643 -7.26172,3.35664 0.19202,-37.316425 c 3.96245,-0.82604 7.42658,-1.29825 10.39241,-1.41662 z M 263.803,103.95269 c -0.0244,-4.173295 -3.37763,-6.126585 -10.05967,-5.859895 -2.00103,0.0799 -3.76933,0.23393 -5.30486,0.46219 l 2.11632,11.087925 4.57821,0.0728 c 6.06575,0.0965 8.69461,-1.55637 8.67,-5.76305 z" + style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165365;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + id="path4223" + inkscape:connector-curvature="0" + sodipodi:nodetypes="scscsccsccccsssccss" /> + <path + d="m 287.31989,95.035445 c 3.33232,0.73811 5.88925,2.26024 7.67077,4.5664 1.81603,2.279665 2.35289,5.095225 1.61054,8.446685 -0.7928,3.57923 -5.05204,6.72224 -5.05204,6.72224 -2.81597,1.45848 -3.44969,2.22861 -7.40288,2.39792 -0.84688,0.0363 -4.05394,-0.3302 -5.24704,-0.76514 l -2.399,13.30577 -5.85632,-0.90533 8.64507,-34.574535 c 3.25075,0.0715 5.92772,0.34013 8.0309,0.80599 z m 5.33499,12.394995 c 0.9874,-4.45777 -2.21202,-6.1985 -4.93709,-7.455185 -1.43225,-0.6605 -4.16678,-1.50776 -5.26087,-1.61356 l -2.50604,12.303985 c 1.32815,0.60139 3.42622,1.21973 3.42622,1.21973 3.66289,0.26727 8.23273,0.2631 9.27778,-4.45497 z" + style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165353;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + id="path4225" + inkscape:connector-curvature="0" + sodipodi:nodetypes="scscsccccsssccss" /> + </g> +</svg> diff --git a/images/zarp_schema.png b/images/zarp_schema.png new file mode 100644 index 0000000000000000000000000000000000000000..bdff27d77c7ef08059c00ac37fe99f0b2226902e Binary files /dev/null and b/images/zarp_schema.png differ diff --git a/pipeline_documentation.md b/pipeline_documentation.md index 63b36129f31e2a83d37b1f714da38847fef07eda..f123d951cd792956025a038f3b97d63419997410 100644 --- a/pipeline_documentation.md +++ b/pipeline_documentation.md @@ -101,27 +101,26 @@ Visual representation of workflow. Automatically prepared with Parameter name | Description | Data type(s) --- | --- | --- -sample | Descriptive sample name | `str` -seqmode | Required for various steps of the workflow. One of `pe` (for paired-end libraries) or `se` (for single-end libraries). | `str` -fq1 | Path of library file in `.fastq.gz` format (or mate 1 read file for paired-end libraries) | `str` -index_size | Required for [STAR](#third-party-software-used). Ideally the maximum read length minus 1 (`max(ReadLength)-1`). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. | `int` -kmer | Required for [Salmon](#third-party-software-used). Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. | `int` +sample | Descriptive sample name. <br> **NOTE**: samples split in multiple fastq files (multilane samples), can be automatically merged by using the same ID| `str` +seqmode | There are two allowed values `pe` (paired-end) and `se` (single-end) according to the protocol used. | `str` +fq1 | Path of library file in `.fastq.gz` format (or mate 1 read file for paired-end libraries). | `str` fq2 | Path of mate 2 read file in `.fastq.gz` format. Value ignored for for single-end libraries. | `str` fq1_3p | Required for [Cutadapt](#third-party-software-used). 3' adapter of mate 1. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. | `str` fq1_5p | Required for [Cutadapt](#third-party-software-used). 5' adapter of mate 1. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. | `str` fq2_3p | Required for [Cutadapt](#third-party-software-used). 3' adapter of mate 2. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. Value ignored for single-end libraries. | `str` fq2_5p | Required for [Cutadapt](#third-party-software-used). 5' adapter of mate 2. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. Value ignored for single-end libraries. | `str` -organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: `GRCh38` | `str` -gtf | Required for [STAR](#third-party-software-used). Path to gene annotation `.fa` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/gene_annotations.gtf` | `str` -gtf_filtered | Required for [Salmon](#third-party-software-used). Path to filtered gene annotation `.gtf` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/gene_annotations.filtered.gtf` | `str` -genome | Required for [STAR](#third-party-software-used). Path to genome `.fa` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/genome.fa` | `str` -sd | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles. Value ignored for paired-end libraries. | `int` -mean | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles. Value ignored for paired-end libraries. | `int` -libtype | Required for [Salmon](#third-party-software-used), and, after internal conversion, for [kallisto](#third-party-software-used) and [ALFA](#third-party-software-used) . See [Salmon manual][docs-salmon] for allowed values. **WARNING**: do *NOT* use `A` to automatically infer the salmon library type, this will cause kallisto and ALFA to fail. | `str` fq1_polya3p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. | `str` fq1_polya5p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. | `str` fq2_polya3p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. | `str` fq2_polya5p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. | `str` +index_size | Required for [STAR](#third-party-software-used). Ideally the maximum read length minus 1. (`max(ReadLength)-1`). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. | `int` +kmer | Required for [Salmon](#third-party-software-used). Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. | `int` +organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. <br> **Example:** `GRCh38` | `str` +gtf | Required for [STAR](#third-party-software-used). Path to gene annotation `.gtf` file. File needs to be in subdirectory corresponding to `organism` field. <br> **Example:** `/path/to/GRCh38/gene_annotations.gtf` | `str` +genome | Required for [STAR](#third-party-software-used). Path to genome `.fa` file. File needs to be in subdirectory corresponding to `organism` field. <br> **Example:** `/path/to/GRCh38/genome.fa` | `str` +sd | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but **only** for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles | `int` +mean | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but **only** for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles | `int` +libtype | Required for [Salmon](#third-party-software-used), and, after internal conversion, for [kallisto](#third-party-software-used) and [ALFA](#third-party-software-used). See [Salmon manual][docs-salmon] for allowed values. <br>**WARNING**: do *NOT* use `A` to automatically infer the salmon library type, this will cause kallisto and ALFA to fail. | `str` #### Create log directories