Skip to content
Snippets Groups Projects
Commit 55beeb8e authored by BIOPZ-Katsantoni Maria's avatar BIOPZ-Katsantoni Maria Committed by BIOPZ-Bak Maciej
Browse files

Update documentation

parent 331f778b
No related branches found
No related tags found
1 merge request!104Update documentation
# ZARP
<div align="left">
<img width="20%" align="left" src=images/zarp_logo.svg>
</div>
[Snakemake][snakemake] workflow that covers common steps of short read RNA-Seq
library analysis developed by the [Zavolan lab][zavolan-lab].
**ZARP** ([Zavolan-Lab][zavolan-lab] Automated RNA-Seq Pipeline) is a generic RNA-Seq analysis workflow that allows
users to process and analyze Illumina short-read sequencing libraries with minimum effort. The workflow relies on
publicly available bioinformatics tools and currently handles single or paired-end stranded bulk RNA-seq data.
The workflow is developed in [Snakemake][snakemake], a widely used workflow management system in the bioinformatics
community.
Reads are analyzed (pre-processed, aligned, quantified) with state-of-the-art
tools to give meaningful initial insights into the quality and composition
of an RNA-Seq library, reducing hands-on time for bioinformaticians and giving
experimentalists the possibility to rapidly assess their data.
According to the current ZARP implementation, reads are analyzed (pre-processed, aligned, quantified) with state-of-the-art
tools to give meaningful initial insights into the quality and composition of an RNA-Seq library, reducing hands-on time for bioinformaticians and giving experimentalists the possibility to rapidly assess their data. Additional reports summarise the results of the individual steps and provide useful visualisations.
Below is a schematic representation of the individual steps of the workflow
("pe" refers to "paired-end"):
<div align="center">
<img width="60%" src=images/zarp_schema.png>
</div>
> ![rule_graph][rule-graph]
For a more detailed description of each step, please refer to the [workflow
documentation][pipeline-documentation].
> **Note:** For a more detailed description of each step, please refer to the [workflow
> documentation][pipeline-documentation].
## Requirements
Currently the workflow is only available for Linux distributions. It was tested
on the following distributions:
# Requirements
The workflow has been tested on:
- CentOS 7.5
- Debian 10
- Ubuntu 16.04, 18.04
## Installation
> **NOTE:**
> Currently, we only support **Linux** execution.
### Cloning the repository
Traverse to the desired directory/folder on your file system, then clone/get the
# Installation
## 1. Clone the repository
Go to the desired directory/folder on your file system, then clone/get the
repository and move into the respective directory with:
```bash
......@@ -37,42 +43,45 @@ git clone ssh://git@git.scicore.unibas.ch:2222/zavolan_group/pipelines/zarp.git
cd zarp
```
### Installing Conda
## 2. Conda installation
Workflow dependencies can be conveniently installed with the [Conda][conda]
package manager. We recommend that you install
[Miniconda][miniconda-installation] for your system (Linux). Be sure to select
Python 3 option. The workflow was built and tested with `miniconda 4.7.12`.
package manager. We recommend that you install [Miniconda][miniconda-installation]
for your system (Linux). Be sure to select Python 3 option.
The workflow was built and tested with `miniconda 4.7.12`.
Other versions are not guaranteed to work as expected.
### Installing dependencies
## 3. Dependencies installation
For improved reproducibility and reusability of the workflow,
each individual step of the workflow runs either in its own [Singularity][singularity]
container or in its own [Conda][conda] virtual environemnt. As a consequence, running this workflow has very few individual dependencies. However, for the **container execution** it requires Singularity to be installed on the system where the workflow is executed. As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity for Linux architectures, the installation instructions are
slightly different depending on your system/setup:
container or in its own [Conda][conda] virtual environemnt.
As a consequence, running this workflow has very few individual dependencies.
The **container execution** requires Singularity to be installed on the system where the workflow is executed.
As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity
for Linux architectures, the installation instructions are slightly different depending on your system/setup:
#### For most users
### For most users
If you do *not* have root privileges on the machine you want to run the
workflow on *or* if you do not have a Linux machine, please [install
If you do *not* have root privileges on the machine you want
to run the workflow on *or* if you do not have a Linux machine, please [install
Singularity][singularity-install] separately and in privileged mode, depending
on your system. You may have to ask an authorized person (e.g., a systems
administrator) to do that. This will almost certainly be required if you want
to run the workflow on a high-performance computing (HPC) cluster. We have
successfully tested the workflow with the following Singularity versions:
to run the workflow on a high-performance computing (HPC) cluster.
- `v2.4.5`
- `v2.6.2`
- `v3.5.2`
> **NOTE:**
> The workflow has been tested with the following Singularity versions:
> * `v2.6.2`
> * `v3.5.2`
After installing Singularity, install the remaining dependencies with:
```bash
conda env create -f install/environment.yml
```
#### As root user on Linux
### As root user on Linux
If you have a Linux machine, as well as root privileges, (e.g., if you plan to
run the workflow on your own computer), you can execute the following command
......@@ -82,7 +91,7 @@ to include Singularity in the Conda environment:
conda env create -f install/environment.root.yml
```
### Activate environment
## 4. Activate environment
Activate the Conda environment with:
......@@ -90,7 +99,9 @@ Activate the Conda environment with:
conda activate zarp
```
### Installing non-essential dependencies
# Extra installation steps (optional)
## 5. Non-essential dependencies installation
Most tests have additional dependencies. If you are planning to run tests, you
will need to install these by executing the following command _in your active
......@@ -100,38 +111,33 @@ Conda environment_:
conda env update -f install/environment.dev.yml
```
## Testing the installation
## 6. Successful installation tests
We have prepared several tests to check the integrity of the workflow and its
components. These can be found in subdirectories of the `tests/` directory.
The most critical of these tests enable you execute the entire workflow on a
The most critical of these tests enable you to execute the entire workflow on a
set of small example input files. Note that for this and other tests to complete
successfully, [additional dependencies](#installing-non-essential-dependencies)
need to be installed.
### Test workflow on local machine
Execute the following command to run the test workflow on your local machine (with singularity):
need to be installed.
Execute one of the following commands to run the test workflow
on your local machine:
* Test workflow on local machine with **Singularity**:
```bash
bash tests/test_integration_workflow/test.local.sh
```
Alternatively execute the following command to run the test workflow on your local machine (with conda):
* Test workflow on local machine with **Conda**:
```bash
bash tests/test_integration_workflow_with_conda/test.local.sh
```
Execute one of the following commands to run the test workflow
on a [Slurm][slurm]-managed high-performance computing (HPC) cluster:
### Test workflow via Slurm
Execute the following command to run the test workflow on a
[Slurm][slurm]-managed high-performance computing (HPC) cluster:
* Test workflow with **Singularity**:
```bash
bash tests/test_integration_workflow/test.slurm.sh
```
or
* Test workflow with **Conda**:
```bash
bash tests/test_integration_workflow_with_conda/test.slurm.sh
......@@ -144,10 +150,10 @@ bash tests/test_integration_workflow_with_conda/test.slurm.sh
> Consult the manual of your workload manager as well as the section of the
> Snakemake manual dealing with [profiles].
## Running the workflow on your own samples
# Running the workflow on your own samples
1. Assuming that your current directory is the repository's root directory,
create a directory for your workflow run and traverse inside it with:
create a directory for your workflow run and move into it with:
```bash
mkdir config/my_run
......@@ -176,7 +182,7 @@ or cluster execution. Before execution of the respective command, you need to
remember to update the argument of the `--singularity-args` option of a
respective profile (file: `profiles/{profile}/config.yaml`) so that
it contains a comma-separated list of _all_ directories
containing input data files (samples and any annoation files etc) required for
containing input data files (samples and any annotation files etc) required for
your run.
Runner script for _local execution_:
......@@ -223,6 +229,8 @@ your run.
[profiles]: <https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles>
[miniconda-installation]: <https://docs.conda.io/en/latest/miniconda.html>
[rule-graph]: images/rule_graph.svg
[zarp-logo]: images/zarp_logo.svg
[zarp-schema]: images/zarp_schema.svg
[snakemake]: <https://snakemake.readthedocs.io/en/stable/>
[singularity]: <https://sylabs.io/singularity/>
[singularity-install]: <https://sylabs.io/guides/3.5/admin-guide/installation.html>
......
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
width="56.26725mm"
height="48.036488mm"
viewBox="0 0 199.37215 170.20803"
id="svg4261"
version="1.1"
inkscape:version="0.91 r13725"
sodipodi:docname="zarp_1.svg">
<defs
id="defs4263" />
<sodipodi:namedview
id="base"
pagecolor="#ffffff"
bordercolor="#666666"
borderopacity="1.0"
inkscape:pageopacity="0"
inkscape:pageshadow="2"
inkscape:zoom="1.979899"
inkscape:cx="-20.897608"
inkscape:cy="8.0923225"
inkscape:document-units="px"
inkscape:current-layer="layer1"
showgrid="false"
fit-margin-top="2"
fit-margin-left="2"
fit-margin-right="2"
fit-margin-bottom="2"
inkscape:window-width="1920"
inkscape:window-height="1056"
inkscape:window-x="0"
inkscape:window-y="24"
inkscape:window-maximized="1" />
<metadata
id="metadata4266">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
inkscape:label="Layer 1"
inkscape:groupmode="layer"
id="layer1"
transform="translate(-143.17107,-38.686762)">
<path
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#e5ff80;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.99999988px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
d="m 243.57805,88.696175 4.26256,-42.817231 9.23941,42.774121 27.72509,-33.056461 -13.42567,56.419606 63.97528,13.31515 -55.4256,5.63049 13.41883,14.07622 -35.29737,-10.00975 2.62545,46.29513 -21.58683,-30.34208 -9.33482,50.67438 -14.42343,-45.05975 -63.46414,14.09207 41.42335,-40.35184 -32.50423,-14.98737 41.31563,-10.16393 -6.52137,-34.918195 24.81619,26.68928 8.48284,-37.93928 z"
id="path4227"
inkscape:connector-curvature="0"
sodipodi:nodetypes="ccccccccccccccccccccc" />
<path
d="m 233.52739,151.64498 -36.82867,20.19032 -4.34714,-6.07217 8.88857,-44.66529 -18.75483,15.37297 -8.11745,-7.10713 34.18337,-24.0349 2.6618,6.34119 -6.60331,47.12207 21.66058,-15.94774 z"
id="path4219"
inkscape:connector-curvature="0"
sodipodi:nodetypes="ccccccccccc"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165365;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
<path
d="m 251.57333,139.84191 -6.23388,7.90103 -6.52448,-19.20392 -15.51878,2.39211 -0.56383,20.5798 -8.03645,-5.98464 6.16552,-42.44714 9.51639,-1.60852 z m -15.38208,-15.39566 -9.00772,-16.1482 -2.67194,17.94854 z"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165377;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
id="path4221"
inkscape:connector-curvature="0"
sodipodi:nodetypes="ccccccccccccc" />
<path
d="m 253.71902,93.936195 c 4.28794,-0.17115 7.84842,0.555 10.68149,2.17841 2.86857,1.58862 4.31255,4.035545 4.33187,7.340785 0.024,4.10654 -2.06683,7.16206 -6.27258,9.16659 2.36696,1.37488 4.04026,3.31175 5.0199,5.8106 0.97962,2.49886 1.75442,8.63598 2.56944,13.54579 l -7.57779,-3.91442 c -1.52577,-10.32469 -2.28878,-13.69806 -7.50577,-13.48984 l -3.69875,-0.10794 -0.87052,14.84643 -7.26172,3.35664 0.19202,-37.316425 c 3.96245,-0.82604 7.42658,-1.29825 10.39241,-1.41662 z M 263.803,103.95269 c -0.0244,-4.173295 -3.37763,-6.126585 -10.05967,-5.859895 -2.00103,0.0799 -3.76933,0.23393 -5.30486,0.46219 l 2.11632,11.087925 4.57821,0.0728 c 6.06575,0.0965 8.69461,-1.55637 8.67,-5.76305 z"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165365;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
id="path4223"
inkscape:connector-curvature="0"
sodipodi:nodetypes="scscsccsccccsssccss" />
<path
d="m 287.31989,95.035445 c 3.33232,0.73811 5.88925,2.26024 7.67077,4.5664 1.81603,2.279665 2.35289,5.095225 1.61054,8.446685 -0.7928,3.57923 -5.05204,6.72224 -5.05204,6.72224 -2.81597,1.45848 -3.44969,2.22861 -7.40288,2.39792 -0.84688,0.0363 -4.05394,-0.3302 -5.24704,-0.76514 l -2.399,13.30577 -5.85632,-0.90533 8.64507,-34.574535 c 3.25075,0.0715 5.92772,0.34013 8.0309,0.80599 z m 5.33499,12.394995 c 0.9874,-4.45777 -2.21202,-6.1985 -4.93709,-7.455185 -1.43225,-0.6605 -4.16678,-1.50776 -5.26087,-1.61356 l -2.50604,12.303985 c 1.32815,0.60139 3.42622,1.21973 3.42622,1.21973 3.66289,0.26727 8.23273,0.2631 9.27778,-4.45497 z"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:70px;line-height:100%;font-family:Laksaman;-inkscape-font-specification:'Laksaman, Bold';text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#00aad4;fill-opacity:1;stroke:#000000;stroke-width:1.77165353;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
id="path4225"
inkscape:connector-curvature="0"
sodipodi:nodetypes="scscsccccsssccss" />
</g>
</svg>
images/zarp_schema.png

305 KiB

......@@ -101,27 +101,26 @@ Visual representation of workflow. Automatically prepared with
Parameter name | Description | Data type(s)
--- | --- | ---
sample | Descriptive sample name | `str`
seqmode | Required for various steps of the workflow. One of `pe` (for paired-end libraries) or `se` (for single-end libraries). | `str`
fq1 | Path of library file in `.fastq.gz` format (or mate 1 read file for paired-end libraries) | `str`
index_size | Required for [STAR](#third-party-software-used). Ideally the maximum read length minus 1 (`max(ReadLength)-1`). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. | `int`
kmer | Required for [Salmon](#third-party-software-used). Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. | `int`
sample | Descriptive sample name. <br> **NOTE**: samples split in multiple fastq files (multilane samples), can be automatically merged by using the same ID| `str`
seqmode | There are two allowed values `pe` (paired-end) and `se` (single-end) according to the protocol used. | `str`
fq1 | Path of library file in `.fastq.gz` format (or mate 1 read file for paired-end libraries). | `str`
fq2 | Path of mate 2 read file in `.fastq.gz` format. Value ignored for for single-end libraries. | `str`
fq1_3p | Required for [Cutadapt](#third-party-software-used). 3' adapter of mate 1. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. | `str`
fq1_5p | Required for [Cutadapt](#third-party-software-used). 5' adapter of mate 1. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. | `str`
fq2_3p | Required for [Cutadapt](#third-party-software-used). 3' adapter of mate 2. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. Value ignored for single-end libraries. | `str`
fq2_5p | Required for [Cutadapt](#third-party-software-used). 5' adapter of mate 2. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. Value ignored for single-end libraries. | `str`
organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: `GRCh38` | `str`
gtf | Required for [STAR](#third-party-software-used). Path to gene annotation `.fa` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/gene_annotations.gtf` | `str`
gtf_filtered | Required for [Salmon](#third-party-software-used). Path to filtered gene annotation `.gtf` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/gene_annotations.filtered.gtf` | `str`
genome | Required for [STAR](#third-party-software-used). Path to genome `.fa` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/genome.fa` | `str`
sd | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles. Value ignored for paired-end libraries. | `int`
mean | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles. Value ignored for paired-end libraries. | `int`
libtype | Required for [Salmon](#third-party-software-used), and, after internal conversion, for [kallisto](#third-party-software-used) and [ALFA](#third-party-software-used) . See [Salmon manual][docs-salmon] for allowed values. **WARNING**: do *NOT* use `A` to automatically infer the salmon library type, this will cause kallisto and ALFA to fail. | `str`
fq1_polya3p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. | `str`
fq1_polya5p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. | `str`
fq2_polya3p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. | `str`
fq2_polya5p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. | `str`
index_size | Required for [STAR](#third-party-software-used). Ideally the maximum read length minus 1. (`max(ReadLength)-1`). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. | `int`
kmer | Required for [Salmon](#third-party-software-used). Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. | `int`
organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. <br> **Example:** `GRCh38` | `str`
gtf | Required for [STAR](#third-party-software-used). Path to gene annotation `.gtf` file. File needs to be in subdirectory corresponding to `organism` field. <br> **Example:** `/path/to/GRCh38/gene_annotations.gtf` | `str`
genome | Required for [STAR](#third-party-software-used). Path to genome `.fa` file. File needs to be in subdirectory corresponding to `organism` field. <br> **Example:** `/path/to/GRCh38/genome.fa` | `str`
sd | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but **only** for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles | `int`
mean | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but **only** for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles | `int`
libtype | Required for [Salmon](#third-party-software-used), and, after internal conversion, for [kallisto](#third-party-software-used) and [ALFA](#third-party-software-used). See [Salmon manual][docs-salmon] for allowed values. <br>**WARNING**: do *NOT* use `A` to automatically infer the salmon library type, this will cause kallisto and ALFA to fail. | `str`
#### Create log directories
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment