Skip to content
Snippets Groups Projects
pipeline_documentation.md 35.6 KiB
Newer Older
# ZARP: workflow documentation

This document describes the individual steps of the workflow. For instructions
on installation and usage please see [here](README.md).

## Table of Contents

- [**Third-party software used**](#third-party-software-used)
- [**Description of workflow steps**](#description-of-workflow-steps)
  - [**Rule graph**](#rule-graph)
  - [**Preparatory**](#preparatory)
    - [**Read samples table**](#read-samples-table)
    - [**Create log directories**](#create-log-directories)
  - [**Sequencing mode-independent**](#sequencing-mode-independent)
    - [**start**](#start)
    - [**create_index_star**](#create_index_star)
    - [**extract_transcriptome**](#extract_transcriptome)
    - [**concatenate_transcriptome_and_genome**](#concatenate_transcriptome_and_genome)
    - [**create_index_salmon**](#create_index_salmon)
    - [**create_index_kallisto**](#create_index_kallisto)
    - [**extract_transcripts_as_bed12**](#extract_transcripts_as_bed12)
    - [**fastqc**](#fastqc)
    - [**index_genomic_alignment_samtools**](#index_genomic_alignment_samtools)
    - [**star_rpm**](#star_rpm)
    - [**rename_star_rpm_for_alfa**](#rename_star_rpm_for_alfa)
    - [**sort_bed_4_big**](#sort_bed_4_big)
    - [**prepare_bigWig**](#prepare_bigwig)
    - [**calculate_TIN_scores**](#calculate_tin_scores)
    - [**salmon_quantmerge_genes**](#salmon_quantmerge_genes)
    - [**salmon_quantmerge_transcripts**](#salmon_quantmerge_transcripts)
    - [**kallisto_merge_genes**](#kallisto_merge_genes)
    - [**kallisto_merge_transcripts**](#kallisto_merge_transcripts)
    - [**pca_kallisto**](#pca_kallisto)
    - [**pca_salmon**](#pca_salmon)
    - [**generate_alfa_index**](#generate_alfa_index)
    - [**alfa_qc**](#alfa_qc)
    - [**prepare_multiqc_config**](#prepare_multiqc_config)
    - [**multiqc_report**](#multiqc_report)
    - [**finish**](#finish)
    - [**Sequencing mode-specific**](#sequencing-mode-specific)
    - [**remove_adapters_cutadapt**](#remove_adapters_cutadapt)
    - [**remove_polya_cutadapt**](#remove_polya_cutadapt)
    - [**map_genome_star**](#map_genome_star)
    - [**quantification_salmon**](#quantification_salmon)
    - [**genome_quantification_kallisto**](#genome_quantification_kallisto)

## Third-party software used

> Tag lines were taken from the developers' websites (code repository or manual)

| Name | License | Tag line | More info |
| --- | --- | --- | --- |
| **ALFA** | [MIT][license-mit] | _"**A**nnotation **L**andscape **F**or **A**ligned reads"_ - _"[...] provides a global overview of features distribution composing NGS dataset(s)"_ | [code][code-alfa] / [manual][docs-alfa] / [publication][pub-alfa] |
| **bedGraphToBigWig** | [MIT][license-mit] | _"Convert a bedGraph file to bigWig format"_ | [code][code-bedgraphtobigwig] / [manual][code-bedgraphtobigwig] |
| **bedtools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][code-bedtools] |
| **cutadapt** | [MIT][license-mit] | _"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"_ | [code][code-cutadapt] / [manual][docs-cutadapt] / [publication][pub-cutadapt] |
| **gffread** | [MIT][license-mit] | _"[...] validate, filter, convert and perform various other operations on GFF files"_ | [code][code-gffread] / [manual][docs-gffread] |
| **FastQC** | [GPLv3][license-gpl3] | _"A quality control analysis tool for high throughput sequencing data"_ | [code][code-fastqc] / [manual][docs-fastqc] |
| **ImageMagick** | [custom][license-imagemagick]^ | _"[...] create, edit, compose, or convert bitmap images"_ | [code][code-imagemagick] / [manual][docs-imagemagick] |
| **kallisto** | [BSD-2][license-bsd2] | _"[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads"_ | [code][code-kallisto] / [manual][docs-kallisto] / [publication][pub-kallisto] |
| **MultiQC** | [GPLv3][license-gpl3] | _"Aggregate results from bioinformatics analyses across many samples into a single report"_ | [code][code-multiqc] / [manual][docs-multiqc] / [publication][pub-multiqc] |
| **RSeqC** | [GPLv3][license-gpl3] | _"[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure."_ | [code][code-rseqc] / [manual][docs-rseqc] / [publication][pub-rseqc] |
| **Salmon** | [GPLv3][license-gpl3] | _"Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment"_ | [code][code-salmon] / [manual][docs-salmon] / [publication][pub-salmon] |
| **SAMtools** | [MIT][license-mit] | _"[...] suite of programs for interacting with high-throughput sequencing data"_ | [code][code-samtools] / [manual][docs-samtools] / [publication][pub-samtools] |
| **STAR** | [MIT][license-mit] | _"**S**pliced **T**ranscripts **A**lignment to a **R**eference"_ - _"RNA-seq aligner"_ | [code][code-star] / [manual][docs-star] / [publication][pub-star] |

^ compatible with [GPLv3][license-gpl3]

## Description of workflow steps

> The workflow consists of three Snakemake files: A main `Snakefile` and an
> individual Snakemake file for each sequencing mode (single-end and
> paired-end), as parameters for some tools differ between sequencing modes.
> The main `Snakefile` contains general steps for the creation of indices and
> other required files derived from the annotations, steps that are applicable
> to both sequencing modes, and steps that deal with gathering, summarizing or
> combining results. Individual steps of the workflow are described briefly, and
> links to the respective software manuals are given. Parameters that can be
> modified by the user (via the samples table) are also described. Descriptions
> for steps for which individual "rules" exist for single- and paired-end
> sequencing libraries are combined, and only differences between the modes are
> highlighted.

### Rule graph

![rule_graph][rule-graph]

Visual representation of workflow. Automatically prepared with
[Snakemake][docs-snakemake].

### Preparatory

#### Read sample table

##### Requirements

- tab-separated values (`.tsv`) file
- first row has to contain parameter names as in [`samples.tsv`](tests/input_files/samples.tsv)
- first column used as sample identifiers

Parameter name | Description | Data type(s)
--- | --- | ---
sample | Descriptive sample name | `str`
seqmode | Required for various steps of the workflow. One of `pe` (for paired-end libraries) or `se` (for single-end libraries). | `str`
fq1 | Path of library file in `.fastq.gz` format (or mate 1 read file for paired-end libraries) | `str`
index_size | Required for [STAR](#third-party-software-used). Ideally the maximum read length minus 1 (`max(ReadLength)-1`). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. | `int`
kmer | Required for [Salmon](#third-party-software-used). Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. | `int`
fq2 | Path of mate 2 read file in `.fastq.gz` format. Value ignored for for single-end libraries. | `str`
fq1_3p | Required for [Cutadapt](#third-party-software-used). 3' adapter of mate 1. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. | `str`
fq1_5p | Required for [Cutadapt](#third-party-software-used). 5' adapter of mate 1. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. | `str`
fq2_3p | Required for [Cutadapt](#third-party-software-used). 3' adapter of mate 2. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. Value ignored for single-end libraries. | `str`
fq2_5p | Required for [Cutadapt](#third-party-software-used). 5' adapter of mate 2. Use value such as `XXXXXXXXXXXXXXX` if no adapter present or if no trimming is desired. Value ignored for single-end libraries. | `str`
organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: `GRCh38` | `str`
gtf | Required for [STAR](#third-party-software-used). Path to gene annotation `.fa` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/gene_annotations.gtf` | `str`
gtf_filtered | Required for [Salmon](#third-party-software-used). Path to filtered gene annotation `.gtf` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/gene_annotations.filtered.gtf` | `str`
genome | Required for [STAR](#third-party-software-used). Path to genome `.fa` file. File needs to be in subdirectory corresponding to `organism` field. Example: `/path/to/GRCh38/genome.fa` | `str`
sd | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles. Value ignored for paired-end libraries. | `int`
mean | Required for [kallisto](#third-party-software-used) and [Salmon](#third-party-software-used), but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles. Value ignored for paired-end libraries. | `int`
multimappers | Required for [STAR](#third-party-software-used). Maximum number of multiple alignments allowed for a read; if exceeded, the read is considered unmapped. | `int`
soft_clip | Required for [STAR](#third-party-software-used). One of `Local` (standard local alignment with soft-clipping allowed) or `EndToEnd` (force end-to-end read alignment, do not soft-clip). | `str`
pass_mode | Required for [STAR](#third-party-software-used). One of `None` (1-pass mapping) or `Basic` (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly). | `str`
libtype | Required for [Salmon](#third-party-software-used), and, after internal conversion, for [kallisto](#third-party-software-used) and [ALFA](#third-party-software-used) . See [Salmon manual][docs-salmon] for allowed values.    **WARNING**: do *NOT* use `A` to automatically infer the salmon library type, this will cause kallisto and ALFA to fail.  | `str`
fq1_polya3p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. | `str`
fq1_polya5p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. | `str`
fq2_polya3p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. | `str`
fq2_polya5p | Required for [Cutadapt](#third-party-software-used). Stretch of `A`s or `T`s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as `XXXXXXXXXXXXXXX` if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. | `str`

#### Create log directories

Sets up logging directories for the workflow run environment. Vanilla Python
statement, not a Snakemake rule.

### Sequencing mode-independent

#### `start`

Copy and rename read files.

> Local rule

- **Input**
  - Reads file (`.fastq.gz`)
- **Output**
  - Reads file, copied, renamed (`.fastq.gz`); used in [**fastqc**](#fastqc) and
    [**remove_adapters_cutadapt**](#remove_adapters_cutadapt)

#### `create_index_star`

Create index for [**STAR**](#third-party-software-used) short read aligner.

> Indices need only be generated once for each combination of genome, set of
> annotations and index size.

- **Input**
  - Genome sequence file (`.fasta`)
  - Gene annotation file (`.gtf`)
- **Parameters**
  - **samples.tsv**
    - `--sjdbOverhang`: maximum read length - 1; lower values may reduce accuracy, higher values may increase STAR runtime; specify in sample table column `index_size`
- **Output**
  - STAR index; used in [**map_genome_star**](#map_genome_star)
  - Index includes files:
    - Chromosome length table `chrNameLength.txt`; used in
      [**generate_alfa_index**](#generate_alfa_index) and
      [**prepare_bigWig**](#prepare_bigwig)
    - Chromosome name list `chrName.txt`; used in
      [**create_index_salmon**](#create_index_salmon)

#### `extract_transcriptome`

Create transcriptome from genome and gene annotations with
[**gffread**](#third-party-software-used).

- **Input**
  - Genome sequence file (`.fasta`)
  - Gene annotation file (`.gtf`)
- **Output**
  - Transcriptome sequence file (`.fasta`); used in
    [**concatenate_transcriptome_and_genome**](#concatenate_transcriptome_and_genome)
    and [**create_index_kallisto**](#create_index_kallisto)

#### `concatenate_transcriptome_and_genome`

Concatenate reference transcriptome and genome.

> Required by [Salmon][docs-salmon-selective-alignment]

- **Input**
  - Genome sequence file (`.fasta`)
  - Transcriptome sequence file (`.fasta`); from
    [**extract_transcriptome**](#extract_transcriptome)
- **Output**
  - Transcriptome genome reference file (`.fasta`); used in
    [**create_index_salmon**](#create_index_salmon)

#### `create_index_salmon`

Create index for [**Salmon**](#third-party-software-used) quantification.

> Required if Salmon is to be used in "mapping-based" mode.
>create_index_salmon
> Index is built using an auxiliary k-mer hash over k-mers of a specified
> length (default: 31). While the mapping algorithms will make use of
> arbitrarily long matches between the query and reference, the selected k-mer
> size will act as the minimum acceptable length for a valid match. Thus, a
> smaller value of k may slightly improve sensitivty. Empirically, the default
> value of k = 31 seems to work well for reads of 75 bp or longer. For shorter
> reads, consider using a smaller k.

- **Input**
  - Transcriptome genome reference file (`.fasta`); from
    [**concatenate_transcriptome_and_genome**](#concatenate_transcriptome_and_genome)
  - Chromosome name list `chrName.txt`; from
    [**create_index_star**](#create_index_star)
- **Parameters**
  - **samples.tsv**
    - `--kmerLen`: k-mer length; specify in sample table column `kmer`
- **Output**
  - Salmon index; used in [**quantification_salmon**](#quantification_salmon)

#### `create_index_kallisto`

Create index for [**kallisto**](#third-party-software-used) quantification.

> The default kmer size of 31 is used in this workflow and is not configurable
> by the user.
- **Input**
  - Transcriptome sequence file (`.fasta`); from
    [**extract_transcriptome**](#extract_transcriptome)
- **Output**
  - kallisto index; used in
    [**genome_quantification_kallisto**](#genome_quantification_kallisto)

#### `extract_transcripts_as_bed12`

Convert transcripts from `.gtf` to extended 12-column `.bed` format with
[custom-script][custom-script-gtf-to-bed12]. Note that the default transcript type setting is used, which is "protein_coding".

- **Input**
  - Gene annotation file (`.gtf`)
- **Output**
  - Transcript annotations file (12-column `.bed`); used in
    [**calculate_TIN_scores**](#calculate_tin_scores)

#### `fastqc`

Prepare quality control report for reads library with
[**FastQC**](#third-party-software-used).

- **Input**
  - Reads file (`.fastq.gz`); from [**start**](#start)
- **Output**
  - FastQC output directory with report (`.txt`) and figures (`.png`); used in
    [**multiqc_report**](#multiqc_report)

#### `index_genomic_alignment_samtools`

Index BAM file with [**SAMtools**](#third-party-software-used).

> Indexing a genome sorted BAM file allows one to quickly extract alignments
> overlapping particular genomic regions. Moreover, indexing is required by
> genome viewers such as IGV so that the viewers can quickly display alignments
> in a genomic region of interest.

- **Input**
  - Alignemnts file (`.bam`); from [**map_genome_star**](#map_genome_star)
- **Output**
  - BAM index file (`.bam.bai`); used in [**star_rpm**](#star_rpm) &
    [**calculate_TIN_scores**](#calculate_tin_scores)

#### `star_rpm`

Create stranded bedGraph coverage (`.bg`) with
[**STAR**](#third-party-software-used).

> Uses STAR's [RPM normalization][docs-star-rpm-norm] functionality.
>
> STAR RPM uses SAM flags to correctly tell where the read and its mate mapped
> to. That is, if mate 1 is mapped to the plus strand, then mate 2 is mapped to
> the minus strand, and STAR will count mate 1 and mate 2 to the plus strand.
> This is in contrast to `bedtools genomecov -bg -split`, where a mate is
> assigned to a strand irrespective of its corresponding mate.

- **Input**
  - Alignments file (`.bam`); from [**map_genome_star**](#map_genome_star)
  - BAM index file (`.bam.bai`); from
    [**index_genomic_alignment_samtools**](#index_genomic_alignment_samtools)
- **Output**
  - Coverage file (`.bg`); used in [**multiqc_report**](#multiqc_report) and
    [**rename_star_rpm_for_alfa**](#rename_star_rpm_for_alfa)

#### `rename_star_rpm_for_alfa`

Rename and copy stranded bedGraph coverage tracks such that they comply with
[**ALFA**](#third-party-software-used).

> Local rule
>
> Renaming to `plus.bg` and `minus.bg` depends on library orientation, which is
> provided by user in sample table column `libtype`.

- **Input**
  - Coverage file (`.bg`); from [**star_rpm**](#star_rpm)
- **Output**
  - Coverage file, renamed (`.bg`); used in [**alfa_qc**](#alfa_qc) and
    [**sort_bed_4_big**](#sort_bed_4_big)

#### `sort_bed_4_big`

Sort bedGraph files with [**bedtools**](#third-party-software-used).

- **Input**
  - Coverage files, renamed (`.bg`); from
    [**rename_star_rpm_for_alfa**](#rename_star_rpm_for_alfa)
- **Output**
  - Coverage files, sorted (`.bg`); used in [**prepare_bigWig**](#prepare_bigwig)

#### `prepare_bigWig`

Generate bigWig from bedGraph with
[**bedGraphToBigWig**](#third-party-software-used).

- **Input**
  - Coverage files, sorted (`.bg`); from [**sort_bed_4_big**](#sort_bed_4_big)
  - Chromosome length table `chrNameLength.txt`; from
    [**create_index_star**](#create_index_star)
- **Output**
  - Coverage files, one per strand and sample (`.bw`); used in
    [**finish**](#finish)

#### `calculate_TIN_scores`

Calculates the Transcript Integrity Number (TIN) for each transcript with
[custom script][custom-script-tin] based on
[**RSeqC**](#third-party-software-used).

> TIN is conceptually similar to RIN (RNA integrity number) but provides
> transcript-level measurement of RNA quality and is more sensitive for
> low-quality RNA samples.
>
> - TIN score of a transcript reflects RNA integrity of the transcript
> - Median TIN score across all transcripts reflects RNA integrity of library
> - TIN ranges from 0 (the worst) to 100 (the best)
> - A TIN of 60 means that 60% of the transcript would have been covered if
>   the read coverage were uniform
> - A TIN of 0 will be assigned if the transcript has no coverage or coverage
>   is below threshold

- **Input**
  - Alignments file (`.bam`); from [**map_genome_star**](#map_genome_star)
  - BAM index file (`.bam.bai`); from
    [**index_genomic_alignment_samtools**](#index_genomic_alignment_samtools)
  - Transcript annotations file (12-column `.bed`); from
    [**extract_transcripts_as_bed12**](#extract_transcripts_as_bed12)
- **Parameters**
  - **rule_config.yaml**
    - `-c 0`: minimum number of read mapped to a transcript (default 10)
- **Output**
  - TIN score table (custom `tsv`); used in
    [**merge_TIN_scores**](#merge_tin_scores)

#### `salmon_quantmerge_genes`

Merge gene-level expression estimates for all samples with
[**Salmon**](#third-party-software-used).

> Rule is run once per sequencing mode

- **Input**
  - Gene expression tables (custom `.tsv`) for samples of same sequencing mode;
    from [**quantification_salmon**](#quantification_salmon)
- **Output**
  - Gene TPM table (custom `.tsv`); used in
    [**multiqc_report**](#multiqc_report)
  - Gene read count table (custom `.tsv`); used in
    [**pca_salmon**](#pca_salmon) and [**pca_kallisto**](#pca_kallisto)

#### `salmon_quantmerge_transcripts`

Merge transcript-level expression estimates for all samples with
[**Salmon**](#third-party-software-used).

> Rule is run once per sequencing mode

- **Input**
  - Transcript expression tables (custom `.tsv`) for samples of same sequencing
    mode; from [**quantification_salmon**](#quantification_salmon)
- **Output**
  - Transcript TPM table (custom `.tsv`); used in
    [**multiqc_report**](#multiqc_report)
  - Transcript read count table (custom `.tsv`); used in
    [**pca_salmon**](#pca_salmon) and [**pca_kallisto**](#pca_kallisto)
#### `kallisto_merge_genes`

Merge gene-level expression estimates for all samples with 
[custom script][custom-script-merge-kallisto].

> Rule is run once per sequencing mode

- **Input**
  - Transcript expression tables (custom `.h5`) for samples of same sequencing
    mode; from [**genome_quantification_kallisto**](#genome_quantification_kallisto) 
  - Gene annotation file (custom `.gtf`)
- **Output**
  - Gene TPM table (custom `.tsv`)
  - Gene read count table (custom `.tsv`)
  - Mapping gene/transcript IDs table (custom `.tsv`)
- **Non-configurable & non-default**
  - `-txOut FALSE`: gene-level summarization (default would be transcript level) 

#### `kallisto_merge_transcripts`

Merge transcript-level expression estimates for all samples with 
[custom script][custom-script-merge-kallisto].

> Rule is run once per sequencing mode

- **Input**
  - Transcript expression tables (custom `.h5`) for samples of same sequencing
    mode; from [**genome_quantification_kallisto**](#genome_quantification_kallisto) 
- **Output**
  - Transcript TPM table (custom `.tsv`)
  - Transcript read count table (custom `.tsv`)

#### `pca_kallisto`

Run PCA analysis on kallisto genes and transcripts with [custom script][custom-script-zpca].

> Rule is run one time for transcript estimates and one time for genes estimates

- **Input**
  - Transcript/Genes TPM table (custom `.tsv`)
- **Output**
  - Directory with PCA plots, scree plot and top loading scores.

#### `pca_salmon`

Run PCA analysis on salmon genes and transcripts with [custom script][custom-script-zpca].

> Rule is run one time for transcript estimates and one time for genes estimates

- **Input**
  - Transcript/Genes TPM table (custom `.tsv`)
- **Output**
  - Directory with PCA plots, scree plot and top loading scores.

#### `generate_alfa_index`

Create index for [**ALFA**](#third-party-software-used).

- **Input**
  - Gene annotation file (`.gtf`)
  - Chromosome length table `chrNameLength.txt`; from
    [**create_index_star**](#create_index_star)
- **Output**
  - ALFA index; stranded and unstranded; used in
    [**alfa_qc**](#alfa_qc)

#### `alfa_qc`

Annotate alignments with [**ALFA**](#third-party-software-used).

> For details on output plots, see [ALFA documentation][docs-alfa].   
> Note: the read orientation of a sample will be inferred from salmon `libtype` specified in `samples.tsv`

- **Input**
  - Coverage files, renamed (`.bg`); from
    [**rename_star_rpm_for_alfa**](#rename_star_rpm_for_alfa)
  - ALFA index, stranded; from [**generate_alfa_index**](#generate_alfa_index)
- **Parameters**
- **Output**
  - Figures for biotypes and feature categories (`.pdf`)
  - Feature counts table (custom `.tsv`); used in
    [**alfa_qc_all_samples**](#alfa_qc_all_samples)

#### `prepare_multiqc_config`

Prepare config file for [**MultiQC**](#third-party-software-used).

> Local rule

- **Input**
  - Directories created during
    [**prepare_files_for_report**](#prepare_files_for_report)
- **Parameters**
  All parameters for this rule have to be specified in main `config.yaml`
  - `--intro-text`
  - `--custom-logo`
  - `--url`
- **Output**
  - Config file (`.yaml`); used in [**multiqc_report**](#multiqc_report)

#### `multiqc_report`

Prepare interactive report from results and logs with
[**MultiQC**](#third-party-software-used).

- **Input**
  - Config file (`.yaml`); from
    [**prepare_multiqc_config**](#prepare_multiqc_config)
  - ALFA plot, combined (`.png`); from
    [**alfa_concat_results**](#alfa_concat_results)
  - Coverage file (`.bg`); from [**star_rpm**](#star_rpm)
  - FastQC output directory with report (`.txt`) and figures (`.png`); from
    [**fastqc**](#fastqc)
  - Gene TPM table (custom `.tsv`); from
    [**salmon_quantmerge_genes**](#salmon_quantmerge_genes)
  - Gene read count table (custom `.tsv`); from
    [**salmon_quantmerge_genes**](#salmon_quantmerge_genes)
  - Pseudoalignments file (`.sam`); from
    [**genome_quantification_kallisto**](#genome_quantification_kallisto)
  - TIN score box plots (`.pdf` and `.png`); from
    [**plot_TIN_scores**](#plot_tin_scores)
  - Transcript TPM table (custom `.tsv`); from
    [**salmon_quantmerge_transcripts**](#salmon_quantmerge_transcripts)
  - Transcript read count table (custom `.tsv`); from
    [**salmon_quantmerge_transcripts**](#salmon_quantmerge_transcripts)

- **Output**
  - Directory with automatically generated `.html` report

#### `finish`

Target rule as required by [Snakemake][docs-snakemake-target-rule].

> Local rule

- **Input**
  - MultiQC report; from [**multiqc_report**](#multiqc_report)
  - Coverage files, one per strand and sample (`.bw`); used in
    [**prepare_bigWig**](#prepare_bigwig)

### Sequencing mode-specific

> Steps described here have two variants, one with the specified names for
> samples prepared with a single-end sequencing protocol, one with `pe_`
> prepended to the specified name for samples prepared with a paired-end
> sequencing protocol.

#### `remove_adapters_cutadapt`

Remove adapter sequences from reads with
[**Cutadapt**](#third-party-software-used).

- **Input**
  - Reads file (`.fastq.gz`); from [**start**](#start)
- **Parameters**
  - **samples.tsv**
    - Adapters to be removed; specify in sample table columns `fq1_3p`, `fq1_5p`,
    `fq2_3p`, `fq2_5p`
  - **rule_config.yaml:**
    - `-m 10`: Discard processed reads that are shorter than 10 nt. If
    specified in `rule_config.yaml`, it will override ZARP's default value of
    `m=1` for this parameter. Note that this is different from `cutadapt`'s
    default behavior (`m=0`), which leads to empty reads being retained,
    causing problems in downstream applications in ZARP. We thus strongly
    recommend to **not** set the value of `m` to `0`! Refer to `cutadapt`'s
    [documentation][docs-cutadapt-m] for more information on the `m`
    parameter.
    - `-n 2`: search for all the given adapter sequences repeatedly, either until
    no adapter match was found or until 2 rounds have been performed. (default 1)

- **Output**
  - Reads file (`.fastq.gz`); used in
    [**remove_polya_cutadapt**](#remove_polya_cutadapt)

#### `remove_polya_cutadapt`

Remove poly(A) tails from reads with
[**Cutadapt**](#third-party-software-used).

- **Input**
  - Reads file (`.fastq.gz`); from
    [**remove_adapters_cutadapt**](#remove_adapters_cutadapt)
- **Parameters**
  - **samples.tsv**
    - Poly(A) stretches to be removed; specify in sample table columns `fq1_polya` and `fq2_polya`
  - **rule_config.yaml**
    - `-m 10`: Discard processed reads that are shorter than 10 nt. If
    specified in `rule_config.yaml`, it will override ZARP's default value of
    `m=1` for this parameter. Note that this is different from `cutadapt`'s
    default behavior (`m=0`), which leads to empty reads being retained,
    causing problems in downstream applications in ZARP. We thus strongly
    recommend to **not** set the value of `m` to `0`! Refer to `cutadapt`'s
    [documentation][docs-cutadapt-m] for more information on the `m`
    parameter.
    - `-O 1`: minimal overlap of 1 (default: 3)
- **Output**
  - Reads file (`.fastq.gz`); used in
    [**genome_quantification_kallisto**](#genome_quantification_kallisto),
    [**map_genome_star**](#map_genome_star) and
    [**quantification_salmon**](#quantification_salmon)

#### `map_genome_star`

Align short reads to reference genome and/or transcriptome with
[**STAR**](#third-party-software-used).

- **Input**
  - Reads file (`.fastq.gz`); from
    [**remove_polya_cutadapt**](#remove_polya_cutadapt)
  - Index; from [**create_index_star**](#create_index_star)
- **Parameters**
  - **samples.tsv**
    - `--outFilterMultimapNmax`: maximum number of multiple alignments allowed; if exceeded, read is considered unmapped; specify in sample table column `multimappers`
    - `--alignEndsType`: one of `Local` (standard local alignment with soft-clipping allowed) or `EndToEnd` (force end-to-end read alignment, do not soft-clip); specify in sample table column `soft_clip`
    - `--twopassMode`: one of `None` (1-pass mapping) or `Basic` (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly); specify in sample table column `pass_mode`
  - **rule_config.yaml**
    - `--outFilterMultimapScoreRange=0`: the score range below the maximum score for multimapping alignments (default 1)
    - `--outFilterType=BySJout`: reduces the number of ”spurious” junctions
- **Output**
  - Aligned reads file (`.bam`); used in
    [**calculate_TIN_scores**](#calculate_TIN_scores),
    [**index_genomic_alignment_samtools**](#index_genomic_alignment_samtools)
    and [**star_rpm**](#star_rpm)
  - STAR log file
- **Non-configurable & non-default**
  - `--outSAMattributes=All`: NH HI AS nM NM MD jM jI MC ch
  - `--outStd=BAM_SortedByCoordinate`: which output will be directed to `STDOUT` (default 'Log')
  - `--outSAMtype=BAM SortedByCoordinate`: type of SAM/BAM output (default SAM)
  - `--outSAMattrRGline`: ID:rnaseq_pipeline SM: *sampleID*

#### `quantification_salmon`

Estimate transcript- and gene-level expression with
[**Salmon**](#third-party-software-used).

- **Input**
  - Reads file (`.fastq.gz`); from
    [**remove_polya_cutadapt**](#remove_polya_cutadapt)
  - Filtered annotation file (`.gtf`)
  - Index; from [**create_index_salmon**](#create_index_salmon)
- **Parameters**
  - **samples.tsv**
    - `libType`: see [Salmon manual][docs-salmon] for allowed values; specify in sample table column `libtype`
    - `--fldMean`: mean of distribution of fragment lengths; specify in sample table column `mean` **(single-end only)**
    - `--fldSD`: standard deviation of distribution of fragment lengths; specify in sample table column `sd` **(single-end only)**
  - **rule_config.yaml**
    - `--seqBias`: [correct for sequence specific
    biases](https://salmon.readthedocs.io/en/latest/salmon.html#seqbias)
    - `--validateMappings`: enables selective alignment of the sequencing reads when mapping them to the transcriptome; this can improve both the sensitivity and specificity of mapping and, as a result, can [improve quantification accuracy](https://salmon.readthedocs.io/en/latest/salmon.html#validatemappings).
    - `--writeUnmappedNames`: write out the names of reads (or mates in paired-end reads) that do not map to the transcriptome. For paired-end this gives flags that indicate how a read failed to map **(paired-end only)**
- **Output**
  - Gene expression table (`quant.sf`); used in
    [**salmon_quantmerge_genes**](#salmon_quantmerge_genes)
  - Transcript expression table ( `quant.sf`); used in
    [**salmon_quantmerge_transcripts**](#salmon_quantmerge_transcripts)

#### `genome_quantification_kallisto`

Generate pseudoalignments of reads to transcripts with
[**kallisto**](#third-party-software-used).
> Note: the kallisto strandedness parameter will be inferred from salmon `libtype` specified in `samples.tsv`

- **Input**
  - Reads file (`.fastq.gz`); from
    [**remove_polya_cutadapt**](#remove_polya_cutadapt)
  - Index; from [**create_index_kallisto**](#create_index_kallisto)
- **Parameters**
  - **samples.tsv**
    - `-l`: mean of distribution of fragment lengths; specify in sample table column `mean` **(single-end only)**
    - `-s`: standard deviation of distribution of fragment lengths; specify in sample table column `sd` **(single-end only)**
- **Output**
  - Pseudoalignments file (`.sam`) and
  - abundance (`.h5`) 
  used in [**kallisto_merge_genes**](#kallisto_merge_genes)
- **Non-configurable & non-default**
  - `--single`: Quantify single-end reads **(single-end only)**
  - `--pseudobam`: Save pseudoalignments to transcriptome to BAM file

[code-alfa]: <https://github.com/biocompibens/ALFA>
[code-bedgraphtobigwig]: <https://github.com/ucscGenomeBrowser/kent>
[code-bedtools]: <https://github.com/arq5x/bedtools2>
[code-cutadapt]: <https://github.com/marcelm/cutadapt>
[code-gffread]: <https://github.com/gpertea/gffread>
[code-fastqc]: <https://github.com/s-andrews/FastQC>
[code-imagemagick]: <https://github.com/ImageMagick/ImageMagick/>
[code-kallisto]: <https://github.com/pachterlab/kallisto>
[code-multiqc]: <https://github.com/ewels/MultiQC>
[code-rseqc]: <http://rseqc.sourceforge.net/>
[code-salmon]: <https://github.com/COMBINE-lab/salmon>
[code-samtools]: <https://github.com/samtools/samtools>
[code-star]: <https://github.com/alexdobin/STAR>
[custom-script-gtf-to-bed12]: <https://github.com/zavolanlab/zgtf>
[custom-script-tin]: <https://github.com/zavolanlab/tin-score-calculation>
[custom-script-merge-kallisto]: <https://github.com/zavolanlab/merge_kallisto>
[custom-script-zpca]: <https://github.com/zavolanlab/zpca>
[docs-alfa]: <https://github.com/biocompibens/ALFA#manual>
[docs-bedgraphtobigwig]: <http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/>
[docs-bedtools]: <https://bedtools.readthedocs.io/en/latest/>
[docs-cutadapt]: <https://cutadapt.readthedocs.io/en/stable/>
[docs-cutadapt-m]: <https://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads>
[docs-gffread]: <http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread>
[docs-fastqc]: <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/>
[docs-imagemagick]: <https://imagemagick.org/>
[docs-kallisto]: <http://pachterlab.github.io/kallisto/manual.html>
[docs-multiqc]: <https://multiqc.info/docs/>
[docs-rseqc]: <http://rseqc.sourceforge.net/#usage-information>
[docs-salmon]: <https://salmon.readthedocs.io/en/latest/>
[docs-salmon-selective-alignment]: <https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/>
[docs-samtools]: <http://www.htslib.org/doc/samtools.html>
[docs-snakemake]: <https://snakemake.readthedocs.io/en/stable/>
[docs-snakemake-target-rule]: <https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-7-adding-a-target-rule>
[docs-star]: <https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>
[docs-star-rpm-norm]: <https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html>
[license-bsd2]: <https://opensource.org/licenses/BSD-2-Clause>
[license-gpl2]: <https://opensource.org/licenses/GPL-2.0>
[license-gpl3]: <https://opensource.org/licenses/GPL-3.0>
[license-imagemagick]: <https://github.com/ImageMagick/ImageMagick/blob/master/LICENSE>
[license-mit]: <https://opensource.org/licenses/MIT>
[pub-alfa]: <https://doi.org/10.1186/s12864-019-5624-2>
[pub-cutadapt]: <https://doi.org/10.14806/ej.17.1.200>
[pub-kallisto]: <https://doi.org/10.1038/nbt.3519>
[pub-multiqc]: <https://doi.org/10.1093/bioinformatics/btw354>
[pub-rseqc]: <https://doi.org/10.1093/bioinformatics/bts356>
[pub-salmon]: <https://doi.org/10.1038/nmeth.4197>
[pub-samtools]: <https://doi.org/10.1093/bioinformatics/btp352>
[pub-star]: <https://doi.org/10.1093/bioinformatics/bts635>
[rule-graph]: images/rule_graph.svg