pipeline_documentation.md



ZARP: workflow documentation
This document describes the individual steps of the workflow. For instructions
on installation and usage please see here.

Table of Contents

Third-party software used

Description of workflow steps

Rule graph

Preparatory

Read samples table
Create log directories


Sequencing mode-independent

start
create_index_star
extract_transcriptome
concatenate_transcriptome_and_genome
create_index_salmon
create_index_kallisto
extract_transcripts_as_bed12
fastqc
index_genomic_alignment_samtools
star_rpm
rename_star_rpm_for_alfa
sort_bed_4_big
prepare_bigWig
calculate_TIN_scores
merge_TIN_scores
plot_TIN_scores
salmon_quantmerge_genes
salmon_quantmerge_transcripts
kallisto_merge_genes
kallisto_merge_transcripts
generate_alfa_index
alfa_qc
alfa_qc_all_samples
alfa_concat_results
prepare_multiqc_config
multiqc_report
finish
Sequencing mode-specific
remove_adapters_cutadapt
remove_polya_cutadapt
map_genome_star
quantification_salmon
genome_quantification_kallisto


Third-party software used

Tag lines were taken from the developers' websites (code repository or manual)


Name
License
Tag line
More info


ALFA
MIT

"Annotation Landscape For Aligned reads" - "[...] provides a global overview of features distribution composing NGS dataset(s)"


code / manual / publication


bedGraphToBigWig
MIT
"Convert a bedGraph file to bigWig format"

code / manual


bedtools
GPLv2
"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"

code / manual


cutadapt
MIT
"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"

code / manual / publication


gffread
MIT
"[...] validate, filter, convert and perform various other operations on GFF files"

code / manual


FastQC
GPLv3
"A quality control analysis tool for high throughput sequencing data"

code / manual


ImageMagick

custom^
"[...] create, edit, compose, or convert bitmap images"

code / manual


kallisto
BSD-2
"[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads"

code / manual / publication


MultiQC
GPLv3
"Aggregate results from bioinformatics analyses across many samples into a single report"

code / manual / publication


RSeqC
GPLv3
"[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure."

code / manual / publication


Salmon
GPLv3
"Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment"

code / manual / publication


SAMtools
MIT
"[...] suite of programs for interacting with high-throughput sequencing data"

code / manual / publication


STAR
MIT

"Spliced Transcripts Alignment to a Reference" - "RNA-seq aligner"


code / manual / publication


^ compatible with GPLv3

Description of workflow steps

The workflow consists of three Snakemake files: A main Snakefile and an
individual Snakemake file for each sequencing mode (single-end and
paired-end), as parameters for some tools differ between sequencing modes.
The main Snakefile contains general steps for the creation of indices and
other required files derived from the annotations, steps that are applicable
to both sequencing modes, and steps that deal with gathering, summarizing or
combining results. Individual steps of the workflow are described briefly, and
links to the respective software manuals are given. Parameters that can be
modified by the user (via the samples table) are also described. Descriptions
for steps for which individual "rules" exist for single- and paired-end
sequencing libraries are combined, and only differences between the modes are
highlighted.


Rule graph

Visual representation of workflow. Automatically prepared with
Snakemake.

Preparatory

Read sample table

Requirements

tab-separated values (.tsv) file
first row has to contain parameter names as in samples.tsv

first column used as sample identifiers


Parameter name
Description
Data type(s)


sample
Descriptive sample name
str


seqmode
Required for various steps of the workflow. One of pe (for paired-end libraries) or se (for single-end libraries).
str


fq1
Path of library file in .fastq.gz format (or mate 1 read file for paired-end libraries)
str


index_size
Required for STAR. Ideally the maximum read length minus 1 (max(ReadLength)-1). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times.
int


kmer
Required for Salmon. Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values of poor mapping is observed.
int


fq2
Path of mate 2 read file in .fastq.gz format. Value ignored for for single-end libraries.
str


fq1_3p
Required for Cutadapt. 3' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired.
str


fq1_5p
Required for Cutadapt. 5' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired.
str


fq2_3p
Required for Cutadapt. 3' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries.
str


fq2_5p
Required for Cutadapt. 5' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries.
str


organism
Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: GRCh38

str


gtf
Required for STAR. Path to gene annotation .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.gtf

str


gtf_filtered
Required for Salmon. Path to filtered gene annotation .gtf file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.filtered.gtf

str


genome
Required for STAR. Path to genome .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/genome.fa

str


sd
Required for kallisto and Salmon, but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles. Value ignored for paired-end libraries.
int


mean
Required for kallisto and Salmon, but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles. Value ignored for paired-end libraries.
int


multimappers
Required for STAR. Maximum number of multiple alignments allowed for a read; if exceeded, the read is considered unmapped.
int


soft_clip
Required for STAR. One of Local (standard local alignment with soft-clipping allowed) or EndToEnd (force end-to-end read alignment, do not soft-clip).
str


pass_mode
Required for STAR. One of None (1-pass mapping) or Basic (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly).
str


libtype
Required for Salmon. See Salmon manual for allowed values. If in doubt, enter A to automatically infer the library type.
str


kallisto_directionality
Required for kallisto and ALFA. One of --fr-stranded (strand-specific reads, first read forward) and --rf-stranded (strand-specific reads, first read reverse)
str


fq1_polya3p
Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired.
str


fq1_polya5p
Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired.
str


fq2_polya3p
Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries.
str


fq2_polya5p
Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries.
str


Create log directories
Sets up logging directories for the workflow run environment. Vanilla Python
statement, not a Snakemake rule.

Sequencing mode-independent

start

Copy and rename read files.

Local rule


Input

Reads file (.fastq.gz)


Output

Reads file, copied, renamed (.fastq.gz); used in fastqc and
remove_adapters_cutadapt


create_index_star

Create index for STAR short read aligner.

Indices need only be generated once for each combination of genome, set of
annotations and index size.


Input

Genome sequence file (.fasta)
Gene annotation file (.gtf)


Parameters


--sjdbOverhang: maximum read length - 1; lower values may reduce accuracy,
higher values may increase STAR runtime; specify in sample table column
index_size


Output

STAR index; used in map_genome_star

Index includes files:

Chromosome length table chrNameLength.txt; used in
generate_alfa_index and
prepare_bigWig

Chromosome name list chrName.txt; used in
create_index_salmon


extract_transcriptome

Create transcriptome from genome and gene annotations with
gffread.


Input

Genome sequence file (.fasta)
Gene annotation file (.gtf)


Output

Transcriptome sequence file (.fasta); used in
concatenate_transcriptome_and_genome
and create_index_kallisto


concatenate_transcriptome_and_genome

Concatenate reference transcriptome and genome.

Required by Salmon


Input

Genome sequence file (.fasta)
Transcriptome sequence file (.fasta); from
extract_transcriptome


Output

Transcriptome genome reference file (.fasta); used in
create_index_salmon


create_index_salmon

Create index for Salmon quantification.

Required if Salmon is to be used in "mapping-based" mode.
create_index_salmon
Index is built using an auxiliary k-mer hash over k-mers of a specified
length (default: 31). While the mapping algorithms will make use of
arbitrarily long matches between the query and reference, the selected k-mer
size will act as the minimum acceptable length for a valid match. Thus, a
smaller value of k may slightly improve sensitivty. Empirically, the default
value of k = 31 seems to work well for reads of 75 bp or longer. For shorter
reads, consider using a smaller k.


Input

Transcriptome genome reference file (.fasta); from
concatenate_transcriptome_and_genome

Chromosome name list chrName.txt; from
create_index_star


Parameters


--kmerLen: k-mer length; specify in sample table column kmer


Output

Salmon index; used in quantification_salmon


create_index_kallisto

Create index for kallisto quantification.

The default kmer size of 31 is used in this workflow and is not configurable
by the user.


Input

Transcriptome sequence file (.fasta); from
extract_transcriptome


Output

kallisto index; used in
genome_quantification_kallisto


extract_transcripts_as_bed12

Convert transcripts from .gtf to extended 12-column .bed format with
custom-script.


Input

Gene annotation file (.gtf)


Output

Transcript annotations file (12-column .bed); used in
calculate_TIN_scores


fastqc

Prepare quality control report for reads library with
FastQC.


Input

Reads file (.fastq.gz); from start


Output

FastQC output directory with report (.txt) and figures (.png); used in
multiqc_report


index_genomic_alignment_samtools

Index BAM file with SAMtools.

Indexing a genome sorted BAM file allows one to quickly extract alignments
overlapping particular genomic regions. Moreover, indexing is required by
genome viewers such as IGV so that the viewers can quickly display alignments
in a genomic region of interest.


Input

Alignemnts file (.bam); from map_genome_star


Output

BAM index file (.bam.bai); used in star_rpm &
calculate_TIN_scores


star_rpm

Create stranded bedGraph coverage (.bg) with
STAR.

Uses STAR's RPM normalization functionality.
STAR RPM uses SAM flags to correctly tell where the read and its mate mapped
to. That is, if mate 1 is mapped to the plus strand, then mate 2 is mapped to
the minus strand, and STAR will count mate 1 and mate 2 to the plus strand.
This is in contrast to bedtools genomecov -bg -split, where a mate is
assigned to a strand irrespective of its corresponding mate.


Input

Alignments file (.bam); from map_genome_star

BAM index file (.bam.bai); from
index_genomic_alignment_samtools


Output

Coverage file (.bg); used in multiqc_report and
rename_star_rpm_for_alfa


Non-configurable & non-default

--outWigStrans="Stranded"
--outWigNorm="RPM"


rename_star_rpm_for_alfa

Rename and copy stranded bedGraph coverage tracks such that they comply with
ALFA.

Local rule
Renaming to plus.bg and minus.bg depends on library orientation, which is
provided by user in sample table column kallisto_directionality.


Input

Coverage file (.bg); from star_rpm


Output

Coverage file, renamed (.bg); used in alfa_qc and
sort_bed_4_big


sort_bed_4_big

Sort bedGraph files with bedtools.


Input

Coverage files, renamed (.bg); from
rename_star_rpm_for_alfa


Output

Coverage files, sorted (.bg); used in prepare_bigWig


prepare_bigWig

Generate bigWig from bedGraph with
bedGraphToBigWig.


Input

Coverage files, sorted (.bg); from sort_bed_4_big

Chromosome length table chrNameLength.txt; from
create_index_star


Output

Coverage files, one per strand and sample (.bw); used in
finish


calculate_TIN_scores

Calculates the Transcript Integrity Number (TIN) for each transcript with
custom script based on
RSeqC.

TIN is conceptually similar to RIN (RNA integrity number) but provides
transcript-level measurement of RNA quality and is more sensitive for
low-quality RNA samples.

TIN score of a transcript reflects RNA integrity of the transcript
Median TIN score across all transcripts reflects RNA integrity of library
TIN ranges from 0 (the worst) to 100 (the best)
A TIN of 60 means that 60% of the transcript would have been covered if
the read coverage were uniform
A TIN of 0 will be assigned if the transcript has no coverage or coverage
is below threshold


Input

Alignments file (.bam); from map_genome_star

BAM index file (.bam.bai); from
index_genomic_alignment_samtools

Transcript annotations file (12-column .bed); from
extract_transcripts_as_bed12


Output

TIN score table (custom tsv); used in
merge_TIN_scores


merge_TIN_scores

Merges TIN score tables for all samples with custom script.


Input

TIN score table (custom tsv); per sample; from
calculate_TIN_scores


Output

TIN score table (custom tsv); for all samples; used in
plot_TIN_scores


plot_TIN_scores

Generate sample-wise box plots of
TIN scores with custom script.


Input

TIN score table (custom tsv); for all samples; from
merge_TIN_scores


Output

TIN score box plots (.pdf and .png); used in
multiqc_report


salmon_quantmerge_genes

Merge gene-level expression estimates for all samples with
Salmon.

Rule is run once per sequencing mode


Input

Gene expression tables (custom .tsv) for samples of same sequencing mode;
from quantification_salmon


Output

Gene TPM table (custom .tsv); used in
multiqc_report

Gene read count table (custom .tsv); used in
multiqc_report


salmon_quantmerge_transcripts

Merge transcript-level expression estimates for all samples with
Salmon.

Rule is run once per sequencing mode


Input

Transcript expression tables (custom .tsv) for samples of same sequencing
mode; from quantification_salmon


Output

Transcript TPM table (custom .tsv); used in
multiqc_report

Transcript read count table (custom .tsv); used in
multiqc_report


kallisto_merge_genes

Merge gene-level expression estimates for all samples with
custom script.

Rule is run once per sequencing mode


Input

Transcript expression tables (custom .h5) for samples of same sequencing
mode; from genome_quantification_kallisto

Gene annotation file (custom .gtf)


Output

Gene TPM table (custom .tsv)
Gene read count table (custom .tsv)
Mapping gene/transcript IDs table (custom .tsv)