Skip to content
Snippets Groups Projects

ZARP: workflow documentation

This document describes the individual steps of the workflow. For instructions on installation and usage please see here.

Table of Contents

Third-party software used

Tag lines were taken from the developers' websites (code repository or manual)

Name License Tag line More info
ALFA MIT "Annotation Landscape For Aligned reads" - "[...] provides a global overview of features distribution composing NGS dataset(s)" code / manual / publication
bedGraphToBigWig MIT "Convert a bedGraph file to bigWig format" code / manual
bedtools GPLv2 "[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF" code / manual
cutadapt MIT "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" code / manual / publication
gffread MIT "[...] validate, filter, convert and perform various other operations on GFF files" code / manual
FastQC GPLv3 "A quality control analysis tool for high throughput sequencing data" code / manual
ImageMagick custom^ "[...] create, edit, compose, or convert bitmap images" code / manual
kallisto BSD-2 "[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads" code / manual / publication
MultiQC GPLv3 "Aggregate results from bioinformatics analyses across many samples into a single report" code / manual / publication
RSeqC GPLv3 "[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure." code / manual / publication
Salmon GPLv3 "Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment" code / manual / publication
SAMtools MIT "[...] suite of programs for interacting with high-throughput sequencing data" code / manual / publication
STAR MIT "Spliced Transcripts Alignment to a Reference" - "RNA-seq aligner" code / manual / publication

^ compatible with GPLv3

Description of workflow steps

The workflow consists of three Snakemake files: A main Snakefile and an individual Snakemake file for each sequencing mode (single-end and paired-end), as parameters for some tools differ between sequencing modes. The main Snakefile contains general steps for the creation of indices and other required files derived from the annotations, steps that are applicable to both sequencing modes, and steps that deal with gathering, summarizing or combining results. Individual steps of the workflow are described briefly, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table) are also described. Descriptions for steps for which individual "rules" exist for single- and paired-end sequencing libraries are combined, and only differences between the modes are highlighted.

Rule graph

rule_graph

Visual representation of workflow. Automatically prepared with Snakemake.

Preparatory

Read sample table

Requirements
  • tab-separated values (.tsv) file
  • first row has to contain parameter names as in samples.tsv
  • first column used as sample identifiers
Parameter name Description Data type(s)
sample Descriptive sample name str
seqmode Required for various steps of the workflow. One of pe (for paired-end libraries) or se (for single-end libraries). str
fq1 Path of library file in .fastq.gz format (or mate 1 read file for paired-end libraries) str
index_size Required for STAR. Ideally the maximum read length minus 1 (max(ReadLength)-1). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. int
kmer Required for Salmon. Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values of poor mapping is observed. int
fq2 Path of mate 2 read file in .fastq.gz format. Value ignored for for single-end libraries. str
fq1_3p Required for Cutadapt. 3' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. str
fq1_5p Required for Cutadapt. 5' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. str
fq2_3p Required for Cutadapt. 3' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. str
fq2_5p Required for Cutadapt. 5' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. str
organism Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: GRCh38 str
gtf Required for STAR. Path to gene annotation .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.gtf str
gtf_filtered Required for Salmon. Path to filtered gene annotation .gtf file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.filtered.gtf str
genome Required for STAR. Path to genome .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/genome.fa str
sd Required for kallisto and Salmon, but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles. Value ignored for paired-end libraries. int
mean Required for kallisto and Salmon, but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles. Value ignored for paired-end libraries. int
multimappers Required for STAR. Maximum number of multiple alignments allowed for a read; if exceeded, the read is considered unmapped. int
soft_clip Required for STAR. One of Local (standard local alignment with soft-clipping allowed) or EndToEnd (force end-to-end read alignment, do not soft-clip). str
pass_mode Required for STAR. One of None (1-pass mapping) or Basic (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly). str
libtype Required for Salmon. See Salmon manual for allowed values. If in doubt, enter A to automatically infer the library type. str
kallisto_directionality Required for kallisto and ALFA. One of --fr-stranded (strand-specific reads, first read forward) and --rf-stranded (strand-specific reads, first read reverse) str
fq1_polya3p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. str
fq1_polya5p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. str
fq2_polya3p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. str
fq2_polya5p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. str

Create log directories

Sets up logging directories for the workflow run environment. Vanilla Python statement, not a Snakemake rule.

Sequencing mode-independent

start

Copy and rename read files.

Local rule

create_index_star

Create index for STAR short read aligner.

Indices need only be generated once for each combination of genome, set of annotations and index size.

  • Input
    • Genome sequence file (.fasta)
    • Gene annotation file (.gtf)
  • Parameters
    • --sjdbOverhang: maximum read length - 1; lower values may reduce accuracy, higher values may increase STAR runtime; specify in sample table column index_size
  • Output

extract_transcriptome

Create transcriptome from genome and gene annotations with gffread.

concatenate_transcriptome_and_genome

Concatenate reference transcriptome and genome.

Required by Salmon

create_index_salmon

Create index for Salmon quantification.

Required if Salmon is to be used in "mapping-based" mode. create_index_salmon Index is built using an auxiliary k-mer hash over k-mers of a specified length (default: 31). While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the selected k-mer size will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivty. Empirically, the default value of k = 31 seems to work well for reads of 75 bp or longer. For shorter reads, consider using a smaller k.

create_index_kallisto

Create index for kallisto quantification.

The default kmer size of 31 is used in this workflow and is not configurable by the user.

extract_transcripts_as_bed12

Convert transcripts from .gtf to extended 12-column .bed format with custom-script.

  • Input
    • Gene annotation file (.gtf)
  • Output

fastqc

Prepare quality control report for reads library with FastQC.

  • Input
    • Reads file (.fastq.gz); from start
  • Output
    • FastQC output directory with report (.txt) and figures (.png); used in multiqc_report

index_genomic_alignment_samtools

Index BAM file with SAMtools.

Indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions. Moreover, indexing is required by genome viewers such as IGV so that the viewers can quickly display alignments in a genomic region of interest.

star_rpm

Create stranded bedGraph coverage (.bg) with STAR.

Uses STAR's RPM normalization functionality.

STAR RPM uses SAM flags to correctly tell where the read and its mate mapped to. That is, if mate 1 is mapped to the plus strand, then mate 2 is mapped to the minus strand, and STAR will count mate 1 and mate 2 to the plus strand. This is in contrast to bedtools genomecov -bg -split, where a mate is assigned to a strand irrespective of its corresponding mate.

rename_star_rpm_for_alfa

Rename and copy stranded bedGraph coverage tracks such that they comply with ALFA.

Local rule

Renaming to plus.bg and minus.bg depends on library orientation, which is provided by user in sample table column kallisto_directionality.

sort_bed_4_big

Sort bedGraph files with bedtools.

prepare_bigWig

Generate bigWig from bedGraph with bedGraphToBigWig.

  • Input
  • Output
    • Coverage files, one per strand and sample (.bw); used in finish

calculate_TIN_scores

Calculates the Transcript Integrity Number (TIN) for each transcript with custom script based on RSeqC.

TIN is conceptually similar to RIN (RNA integrity number) but provides transcript-level measurement of RNA quality and is more sensitive for low-quality RNA samples.

  • TIN score of a transcript reflects RNA integrity of the transcript
  • Median TIN score across all transcripts reflects RNA integrity of library
  • TIN ranges from 0 (the worst) to 100 (the best)
  • A TIN of 60 means that 60% of the transcript would have been covered if the read coverage were uniform
  • A TIN of 0 will be assigned if the transcript has no coverage or coverage is below threshold

merge_TIN_scores

Merges TIN score tables for all samples with custom script.

plot_TIN_scores

Generate sample-wise box plots of TIN scores with custom script.

salmon_quantmerge_genes

Merge gene-level expression estimates for all samples with Salmon.

Rule is run once per sequencing mode

salmon_quantmerge_transcripts

Merge transcript-level expression estimates for all samples with Salmon.

Rule is run once per sequencing mode

  • Input
    • Transcript expression tables (custom .tsv) for samples of same sequencing mode; from quantification_salmon
  • Output

generate_alfa_index

Create index for ALFA.

  • Input
    • Gene annotation file (.gtf)
    • Chromosome length table chrNameLength.txt; from create_index_star
  • Output
    • ALFA index; stranded and unstranded; used in alfa_qc

alfa_qc

Annotate alignments with ALFA.

For details on output plots, see ALFA documentation.

  • Input
  • Parameters
    • -s: library orientation; specified by user in sample table column kallisto_directionality
  • Output
    • Figures for biotypes and feature categories (.pdf)
    • Feature counts table (custom .tsv); used in alfa_qc_all_samples

alfa_qc_all_samples

Combines output of all samples with ALFA.

  • Input
    • Feature counts table (custom .tsv); from alfa_qc
  • Output
    • Figures for biotypes and feature categories (.pdf); summarized for all samples together; used in alfa_concat_results

alfa_concat_results

Concatenate and convert ALFA output plots into single plot with ImageMagick.

  • Input
    • Figures for biotypes and feature categories (.pdf); for individual and summarized for all samples
  • Output

prepare_multiqc_config

Prepare config file for MultiQC.

Local rule

multiqc_report

Prepare interactive report from results and logs with MultiQC.

finish

Target rule as required by Snakemake.

Local rule

Sequencing mode-specific

Steps described here have two variants, one with the specified names for samples prepared with a single-end sequencing protocol, one with pe_ prepended to the specified name for samples prepared with a paired-end sequencing protocol.

remove_adapters_cutadapt

Remove adapter sequences from reads with Cutadapt.

  • Input
    • Reads file (.fastq.gz); from start
  • Parameters
    • Adapters to be removed; specify in sample table columns fq1_3p, fq1_5p, fq2_3p, fq2_5p
  • Output
  • Non-configurable & non-default
    • -e 0.1: maximum error-rate of 10%
    • -j 8: use 8 threads
    • -m 10: Discard processed reads that are shorter than 10
    • -n 2: search for all the given adapter sequences repeatedly, either until no adapter match was found or until 2 rounds have been performed.
    • --pair-filter=any: (paired-end only) filtering criteria must apply to any of the two reads in order for a read pair to be discarded

remove_polya_cutadapt

Remove poly(A) tails from reads with Cutadapt.

  • Input
  • Parameters
    • Poly(A) stretches to be removed; specify in sample table columns fq1_polya and fq2_polya
  • Output
  • Non-configurable & non-default
    • -e 0.1: maximum error-rate of 10%
    • -j 8: use 8 threads
    • -m 10: Discard processed reads that are shorter than 10
    • -n 1: search for all the given adapter sequences repeatedly, either until no adapter match was found or until 1 round has been performed.
    • --pair-filter=any: (paired-end only) filtering criteria must apply to any of the two reads in order for a read pair to be discarded
    • -O 1: (single-end only) minimal overlap of 1

map_genome_star

Align short reads to reference genome and/or transcriptome with STAR.

  • Input
  • Parameters
    • --outFilterMultimapNmax: maximum number of multiple alignments allowed; if exceeded, read is considered unmapped; specify in sample table column multimappers
    • --alignEndsType: one of Local (standard local alignment with soft-clipping allowed) or EndToEnd (force end-to-end read alignment, do not soft-clip); specify in sample table column soft_clip
    • --twopassMode: one of None (1-pass mapping) or Basic (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly); specify in sample table column pass_mode
  • Output
  • Non-configurable & non-default
    • --outSAMunmapped=None: do not output unmapped reads in SAM file
    • --outFilterMultimapScoreRange=0: the score range below the maximum score for multimapping alignments
    • --outSAMattributes=All: NH HI AS nM NM MD jM jI MC ch
    • --outStd=BAM_SortedByCoordinate: which output will be directed to STDOUT
    • --outSAMtype=BAM_SortedByCoordinate: type of SAM/BAM output
    • --outFilterMismatchNoverLmax=0.04: alignment will be output only if its ratio of mismatches to mapped length is less than or equal to this value
    • --outFilterScoreMinOverLread=0.3: same as outFilterScoreMin, but normalized to read length (sum of mates’ lengths for paired-end reads)
    • --outFilterMatchNminOverLread=0.3: minimal fraction of aligned bases
    • --outFilterType=BySJout: reduces the number of ”spurious” junctions
    • --outReadsUnmapped=None: do not output unmapped reads

quantification_salmon

Estimate transcript- and gene-level expression with Salmon.

  • Input
  • Parameters
    • libType: see Salmon manual for allowed values; specify in sample table column libtype
    • --fldMean: mean of distribution of fragment lengths; specify in sample table column mean (single-end only)
    • --fldSD: standard deviation of distribution of fragment lengths; specify in sample table column sd (single-end only)
  • Output
  • Non-configurable & non-default
    • --seqBias: correct for sequence specific biases
    • --validateMappings: enables selective alignment of the sequencing reads when mapping them to the transcriptome; this can improve both the sensitivity and specificity of mapping and, as a result, can improve quantification accuracy.
    • --writeUnmappedNames: write out the names of reads (or mates in paired-end reads) that do not map to the transcriptome.

genome_quantification_kallisto

Generate pseudoalignments of reads to transcripts with kallisto.

  • Input
  • Parameters
    • directionality; specify in sample table column kallisto_directionality
    • -l: mean of distribution of fragment lengths; specify in sample table column mean (single-end only)
    • -s: standard deviation of distribution of fragment lengths; specify in sample table column sd (single-end only)
  • Output