-
The rules rely on https://github.com/zavolanlab/merge_kallisto Update info in pipeline_documentation.md
The rules rely on https://github.com/zavolanlab/merge_kallisto Update info in pipeline_documentation.md
ZARP: workflow documentation
This document describes the individual steps of the workflow. For instructions on installation and usage please see here.
Table of Contents
- Third-party software used
-
Description of workflow steps
- Rule graph
- Preparatory
-
Sequencing mode-independent
- start
- create_index_star
- extract_transcriptome
- concatenate_transcriptome_and_genome
- create_index_salmon
- create_index_kallisto
- extract_transcripts_as_bed12
- fastqc
- index_genomic_alignment_samtools
- star_rpm
- rename_star_rpm_for_alfa
- sort_bed_4_big
- prepare_bigWig
- calculate_TIN_scores
- merge_TIN_scores
- plot_TIN_scores
- salmon_quantmerge_genes
- salmon_quantmerge_transcripts
- kallisto_merge_genes
- kallisto_merge_transcripts
- generate_alfa_index
- alfa_qc
- alfa_qc_all_samples
- alfa_concat_results
- prepare_multiqc_config
- multiqc_report
- finish
- Sequencing mode-specific
- remove_adapters_cutadapt
- remove_polya_cutadapt
- map_genome_star
- quantification_salmon
- genome_quantification_kallisto
Third-party software used
Tag lines were taken from the developers' websites (code repository or manual)
Name | License | Tag line | More info |
---|---|---|---|
ALFA | MIT | "Annotation Landscape For Aligned reads" - "[...] provides a global overview of features distribution composing NGS dataset(s)" | code / manual / publication |
bedGraphToBigWig | MIT | "Convert a bedGraph file to bigWig format" | code / manual |
bedtools | GPLv2 | "[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF" | code / manual |
cutadapt | MIT | "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" | code / manual / publication |
gffread | MIT | "[...] validate, filter, convert and perform various other operations on GFF files" | code / manual |
FastQC | GPLv3 | "A quality control analysis tool for high throughput sequencing data" | code / manual |
ImageMagick | custom^ | "[...] create, edit, compose, or convert bitmap images" | code / manual |
kallisto | BSD-2 | "[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads" | code / manual / publication |
MultiQC | GPLv3 | "Aggregate results from bioinformatics analyses across many samples into a single report" | code / manual / publication |
RSeqC | GPLv3 | "[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure." | code / manual / publication |
Salmon | GPLv3 | "Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment" | code / manual / publication |
SAMtools | MIT | "[...] suite of programs for interacting with high-throughput sequencing data" | code / manual / publication |
STAR | MIT | "Spliced Transcripts Alignment to a Reference" - "RNA-seq aligner" | code / manual / publication |
^ compatible with GPLv3
Description of workflow steps
The workflow consists of three Snakemake files: A main
Snakefile
and an individual Snakemake file for each sequencing mode (single-end and paired-end), as parameters for some tools differ between sequencing modes. The mainSnakefile
contains general steps for the creation of indices and other required files derived from the annotations, steps that are applicable to both sequencing modes, and steps that deal with gathering, summarizing or combining results. Individual steps of the workflow are described briefly, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table) are also described. Descriptions for steps for which individual "rules" exist for single- and paired-end sequencing libraries are combined, and only differences between the modes are highlighted.
Rule graph
Visual representation of workflow. Automatically prepared with Snakemake.
Preparatory
Read sample table
Requirements
- tab-separated values (
.tsv
) file - first row has to contain parameter names as in
samples.tsv
- first column used as sample identifiers
Parameter name | Description | Data type(s) |
---|---|---|
sample | Descriptive sample name | str |
seqmode | Required for various steps of the workflow. One of pe (for paired-end libraries) or se (for single-end libraries). |
str |
fq1 | Path of library file in .fastq.gz format (or mate 1 read file for paired-end libraries) |
str |
index_size | Required for STAR. Ideally the maximum read length minus 1 (max(ReadLength)-1 ). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. |
int |
kmer | Required for Salmon. Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values of poor mapping is observed. | int |
fq2 | Path of mate 2 read file in .fastq.gz format. Value ignored for for single-end libraries. |
str |
fq1_3p | Required for Cutadapt. 3' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. |
str |
fq1_5p | Required for Cutadapt. 5' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. |
str |
fq2_3p | Required for Cutadapt. 3' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. |
str |
fq2_5p | Required for Cutadapt. 5' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. |
str |
organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: GRCh38
|
str |
gtf | Required for STAR. Path to gene annotation .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.gtf
|
str |
gtf_filtered | Required for Salmon. Path to filtered gene annotation .gtf file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.filtered.gtf
|
str |
genome | Required for STAR. Path to genome .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/genome.fa
|
str |
sd | Required for kallisto and Salmon, but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles. Value ignored for paired-end libraries. | int |
mean | Required for kallisto and Salmon, but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles. Value ignored for paired-end libraries. | int |
multimappers | Required for STAR. Maximum number of multiple alignments allowed for a read; if exceeded, the read is considered unmapped. | int |
soft_clip | Required for STAR. One of Local (standard local alignment with soft-clipping allowed) or EndToEnd (force end-to-end read alignment, do not soft-clip). |
str |
pass_mode | Required for STAR. One of None (1-pass mapping) or Basic (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly). |
str |
libtype | Required for Salmon. See Salmon manual for allowed values. If in doubt, enter A to automatically infer the library type. |
str |
kallisto_directionality | Required for kallisto and ALFA. One of --fr-stranded (strand-specific reads, first read forward) and --rf-stranded (strand-specific reads, first read reverse) |
str |
fq1_polya3p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. |
str |
fq1_polya5p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. |
str |
fq2_polya3p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. |
str |
fq2_polya5p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. |
str |
Create log directories
Sets up logging directories for the workflow run environment. Vanilla Python statement, not a Snakemake rule.
Sequencing mode-independent
start
Copy and rename read files.
Local rule
-
Input
- Reads file (
.fastq.gz
)
- Reads file (
-
Output
- Reads file, copied, renamed (
.fastq.gz
); used in fastqc and remove_adapters_cutadapt
- Reads file, copied, renamed (
create_index_star
Create index for STAR short read aligner.
Indices need only be generated once for each combination of genome, set of annotations and index size.
-
Input
- Genome sequence file (
.fasta
) - Gene annotation file (
.gtf
)
- Genome sequence file (
-
Parameters
-
--sjdbOverhang
: maximum read length - 1; lower values may reduce accuracy, higher values may increase STAR runtime; specify in sample table columnindex_size
-
-
Output
- STAR index; used in map_genome_star
- Index includes files:
- Chromosome length table
chrNameLength.txt
; used in generate_alfa_index and prepare_bigWig - Chromosome name list
chrName.txt
; used in create_index_salmon
- Chromosome length table
extract_transcriptome
Create transcriptome from genome and gene annotations with gffread.
-
Input
- Genome sequence file (
.fasta
) - Gene annotation file (
.gtf
)
- Genome sequence file (
-
Output
- Transcriptome sequence file (
.fasta
); used in concatenate_transcriptome_and_genome and create_index_kallisto
- Transcriptome sequence file (
concatenate_transcriptome_and_genome
Concatenate reference transcriptome and genome.
Required by Salmon
-
Input
- Genome sequence file (
.fasta
) - Transcriptome sequence file (
.fasta
); from extract_transcriptome
- Genome sequence file (
-
Output
- Transcriptome genome reference file (
.fasta
); used in create_index_salmon
- Transcriptome genome reference file (
create_index_salmon
Create index for Salmon quantification.
Required if Salmon is to be used in "mapping-based" mode. create_index_salmon Index is built using an auxiliary k-mer hash over k-mers of a specified length (default: 31). While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the selected k-mer size will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivty. Empirically, the default value of k = 31 seems to work well for reads of 75 bp or longer. For shorter reads, consider using a smaller k.
-
Input
- Transcriptome genome reference file (
.fasta
); from concatenate_transcriptome_and_genome - Chromosome name list
chrName.txt
; from create_index_star
- Transcriptome genome reference file (
-
Parameters
-
--kmerLen
: k-mer length; specify in sample table columnkmer
-
-
Output
- Salmon index; used in quantification_salmon
create_index_kallisto
Create index for kallisto quantification.
The default kmer size of 31 is used in this workflow and is not configurable by the user.
-
Input
- Transcriptome sequence file (
.fasta
); from extract_transcriptome
- Transcriptome sequence file (
-
Output
- kallisto index; used in genome_quantification_kallisto
extract_transcripts_as_bed12
Convert transcripts from .gtf
to extended 12-column .bed
format with
custom-script.
-
Input
- Gene annotation file (
.gtf
)
- Gene annotation file (
-
Output
- Transcript annotations file (12-column
.bed
); used in calculate_TIN_scores
- Transcript annotations file (12-column
fastqc
Prepare quality control report for reads library with FastQC.
-
Input
- Reads file (
.fastq.gz
); from start
- Reads file (
-
Output
- FastQC output directory with report (
.txt
) and figures (.png
); used in multiqc_report
- FastQC output directory with report (
index_genomic_alignment_samtools
Index BAM file with SAMtools.
Indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions. Moreover, indexing is required by genome viewers such as IGV so that the viewers can quickly display alignments in a genomic region of interest.
-
Input
- Alignemnts file (
.bam
); from map_genome_star
- Alignemnts file (
-
Output
- BAM index file (
.bam.bai
); used in star_rpm & calculate_TIN_scores
- BAM index file (
star_rpm
Create stranded bedGraph coverage (.bg
) with
STAR.
Uses STAR's RPM normalization functionality.
STAR RPM uses SAM flags to correctly tell where the read and its mate mapped to. That is, if mate 1 is mapped to the plus strand, then mate 2 is mapped to the minus strand, and STAR will count mate 1 and mate 2 to the plus strand. This is in contrast to
bedtools genomecov -bg -split
, where a mate is assigned to a strand irrespective of its corresponding mate.
-
Input
- Alignments file (
.bam
); from map_genome_star - BAM index file (
.bam.bai
); from index_genomic_alignment_samtools
- Alignments file (
-
Output
- Coverage file (
.bg
); used in multiqc_report and rename_star_rpm_for_alfa
- Coverage file (
-
Non-configurable & non-default
--outWigStrans="Stranded"
--outWigNorm="RPM"
rename_star_rpm_for_alfa
Rename and copy stranded bedGraph coverage tracks such that they comply with ALFA.
Local rule
Renaming to
plus.bg
andminus.bg
depends on library orientation, which is provided by user in sample table columnkallisto_directionality
.
-
Input
- Coverage file (
.bg
); from star_rpm
- Coverage file (
-
Output
- Coverage file, renamed (
.bg
); used in alfa_qc and sort_bed_4_big
- Coverage file, renamed (
sort_bed_4_big
Sort bedGraph files with bedtools.
-
Input
- Coverage files, renamed (
.bg
); from rename_star_rpm_for_alfa
- Coverage files, renamed (
-
Output
- Coverage files, sorted (
.bg
); used in prepare_bigWig
- Coverage files, sorted (
prepare_bigWig
Generate bigWig from bedGraph with bedGraphToBigWig.
-
Input
- Coverage files, sorted (
.bg
); from sort_bed_4_big - Chromosome length table
chrNameLength.txt
; from create_index_star
- Coverage files, sorted (
-
Output
- Coverage files, one per strand and sample (
.bw
); used in finish
- Coverage files, one per strand and sample (
calculate_TIN_scores
Calculates the Transcript Integrity Number (TIN) for each transcript with custom script based on RSeqC.
TIN is conceptually similar to RIN (RNA integrity number) but provides transcript-level measurement of RNA quality and is more sensitive for low-quality RNA samples.
- TIN score of a transcript reflects RNA integrity of the transcript
- Median TIN score across all transcripts reflects RNA integrity of library
- TIN ranges from 0 (the worst) to 100 (the best)
- A TIN of 60 means that 60% of the transcript would have been covered if the read coverage were uniform
- A TIN of 0 will be assigned if the transcript has no coverage or coverage is below threshold
-
Input
- Alignments file (
.bam
); from map_genome_star - BAM index file (
.bam.bai
); from index_genomic_alignment_samtools - Transcript annotations file (12-column
.bed
); from extract_transcripts_as_bed12
- Alignments file (
-
Output
- TIN score table (custom
tsv
); used in merge_TIN_scores
- TIN score table (custom
merge_TIN_scores
Merges TIN score tables for all samples with custom script.
-
Input
- TIN score table (custom
tsv
); per sample; from calculate_TIN_scores
- TIN score table (custom
-
Output
- TIN score table (custom
tsv
); for all samples; used in plot_TIN_scores
- TIN score table (custom
plot_TIN_scores
Generate sample-wise box plots of TIN scores with custom script.
-
Input
- TIN score table (custom
tsv
); for all samples; from merge_TIN_scores
- TIN score table (custom
-
Output
- TIN score box plots (
.pdf
and.png
); used in multiqc_report
- TIN score box plots (
salmon_quantmerge_genes
Merge gene-level expression estimates for all samples with Salmon.
Rule is run once per sequencing mode
-
Input
- Gene expression tables (custom
.tsv
) for samples of same sequencing mode; from quantification_salmon
- Gene expression tables (custom
-
Output
- Gene TPM table (custom
.tsv
); used in multiqc_report - Gene read count table (custom
.tsv
); used in multiqc_report
- Gene TPM table (custom
salmon_quantmerge_transcripts
Merge transcript-level expression estimates for all samples with Salmon.
Rule is run once per sequencing mode
-
Input
- Transcript expression tables (custom
.tsv
) for samples of same sequencing mode; from quantification_salmon
- Transcript expression tables (custom
-
Output
- Transcript TPM table (custom
.tsv
); used in multiqc_report - Transcript read count table (custom
.tsv
); used in multiqc_report
- Transcript TPM table (custom
kallisto_merge_genes
Merge gene-level expression estimates for all samples with custom script.
Rule is run once per sequencing mode
-
Input
- Transcript expression tables (custom
.h5
) for samples of same sequencing mode; from genome_quantification_kallisto - Gene annotation file (custom
.gtf
)
- Transcript expression tables (custom
-
Output
- Gene TPM table (custom
.tsv
) - Gene read count table (custom
.tsv
) - Mapping gene/transcript IDs table (custom
.tsv
)
- Gene TPM table (custom