CJHerrmann
--- a/pipeline_steps.md → pipeline_documentation.md

+ 52

− 49
+++ b/pipeline_steps.md → pipeline_documentation.md

+ 52

− 49
 @@ -9,31 +9,32 @@
 @@ -9,31 +9,32 @@
 * **create_index_salmon**
 * **create_index_kallisto**
 * **extract_transcripts_as_bed12**
+* **index_genomic_alignment_samtools**
+* **star_rpm**
 * **calculate_TIN_scores**
 * **salmon_quantmerge_genes**
 * **salmon_quantmerge_transcripts**
-### sequencing mode specific
+### Sequencing mode specific
-* **pe_fastqc**
+* **(pe_)fastqc**
-* **pe_remove_adapters_cutadapt**
+* **(pe_)remove_adapters_cutadapt**
-* **pe_remove_polya_cutadapt**
+* **(pe_)remove_polya_cutadapt**
-* **pe_map_genome_star**
+* **(pe_)map_genome_star**
-* **pe_index_genomic_alignment_samtools**
+* **(pe_)quantification_salmon**
-* **pe_quantification_salmon**
+* **(pe_)genome_quantification_kallisto**
-* **pe_genome_quantification_kallisto**
-* **star_rpm_paired_end**
 ## Detailed description of steps
-The pipeline consists of three snakefiles: The main Snakefile contains some general rules for the creation of indices and rules that deal with summary steps and combining of results across samples of the run. For single-end and paired-end sequencing samples there are two separate sub-snakefiles, as parameters to individual tools differ between the sequencing modes.    
+The pipeline consists of three snakefiles: A main Snakefile and an individual Snakefile for each sequencing mode (single-end and paired-end), as parameters to individual tools differ between the sequencing modes. The main Snakefile contains some general rules for the creation of indices, rules that are applicable to both sequencing modes, and rules that deal with summary steps and combining results across samples of the run.     
-Individual rules of the pipeline will be described shortly, and links to the respective software manuals are given. If parameters can be influenced by the user (via the input samples table) they will also be described.
+Individual rules of the pipeline are described briefly, and links to the respective software manuals are given. If parameters can be influenced by the user (via the samples table) they are also described.
-Description of paired- and single-end rules are combined, only differences will be highlighted.
+Description of paired- and single-end rules are combined, only differences are highlighted.
 ### General
 #### read samples table
-The input samples table will be read.    
 Requirements: 
 * tab separated file
 * first row has to contain parameter names as in  [samples.tsv](tests/input_files/samples.tsv)
 @@ -71,20 +72,24 @@ fq2_polya | stretch of As or Ts, depending on read orientation; for cutadapt (ty
 @@ -71,20 +72,24 @@ fq2_polya | stretch of As or Ts, depending on read orientation; for cutadapt (ty
 Currently not implemented as Snakemake rule, but general statement.
 #### create_index_star
-Create index for STAR alignments. Supply the reference genome sequences (FASTA files) and annotations (GTF file), from which STAR generates genome indexes that are utilized in the 2nd (mapping) step. The genome indexes are saved to disk and need only be generated once for each genome/annotation/index size combination. [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)    
+Create index for STAR alignments. Supply the reference genome sequences (FASTA files) and annotations (GTF file), from which STAR generates genome indexes that are utilized in the 2nd (mapping) step. The genome indexes are saved to disk and need only be generated once for each genome/annotation/index size combination. [STAR manual](http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf#section.2)    
 **Input:** genome fasta file, gtf file    
 **Parameters:** sjdbOverhang (This is the `index_size` specified in the samples table).    
 **Output:** chrNameLength.txt will be used for STAR mapping; chrName.txt
 #### extract_transcriptome
-> TODO
+Create transcriptome from genome and gene annotations using [gffread](https://github.com/gpertea/gffread).
+**Input:** `genome` and `gtf` of the input samples table    
+**Output:** transcriptome fasta file.    
 #### create_index_salmon
-Create index for Salmon quantification. If you want to use Salmon in mapping-based mode, then you first have to build a salmon index for your transcriptome. This will build the mapping-based index, using an auxiliary k-mer hash over k-mers of length 31. While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the k size selected here will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivty. We find that a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads. [Salmon manual](https://salmon.readthedocs.io/en/latest/salmon.html)   
+Create index for [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) quantification. If you want to use Salmon in mapping-based mode, then you first have to build a salmon index for your transcriptome. This will build the mapping-based index, using an auxiliary k-mer hash over k-mers of length 31. While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the k size selected here will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivty. Apparently a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads.   
 **Input:** transcriptome fasta file for transcripts to be quantified    
 **Parameters:** kmer length (`kmer` in the input samples table).    
 @@ -92,19 +97,40 @@ Create index for Salmon quantification. If you want to use Salmon in mapping-bas
 @@ -92,19 +97,40 @@ Create index for Salmon quantification. If you want to use Salmon in mapping-bas
 #### create_index_kallisto
-Create index for Kallisto quantification. Similar to salmon index described above. The default kmer size of 31 is used in this pipeline and thus not adaptable by the user. [Kallisto manual](https://pachterlab.github.io/kallisto/manual).    
+Create index for [Kallisto](https://pachterlab.github.io/kallisto/manual) quantification. Similar to salmon index described above. The default kmer size of 31 is used in this pipeline and thus not adaptable by the user.       
 **Input:** transcriptome fasta file for transcripts to be quantified    
 **Output:** kallisto index, used for kallisto quantification.    
 #### extract_transcripts_as_bed12
-Convert transcripts from gtf to bed12 format. This is needed for the TIN score calculation and doesn't require any parameters.    
+Convert transcripts from gtf to bed12 format. This is needed for the TIN score calculation and doesn't require any parameters. [GitLab repository](https://git.scicore.unibas.ch/zavolan_group/tools/gtf_transcript_type_to_bed12/)    
 **Input:** gtf file    
 **Output:** "full_transcripts_protein_coding.bed"    
+#### index_genomic_alignment_samtools
+Index the genomic alignment with [samtools index](http://quinlanlab.org/tutorials/samtools/samtools.html#samtools-index). Indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions. Moreover, indexing is required by genome viewers such as IGV so that the viewers can quickly display alignments in each genomic region to which you navigate.    
+Needed for TIN score calculation and bedgraph coverage calculation.    
+**Input:** bam file    
+**Output:** bam.bai index file    
+#### star_rpm
+Create stranded bedgraph coverage with STAR's RPM normalisation.
+Described [here](https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html)    
+**Input:** .bam, .bam.bai index    
+**Output:** coverage bedGraphs 
+**Arguments not influencable by user:**   
+--outWigStrans "Stranded"    
+--outWigNorm "RPM"  
 #### calculate_TIN_scores
 Given a set of BAM files and a gene annotation BED file, calculates the Transcript Integrity Number (TIN) for each transcript. [GitLab repository](https://git.scicore.unibas.ch/zavolan_group/tools/tin_score_calculation). TIN is conceptually similar to RIN (RNA integrity number) but provides transcript level measurement of RNA quality and is more sensitive to measure low quality RNA samples:
 @@ -116,6 +142,8 @@ Given a set of BAM files and a gene annotation BED file, calculates the Transcri
 @@ -116,6 +142,8 @@ Given a set of BAM files and a gene annotation BED file, calculates the Transcri
 **Input:** aligned reads.bam.bai, "full_transcripts_protein_coding.bed"         
 **Output:** TIN score tsv file
 #### salmon_quantmerge_genes
 Merge the salmon quantification *gene* results for all samples of same sequencing mode into a single file. Do this for tpm and number of reads separately.    
 @@ -133,10 +161,9 @@ Merge the salmon quantification *transcript* results for all samples of same seq
 @@ -133,10 +161,9 @@ Merge the salmon quantification *transcript* results for all samples of same seq
 #### (pe_)fastqc
 [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.   
-**Input:** raw fastq files    
+**Input:** raw fastq file(s)    
 **Output:** fastqc report (.txt) and several figures (.png)
-*Same for single- and paired-end.*
 #### (pe_)remove_adapters_cutadapt
 @@ -150,7 +177,7 @@ Merge the salmon quantification *transcript* results for all samples of same seq
 @@ -150,7 +177,7 @@ Merge the salmon quantification *transcript* results for all samples of same seq
 **Arguments not influencable by user:**        
 -e 0.1  maximum error-rate of 10%    
 -j 8    use 8 threads    
-m 10   Discard processed reads that are shorter than 10
+-m 10   Discard processed reads that are shorter than 10    
 -n 3    search for all the given adapter sequences repeatedly, either until no adapter match was found or until 3 rounds have been performed.    
 *paired end:*    
 @@ -160,13 +187,13 @@ Merge the salmon quantification *transcript* results for all samples of same seq
 @@ -160,13 +187,13 @@ Merge the salmon quantification *transcript* results for all samples of same seq
 -O 1    minimal overlap of 1
 #### (pe_)remove_polya_cutadapt
-Here, cutadapt is used to remove poly(A) tails. 
+Here, [Cutadapt](https://cutadapt.readthedocs.io/en/stable/)t is used to remove poly(A) tails. 
 **Input:** fastq reads    
 **Parameters:** Adapters to be removed, specified by user in the columns 'fq1_polya', 'fq2_polya', respectively.    
 **Output:** fastq files with poly(A) tails removed, reads shorter than 10nt will be discarded. 
-**Arguments like above and additionally:**    
+**Arguments like in remove_adapters_cutadapt and additionally:**    
 --match-read-wildcards This option is used to allow matching wildcard characters also within reads, because if no tail should be trimmed "XXXXXX" is specified in the samples table, which doesn't match any nucleotides, and thus nothing will be done here.    
 -n 2    search for all the given adapter sequences repeatedly, either until no adapter match was found or until 2 rounds have been performed.    
 -q 6    trim low-quality 3'ends with a cutoff of 6 nucleotides    
 @@ -179,9 +206,7 @@ Here, cutadapt is used to remove poly(A) tails.
 @@ -179,9 +206,7 @@ Here, cutadapt is used to remove poly(A) tails.
 -O 1    minimal overlap of 1
 #### (pe_)map_genome_star
-Spliced Transcripts Alignment to a Reference    
+Spliced Transcripts Alignment to a Reference; Read the [Publication](https://www.ncbi.nlm.nih.gov/pubmed/23104886) or check out the [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf).    
-[Publication](https://www.ncbi.nlm.nih.gov/pubmed/23104886).    
-[STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf).    
 **Input:** STAR_index, reads as .fastq.gz    
 **Parameters:**    
 @@ -207,14 +232,7 @@ Spliced Transcripts Alignment to a Reference
 @@ -207,14 +232,7 @@ Spliced Transcripts Alignment to a Reference
 *Same for single- and paired-end.*
-#### (pe_)index_genomic_alignment_samtools
-Index the genomic alignment with [samtools index](http://quinlanlab.org/tutorials/samtools/samtools.html#samtools-index). Indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions. Moreover, indexing is required by genome viewers such as IGV so that the viewers can quickly display alignments in each genomic region to which you navigate.    
-Needed for TIN score calculation and bedgraph coverage calculation.    
-**Input:** bam file    
-**Output:** bam.bai index file    
-*Same for single- and paired-end.*
 #### (pe_)quantification_salmon
 [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) is a tool for wicked-fast transcript quantification from RNA-seq data.
 @@ -246,7 +264,7 @@ Needed for TIN score calculation and bedgraph coverage calculation.
 @@ -246,7 +264,7 @@ Needed for TIN score calculation and bedgraph coverage calculation.
 * .fastq.gz reads, adapters and poly(A)tails removed.
 * kallisto index, from **create_index_kallisto** 
-**Parameters:** directionality, `kallisto_directionality` from samples table    
+**Parameters:** directionality, which is `kallisto_directionality` from samples table    
 **Output:** Pseudoalignment .sam file
 @@ -254,18 +272,3 @@ Needed for TIN score calculation and bedgraph coverage calculation.
 @@ -254,18 +272,3 @@ Needed for TIN score calculation and bedgraph coverage calculation.
 * -l: fragment length, user specified as `mean`
 * -s: fragment length SD, user specified as `sd` 
-#### star_rpm_paired_end
-Create stranded bedgraph coverage with STARs RPM normalisation.
-Described [here](https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html)    
-**Input:** .bam, .bam.bai index
-**Output:** coverage bedGraphs 
-**Arguments not influencable by user:**   
--outWigStrans "Stranded"    
--outWigNorm "RPM"  
-*Same for single- and paired-end.*
 \ No newline at end of file