From d57020176ee38542cf5b29bb3f63638fae5a72ac Mon Sep 17 00:00:00 2001 From: burri0000 <dominik.burri@unibas.ch> Date: Thu, 12 Mar 2020 13:42:02 +0100 Subject: [PATCH] added ALFA documentation fixed typos --- pipeline_documentation.md | 42 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/pipeline_documentation.md b/pipeline_documentation.md index dcd55ca..18a6488 100644 --- a/pipeline_documentation.md +++ b/pipeline_documentation.md @@ -12,9 +12,13 @@ This document describes the individual rules of the pipeline for information pur * **extract_transcripts_as_bed12** * **index_genomic_alignment_samtools** * **star_rpm** +* **rename_star_rpm_for_alfa** * **calculate_TIN_scores** * **salmon_quantmerge_genes** * **salmon_quantmerge_transcripts** +* **generate_alfa_index** +* **alfa_qc** +* **alfa_qc_all_samples** ### Sequencing mode specific * **(pe_)fastqc** @@ -123,6 +127,9 @@ Needed for TIN score calculation and bedgraph coverage calculation. Create stranded bedgraph coverage with STAR's RPM normalisation. Described [here](https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html) +STAR RPM uses SAM flags to correctly tell where the read and its mate mapped to. That is, if read1 is mapped to the plus strand, then read2 is mapped to the minus strand and STAR will count read1 and read2 to the plus strand. +This is in contrast to `bedtools genomecov -bg -split`, where the reads are assigned to the respective strand irrespective of their mates. + **Input:** .bam, .bam.bai index **Output:** coverage bedGraphs @@ -131,6 +138,14 @@ Described [here](https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.ht --outWigNorm "RPM" +#### rename_star_rpm_for_alfa +Local rule to rename and copy the stranded bedgraph coverage tracks such that they comply with [ALFA](https://github.com/biocompibens/ALFA). +The renaming to `plus.bg` and `minus.bg` depends on the library orientation, which is provided by the user in `kallisto_directionality`. + + +**Input:** .bg coverage tracks +**Output:** renamed and copied bedgraph files + #### calculate_TIN_scores Given a set of BAM files and a gene annotation BED file, calculates the Transcript Integrity Number (TIN) for each transcript. [GitLab repository](https://git.scicore.unibas.ch/zavolan_group/tools/tin_score_calculation). TIN is conceptually similar to RIN (RNA integrity number) but provides transcript level measurement of RNA quality and is more sensitive to measure low quality RNA samples: @@ -158,6 +173,33 @@ Merge the salmon quantification *transcript* results for all samples of same seq **Output:** Two tsv files for transcript quantifications, one for tpm and one for number of reads. +#### generate_alfa_index +Create ALFA index files used for running [ALFA](https://github.com/biocompibens/ALFA) for a given organism. + +**Input:** .gtf genome annotation, chrNameLength.txt file containing chromosome names and lengths +**Output:** two ALFA index files, one stranded and one unstranded + + +#### alfa_qc +Run [ALFA](https://github.com/biocompibens/ALFA) from stranded bedgraph tracks. +The library orientation is needed as *fr-firststrand* and *fr-secondstrand*. Currently, the values from `kallisto_directionality` are re-used. + +ALFA counts features in the bedgraph coverage tracks, by using the library orientation and the ALFA index files. The counts are stored in `ALFA_feature_counts.tsv`. + +The main output of ALFA are two plots, `ALFA_Biotypes.pdf` and `ALFA_Categories.pdf`. They display the nucleotide distributions among the different features and their enrichment. For details see [ALFA documentation](https://github.com/biocompibens/ALFA). + + +**Input:** the renamed .bg files (suffixed with `out.plus.bg` and `out.minus.bg`), library orientation, the stranded ALFA index file +**Output:** ALFA_Biotypes.pdf and ALFA_Categories.pdf; ALFA_feature_counts.tsv containing table for the plots + + +#### alfa_qc_all_samples +Combine the output of all samples into one plot generated by [ALFA](https://github.com/biocompibens/ALFA). + +**Input:** ALFA_feature_counts.tsv from each sample in `samples.tsv` +**Output:** ALFA_Biotypes.pdf and ALFA_Categories.pdf for all samples together + + ### Sequencing mode specific rules #### (pe_)fastqc [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. -- GitLab