diff --git a/README.md b/README.md index f5185f2362196156a38818a0a132d25bf28fffa7..695871cb7aa34d6847925864c5b36bf19b6736ac 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,11 @@ # _MIRFLOWZ_ -Suite of [Snakemake][snakemake] workflows for the mapping and quantification +[Snakemake][snakemake] workflow for the mapping and quantification of smallRNA-seq libraries, including miRNA and isomiR quantification. ## Installation -All workflows live inside this repository and will be available for you to run +The whole workflow live inside this repository and will be available for you to run after following the installation instructions layed out in this section. ### Cloning the repository @@ -20,9 +20,9 @@ cd mirflowz ### Dependencies -For improved reproducibility and reusability of the workflows, as well as an +For improved reproducibility and reusability of the workflow, as well as an easy means to run them on a high performance computing (HPC) cluster managed, -e.g., by [Slurm][slurm], all steps of the workflows run inside their own +e.g., by [Slurm][slurm], all steps of the workflow run inside their own containers. As a consequence, running this workflow has very few individual dependencies. It does, however, require the package manager [Conda][conda] and the container engine [Singularity][singularity] to be installed before you @@ -36,7 +36,7 @@ proceed. ### Setting up a virtual environment It you do not already have [Conda][conda] installed globally on your system, -we recommend that you install [Miniconda][miniconda]. For faster creation of +we recommend that you install [Miniconda][miniconda-installation]. For faster creation of the environment (and Conda environments in general), you can also install [Mamba][mamba] on top of Conda. In that case, replace `conda` with `mamba` in the commands below (particularly in `conda env create`). @@ -55,7 +55,7 @@ the instructions in this section to make sure the workflows are ready to use. #### Run test workflows on local machine -Execute the following command to run the test workflows on your local machine: +Execute the following command to run the test workflow on your local machine: ```bash bash test/test_workflow_local.sh @@ -63,7 +63,7 @@ bash test/test_workflow_local.sh #### Run test workflows via Slurm -Execute the following command to run the test workflows on a Slurm-managed +Execute the following command to run the test workflow on a Slurm-managed high-performance computing (HPC) cluster: ```bash @@ -72,12 +72,12 @@ bash test/test_workflow_slurm.sh > **NOTE:** The Slurm tests were configured to run on the developer's cluster. > Several files may need to be modified if you would like to run tests (and -> the actual workflows) on other systems. These may possibly include the +> the actual workflow) on other systems. These may possibly include the > following (relative to the repository root directory), but potentially others > as well: > > * `jobscript.sh` -> * `RUNS/JOB/{prepare,map,quantify}/cluster.json` +> * `test/cluster.json` > * `test/test_workflow_slurm.sh` > > Consult the manual of @@ -86,8 +86,8 @@ bash test/test_workflow_slurm.sh #### DAG and rule graph -Execute the following commands to generate DAG and rule graph images for each -workflow. Outputs will be found in the `images/` directory in the repository +Execute the following commands to generate DAG and rule graph images for the +workflow. The output will be found in the `images/` directory in the repository root. > **NOTE:** It is essential that you run the DAG and rule graph tests only @@ -110,86 +110,122 @@ bash test/test_cleanup.sh ## Usage -Now that your virtual environment is set up and the workflows are deployed and -tested, you can go ahead and run the workflows on your samples. +Now that your virtual environment is set up and the workflow is deployed and +tested, you can go ahead and run the workflow on your samples. -But first, here is a brief description of what each of the three workflows -does: +But first, here is a brief description of what the workflow does: ### Workflow description -The repository contains the following workflows, all implemented in Snakemake -and fully containerized: +The workflow consist on one main `Snakefile` +gathering three different subworkflows, +**_PREPARE_**, **_MAP_** and **_QUANTIFY_**. +Each one of them is implemented in an individual Snakemake file +and fully containerized. + #### _PREPARE_ -The first workflow, **_PREPARE_** downloads and processes "genome resources" +The first subworkflow, **_PREPARE_**, downloads and processes "genome resources" from the publicly available repositories [Ensembl][ensembl] and [miRBase][mirbase] according to your instructions. Resources are then processed to prepare indexes and other contingent resources that will be used in later steps. -The scheme below is a visual representation of an example run of the -**_PREPARE_** workflow: - -> ![rule-graph-prepare][rule-graph-prepare] - #### _MAP_ -The second workflow, **_MAP_** aligns the user-provided short read smallRNA-seq -libraries against the references generated with the **_PREPARE_** workflow. For +The second subworkflow, **_MAP_**, aligns the user-provided short read smallRNA-seq +libraries against the references generated with the **_PREPARE_** subworkflow. For increased fidelity it uses two separate aligning tools, [Segemehl][segemehl] and our in-house tool [Oligomap][oligomap]. In both cases, reads are aligned separately to the genome and the transcriptome. Afterwards, alignments are merged in a way that only the best alignment (or alignments) of each read are kept. -The scheme below is a visual representation of an example run of the **_MAP_** -workflow: - -> ![rule-graph-map][rule-graph-map] - #### _QUANTIFY_ -The third and final workflow, **_QUANTIFY_** quantifies miRNA expression by -intersecting the alignments from the **_MAP_** workflow with the annotations -generated in the **_PREPARE_** workflow. Intersections are computed with +The third and final subworkflow, **_QUANTIFY_**, quantifies miRNA expression by +intersecting the alignments from the **_MAP_** subworkflow with the annotations +generated in the **_PREPARE_** subworkflow. Intersections are computed with [`bedtools`][bedtools] for one or multiple of mature, primary transcripts and isomiRs. Reads consistent with each miRNA are counted and tabulated. -The scheme below is a visual representation of an example run of the -**_QUANTIFY_** workflow: -> ![rule-graph-quantify][rule-graph-quantify] +The schema below is a visual representation of +an example run of the whole workflow: + +> ![rule-graph][rule-graph] -### Running the workflows -Assuming that you are currently inside the repository's root directory, change -to the run root directory: +### Running the workflow + +Assuming that you are currently inside the repository's root directory, +create a directory from which you will run your workflow and +name it whatever you want e.g., `MY_ANALYSIS` and +head to it. ```bash -cd RUNS +mkdir MY_ANALYSIS +cd MY_ANALYSIS ``` +Now within the `MY_ANALYSIS/` directory, +create a new directory to store you library; +name it as you like, e.g. `my_lib` and move there. -Now make a clean copy of the `JOB` directory and name it whatever you want, -e.g., `MY_ANALYSIS`: +```bash +mkdir my_lib +cd my_lib +``` +Place your library file here. +In addition, create a sample table. +Fill it with the correct entries. +You can look at the `test/test_files/sample_table.csv` +to get an idea of what this file must contain. ```bash -cp -r JOB MY_ANALYSIS +touch sample_table.csv ``` -Now traverse to the new directory. You will see that there are three -subdirectories, one for each workflow, change into the one you would like to -run (probably `prepare`). +Now traverse back to the previous directory. +Here, a coupleof things must be done: + +First of all, copy the `config_template.yaml` to this directory. ```bash -cd MY_ANALYSIS -cd {prepare,map,quantify} +cp ..config/config_template.yaml ./config.yaml +``` +Then, using your editor of choice, adjust the parameters of the `config.yaml`. +The file explains what each of the parameters means and how you can meaningfully +fill them in. + + +Accordingly to how you want +to run the workflow you can either +copy the script to run it **locally** with + +```bash +cp ../test/test_workflow_local.sh ./run_workflow_local.sh +``` +or copy the script to run the workflow +on a **cluster via Slurm** +along with the `cluster.json` file with + +```bash +cp ../test/test_workflow_slurm.sh ./run_workflow_slurm.sh +cp ../test/cluster.json ./cluster.json +``` + +In both cases, the final 9 lines must be removed; this can be done with: + +```bash +head -n 9 run_workflow_*.sh ``` +Finally, you can optionally copy the script +to remove all artifacts generated by the run: -Before running the workflow adjust the parameters in file `config.yaml`. The -file explains what each of the parameters means and how you can meaningfully -fill them in. +```bash +cp ../test/test_cleanup.sh ./run_cleanup.sh +``` To start workflow execution, run: @@ -207,149 +243,116 @@ To start workflow execution, run: After successful execution of the workflow, results and logs will be found in `results/` and `logs/` directories, respectively. -### Appendix: Configuration files +### Appendix: Configuration file -_MIRFLOWZ_ comes with template configuration files for each individual -workflow. These contain notes on how to fill in each parameter. - -#### _PREPARE_ +_MIRFLOWZ_ comes with a template configuration file +for the whole workflow. +This template contains notes on how to fill in each parameter. -**File location:** `RUNS/JOB/prepare/config.yaml` +**File location:** `config/config_template.yaml` ```yaml --- -#### GLOBAL PARAMETERS ##### - -# Directories -# Usually there is no need to change these -scripts_dir: "../../../scripts" -output_dir: "results" -local_log: "logs/local" -cluster_log: "logs/cluster" - -# Isomirs annotation file -# Number of base pairs to add/substract from 5' (start) and 3' (end) coordinates. -bp_5p: [0] # array of numbers, e.g., [-2,-1,0,+1], to include 2 upstream and 1 downstream nts -bp_3p: [0] # array of numbers, e.g., [-2,-1,0,+1], to include 2 upstream and 1 downstream nts - -# List of inputs -organism: ["org/pre"] # e.g., ["homo_sapiens/GRCh38.100", "mus_musculus/GRCm37.98"] - -#### PARAMETERS SPECIFIC TO INPUTS #### - -org/pre: # One section for each list item in "organism"; names have to match precisely - # URLs to genome, gene & miRNA annotations - genome_url: # FTP/HTTP URL to gzipped genome in FASTA format, Ensembl style - gtf_url: # FTP/HTTP URL to gzipped gene annotations in GTF format, Ensembl style - mirna_url: # FTP/HTTP URL to unzipped microRNA annotations in GFF format, miRBase style - - # Chromosome name mappings between UCSC <-> Ensembl - # Other organisms available at: https://github.com/dpryan79/ChromosomeMappings - map_chr_url: # FTP/HTTP URL to mapping table - # Chromosome name mapping parameters: - column: 1 # Column number from input file where to change chromosome name - delimiter: "TAB" # Delimiter of the input file -... -``` +############################# +#### REQUIRED PARAMETERS #### +############################# -> **Note:** We expect the genome and gene annotations to be formatted according -> the style used by Ensembl. Other formats are very likely to lead to problems, -> if not in this pipeline, then further down the road in the mapping or -> annotation pipelines. The miRNA annotation file is expected to originate from -> miRBase, or follow their exact layout. +# All paths are relative to the current working directory unless noted otherwise -#### _MAP_ +samples: path/to/samples_table.csv -**File location:** `RUNS/JOB/map/config.yaml` +#### GENOME RESOURCES ##### -```yaml ---- -#### GLOBAL PARAMETERS #### - -# Directories -# Usually there is no need to change these -scripts_dir: "../../../scripts" -output_dir: "results" -local_log: "logs/local" -cluster_log: "logs/cluster" - -# Resources: genome, transcriptome, genes, miRs -# All of these are produced by the "prepare" workflow -genome: "path/to/genome.processed.fa" -gtf: "path/to/gene_annotations.filtered.gtf" -transcriptome: "path/to/transcriptome_idtrim.fa" -transcriptome_index_segemehl: "path/to/transcriptome_index_segemehl.idx" -genome_index_segemehl: "path/to/genome_index_segemehl.idx" -exons: "path/to/exons.bed" -header_of_collapsed_fasta: "path/to/headerOfCollapsedFasta.sam" - -# Tool parameters: quality filter -q_value: 10 # Q (Phred) score; minimum quality score to keep -p_value: 50 # minimum % of bases that must have Q quality +# All genome resources have to match the source/organism +# of all samples in the sample table -# Tool parameters: adapter removal -error_rate: 0.1 # fraction of allowed errors -minimum_length: 15 # discard processed reads shorter than the indicated length -overlap: 3 # minimum overlap length of adapter and read to trim the bases -max_n: 0 # discard reads containing more than the indicated number of N bases +# FTP/HTTP URL to gzipped genome in FASTA format, Ensembl style +genome_url: # e.g. "ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" -# Tool parameters: mapping -max_length_reads: 30 # maximum length of processed reads to map with oligomap -nh: 100 # discard reads with more mappings than the indicated number +# FTP/HTTP URL to gzipped gene annotations in GTF format, Ensembl style +gtf_url: # e.g. "ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.chr.gtf.gz" -# Inputs information -input_dir: "path/to/input_directory" -sample: ["sample_1", "sample_2"] # put all sample names, separated by comma +# FTP/HTTP URL to unzipped microRNA annotations in GFF format, miRBase style +mirna_url: # e.g. "https://www.mirbase.org/ftp/CURRENT/genomes/hsa.gff3" -#### PARAMETERS SPECIFIC TO INPUTS #### +# Tab-separated mappings table between UCSC (column 1) +# and Ensembl (coulm 2) chromosome names +# Available at: https://github.com/dpryan79/ChromosomeMappings +map_chr_url: # e.g. "https://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh38_UCSC2ensembl.txt" -sample_1: # one section per list item in "sample"; names have to match - adapter: "XXXXXXXXXXXXXXXXXXXX" # 3' adapter sequence to trim - format: "fa" # file format; currently supported: "fa" -... -``` -#### _QUANTIFY_ +############################### +#### "OPTIONAL" PARAMETERS #### +############################### -**File location:** `RUNS/JOB/quantify/config.yaml` +# The below parameters only need to be changed if the default behavior of +# MIRFLOWZ is to be changed; however, they still need to be present! -```yaml ---- -#### GLOBAL PARAMETERS #### +#### DIRECTORIES #### + +output_dir: results/ +local_log: logs/local/ +cluster_log: logs/cluster/ +scripts_dir: ../scripts/ + + +#### ISOMIR GENERATION PARAMETERS #### + +# Generate isomiR annotations with the indicated number of shifts relative to +# the start and end position of each annotated mature miRNA, as an array of +# relative positions +# Examples: +# - `bp_5p: [-2,0,+1]` and `bp_3p: [+1]` generates 3 isomiRs for each mature +# miRNA: one that starts two nucleotides before, one that starts exactly at +# and one that starts one nucleotide after the annotated mature miRNA; all +# isomiRs will stop one nucleotide after the end of the annotated mature +# miRNA; note that because `0` is not included in the `bp_3p` array, the +# annotated mature miRNAs will not be included in this example +# - Use `bp_5p: [0]` and `bp_3p: [0]` to only include the mature annotated +# miRNAs and no isomiRs -# Directories -# Usually there is no need to change these -output_dir: "results" -scripts_dir: "../scripts" -local_log: "logs/local" -cluster_log: "logs/cluster" +bp_5p: [-2, -1, 0, +1, +2] +bp_3p: [-2, -1, 0, +1, +2] + +#### PROCESSING PARAMETERS #### + +# quality filter +q_value: 10 # Q (Phred) score; minimum quality score to keep +p_value: 50 # minimum % of bases that must have Q quality + +# adapter removal +error_rate: 0.1 # fraction of allowed errors +minimum_length: 15 # discard processed reads shorter than the indicated length +overlap: 3 # minimum overlap length of adapter and read to trim the bases +max_n: 0 # discard reads containing more than the indicated number of N bases + +# mapping +max_length_reads: 30 # maximum length of processed reads to map with oligomap +nh: 100 # discard reads with more mappings than the indicated number + +#### QUANTIFICATION PARAMETERS #### # Types of miRNAs to quantify # Remove miRNA types you are not interested in mir_list: ["miRNA", "miRNA_primary_transcript", "isomirs"] - -# Resources: miR annotations, chromosome name mappings -# All of these are produced by the "prepare" workflow -mirnas_anno: "path/to/mirna_filtered.bed" -isomirs_anno: "path/to/isomirs_annotation.bed" - -# Inputs information -input_dir: "path/to/input_directory" -sample: ["sample_1", "sample_2"] # put all samples, separated by comma ... ``` +> **Note:** We expect the genome and gene annotations to be formatted according +> the style used by Ensembl. Other formats are very likely to lead to problems. +> The miRNA annotation file is expected to originate from +> miRBase, or follow their exact layout. + + [bedtools]: <https://github.com/arq5x/bedtools2> [conda]: <https://docs.conda.io/projects/conda/en/latest/index.html> -[cluster execution]: <https://snakemake.readthedocs.io/en/stable/executing/cluster-cloud.html#cluster-execution> +[cluster execution]: <https://snakemake.readthedocs.io/en/stable/executing/cluster.html> [ensembl]: <https://ensembl.org/> [mamba]: <https://github.com/mamba-org/mamba> [miniconda-installation]: <https://docs.conda.io/en/latest/miniconda.html> [mirbase]: <https://mirbase.org/> [oligomap]: <https://bio.tools/oligomap> -[rule-graph-map]: images/rule_graph_map.svg -[rule-graph-prepare]: images/rule_graph_prepare.svg -[rule-graph-quantify]: images/rule_graph_quantify.svg +[rule-graph]: images/rule_graph.svg [segemehl]: <https://www.bioinf.uni-leipzig.de/Software/segemehl/> [singularity]: <https://sylabs.io/singularity/> [slurm]: <https://slurm.schedmd.com/documentation.html>