MIRFLOWZ
Suite of Snakemake workflows for the mapping and quantification of smallRNA-seq libraries, including miRNA and isomiR quantification.
Installation
All workflows live inside this repository and will be available for you to run after following the installation instructions layed out in this section.
Cloning the repository
Traverse to the desired path on your file system, then clone the repository and change into it with:
git clone ssh://git@git.scicore.unibas.ch:2222/zavolan_group/pipelines/mirflowz.git
cd mirflowz
Dependencies
For improved reproducibility and reusability of the workflows, as well as an easy means to run them on a high performance computing (HPC) cluster managed, e.g., by Slurm, all steps of the workflows run inside their own containers. As a consequence, running this workflow has very few individual dependencies. It does, however, require the package manager Conda and the container engine Singularity to be installed before you proceed.
NOTE: If you have root permissions for your system and you do not already have
singularity
installed globally on your system, you can conveniently use Conda to install it. In that case, replaceenvironment.yml
withenvironment.root.yml
in the first command below.
Setting up a virtual environment
It you do not already have Conda installed globally on your system,
we recommend that you install [Miniconda][miniconda]. For faster creation of
the environment (and Conda environments in general), you can also install
Mamba on top of Conda. In that case, replace conda
with mamba
in
the commands below (particularly in conda env create
).
Create and activate the environment with necessary dependencies with Conda:
conda env create -f environment.yml
conda activate mirflowz
Testing your installation
Several tests are provided to check the integrity of the installation. Follow the instructions in this section to make sure the workflows are ready to use.
Run test workflows on local machine
Execute the following command to run the test workflows on your local machine:
bash test/test_workflow_local.sh
Run test workflows via Slurm
Execute the following command to run the test workflows on a Slurm-managed high-performance computing (HPC) cluster:
bash test/test_workflow_slurm.sh
NOTE: The Slurm tests were configured to run on the developer's cluster. Several files may need to be modified if you would like to run tests (and the actual workflows) on other systems. These may possibly include the following (relative to the repository root directory), but potentially others as well:
jobscript.sh
RUNS/JOB/{prepare,map,quantify}/cluster.json
test/test_workflow_slurm.sh
Consult the manual of your batch scheduling system, as well as the section of the Snakemake manual dealing with cluster execution.
DAG and rule graph
Execute the following commands to generate DAG and rule graph images for each
workflow. Outputs will be found in the images/
directory in the repository
root.
NOTE: It is essential that you run the DAG and rule graph tests only after running the test workflow. This is because they require files to be available that will only be created when running that workflows.
bash test/test_dag.sh
bash test/test_rule_graph.sh
Clean up test results
After successfully running the tests above, you can run the following command to remove all artifacts generated by the test runs:
bash test/test_cleanup.sh
Usage
Now that your virtual environment is set up and the workflows are deployed and tested, you can go ahead and run the workflows on your samples.
But first, here is a brief description of what each of the three workflows does:
Workflow description
The repository contains the following workflows, all implemented in Snakemake and fully containerized:
PREPARE
The first workflow, PREPARE downloads and processes "genome resources" from the publicly available repositories Ensembl and miRBase according to your instructions. Resources are then processed to prepare indexes and other contingent resources that will be used in later steps.
The scheme below is a visual representation of an example run of the PREPARE workflow:
MAP
The second workflow, MAP aligns the user-provided short read smallRNA-seq libraries against the references generated with the PREPARE workflow. For increased fidelity it uses two separate aligning tools, Segemehl and our in-house tool Oligomap. In both cases, reads are aligned separately to the genome and the transcriptome. Afterwards, alignments are merged in a way that only the best alignment (or alignments) of each read are kept.
The scheme below is a visual representation of an example run of the MAP workflow:
QUANTIFY
The third and final workflow, QUANTIFY quantifies miRNA expression by
intersecting the alignments from the MAP workflow with the annotations
generated in the PREPARE workflow. Intersections are computed with
bedtools
for one or multiple of mature, primary transcripts and
isomiRs. Reads consistent with each miRNA are counted and tabulated.
The scheme below is a visual representation of an example run of the QUANTIFY workflow:
Running the workflows
Assuming that you are currently inside the repository's root directory, change to the run root directory:
cd RUNS
Now make a clean copy of the JOB
directory and name it whatever you want,
e.g., MY_ANALYSIS
:
cp -r JOB MY_ANALYSIS
Now traverse to the new directory. You will see that there are three
subdirectories, one for each workflow, change into the one you would like to
run (probably prepare
).
cd MY_ANALYSIS
cd {prepare,map,quantify}
Before running the workflow adjust the parameters in file config.yaml
. The
file explains what each of the parameters means and how you can meaningfully
fill them in.
To start workflow execution, run:
./run_workflow_slurm.sh
NOTE: Check back in the installation section to find more information on how to run the workflows on your HPC system. Although we do provide a workflow runner to execute the workflows locally (
run_workflow_local.sh
) on your laptop or desktop machine, we recommend against that for real-world data, as the resources requirements for running the workflows are very high (can be >50 Gigs of memory!).
After successful execution of the workflow, results and logs will be found in
results/
and logs/
directories, respectively.
Appendix: Configuration files
MIRFLOWZ comes with template configuration files for each individual workflow. These contain notes on how to fill in each parameter.
PREPARE
File location: RUNS/JOB/prepare/config.yaml
---
#### GLOBAL PARAMETERS #####
# Directories
# Usually there is no need to change these
scripts_dir: "../../../scripts"
output_dir: "results"
local_log: "logs/local"
cluster_log: "logs/cluster"
# Isomirs annotation file
# Number of base pairs to add/substract from 5' (start) and 3' (end) coordinates.
bp_5p: [0] # array of numbers, e.g., [-2,-1,0,+1], to include 2 upstream and 1 downstream nts
bp_3p: [0] # array of numbers, e.g., [-2,-1,0,+1], to include 2 upstream and 1 downstream nts
# List of inputs
organism: ["org/pre"] # e.g., ["homo_sapiens/GRCh38.100", "mus_musculus/GRCm37.98"]
#### PARAMETERS SPECIFIC TO INPUTS ####
org/pre: # One section for each list item in "organism"; names have to match precisely
# URLs to genome, gene & miRNA annotations
genome_url: # FTP/HTTP URL to gzipped genome in FASTA format, Ensembl style
gtf_url: # FTP/HTTP URL to gzipped gene annotations in GTF format, Ensembl style
mirna_url: # FTP/HTTP URL to unzipped microRNA annotations in GFF format, miRBase style
# Chromosome name mappings between UCSC <-> Ensembl
# Other organisms available at: https://github.com/dpryan79/ChromosomeMappings
map_chr_url: # FTP/HTTP URL to mapping table
# Chromosome name mapping parameters:
column: 1 # Column number from input file where to change chromosome name
delimiter: "TAB" # Delimiter of the input file
...
Note: We expect the genome and gene annotations to be formatted according the style used by Ensembl. Other formats are very likely to lead to problems, if not in this pipeline, then further down the road in the mapping or annotation pipelines. The miRNA annotation file is expected to originate from miRBase, or follow their exact layout.
MAP
File location: RUNS/JOB/map/config.yaml
---
#### GLOBAL PARAMETERS ####
# Directories
# Usually there is no need to change these
scripts_dir: "../../../scripts"
output_dir: "results"
local_log: "logs/local"
cluster_log: "logs/cluster"
# Resources: genome, transcriptome, genes, miRs
# All of these are produced by the "prepare" workflow
genome: "path/to/genome.processed.fa"
gtf: "path/to/gene_annotations.filtered.gtf"
transcriptome: "path/to/transcriptome_idtrim.fa"
transcriptome_index_segemehl: "path/to/transcriptome_index_segemehl.idx"
genome_index_segemehl: "path/to/genome_index_segemehl.idx"
exons: "path/to/exons.bed"
header_of_collapsed_fasta: "path/to/headerOfCollapsedFasta.sam"
# Tool parameters: quality filter
q_value: 10 # Q (Phred) score; minimum quality score to keep
p_value: 50 # minimum % of bases that must have Q quality
# Tool parameters: adapter removal
error_rate: 0.1 # fraction of allowed errors
minimum_length: 15 # discard processed reads shorter than the indicated length
overlap: 3 # minimum overlap length of adapter and read to trim the bases
max_n: 0 # discard reads containing more than the indicated number of N bases
# Tool parameters: mapping
max_length_reads: 30 # maximum length of processed reads to map with oligomap
nh: 100 # discard reads with more mappings than the indicated number
# Inputs information
input_dir: "path/to/input_directory"
sample: ["sample_1", "sample_2"] # put all sample names, separated by comma
#### PARAMETERS SPECIFIC TO INPUTS ####
sample_1: # one section per list item in "sample"; names have to match
adapter: "XXXXXXXXXXXXXXXXXXXX" # 3' adapter sequence to trim
format: "fa" # file format; currently supported: "fa"
...
QUANTIFY
File location: RUNS/JOB/quantify/config.yaml
---
#### GLOBAL PARAMETERS ####
# Directories
# Usually there is no need to change these
output_dir: "results"
scripts_dir: "../scripts"
local_log: "logs/local"
cluster_log: "logs/cluster"
# Types of miRNAs to quantify
# Remove miRNA types you are not interested in
mir_list: ["miRNA", "miRNA_primary_transcript", "isomirs"]
# Resources: miR annotations, chromosome name mappings
# All of these are produced by the "prepare" workflow
mirnas_anno: "path/to/mirna_filtered.bed"
isomirs_anno: "path/to/isomirs_annotation.bed"
# Inputs information
input_dir: "path/to/input_directory"
sample: ["sample_1", "sample_2"] # put all samples, separated by comma
...