Skip to content
Snippets Groups Projects
Alex Kanitz's avatar
Alex Kanitz authored
- trap call functionalized through cleanup() function
- function added to all test scripts
- function prints out exit status of last command before trap
- flag `--verbose` added to Snakemake calls in all test scripts
- script tests rename to follow naming convention 'test_script_<script_name>_<script_run_mode>
0d95577e
History

RNA-Seq pipeline

Snakemake workflow for general purpose RNA-Seq library annotation developed by the Zavolan lab.

Reads are processed, aligned, quantified and analyzed with state-of-the-art tools to give meaningful initial insights into various aspects of an RNA-Seq library while cutting down on hands-on time for bioinformaticians.

The scheme below is a visual representation of an example run of the workflow:

rule_graph

Installation

Cloning the repository

Traverse to the desired path on your file system, then clone the repository and move into it with:

git clone ssh://git@git.scicore.unibas.ch:2222/zavolan_group/pipelines/rnaseqpipeline.git
cd rnaseqpipeline

Installing Singularity

For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by Slurm, each individual step of the workflow runs in its own container. Specifically, containers are created out of Singularity images built for each software used within the workflow. As a consequence, running this workflow has very few individual dependencies. It does, however, require that Singularity be installed. See the links below for installation instructions for the most up-to-date (as of writing) as well as for the tested version (2.6.1) of Singularity:

If you have root privileges, you can directly install Singularity together with snakemake in a virtual environment (see next section)

Setting up a Snakemake virtual environment

In addition to Singularity, Snakemake needs to be installed. We strongly recommended to do so via a virtual environment. Here we describe the steps necessary to set up such a virtual environment with a recent version (v4.4+) of the conda package manager. If you prefer to use another solution, such as virtualenv, adapt the steps according to the specific instructions of your preferred solution.

If you do not have conda installed for Python3, we recommend to install the minimal version (Python and package manager) Miniconda (see the link for installation instructions). Be sure to select the correct version for your operating system and ensure that you select the Python 3 option.

To create and activate a snakemake environment, run:

conda create -n rnaseq_pipeline \
    -c bioconda \
    -c conda-forge \
    snakemake=5.10.0 
conda activate rnaseq_pipeline

or, to create a conda environment containing Snakemake AND Singularity (currently not working on MacOS):

Note: Singularity has to be installed as root, so wherever you don't have root privileges, use the installation methods described above!

conda create -n rnaseq_pipeline \
    -c bioconda \
    -c conda-forge \
    snakemake=5.10.0 \
    singularity=3.5.2
conda activate rnaseq_pipeline 

All installation requirements should now be met with.

Testing the installation

We have prepared several tests to check the integrity of the workflow. The most important one lets you execute the workflow on a small set of example input files.

Run workflow on local machine

Execute the following command to run the test workflow on your local machine:

bash tests/test_integration_workflow/test.local.sh

Run workflow via Slurm

Execute the following command to run the test workflow on a Slurm-managed HPC.

bash tests/test_integration_workflow/test.slurm.sh

NOTE: Depending on the configuration of your Slurm installation or if using a different workflow manager, you may need to adapt file cluster.json and the arguments to options --config and --cores in file test.slurm.sh, both located in directory tests/test_integration_workflow. Consult the manual of your workload manager as well as the section of the Snakemake manual dealing with cluster execution.

Running the workflow on your own samples

  1. Assuming that you are currently inside the repository's root directory, create a directory for your workflow run and traverse inside it with:

    mkdir config/my_run
    cd config/my_run
  2. Create empty sample table, workflow configuration and, if necessary, cluster configuration files:

    touch samples.tsv
    touch config.yaml
    touch cluster.json
  3. Use your editor of choice to manually populate these files with appropriate values. Have a look at the examples in the tests/ directory to see what the files should look like, specifically:

  4. Create a runner script. Pick one of the following choices for either local or cluster execution. Before execution of the respective command, you must replace the data directory placeholders in the argument to the --singularity-args option with a comma-separated list of all directories containing input data files (samples and any annoation files etc) required for your run.

    Runner script for local execution:

    cat << "EOF" > run.sh
    #!/bin/bash
    mkdir -p logs/local_log
    snakemake \
        --snakefile="../../snakemake/Snakefile" \
        --configfile="config.yaml" \
        --cores=4 \
        --printshellcmds \
        --rerun-incomplete \
        --use-singularity \
        --singularity-args="--bind <data_dir_1>,<data_dir_2>,<data_dir_n>"
    EOF

    OR

    Runner script for Slurm cluster exection (note that you may need to modify the arguments to --cluster and --cores depending on your HPC and workload manager configuration):

    cat << "EOF" > run.sh
    #!/bin/bash
    mkdir -p logs/cluster_log
    snakemake \
        --snakefile="../../snakemake/Snakefile" \
        --configfile="config.yaml" \
        --cluster-config="cluster.json" \
        --cluster="sbatch --cpus-per-task={cluster.threads} --mem={cluster.mem} --qos={cluster.queue} --time={cluster.time} --job-name={cluster.name} -o {cluster.out} -p scicore" \
        --cores=256 \
        --printshellcmds \
        --rerun-incomplete \
        --use-singularity \
        --singularity-args="--bind <data_dir_1>,<data_dir_2>,<data_dir_n>"
    EOF
  5. Start your workflow run:

    bash run.sh

Configuring workflow runs via LabKey tables

Our lab stores metadata for sequencing samples in a locally deployed LabKey instance. This repository provides two scripts that give programmatic access to the LabKey data table and convert it to the corresponding workflow inputs (samples.tsv and config.yaml), respectively. As such, these scripts largely automate step 3. of the above instructions. However, as these scripts were specifically for the needs of our lab, they are likely not portable or, at least, will require considerable modification for other setups (e.g., different LabKey table structure).

NOTE: All of the below steps assume that your current working directory is the repository's root directory.

  1. The scripts have additional dependencies that can be installed with:

    pip install -r scripts/requirements.txt
  2. In order to gain programmatic access to LabKey via its API, a credential file is required. Create it with the following command after replacing the placeholder values with your real credentials (talk to your LabKey manager if you do not have these):

    cat << EOF | ( umask 0377; cat >> ${HOME}/.netrc; )
    machine <remote-instance-of-labkey-server>  
    login <user-email>
    password <user-password>  
    EOF
  3. Generate the workflow configuration with the following command, after replacing the placeholders with the appropriate values (check out the help screen with option '--help' for further options and information):

    python scripts/labkey_to_snakemake.py \
        --input_dict="scripts/labkey_to_snakemake.dict.tsv" \
        --config_file="config/my_run/config.yaml" \
        --samples_table="config/my_run/samples.tsv" \
        --remote \
        --project-name="project_name" \
        --table-name="table_name" \
        <path_to_annotation_files>

Additional information

The metadata field names in the LabKey instance and those in the parameters in the Snakemake workflow have different names. A mapping between LabKey field identifiers and Snakemake parameters is listed below:

Labkey Snakemake
Entry date entry_date
Path to FASTQ file(s) fastq_path
Condition name condition
Replicate name replicate_name
End type (PAIRED or SINGLE) seqmode
Name of Mate1 FASTQ file fq1
Name of Mate2 FASTQ file fq2
Direction of Mate1 (SENSE, ANTISENSE or RANDOM) mate1_direction
Direction of Mate2 (SENSE, ANTISENSE or RANDOM) mate2_direction
5' adapter of Mate1 fq1_5p
3' adapter of Mate1 fq1_3p
5' adapter of Mate2 fq2_5p
3' adapter of Mate2 fq2_3p
Fragment length mean mean
Fragment length SD sd
Quality control flag (PASSED or FAILED) quality_control_flag
Checksum of raw Mate1 FASTQ file mate1_checksum
Checksum of raw Mate2 FASTQ file mate2_checksum
Name of metadata file metadata
Name of quality control file for Mate1 mate1_quality
Name of quality control file for Mate2 mate2_quality
Organism organism
Taxon ID taxon_id
Name of Strain / Isolate / Breed / Ecotype strain_name
Strain / Isolate / Breed / Ecotype ID strain_id
Biomaterial provider biomaterial_provider
Source / tissue name source_name
Tissue code tissue_code
Additional tissue description tissue_description
Genotype short name genotype_name
Genotype description genotype_description
Disease short name disease_name
Disease description disease_description
Abbreviation for treatment treatment
Treatment description treatment_description
Gender gender
Age age
Developmental stage development_stage
Passage number passage_number
Sample preparation date (YYYY-MM-DD) sample_prep_date
Prepared by prepared_by
Documentation documentation
Name of protocol file protocol_file
Sequencing date (YYYY-MM-DD) seq_date
Sequencing instrument seq_instrument
Library preparation kit library_kit
Cycles cycles
Molecule molecule
Contaminant sequences contaminant_seqs