Skip to content
Snippets Groups Projects
Commit aa6c88f8 authored by Alex Kanitz's avatar Alex Kanitz
Browse files

Merge branch 'dev' into 'master'

Bump version to v0.2.0

See merge request !76
parents 4dd1ad78 2bdcf390
No related branches found
No related tags found
1 merge request!76Bump version to v0.2.0
Pipeline #11050 passed
Showing
with 1137 additions and 796 deletions
......@@ -328,6 +328,7 @@ pip-selfcheck.json
.DS_Store
runs/.*
!runs/PUT_YOUR_WORKFLOW_RUN_CONFIGS_HERE
._*
._.DS_Store
.snakemake/
logs/
......
......@@ -4,7 +4,7 @@ before_script:
- apt update && apt install -y gcc
- conda init bash && source ~/.bashrc && echo $CONDA_DEFAULT_ENV
- conda env create -f install/environment.root.yml
- conda activate rhea && echo $CONDA_DEFAULT_ENV
- conda activate zarp && echo $CONDA_DEFAULT_ENV
- conda env update -f install/environment.dev.yml
test:
......@@ -12,11 +12,12 @@ test:
# add code quality tests here
# add unit tests here
# add script tests here
- bash tests/test_scripts_labkey_to_snakemake_table/test.sh
- bash tests/test_scripts_labkey_to_snakemake_api/test.sh
- bash tests/test_scripts_prepare_inputs_table/test.sh
- bash tests/test_scripts_prepare_inputs_labkey/test.sh
- bash tests/test_alfa/test.sh
# add integration tests here
- bash tests/test_create_dag_image/test.sh
- bash tests/test_create_rule_graph/test.sh
- bash tests/test_integration_workflow/test.local.sh
- bash tests/test_integration_workflow_multiple_lanes/test.local.sh
# Rhea pipeline
# ZARP
[Snakemake][snakemake] workflow for general purpose RNA-Seq library annotation
developed by the [Zavolan lab][zavolan-lab].
[Snakemake][snakemake] workflow that covers common steps of short read RNA-Seq
library analysis developed by the [Zavolan lab][zavolan-lab].
Reads are processed, aligned, quantified and analyzed with state-of-the-art
tools to give meaningful initial insights into various aspects of an RNA-Seq
library while cutting down on hands-on time for bioinformaticians.
Reads are analyzed (pre-processed, aligned, quantified) with state-of-the-art
tools to give meaningful initial insights into the quality and composition
of an RNA-Seq library, reducing hands-on time for bioinformaticians and giving
experimentalists the possibility to rapidly assess their data.
Below is a schematic representation of the individual workflow steps ("pe"
refers to "paired-end"):
Below is a schematic representation of the individual steps of the workflow
("pe" refers to "paired-end"):
> ![rule_graph][rule-graph]
For a more detailed description of each step, please refer to the [pipeline
documentation][pipeline-documentation].
For a more detailed description of each step, please refer to the [workflow
documentation][workflow-documentation].
## Requirements
......@@ -28,12 +29,12 @@ on the following distributions:
### Cloning the repository
Traverse to the desired path on your file system, then clone the repository and
move into it with:
Traverse to the desired directory/folder on your file system, then clone/get the
repository and move into the respective directory with:
```bash
git clone ssh://git@git.scicore.unibas.ch:2222/zavolan_group/pipelines/rhea.git
cd rhea
git clone ssh://git@git.scicore.unibas.ch:2222/zavolan_group/pipelines/zarp.git
cd zarp
```
### Installing Conda
......@@ -49,8 +50,8 @@ Other versions are not guaranteed to work as expected.
For improved reproducibility and reusability of the workflow,
each individual step of the workflow runs in its own [Singularity][singularity]
container. As a consequence, running this workflow has very few
individual dependencies. It does, however, require Singularity to be installed
on the system running the workflow. As the functional installation of
individual dependencies. However, it requires Singularity to be installed
on the system where the workflow is executed. As the functional installation of
Singularity requires root privileges, and Conda currently only provides
Singularity for Linux architectures, the installation instructions are
slightly different depending on your system/setup:
......@@ -90,7 +91,7 @@ conda env create -f install/environment.root.yml
Activate the Conda environment with:
```bash
conda activate rhea
conda activate zarp
```
### Installing non-essential dependencies
......@@ -105,15 +106,14 @@ conda env update -f install/environment.dev.yml
## Testing the installation
We have prepared several tests to check the integrity of the workflow, its
components and non-essential processing scripts. These can be found in
subdirectories of the `tests/` directory. The most critical of these tests
lets you execute the entire workflow on a small set of example input files.
Note that for this and other tests to complete without issues,
[additional dependencies](#installing-non-essential-dependencies) need to be
installed.
We have prepared several tests to check the integrity of the workflow and its
components. These can be found in subdirectories of the `tests/` directory.
The most critical of these tests enable you execute the entire workflow on a
set of small example input files. Note that for this and other tests to complete
successfully, [additional dependencies](#installing-non-essential-dependencies)
need to be installed.
### Run workflow on local machine
### Test workflow on local machine
Execute the following command to run the test workflow on your local machine:
......@@ -121,7 +121,7 @@ Execute the following command to run the test workflow on your local machine:
bash tests/test_integration_workflow/test.local.sh
```
### Run workflow via Slurm
### Test workflow via Slurm
Execute the following command to run the test workflow on a
[Slurm][slurm]-managed high-performance computing (HPC) cluster:
......@@ -131,15 +131,15 @@ bash tests/test_integration_workflow/test.slurm.sh
```
> **NOTE:** Depending on the configuration of your Slurm installation or if
> using a different workflow manager, you may need to adapt file `cluster.json`
> and the arguments to options `--config` and `--cores` in file
> using a different workload manager, you may need to adapt file `cluster.json`
> and the arguments to options `--config` and `--cores` in the file
> `test.slurm.sh`, both located in directory `tests/test_integration_workflow`.
> Consult the manual of your workload manager as well as the section of the
> Snakemake manual dealing with [cluster execution].
## Running the workflow on your own samples
1. Assuming that you are currently inside the repository's root directory,
1. Assuming that your current directory is the repository's root directory,
create a directory for your workflow run and traverse inside it with:
```bash
......@@ -156,7 +156,7 @@ configuration files:
touch cluster.json
```
3. Use your editor of choice to manually populate these files with appropriate
3. Use your editor of choice to populate these files with appropriate
values. Have a look at the examples in the `tests/` directory to see what the
files should look like, specifically:
......@@ -166,7 +166,7 @@ files should look like, specifically:
4. Create a runner script. Pick one of the following choices for either local
or cluster execution. Before execution of the respective command, you must
replace the data directory placeholders in the argument to the
replace the data directory placeholders in the argument of the
`--singularity-args` option with a comma-separated list of _all_ directories
containing input data files (samples and any annoation files etc) required for
your run.
......@@ -223,9 +223,11 @@ Our lab stores metadata for sequencing samples in a locally deployed
programmatic access to the LabKey data table and convert it to the
corresponding workflow inputs (`samples.tsv` and `config.yaml`), respectively.
As such, these scripts largely automate step 3. of the above instructions.
However, as these scripts were specifically for the needs of our lab, they are
likely not portable or, at least, will require considerable modification for
other setups (e.g., different LabKey table structure).
However, as these scripts were written specifically for the needs of our lab,
they are likely not directly usable or, at least, will require considerable
modification for other setups (e.g., different LabKey table structure).
Nevertheless, they can serve as an example for interfacing between LabKey and
your workflow.
> **NOTE:** All of the below steps assume that your current working directory
> is the repository's root directory.
......@@ -254,10 +256,10 @@ replacing the placeholders with the appropriate values (check out the
help screen with option '--help' for further options and information):
```bash
python scripts/labkey_to_snakemake.py \
python scripts/prepare_inputs.py \
--labkey-domain="my.labkey.service.io"
--labkey-domain="/my/project/path"
--input-to-output-mapping="scripts/labkey_to_snakemake.dict.tsv" \
--input-to-output-mapping="scripts/prepare_inputs.dict.tsv" \
--resources-dir="/path/to/my/genome/resources" \
--output-table="config/my_run/samples.tsv" \
--config_file="config/my_run/config.yaml" \
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
name: rhea
name: zarp
channels:
- bioconda
- conda-forge
......
name: rhea
name: zarp
channels:
- conda-forge
- defaults
dependencies:
- graphviz=2.40.1
- jinja2=2.11.2
- networkx=2.4
- pip=20.0.2
- pygments=2.6.1
- pygraphviz=1.3
- python=3.7.4
- singularity=3.5.2
- pip:
- pandas==1.0.1
- snakemake==5.10.0
- snakemake==5.19.2
name: rhea
name: zarp
channels:
- conda-forge
- defaults
dependencies:
- graphviz=2.40.1
- jinja2=2.11.2
- networkx=2.4
- pip=20.0.2
- pygments=2.6.1
- pygraphviz=1.3
- python=3.7.4
- pip:
- pandas==1.0.1
- snakemake==5.10.0
- snakemake==5.19.2
# Rhea: workflow documentation
# ZARP: workflow documentation
This document describes the individual steps of the workflow. For instructions
on installation and usage please see [here](README.md).
......@@ -30,6 +30,8 @@ on installation and usage please see [here](README.md).
- [**plot_TIN_scores**](#plot_tin_scores)
- [**salmon_quantmerge_genes**](#salmon_quantmerge_genes)
- [**salmon_quantmerge_transcripts**](#salmon_quantmerge_transcripts)
- [**kallisto_merge_genes**](#kallisto_merge_genes)
- [**kallisto_merge_transcripts**](#kallisto_merge_transcripts)
- [**generate_alfa_index**](#generate_alfa_index)
- [**alfa_qc**](#alfa_qc)
- [**alfa_qc_all_samples**](#alfa_qc_all_samples)
......@@ -37,7 +39,7 @@ on installation and usage please see [here](README.md).
- [**prepare_multiqc_config**](#prepare_multiqc_config)
- [**multiqc_report**](#multiqc_report)
- [**finish**](#finish)
- [**Sequencing mode-specific**](#sequencing-mode-specific)
- [**Sequencing mode-specific**](#sequencing-mode-specific)
- [**remove_adapters_cutadapt**](#remove_adapters_cutadapt)
- [**remove_polya_cutadapt**](#remove_polya_cutadapt)
- [**map_genome_star**](#map_genome_star)
......@@ -416,6 +418,36 @@ Merge transcript-level expression estimates for all samples with
- Transcript read count table (custom `.tsv`); used in
[**multiqc_report**](#multiqc_report)
#### `kallisto_merge_genes`
Merge gene-level expression estimates for all samples with
[custom script][custom-script-merge-kallisto].
> Rule is run once per sequencing mode
- **Input**
- Transcript expression tables (custom `.h5`) for samples of same sequencing
mode; from [**genome_quantification_kallisto**](#genome_quantification_kallisto)
- Gene annotation file (custom `.gtf`)
- **Output**
- Gene TPM table (custom `.tsv`)
- Gene read count table (custom `.tsv`)
- Mapping gene/transcript IDs table (custom `.tsv`)
#### `kallisto_merge_transcripts`
Merge transcript-level expression estimates for all samples with
[custom script][custom-script-merge-kallisto].
> Rule is run once per sequencing mode
- **Input**
- Transcript expression tables (custom `.h5`) for samples of same sequencing
mode; from [**genome_quantification_kallisto**](#genome_quantification_kallisto)
- **Output**
- Transcript TPM table (custom `.tsv`)
- Transcript read count table (custom `.tsv`)
#### `generate_alfa_index`
Create index for [**ALFA**](#third-party-software-used).
......@@ -681,6 +713,7 @@ Generate pseudoalignments of reads to transcripts with
[code-star]: <https://github.com/alexdobin/STAR>
[custom-script-gtf-to-bed12]: <https://git.scicore.unibas.ch/zavolan_group/tools/gtf_transcript_type_to_bed12>
[custom-script-tin]: <https://git.scicore.unibas.ch/zavolan_group/tools/tin_score_calculation>
[custom-script-merge-kallisto]: <https://github.com/zavolanlab/merge_kallisto>
[docs-alfa]: <https://github.com/biocompibens/ALFA#manual>
[docs-bedgraphtobigwig]: <http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/>
[docs-bedtools]: <https://bedtools.readthedocs.io/en/latest/>
......
......@@ -2,7 +2,7 @@ labkey snakemake
Entry_Date entry_date
Path_Fastq_Files fastq_path
Condition_Name condition
Replicate_Name replicate_name
Sample_Name sample_name
Single_Paired seqmode
Mate1_File fq1
Mate2_File fq2
......
......@@ -42,7 +42,13 @@
{
"time": "03:00:00",
"threads":"8",
"mem":"10G"
"mem":"40G"
},
"sort_bed_4_big":
{
"time": "03:00:00",
"threads":"8",
"mem":"20G"
},
"create_index_kallisto":
{
......@@ -168,13 +174,13 @@
{
"time": "03:00:00",
"threads":"6",
"mem":"10G"
"mem":"20G"
},
"quantification_salmon":
{
"time": "03:00:00",
"threads":"6",
"mem":"10G"
"mem":"20G"
},
"pe_genome_quantification_kallisto":
{
......
---
samples: "../input_files/samples.multiple_lanes.tsv"
output_dir: "results"
log_dir: "logs"
kallisto_indexes: "results/kallisto_indexes"
salmon_indexes: "results/salmon_indexes"
star_indexes: "results/star_indexes"
alfa_indexes: "results/alfa_indexes"
report_description: "No description provided by user"
report_logo: "../../images/logo.128px.png"
report_url: "https://zavolan.biozentrum.unibas.ch/"
...
\ No newline at end of file
File added
File added
File added
File added
sample seqmode fq1 fq2 index_size kmer fq1_3p fq1_5p fq2_3p fq2_5p organism gtf genome sd mean kallisto_directionality alfa_directionality alfa_plus alfa_minus multimappers soft_clip pass_mode libtype fq1_polya_3p fq1_polya_5p fq2_polya_3p fq2_polya_5p
synthetic_10_reads_paired_synthetic_10_reads_paired pe ../input_files/pe_lane1/synthetic_split_lane1.mate_1.fastq.gz ../input_files/pe_lane1/synthetic_split_lane1.mate_2.fastq.gz 75 31 AGATCGGAAGAGCACA XXXXXXXXXXXXXXX AGATCGGAAGAGCGT XXXXXXXXXXXXXXX homo_sapiens ../input_files/homo_sapiens/annotation.gtf ../input_files/homo_sapiens/genome.fa 100 250 --fr fr-firststrand str1 str2 10 EndToEnd None A AAAAAAAAAAAAAAA XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX TTTTTTTTTTTTTTT
synthetic_10_reads_paired_synthetic_10_reads_paired pe ../input_files/pe_lane2/synthetic_split_lane2.mate_1.fastq.gz ../input_files/pe_lane2/synthetic_split_lane2.mate_2.fastq.gz 75 31 AGATCGGAAGAGCACA XXXXXXXXXXXXXXX AGATCGGAAGAGCGT XXXXXXXXXXXXXXX homo_sapiens ../input_files/homo_sapiens/annotation.gtf ../input_files/homo_sapiens/genome.fa 100 250 --fr fr-firststrand str1 str2 10 EndToEnd None A AAAAAAAAAAAAAAA XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX TTTTTTTTTTTTTTT
synthetic_10_reads_mate_1_synthetic_10_reads_mate_1 se ../input_files/se_lane1/synthetic_split_lane1.mate_1.fastq.gz XXXXXXXXXXXXXXX 75 31 AGATCGGAAGAGCACA XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX homo_sapiens ../input_files/homo_sapiens/annotation.gtf ../input_files/homo_sapiens/genome.fa 100 250 --fr fr-firststrand str1 str2 10 EndToEnd None A AAAAAAAAAAAAAAA XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX
synthetic_10_reads_mate_1_synthetic_10_reads_mate_1 se ../input_files/se_lane2/synthetic_split_lane2.mate_1.fastq.gz XXXXXXXXXXXXXXX 75 31 AGATCGGAAGAGCACA XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX homo_sapiens ../input_files/homo_sapiens/annotation.gtf ../input_files/homo_sapiens/genome.fa 100 250 --fr fr-firststrand str1 str2 10 EndToEnd None A AAAAAAAAAAAAAAA XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment