Skip to content
Snippets Groups Projects
christoph.stritt@unibas.ch's avatar
Christoph Stritt authored
310baa99
History

Genome assembly workflow

The genome assembly workflow includes the following tools/steps:

  • LongQC: Get some read summary statistics. The reads are not modified in any way before assembly,
  • Flye: Assembly.
  • circlator: Reorient the assembly such that it begins with dnaA.
  • bakta: Annotate the reoriented assembly.
  • minimap2: Map the long reads back against the assembly. The resulting alignments can be used to check for inconsistencies between reads and assemblies.

All this software is ready-to-use in a container (see the .def and .yml files in the container folder).

Quick start

Set up conda environment with snakemake and singularity

conda env create -f config/environment.yml

Run the pipeline

It is assumed that the pipeline is run from within the PacbioSnake folder. If not, adapt paths accordingly. If the pipeline is run from a different place than the pipeline folder, absolute paths for the input file and the output directory need to be provided.

# Connect to sciCORE through the terminal

# Create a screen, named assembly, which allows you run a job in the background. 
screen -R assembly

# Load the conda environment
conda activate PacbioSnake

# Run the pipeline 
./run_assembly_pipeline.py \
  -s config/samples.tsv \
  -o ~/assemblies \
  -j 5 \
  -t 2

# Leave the screen while the pipeline is running: Ctrl + a + d
# Re-attach the screen
screen -r assembly

Some explanations

The assembly pipeline can be run by executing the run_assembly_pipeline.py script. This is a wrapper around the snakemake command where some parameters and paths are hardwired to work in the sciCORE environment with minimal user input.

The see the arguments required for run_assembly_pipeline.py, type ''' ./run_assembly_pipeline.py -h '''

Two arguments are required:

  • -s: a tab separate table, without header, that contains the sample names and the corresponding paths to the HiFi consensus reads in fastq format
  • -o: path to the output directory

Optional arguments:

  • -n: perform dry run (recommended), to see if all the paths work out
  • -j: number of jobs to run in parallel (default = 4)
  • -t: number of threads per job (default = 10)

Output

For each sample defined in the samples table, a folder is generated in the output directory. It contains:

assembly.circularized.renamed.fasta
bakta/
circlator/
flye/
longqc/
remapping/

Configuration

Two big files required to run the pipeline are not in this repository but available on sciSCORE:

Singularity container with all required software

/scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/assemblySC.sif

Bakta database

/scicore/home/gagneux/GROUP/PacbioSnake_resources/databases/bakta_db

config.yml

In the file config/config.yaml some global parameters can be set:

# REQUIRED
samples: config/samples.tsv # Path to sample table, no header, tab-separated
outdir: ./results # Path to output directory

# OPTIONAL
annotate: "Yes" # Annotate assembly with bakta Yes/No

ref:
  genome_size: 4.4m # 
  gbf: resources/H37Rv.gbf # Used for bakta annotation step

bakta_db: resources/bakta_db # Used for bakta annotation step
container: containers/assemblySMK.sif # Singularity container containing all reuquired software

threads_per_job: 4 # Should match cpus-per-task in the snakemake command
 
keep_intermediate: "Yes" # Not implemented yet...