Genome assembly workflow
The genome assembly workflow includes the following tools/steps:
- LongQC: Get some read summary statistics. The reads are not modified in any way before assembly,
- Flye: Assembly.
- circlator: Reorient the assembly such that it begins with dnaA.
- bakta: Annotate the reoriented assembly.
- minimap2: Map the long reads back against the assembly. The resulting alignments can be used to check for inconsistencies between reads and assemblies.
All this software is ready-to-use in a container (see the .def and .yml files in the container folder).
Quick start
Set up conda environment with snakemake and singularity
conda env create -f config/environment.yml
Run the pipeline
It is assumed that the pipeline is run from within the PacbioSnake folder. If not, adapt paths accordingly. If the pipeline is run from a different place than the pipeline folder, absolute paths for the input file and the output directory need to be provided.
# Connect to sciCORE through the terminal
# Create a screen, named assembly, which allows you run a job in the background.
screen -R assembly
# Load the conda environment
conda activate PacbioSnake
# Run the pipeline
./run_assembly_pipeline.py \
-s config/samples.tsv \
-o ~/assemblies \
-j 5 \
-t 2
# Leave the screen while the pipeline is running: Ctrl + a + d
# Re-attach the screen
screen -r assembly
Some explanations
The assembly pipeline can be run by executing the run_assembly_pipeline.py script. This is a wrapper around the snakemake command where some parameters and paths are hardwired to work in the sciCORE environment with minimal user input.
The see the arguments required for run_assembly_pipeline.py, type ''' ./run_assembly_pipeline.py -h '''
Two arguments are required:
- -s: a tab separate table, without header, that contains the sample names and the corresponding paths to the HiFi consensus reads in fastq format
- -o: path to the output directory
Optional arguments:
- -n: perform dry run (recommended), to see if all the paths work out
- -j: number of jobs to run in parallel (default = 4)
- -t: number of threads per job (default = 10)
Output
For each sample defined in the samples table, a folder is generated in the output directory. It contains:
assembly.circularized.renamed.fasta
bakta/
circlator/
flye/
longqc/
remapping/
Configuration
Two big files required to run the pipeline are not in this repository but available on sciSCORE:
Singularity container with all required software
/scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/assemblySC.sif
Bakta database
/scicore/home/gagneux/GROUP/PacbioSnake_resources/databases/bakta_db
config.yml
In the file config/config.yaml some global parameters can be set:
# REQUIRED
samples: config/samples.tsv # Path to sample table, no header, tab-separated
outdir: ./results # Path to output directory
# OPTIONAL
annotate: "Yes" # Annotate assembly with bakta Yes/No
ref:
genome_size: 4.4m #
gbf: resources/H37Rv.gbf # Used for bakta annotation step
bakta_db: resources/bakta_db # Used for bakta annotation step
container: containers/assemblySMK.sif # Singularity container containing all reuquired software
threads_per_job: 4 # Should match cpus-per-task in the snakemake command
keep_intermediate: "Yes" # Not implemented yet...