The aim of this workshop is to get familiar with the pangenome graph framework, which, given a set of accurate assemblies, allows a full characterization of genetic variation of any type in any region of the genome.
It consists of two basic modules where we create a graph, visulize it, and call variants from it; and five "biological inference" modules where the graph is used to do various things:
It consists of two basic modules where we create a graph, visulize it, and call variants from it; and four modules where the graph is used to do various things:
**Basics**
*[Create a graph and visualize it](#create-a-graph--visualize-it)
*[Call variants from a graph](#call-variants-from-a-graph)
*[Create a graph and visualize it](A_create_graph.ipynb)
*[Call variants from a graph](B_call_variants.ipynb)
**Biological inference**
*[Core genome alignment](#core-genome-alignment)
*[Insertion or deletion?](#categorize-svs-into-insertions-and-deletions)
This workshop is based on Jupyiter notebooks, from where we run Python and bash code. You can run the notebooks from the sciCORE open-on-demand platform (http://ood-ubuntu.scicore.unibas.ch/) or from a editor like Visual Studio Code.
We are going to use [PGGB](https://pggb.readthedocs.io/en/latest/) for creating analyzing pangenome graphs, a tool set developed by the human pangenome reference consortium. In addition, a few Python packages need to be installed, as explained below. Please try to do at least steps 1 and 2 below **before** the workshop. Especially step 2 might take some time.
We are going to use [PGGB](https://pggb.readthedocs.io/en/latest/) for creating analyzing pangenome graphs, a tool set developed by the human pangenome reference consortium. In addition, a few Python packages need to be installed, as explained below.
Login to your sciCORE account and do the following:
@@ -51,33 +48,8 @@ The data we will explore in this workshop are 17 genomes that have been assemble
These genomes have neither been published nor thoroughly analyzed, so genuine discoveries are possible during the workshop!
## Basics
### Create a graph & visualize it
Given a set of assemblies and some assembly metadata, we select assemblies, rename them, and write them to a single fasta file. This is the only mandatory input for PGGB. We also explore the effect of two key parameters of PGGB, the minimum pairwise identity between seeds (-p) and the seed length (-s).
### Call variants from a graph
In this part, we obtain variants in classic vcf format from the graph, using an arbitrary reference assembly. A summary is created of the indels and SVs, and the complication of nested SVs is explored.
## Biological inference
### Core genome alignment
To make sense of genetic variation we need a phylogenetic tree. Here we traverse the graph and extract SNPs in nodes that are s in single strains. hared by all assemblies (i.e. the core genome). These SNPs are used to create an alignment and to estimate a tree.
### Graph annotation & liftovers
In this part we explore variants in genes and regions of interest, making use of the lift-over functionalities of PGGB. These allow to translate positions in one genome to any other genome in the graph.
### Categorize SVs into insertions and deletions
To tell whether a sequence missing in genome A reflects a deletion in an ancestor of A or an insertion elsewhere requires information about this sequence in more than two strains. Here we assess the frequency of he structural variant and its presence/absence in an outgroup strain in order to distinguish insertions from deletions.
### Identify gene conversion events
Gene conversion occurs through recombination between close paralogs and can result in heavily mutated genes. Gene conversion in the MTBC affects the functionally interesting PE/PPE gene families. Furthermore, treating SNPs introduced throug gene conversion as point mutations can bias various downstream analyses. Here we identify gene conversion events, which give themselves away as variant hotspots in single strains.
### Large unassembled duplications
Large duplications and amplifications pose a challenge for long-read sequencing. But even if these regions may not be assembled properly, duplications can be identified and studied.
## More modules to come...
- Mapping long reads against the assemblies to assess assembly accuracy