Workshop day 0

310baa99 · Christoph Stritt · af315c4f · 310baa99
Commit 310baa99 authored 6 months ago by Christoph Stritt
--- a/workshop/README.md
+++ b/workshop/README.md
 # Workshop: Long-read genome assembly and pangenome graph analysis
-Swiss TPH, 5.11.2024, christoph.stritt@swisstph.ch
+Swiss TPH, 5.11.2024
+
+christoph.stritt@swisstph.ch

 The aim of this workshop is to get familiar with the pangenome graph framework, which, given a set of accurate assemblies, allows a full characterization of genetic variation of any type in any region of the genome.

-It consists of two basic modules where we create a graph, visulize it, and call variants from it; and five "biological inference" modules where the graph is used to do various things:
+It consists of two basic modules where we create a graph, visulize it, and call variants from it; and four modules where the graph is used to do various things:


 **Basics**
-*  [Create a graph and visualize it](#create-a-graph--visualize-it)
-*  [Call variants from a graph](#call-variants-from-a-graph)
+*  [Create a graph and visualize it](A_create_graph.ipynb)
+*  [Call variants from a graph](B_call_variants.ipynb)

-**Biological inference**
-*  [Core genome alignment](#core-genome-alignment)
-*  [Insertion or deletion?](#categorize-svs-into-insertions-and-deletions)
-*  [Identify gene conversion events](#identify-gene-conversion-events)
-*  [Large unassembled duplications](#large-unassembled-duplications)
+**Less basic**
+*  [Core genome alignment for a phylogeny](C_core_genome_alignment.ipynb)
+*  [SV: insertion or deletion?](D_categorize_SVs.ipynb)
+*  [Identify gene conversion events](E_gene_conversion.ipynb)
+*  [Large unassembled duplications](F_large_duplications.ipynb)


 ## Set up
 This workshop is based on Jupyiter notebooks, from where we run Python and bash code. You can run the notebooks from the sciCORE open-on-demand platform (http://ood-ubuntu.scicore.unibas.ch/) or from a editor like Visual Studio Code. 

-We are going to use [PGGB](https://pggb.readthedocs.io/en/latest/) for creating analyzing pangenome graphs, a tool set developed by the human pangenome reference consortium. In addition, a few Python packages need to be installed, as explained below. Please try to do at least steps 1 and 2 below **before** the workshop. Especially step 2 might take some time.
+We are going to use [PGGB](https://pggb.readthedocs.io/en/latest/) for creating analyzing pangenome graphs, a tool set developed by the human pangenome reference consortium. In addition, a few Python packages need to be installed, as explained below. 

+Login to your sciCORE account and do the following:

 Step 1: clone the repository from gitlab.
 ```
 git clone https://git.scicore.unibas.ch/TBRU/PacbioSnake
 ```

-
 Step 2: Copy the PGGB container into the pipeline directory
 ```
 cd PacbioSnake
 cp /scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/pggb_latest.sif ./
 ```

-
 Step 3: Set up a conda environment with the required Python packages. This environment can then be selected as the kernel for running the notebooks.
-
-pandas, biopython, singularity, matplotlib, ipykernel, raxml-ng
-
 ```
-conda env create -f config/environment.yml
-
+conda create -n pacbio_ws singularity=3.8.6 pandas biopython matplotlib ipykernel
 ```


@@ -51,33 +48,8 @@ The data we will explore in this workshop are 17 genomes that have been assemble

 These genomes have neither been published nor thoroughly analyzed, so genuine discoveries are possible during the workshop!

-## Basics
-
-### Create a graph & visualize it
-Given a set of assemblies and some assembly metadata, we select assemblies, rename them, and write them to a single fasta file. This is the only mandatory input for PGGB. We also explore the effect of two key parameters of PGGB, the minimum pairwise identity between seeds (-p) and the seed length (-s). 
-
-### Call variants from a graph
-In this part, we obtain variants in classic vcf format from the graph, using an arbitrary reference assembly. A summary is created of the indels and SVs, and the complication of nested SVs is explored. 
-
-
-## Biological inference
-
-### Core genome alignment
-To make sense of genetic variation we need a phylogenetic tree. Here we traverse the graph and extract SNPs in nodes that are s in single strains. hared by all assemblies (i.e. the core genome). These SNPs are used to create an alignment and to estimate a tree. 
-
-### Graph annotation & liftovers
-In this part we explore variants in genes and regions of interest, making use of the lift-over functionalities of PGGB. These allow to translate positions in one genome to any other genome in the graph. 
-
-### Categorize SVs into insertions and deletions
-To tell whether a sequence missing in genome A reflects a deletion in an ancestor of A or an insertion elsewhere requires information about this sequence in more than two strains. Here we assess the frequency of he structural variant and its presence/absence in an outgroup strain in order to distinguish insertions from deletions.  
-
-### Identify gene conversion events
-Gene conversion occurs through recombination between close paralogs and can result in heavily mutated genes. Gene conversion in the MTBC affects the functionally interesting PE/PPE gene families. Furthermore, treating SNPs introduced throug gene conversion as point mutations can bias various downstream analyses. Here we identify gene conversion events, which give themselves away as variant hotspots in single strains. 
-
-### Large unassembled duplications
-Large duplications and amplifications pose a challenge for long-read sequencing. But even if these regions may not be assembled properly, duplications can be identified and studied. 
-

 ## More modules to come...
-
+- Mapping long reads against the assemblies to assess assembly accuracy
 - Mapping short reads against a graph
+