Workshop continued...

ddaf6496 · cstritt · 922071c0 · ddaf6496 · ddaf6496 · ddaf6496
Commit ddaf6496 authored 9 months ago by cstritt
--- a/workshop/A_create_graph.ipynb
+++ b/workshop/A_create_graph.ipynb
@@ -1140,7 +1140,7 @@
    "\n",
    "❓Explore the graph. Is there a lot of structural variation? \n",
    "\n",
-    "❓In the workshop/data/genes folder, there are the sequences of some features of interest. Pick one and blast it against the graph. What do you find?\n",
+    "❓In the workshop/data/genes folder, there are the sequences of some features of interest. TBD1, for instance, is a region absent in the reference genome H37Rv and thus invisible to most short-read approaches. How does TBD1 look like? \n",
    "\n"
   ]
  }

 %% Cell type:markdown id: tags:

 # Create a pangenome graph with PGGB

 The folder **/scicore/home/gagneux/GROUP/PacbioSnake_resources/L1_samples** contains the output from running the assembly pipeline on the 19 L1 strains. In a first step, we will use the assembly summaries created by the pipeline to decide which assemblies to include in the graph.

 The assembly summaries can be found here: **/scicore/home/gagneux/GROUP/PacbioSnake_resources/workshop/L1_samples/assembly_summaries.tsv**

 PGGB requires as input a single fasta file, so in a second step we will combine the selected sequences and rename them.

 %% Cell type:markdown id: tags:

 # Inspect the Flye assembly summary

 %% Cell type:code id: tags:

 ``` python
 import pandas

 path_to_summary = '/scicore/home/gagneux/GROUP/PacbioSnake_resources/workshop/L1_samples/assembly_summaries.tsv'

 md = pandas.read_csv(path_to_summary, sep='\t')
 md.head()
 ```

 %% Output

         sample  #seq_name   length  cov. circ. repeat  mult. alt_group graph_path
    0  PB000195   contig_1  4444823   113     Y      N      1         *          1
    1  PB000196  contig_29  4415329   133     N      N      1         *      29,30
    2  PB000196  contig_36    13589    31     N      Y      1        38     *,36,*
    3  PB000196  contig_46    11948    46     N      Y      1        46     *,46,*
    4  PB000196  contig_55    10270    45     N      Y      1        56     *,55,*

 %% Cell type:markdown id: tags:

 How many sequences per strain?

 %% Cell type:code id: tags:

 ``` python
 md['sample'].value_counts()
 ```

 %% Output

    sample
    PB000205    49
    PB000202     7
    PB000196     6
    PB000220     2
    PB000198     2
    PB000207     1
    PB000219     1
    PB000211     1
    PB000210     1
    PB000209     1
    PB000208     1
    PB000195     1
    PB000206     1
    PB000203     1
    PB000201     1
    PB000200     1
    PB000199     1
    PB000197     1
    PB000204     1
    Name: count, dtype: int64

 %% Cell type:markdown id: tags:

 Which sequences are cirularized?

 %% Cell type:code id: tags:

 ``` python
 md[md['circ.'] == 'Y']
 ```

 %% Output

          sample  #seq_name   length  cov. circ. repeat  mult. alt_group  \
    0   PB000195   contig_1  4444823   113     Y      N      1         *
    7   PB000197   contig_1  4410932   129     Y      N      1         *
    10  PB000199   contig_1  4426248   112     Y      N      1         *
    11  PB000200   contig_1  4428581    83     Y      N      1         *
    12  PB000201   contig_1  4408479    96     Y      N      1         *
    20  PB000203   contig_1  4434008    47     Y      N      1         *
    21  PB000204   contig_1  4419776    74     Y      N      1         *
    22  PB000205  contig_53  4420883    37     Y      Y      2         *
    71  PB000206   contig_1  4440794   117     Y      N      1         *
    72  PB000207   contig_1  4424460    85     Y      N      1         *
    73  PB000208   contig_1  4419922   119     Y      N      1         *
    74  PB000209   contig_1  4443791    80     Y      N      1         *
    75  PB000210   contig_1  4444571   115     Y      N      1         *
    76  PB000211   contig_1  4441994   121     Y      N      1         *
    77  PB000219   contig_1  4410932   101     Y      N      1         *
    
       graph_path
    0           1
    7           1
    10          1
    11          1
    12          1
    20          1
    21          1
    22         53
    71          1
    72          1
    73          1
    74          1
    75          1
    76          1
    77          1

 %% Cell type:markdown id: tags:

 Sequences larger than 4 Mb

 %% Cell type:code id: tags:

 ``` python
 md[md['length']>4e6]
 ```

 %% Output

          sample  #seq_name   length  cov. circ. repeat  mult. alt_group  \
    0   PB000195   contig_1  4444823   113     Y      N      1         *
    1   PB000196  contig_29  4415329   133     N      N      1         *
    7   PB000197   contig_1  4410932   129     Y      N      1         *
    8   PB000198   contig_1  4332804    86     N      N      1         *
    10  PB000199   contig_1  4426248   112     Y      N      1         *
    11  PB000200   contig_1  4428581    83     Y      N      1         *
    12  PB000201   contig_1  4408479    96     Y      N      1         *
    20  PB000203   contig_1  4434008    47     Y      N      1         *
    21  PB000204   contig_1  4419776    74     Y      N      1         *
    22  PB000205  contig_53  4420883    37     Y      Y      2         *
    71  PB000206   contig_1  4440794   117     Y      N      1         *
    72  PB000207   contig_1  4424460    85     Y      N      1         *
    73  PB000208   contig_1  4419922   119     Y      N      1         *
    74  PB000209   contig_1  4443791    80     Y      N      1         *
    75  PB000210   contig_1  4444571   115     Y      N      1         *
    76  PB000211   contig_1  4441994   121     Y      N      1         *
    77  PB000219   contig_1  4410932   101     Y      N      1         *
    78  PB000220   contig_1  4319283    88     N      N      1         *
    
       graph_path
    0           1
    1       29,30
    7           1
    8       2,1,2
    10          1
    11          1
    12          1
    20          1
    21          1
    22         53
    71          1
    72          1
    73          1
    74          1
    75          1
    76          1
    77          1
    78      2,1,2

 %% Cell type:markdown id: tags:

 Which samples lack a circular genome?

 %% Cell type:code id: tags:

 ``` python
 samples_circ = set(md['sample'][md['circ.'] == 'Y'])
 samples_nocirc = md['sample'][md['sample'].isin(samples_circ) == False]

 print(set(samples_nocirc))
 ```

 %% Output

    {'PB000220', 'PB000198', 'PB000202', 'PB000196'}

 %% Cell type:markdown id: tags:


 ❓ What is the matter with these genomes?

 ❓ Should we include them in the graph?

 %% Cell type:code id: tags:

 ``` python
 md[md['sample'] == 'PB000220']
 ```

 %% Output

          sample #seq_name   length  cov. circ. repeat  mult. alt_group graph_path
    78  PB000220  contig_1  4319283    88     N      N      1         *      2,1,2
    79  PB000220  contig_2   110156   166     N      Y      2         *          2

 %% Cell type:markdown id: tags:

 # Combine complete genomes

 Let us be liberal and just use all sequences > 4 Mb. We can add a tag to the name of the 'suspicuous' ones.

 To combine assemblies, we will loop through the pipeline output directories and get the sequences from the individual fasta files. We will also add H37Rv, on the one hand to have an outgroup, on the other to use the H37Rv annotation.

 %% Cell type:code id: tags:

 ``` python
 import os
 from Bio import SeqIO


 path_to_pipeline_output = '/scicore/home/gagneux/GROUP/PacbioSnake_resources/workshop/L1_samples'

 # Create results directory and output handle
 os.makedirs('results', exist_ok=True)
 output_handle = open('./results/assemblies_4mb.fasta', 'w')

 # Get assembly summary data for all sequences > 4Mb
 md_4mb = md[md['length']>4e6]


 # Loop through pipeline output directory
 for root, dirs, files in os.walk(path_to_pipeline_output):
    for file in files:
        if file.startswith('PB') and file.endswith('.fasta'):

            for rec in SeqIO.parse(os.path.join(root, file), "fasta"):

                if len(rec.seq) < 4e6:
                    continue

                strain = rec.id.split('_')[0]

                # Look up the strain in the assembly summary
                strain_circ = md_4mb['circ.'][md_4mb['sample'] == strain].to_string(index=False)
                strain_repeat = md_4mb['repeat'][md_4mb['sample'] == strain].to_string(index=False)

                # Create new seq ID
                new_id = strain
                if strain_circ == 'N':
                    new_id += '#circNo'
                if strain_repeat == 'Y':
                    new_id += '#repeatY'
                rec.id = new_id
                rec.description = ''
                SeqIO.write(rec, output_handle, 'fasta')


 output_handle.close()

 ```

 %% Cell type:markdown id: tags:

 Finally, add H37Rv, zip and index the sequences.

 %% Cell type:code id: tags:

 ``` python
 %%bash
 PATH_TO_PGGB=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/pggb_latest.sif

 sed '1s/^>.*$/>H37Rv/' data/GCF_000195955.2_ASM19595v2_genomic.fna >> ./results/assemblies_4mb.fasta

 singularity exec ${PATH_TO_PGGB} bgzip ./results/assemblies_4mb.fasta
 singularity exec ${PATH_TO_PGGB} samtools faidx ./results/assemblies_4mb.fasta.gz
 ```

 %% Output

    INFO:    Converting SIF file to temporary sandbox...
    INFO:    Cleaning up image...
    INFO:    Converting SIF file to temporary sandbox...
    INFO:    Cleaning up image...

 %% Cell type:markdown id: tags:

 # Run the PanGenome Graph Builder

 Now we can create a graph, which essentially is an all-vs-all alignment that is translated into a concise format of nodes, edges, and paths.


 The main parameters parameters of PGGB are:

    -i : path to combined assemblies
    -o : output directory name
    -n : number of haplotypes
    -p : minimum pairwise identity between seeds
    -s : seed length


 https://pggb.readthedocs.io/en/latest/


 As shown by [Yang et al. 2023](https://doi.org/10.3389/fgene.2023.1225248), different values for -p and -s affect the structure of the graph, especially when building it from fairly diverse genomes. For TB with its highly similar genomes, I don't expect a dramatic effect of changing these parameters. The exception might be -s when there are larger duplications. -s should approximately correspond to the maximum length of repeats in the genome. If set too small, a duplication larger than -s could be represented somewhat confusingly in the graph as multiple variants.


 [Here](pics/pggb-flow-diagram.png) is a flow diagram of what PGGB does.


 %% Cell type:markdown id: tags:

 ### Create slurm script

 %% Cell type:code id: tags:

 ``` python
 %%bash

 PATH_TO_PGGB=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/pggb_latest.sif
 INPUT=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/workshop/results/assemblies_4mb.fasta.gz
 OUTDIR=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/workshop/results/pggb

 N_HAPLOTYPES=18
 MIN_SEED_ID=99
 SEED_LENGTH=5k

 echo "\
 #!/bin/bash

 #SBATCH --cpus-per-task=20
 #SBATCH --mem-per-cpu=1G
 #SBATCH --time=06:00:00
 #SBATCH --qos=6hours
 #SBATCH --output=results/pggb_stdout.%j
 #SBATCH --error=results/pggb_stderr.%j

 singularity exec ${PATH_TO_PGGB} pggb \
  -i ${INPUT} \
  -o ${OUTDIR} \
  -m \
  -t 20 \
  -n ${N_HAPLOTYPES} \
  -p ${MIN_SEED_ID} \
  -s ${SEED_LENGTH}" > results/run_pggb.slrm
 ```

 %% Cell type:markdown id: tags:

 And submit the job. With 20 CPUs and 18 strains, this should take less than 10 minutes.

 %% Cell type:code id: tags:

 ``` python
 %%bash

 sbatch results/run_pggb.slrm
 ```

 %% Output

    Submitted batch job 12295701

 %% Cell type:markdown id: tags:

 PGGB creates quite some output. We will only use the file ending with .gfa, which is ..., and the one ending with .og, which is a ...

 %% Cell type:markdown id: tags:

 # Graph visualization

 Like IGV for short reads, visualization tools

 Bandage allows you to:

  - browse a graph interactively
  - zoom in on regions of interest, export figures
  - extract node sequences
  - blast sequences against the graph

 A limitation of Bandage is that it does not allow you to see which strain follows which path through the graph. There are **more recent tools** I haven't explored yet, for example [VRPG](https://www.evomicslab.org/app/vrpg/) or [PanGraphViewer](https://github.com/TF-Chan-Lab/panGraphViewer), which seem to be able to do this.


 To **run Bandage**, open a sciCORE Desktop on https://ood-ubuntu.scicore.unibas.ch. Then open a terminal, load the Bandage module, and open the GUI.

 ```bash
 ml Bandage/0.9.0-GCCcore-13.2.0
 Bandage
 ```

 Then you should be able to load the graph from where you have saved it.

 ❓Explore the graph. Is there a lot of structural variation?

-❓In the workshop/data/genes folder, there are the sequences of some features of interest. Pick one and blast it against the graph. What do you find?
+❓In the workshop/data/genes folder, there are the sequences of some features of interest. TBD1, for instance, is a region absent in the reference genome H37Rv and thus invisible to most short-read approaches. How does TBD1 look like?


--- a/workshop/C_core_genome_alignment.ipynb
+++ b/workshop/C_core_genome_alignment.ipynb
@@ -8,12 +8,12 @@
    "\n",
    "Nothing makes sense in biology except in the light of a phylogenetic tree... \n",
    "\n",
-    "In this module, we take the variants from the vcf and create a SNP alignment. We further use the graph to estimate the number of non-variable positions. "
+    "In this module, we take the variants from the graph vcf and create a SNP alignment. We further use the graph to estimate the number of non-variable positions. "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
@@ -43,32 +43,28 @@
    "print(f'Number of excluded positions: {len(excluded)}')"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By using the variants.vcf file, we will also include sites that do not vary within the L1 samples, but only between L1 and H37Rv."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "['H37Rv', '2163475', '>10733>10736', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>10733>10734>10736,>10733>10735>10736;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']\n",
-      "['H37Rv', '2163481', '>10736>10739', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>10736>10737>10739,>10736>10738>10739;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']\n",
-      "['H37Rv', '2163484', '>10739>10742', 'A', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>10739>10740>10742,>10739>10741>10742;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']\n",
-      "['H37Rv', '2163486', '>10742>10745', 'C', 'G', '60', '.', 'AC=0;AF=0;AN=17;AT=>10742>10743>10745,>10742>10744>10745;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']\n",
-      "['H37Rv', '2163488', '>10745>10748', 'T', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>10745>10746>10748,>10745>10747>10748;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']\n",
-      "['H37Rv', '3113846', '>15187>15190', 'G', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>15187>15188>15190,>15187>15189>15190;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']\n",
-      "['H37Rv', '3113855', '>15193>15196', 'G', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>15193>15194>15196,>15193>15195>15196;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']\n",
-      "['H37Rv', '3113862', '>15196>15199', 'C', 'T', '60', '.', 'AC=0;AF=0;AN=17;AT=>15196>15198>15199,>15196>15197>15199;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']\n",
-      "['H37Rv', '3113874', '>15204>15207', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>15204>15205>15207,>15204>15206>15207;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '0', '', '.', '', '', '', '0', '0']\n",
-      "['H37Rv', '3113879', '>15207>15210', 'C', 'T', '60', '.', 'AC=0;AF=0;AN=17;AT=>15207>15208>15210,>15207>15209>15210;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '0', '', '.', '', '', '', '0', '0']\n",
-      "['H37Rv', '3935950', '>19372>19376', 'T', 'C', '60', '.', 'AC=17;AF=1;AN=17;AT=>19372>19373>19376,>19372>19375>19376;NS=17;LV=2;PS=>19356>19447', 'GT', '1', '1', '1', '', '.', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']\n",
      "Filtered out:\n",
-      "In repeat or DR locus 729\n",
-      "Not SNP 859\n",
-      "Missing genotypes 312\n",
-      "Invariant 21\n",
-      "Kept:  4505\n"
+      "In repeat or DR locus 849\n",
+      "Not SNP 967\n",
+      "Missing genotypes 317\n",
+      "Invariant 1\n",
+      "Kept:  5115\n"
     ]
    }
   ],
@@ -125,7 +121,7 @@
    "        \n",
    "        # Check if all strains have genotypes (not sure what happened if not)\n",
    "        if len(gt) != len(strains):\n",
-    "            print(fields)\n",
+    "            #print(fields)\n",
    "            continue\n",
    "\n",
    "        # Write alleles to site\n",
@@ -141,6 +137,7 @@
    "            else:\n",
    "                strain_allele = alleles[int(strain_gt)]\n",
    "            site += strain_allele\n",
+    "\n",
    "            \n",
    "        # Exclude sites with more too many missing genotypes\n",
    "        prop_missing = site.count('-') / len(site)\n",
@@ -151,6 +148,7 @@
    "        \n",
    "        # Re-check if there are remaining invariant sites\n",
    "        bases_only = site.replace('-', '')\n",
+    "        bases_only += ref\n",
    "        if len(set(bases_only)) == 1:\n",
    "            discarded[\"Invariant\"] += 1\n",
    "            continue\n",
@@ -182,7 +180,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
@@ -191,7 +189,7 @@
       "19"
      ]
     },
-     "execution_count": 15,
+     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -224,7 +222,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
@@ -297,14 +295,14 @@
   "source": [
    "# Estimate a tree\n",
    "\n",
-    "Use Stamatakis model in RAXML to account for invariable sites and correct branch lengths.\n",
+    "... using your method of choice. \n",
    "\n",
-    "--model GTR+G+ASC_STAM{633511/1206483/1214591/640523}"
+    "Below I use raxml-ng with the Stamatakis model to account for invariable sites and correct branch lengths."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 10,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
@@ -315,7 +313,7 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Submitted batch job 12899598\n"
+      "Submitted batch job 13472974\n"
     ]
    }
   ],
@@ -336,7 +334,7 @@
    "  \n",
    "raxml-ng \\\n",
    "  --msa results/H37Rv_SNP_alignment.from_graph.fasta \\\n",
-    "  --all --model GTR+G+ASC_STAM{633312/1206359/1214398/640320} \\\n",
+    "  --all --model GTR+G+ASC_STAM{712428/1359777/1354977/713950} \\\n",
    "  --tree pars{10} --bs-trees 100 \\\n",
    "  --outgroup H37Rv \\\n",
    "  --threads 20\" > run_raxml.slrm\n",
@@ -351,7 +349,8 @@
   "source": [
    "Plot the tree, for example with https://itol.embl.de.\n",
    "\n",
-    "❓ Is there anything remarkable about the tree? "
+    "❓ Is there anything remarkable about the tree?\n",
+    "❓ "
   ]
  }
 ],

 %% Cell type:markdown id: tags:

 # From graph to phylogeny

 Nothing makes sense in biology except in the light of a phylogenetic tree...

-In this module, we take the variants from the vcf and create a SNP alignment. We further use the graph to estimate the number of non-variable positions.
+In this module, we take the variants from the graph vcf and create a SNP alignment. We further use the graph to estimate the number of non-variable positions.

 %% Cell type:code id: tags:

 ``` python
 # Create list of positions to exclude
 import pandas

 dr_loci = pandas.read_csv("data/DR_genes_coordinates.tsv", sep="\t")
 repeats = pandas.read_csv("data/H37Rv.nucmer_repeats_id90_l50.tsv")

 excluded = set()

 for i, gene in dr_loci.iterrows():
    excluded.update(range(gene["start_region"], gene["stop_region"]+1))

 for i,hpair in repeats.iterrows():
    excluded.update(range(hpair["S1"], hpair["E1"]+1))
    excluded.update(range(hpair["S2"], hpair["E2"]+1))

 print(f'Number of excluded positions: {len(excluded)}')
 ```

 %% Output

    Number of excluded positions: 176136

+%% Cell type:markdown id: tags:
+
+By using the variants.vcf file, we will also include sites that do not vary within the L1 samples, but only between L1 and H37Rv.
+
 %% Cell type:code id: tags:

 ``` python
 variants = 'results/variants.vcf'
 outgroup = 'H37Rv'

 max_missing_prop = 0.1

 keep = []

 discarded = {
    "In repeat or DR locus" : 0,
    "Not SNP" : 0,
    "Missing genotypes" : 0,
    "Invariant" : 0
 }

 with open(variants) as f:

    for line in f:

        if line.startswith("#CHROM"):

            fields = line.strip().split("\t")
            strains = fields[9:]

            # Dictionary to store SNPs for each strain
            strain_seqs = {strain : "" for strain in strains}
            strain_seqs["H37Rv"] = ""
            continue

        elif line.startswith("#"):
            continue

        fields = line.strip().split("\t")
        position = int(fields[1])

        ref = fields[3]
        alt = fields[4].split(",")
        alleles = [ref] + alt

        gt = fields[9:]

        # Exclude sites in repeats and DR loci
        if position in excluded:
            discarded["In repeat or DR locus"] += 1
            continue

        # Only SNPs
        if len(ref) > 1 or any(len(x) > 1 for x in alt):
            discarded["Not SNP"] += 1
            continue

        # Check if all strains have genotypes (not sure what happened if not)
        if len(gt) != len(strains):
-            print(fields)
+            #print(fields)
            continue

        # Write alleles to site
        site = ''

        for strain in strains:

            strain_i = strains.index(strain)
            strain_gt = gt[strain_i]

            if not strain_gt or strain_gt not in '012345':
                strain_allele = '-'
            else:
                strain_allele = alleles[int(strain_gt)]
            site += strain_allele

+
        # Exclude sites with more too many missing genotypes
        prop_missing = site.count('-') / len(site)
        if prop_missing > max_missing_prop:
            discarded["Missing genotypes"] += 1
            continue


        # Re-check if there are remaining invariant sites
        bases_only = site.replace('-', '')
+        bases_only += ref
        if len(set(bases_only)) == 1:
            discarded["Invariant"] += 1
            continue


        for i, allele in enumerate(site):
            strain = strains[i]
            strain_seqs[strain] += allele

        # Write H37Rv allele
        strain_seqs["H37Rv"] += ref

        keep.append((position, alleles, gt))


 print("Filtered out:")
 for key, value in discarded.items():
    print(key, str(value))

 print("Kept: ", str(len(keep)))
 ```

 %% Output

-    ['H37Rv', '2163475', '>10733>10736', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>10733>10734>10736,>10733>10735>10736;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
-    ['H37Rv', '2163481', '>10736>10739', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>10736>10737>10739,>10736>10738>10739;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
-    ['H37Rv', '2163484', '>10739>10742', 'A', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>10739>10740>10742,>10739>10741>10742;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
-    ['H37Rv', '2163486', '>10742>10745', 'C', 'G', '60', '.', 'AC=0;AF=0;AN=17;AT=>10742>10743>10745,>10742>10744>10745;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
-    ['H37Rv', '2163488', '>10745>10748', 'T', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>10745>10746>10748,>10745>10747>10748;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
-    ['H37Rv', '3113846', '>15187>15190', 'G', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>15187>15188>15190,>15187>15189>15190;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']
-    ['H37Rv', '3113855', '>15193>15196', 'G', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>15193>15194>15196,>15193>15195>15196;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']
-    ['H37Rv', '3113862', '>15196>15199', 'C', 'T', '60', '.', 'AC=0;AF=0;AN=17;AT=>15196>15198>15199,>15196>15197>15199;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']
-    ['H37Rv', '3113874', '>15204>15207', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>15204>15205>15207,>15204>15206>15207;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '0', '', '.', '', '', '', '0', '0']
-    ['H37Rv', '3113879', '>15207>15210', 'C', 'T', '60', '.', 'AC=0;AF=0;AN=17;AT=>15207>15208>15210,>15207>15209>15210;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '0', '', '.', '', '', '', '0', '0']
-    ['H37Rv', '3935950', '>19372>19376', 'T', 'C', '60', '.', 'AC=17;AF=1;AN=17;AT=>19372>19373>19376,>19372>19375>19376;NS=17;LV=2;PS=>19356>19447', 'GT', '1', '1', '1', '', '.', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
    Filtered out:
-    In repeat or DR locus 729
-    Not SNP 859
-    Missing genotypes 312
-    Invariant 21
-    Kept:  4505
+    In repeat or DR locus 849
+    Not SNP 967
+    Missing genotypes 317
+    Invariant 1
+    Kept:  5115

 %% Cell type:markdown id: tags:

 Write variable alignment to fasta

 %% Cell type:code id: tags:

 ``` python
 from Bio import SeqIO
 from Bio.Seq import Seq
 from Bio.SeqRecord import SeqRecord

 records = []

 for strain in ["H37Rv"] + strains:

    sequence = Seq(strain_seqs[strain])
    record = SeqRecord(sequence, id=strain, name="", description="")
    records.append(record)

 SeqIO.write(records, "results/H37Rv_SNP_alignment.from_graph.fasta", "fasta")
 ```

 %% Output

    19

 %% Cell type:markdown id: tags:

 # Count non-variable positions

 The idea here is to use the graph and find all nodes which are shared by all strains (i.e. they do not vary).
 From these we can estimate the number of non-variable positions.

 %% Cell type:code id: tags:

 ``` python
 from collections import Counter

 graph='results/pggb/assemblies_4mb.fasta.gz.a206ba8.417fcdf.47e7f7c.smooth.final.gfa'

 n_paths = 0

 node_seqs = {}
 node_count = {}

 # First traversal: get nodes present in all paths
 with open(graph) as f:
    for line in f:

        fields = line.strip().split('\t')

        if line.startswith('S'):
            node_seqs[fields[1]] = fields[2]

        elif line.startswith('P'):

            n_paths += 1

            paff = [x[:-1] for x in fields[2].split(',') ]

            for node in paff:

                if node not in node_count:
                    node_count[node] = 0

                node_count[node] += 1

 invariable_seq = ''

 for node in node_count:
    if node_count[node] == n_paths:
        invariable_seq += node_seqs[node]

 base_counts = Counter(invariable_seq)

 for base in base_counts:
    print(base, base_counts[base])


 total = sum([base_counts[base] for base in base_counts])
 gc = base_counts['G'] + base_counts['C']

 print('Total invariable bases: ', total)
 print('GC content: ', gc/total)
 ```

 %% Output

    G 1354977
    T 713950
    C 1359777
    A 712428
    Total invariable bases:  4141132
    GC content:  0.6555584318490693

 %% Cell type:markdown id: tags:

 # Estimate a tree

-Use Stamatakis model in RAXML to account for invariable sites and correct branch lengths.
+... using your method of choice.

--model GTR+G+ASC_STAM{633511/1206483/1214591/640523}
+Below I use raxml-ng with the Stamatakis model to account for invariable sites and correct branch lengths.

 %% Cell type:code id: tags:

 ``` python
 %%bash

 echo "\
 #!/bin/bash
 #SBATCH --cpus-per-task=20
 #SBATCH --mem-per-cpu=1G
 #SBATCH --time=06:00:00
 #SBATCH --qos=6hours
 #SBATCH --output=results/raxml_stdout.%j
 #SBATCH --error=results/raxml_stderr.%j

 source ~/miniconda3/etc/profile.d/conda.sh
 conda activate raxml

 raxml-ng \
  --msa results/H37Rv_SNP_alignment.from_graph.fasta \
-  --all --model GTR+G+ASC_STAM{633312/1206359/1214398/640320} \
+  --all --model GTR+G+ASC_STAM{712428/1359777/1354977/713950} \
  --tree pars{10} --bs-trees 100 \
  --outgroup H37Rv \
  --threads 20" > run_raxml.slrm


 sbatch run_raxml.slrm
 ```

 %% Output

-    Submitted batch job 12899598
+    Submitted batch job 13472974

 %% Cell type:markdown id: tags:

 Plot the tree, for example with https://itol.embl.de.

 ❓ Is there anything remarkable about the tree?
+❓

--- a/workshop/D_graph_annotation.ipynb
+++ b/workshop/D_graph_annotation.ipynb
--- a/workshop/E_gene_conversion.ipynb
+++ b/workshop/E_gene_conversion.ipynb
--- a/workshop/F_categorize_SVs.ipynb
+++ b/workshop/F_categorize_SVs.ipynb
--- a/workshop/G_large_duplications.ipynb
+++ b/workshop/G_large_duplications.ipynb
@@ -6,8 +6,41 @@
   "source": [
    "# Explore potential large duplications\n",
    "\n",
+    "The PB000198 and PB000220 assemblies could both not be closed and show the same pattern in the assembly graph: a piece of 110kb that is repeated twice and could not be integrated into the chromosome. These two isolates are closely related, so most likely the reason their assembly failed is biological and not technical. \n",
+    "\n",
+    "One useful way to explore what is going on here is to go back to the long reads, align them against a closely related complete genome, and visualize the problematic region with IGV. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "READS_PB198=\n",
+    "ASM=\n",
+    "\n",
    "\n"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load the assembly and the aligned reads in IGV (available on sciCORE Desktop).\n",
+    "\n",
+    "❓Can you find the problematic region? What genes does it contain?\n",
+    "\n",
+    "❓What has happened here, why has the assembly failed? \n",
+    "\n",
+    "❓\n"
+   ]
  }
 ],
 "metadata": {

 %% Cell type:markdown id: tags:

 # Explore potential large duplications

+The PB000198 and PB000220 assemblies could both not be closed and show the same pattern in the assembly graph: a piece of 110kb that is repeated twice and could not be integrated into the chromosome. These two isolates are closely related, so most likely the reason their assembly failed is biological and not technical.

+One useful way to explore what is going on here is to go back to the long reads, align them against a closely related complete genome, and visualize the problematic region with IGV.
+
+%% Cell type:code id: tags:
+
+``` 
+%%bash
+
+READS_PB198=
+ASM=
+
+
+```
+
+%% Cell type:markdown id: tags:
+
+Load the assembly and the aligned reads in IGV (available on sciCORE Desktop).
+
+❓Can you find the problematic region? What genes does it contain?
+
+❓What has happened here, why has the assembly failed?
+
+❓

--- a/workshop/run_raxml.slrm
+++ b/workshop/run_raxml.slrm
+#!/bin/bash
+#SBATCH --cpus-per-task=20
+#SBATCH --mem-per-cpu=1G
+#SBATCH --time=06:00:00
+#SBATCH --qos=6hours
+#SBATCH --output=results/raxml_stdout.%j
+#SBATCH --error=results/raxml_stderr.%j
+  
+source ~/miniconda3/etc/profile.d/conda.sh
+conda activate raxml
+  
+raxml-ng   --msa results/H37Rv_SNP_alignment.from_graph.fasta   --all --model GTR+G+ASC_STAM{712428/1359777/1354977/713950}   --tree pars{10} --bs-trees 100   --outgroup H37Rv   --threads 20