Pangenome graph notebook added

e97528b1 · Christoph Stritt · dca40749 · e97528b1 · dca40749 · e97528b1
Commit e97528b1 authored 10 months ago by Christoph Stritt
--- a/notebooks/create_pangenome_graph.ipynb
+++ b/notebooks/create_pangenome_graph.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Create a pangenome graph with pggb\n",
+    "\n",
+    "pggb requires a single gzipped and indexed fasta file containing all assemblies. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install biopython\n",
+    "\n",
+    "If not already present, create a conda environment with biopython. \n",
+    "\n",
+    "```\n",
+    "conda create -n bio biopython\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then select the bio environent as the kernel for the Jupyter notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Combine assemblies\n",
+    "\n",
+    "The script below creates a file 'assemblies_combined.fasta' in which all assemblies listed in 'paths_to_assemblies.txt' are combined.\n",
+    "\n",
+    "'paths_to_assemblies.txt' should be a text file with one path per line. \n",
+    "\n",
+    "If one_contig_only is set to True, only single-contig assemblies are included."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from Bio import SeqIO\n",
+    "\n",
+    "\n",
+    "assemblies = 'paths_to_assemblies.txt'\n",
+    "one_contig_only = False\n",
+    "\n",
+    "fasta_combined = []\n",
+    "\n",
+    "with open(assemblies) as f:\n",
+    "    \n",
+    "    for line in f:\n",
+    "        \n",
+    "        assembly_path = line.strip()\n",
+    "        strain = assembly_path.split('/')[-1].split('.')[0]\n",
+    "     \n",
+    "        records = [rec for rec in SeqIO.parse(assembly_path, \"fasta\")]\n",
+    "        \n",
+    "        if one_contig_only and len(records) > 1:\n",
+    "            print('Discarded multi-contig assembly for ' + strain)\n",
+    "            continue\n",
+    "        \n",
+    "        fasta_combined += records\n",
+    "\n",
+    "SeqIO.write(fasta_combined, 'assemblies_combined.fasta', 'fasta')\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zip and index fasta"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "gzip single_contig_assemblies.fasta\n",
+    "samtools faidx single_contig_assemblies.fasta.gz"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run pggb\n",
+    "\n",
+    "Important paramters:\n",
+    "\n",
+    "    -i : path to combined assemblies\n",
+    "    -o : output directory name\n",
+    "    -n : Number of contigs included\n",
+    "    -p : Similarity of contigs\n",
+    "    -s : Maximum length of repeats in the genome"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Example slurm script:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#!/bin/bash\n",
+    "\n",
+    "#SBATCH --cpus-per-task=20\n",
+    "#SBATCH --mem-per-cpu=4G\n",
+    "#SBATCH --time=24:00:00\n",
+    "#SBATCH --qos=1day\n",
+    "#SBATCH --output=stdout.%j\n",
+    "#SBATCH --error=stderr.%j\n",
+    "\n",
+    "singularity exec /scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/pggb_latest.sif pggb \\\n",
+    "  -i assemblies_combined.fasta.gz \\\n",
+    "  -o ./pggb \\\n",
+    "  -m \\\n",
+    "  -t 20 \\\n",
+    "  -n 52 \\\n",
+    "  -p 99 \\\n",
+    "  -s 5k\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Create a pangenome graph with pggb
+
+pggb requires a single gzipped and indexed fasta file containing all assemblies.
+
+%% Cell type:markdown id: tags:
+
+## Install biopython
+
+If not already present, create a conda environment with biopython.
+
+```
+conda create -n bio biopython
+```
+
+%% Cell type:markdown id: tags:
+
+Then select the bio environent as the kernel for the Jupyter notebook.
+
+%% Cell type:markdown id: tags:
+
+## Combine assemblies
+
+The script below creates a file 'assemblies_combined.fasta' in which all assemblies listed in 'paths_to_assemblies.txt' are combined.
+
+'paths_to_assemblies.txt' should be a text file with one path per line.
+
+If one_contig_only is set to True, only single-contig assemblies are included.
+
+%% Cell type:code id: tags:
+
+``` python
+from Bio import SeqIO
+
+
+assemblies = 'paths_to_assemblies.txt'
+one_contig_only = False
+
+fasta_combined = []
+
+with open(assemblies) as f:
+
+    for line in f:
+
+        assembly_path = line.strip()
+        strain = assembly_path.split('/')[-1].split('.')[0]
+
+        records = [rec for rec in SeqIO.parse(assembly_path, "fasta")]
+
+        if one_contig_only and len(records) > 1:
+            print('Discarded multi-contig assembly for ' + strain)
+            continue
+
+        fasta_combined += records
+
+SeqIO.write(fasta_combined, 'assemblies_combined.fasta', 'fasta')
+
+```
+
+%% Cell type:markdown id: tags:
+
+## Zip and index fasta
+
+%% Cell type:code id: tags:
+
+``` python
+%%bash
+
+gzip single_contig_assemblies.fasta
+samtools faidx single_contig_assemblies.fasta.gz
+```
+
+%% Cell type:markdown id: tags:
+
+## Run pggb
+
+Important paramters:
+
+    -i : path to combined assemblies
+    -o : output directory name
+    -n : Number of contigs included
+    -p : Similarity of contigs
+    -s : Maximum length of repeats in the genome
+
+%% Cell type:markdown id: tags:
+
+Example slurm script:
+
+%% Cell type:markdown id: tags:
+
+#!/bin/bash
+
+#SBATCH --cpus-per-task=20
+#SBATCH --mem-per-cpu=4G
+#SBATCH --time=24:00:00
+#SBATCH --qos=1day
+#SBATCH --output=stdout.%j
+#SBATCH --error=stderr.%j
+
+singularity exec /scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/pggb_latest.sif pggb \
+  -i assemblies_combined.fasta.gz \
+  -o ./pggb \
+  -m \
+  -t 20 \
+  -n 52 \
+  -p 99 \
+  -s 5k
--- a/notebooks/pangenome_graph.ipynb
+++ b/notebooks/pangenome_graph.ipynb
--- a/notebooks/tree_from_graph.ipynb
+++ b/notebooks/tree_from_graph.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Create a phylogenetic tree from a pangenome graph\n",
+    "\n",
+    "Call SNPs from graph with vg deconstruct.\n",
+    "\n",
+    "Filter out resistance genes and clustered variants indicative of gene conversion or selection acting on sloppy targets.\n",
+    "\n",
+    "Maybe determine clustering threshold through simulations?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "singularity exec /home/cristobal/programs/pggb_latest.sif vg deconstruct \\\n",
+    "  pggb/assemblies_combined.fasta.gz.d71a954.11fba48.a4da63a.smooth.final.gfa -d1 -e \\\n",
+    "    -p H37Rv \\\n",
+    "    -t 4 \\\n",
+    "    --all-snarls \\\n",
+    "    > variants.vcf\n",
+    "\n",
+    "grep -v -c ^# variants.vcf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create list of positions to exclude\n",
+    "import pandas\n",
+    "\n",
+    "dr_loci = pandas.read_csv(\"DR_genes_coordinates.tsv\", sep=\"\\t\")\n",
+    "repeats = pandas.read_csv(\"nucmer/delta.out.filtered.tsv\")\n",
+    "\n",
+    "excluded = set()\n",
+    "\n",
+    "for i, gene in dr_loci.iterrows():\n",
+    "    excluded.update(range(gene[\"start_region\"], gene[\"stop_region\"]+1))\n",
+    "\n",
+    "for i,hpair in repeats.iterrows():\n",
+    "    excluded.update(range(hpair[\"S1\"], hpair[\"E1\"]+1))\n",
+    "    excluded.update(range(hpair[\"S2\"], hpair[\"E2\"]+1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "canettii = [\"CIPT_140010059_/_STB-A\", \"ET1291\"]\n",
+    "\n",
+    "keep = []\n",
+    "\n",
+    "discarded = {\n",
+    "    \"In repeat or DR locus\" : 0,\n",
+    "    \"Not SNP\" : 0,\n",
+    "    \"canettii only\" : 0,\n",
+    "    \"Missing genotypes\" : 0,\n",
+    "    \"Invariant\" : 0\n",
+    "}\n",
+    "\n",
+    "with open(\"variants.vcf\") as f:\n",
+    "    \n",
+    "    for line in f:\n",
+    "        \n",
+    "        if line.startswith(\"#CHROM\"):\n",
+    "            \n",
+    "            fields = line.strip().split(\"\\t\")\n",
+    "            strains = fields[9:]\n",
+    "            canettii_i = [strains.index(x) for x in canettii]\n",
+    "            \n",
+    "            # Dictionary to store SNPs for each strain\n",
+    "            strain_seqs = {strain : \"\" for strain in strains}\n",
+    "            strain_seqs[\"H37Rv\"] = \"\"\n",
+    "            continue\n",
+    "        \n",
+    "        elif line.startswith(\"#\"):\n",
+    "            continue\n",
+    "\n",
+    "        fields = line.strip().split(\"\\t\")\n",
+    "        position = int(fields[1])\n",
+    "        \n",
+    "        ref = fields[3]\n",
+    "        alt = fields[4].split(\",\")\n",
+    "        \n",
+    "        gt = fields[9:]\n",
+    "        \n",
+    "        # Two entries weirdly have missing genotypes. Skip them\n",
+    "        if len(gt) != len(strains):\n",
+    "            print(fields)\n",
+    "            continue\n",
+    "        \n",
+    "        \n",
+    "        canettii_gt = [gt[i] for i in canettii_i]\n",
+    "        others = [gt[i] for i in range(len(gt)) if i not in canettii_i]\n",
+    "        \n",
+    "        \n",
+    "        # Exclude sites in repeats and DR loci\n",
+    "        if position in excluded:\n",
+    "            discarded[\"In repeat or DR locus\"] += 1\n",
+    "            continue\n",
+    "\n",
+    "        # Only SNPs\n",
+    "        if len(ref) > 1 or any(len(x) > 1 for x in alt):\n",
+    "            discarded[\"Not SNP\"] += 1\n",
+    "            continue\n",
+    "            \n",
+    "        # Skip canettii-only variants\n",
+    "        if \"1\" in canettii_gt and \"1\" not in others:\n",
+    "            discarded[\"canettii only\"] += 1\n",
+    "            continue\n",
+    "\n",
+    "        # Exclude missing genotypes\n",
+    "        if any(x in gt for x in [\"\", \".\"]):\n",
+    "            discarded[\"Missing genotypes\"] += 1\n",
+    "            #continue\n",
+    "\n",
+    "\n",
+    "        # Write alleles to site\n",
+    "        alleles = [ref] + alt       \n",
+    "        site = ''\n",
+    "        \n",
+    "        for strain in strains:\n",
+    "            strain_i = strains.index(strain)\n",
+    "            strain_gt = gt[strain_i]\n",
+    "            if not strain_gt or strain_gt not in '012345':\n",
+    "                strain_allele = '-'\n",
+    "            else:\n",
+    "                strain_allele = alleles[int(strain_gt)]\n",
+    "            \n",
+    "            site += strain_allele\n",
+    "            \n",
+    "        # Re-check if there are remaining invariant sites\n",
+    "        bases_only = ''\n",
+    "        for allele in site:\n",
+    "            if allele in ['A','C','G', 'T']:\n",
+    "                bases_only += allele\n",
+    "\n",
+    "        #bases_only = ref + site.replace('-', '')\n",
+    "        \n",
+    "        \n",
+    "        if len(set(bases_only)) > 1:\n",
+    "                \n",
+    "            for i, allele in enumerate(site):\n",
+    "                strain = strains[i]\n",
+    "                strain_seqs[strain] += allele\n",
+    "                \n",
+    "            # Write H37Rv allele\n",
+    "            strain_seqs[\"H37Rv\"] += ref\n",
+    "            \n",
+    "            keep.append((position, alleles, gt))\n",
+    "            \n",
+    "        else:\n",
+    "            discarded[\"Invariant\"] += 1\n",
+    "\n",
+    "                \n",
+    "            \n",
+    "print(\"Filtered out:\")\n",
+    "for k in discarded:\n",
+    "    print(k, str(discarded[k]))\n",
+    "    \n",
+    "print(\"Kept: \", str(len(keep)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Write variable alignment to fasta"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from Bio import SeqIO\n",
+    "from Bio.Seq import Seq\n",
+    "from Bio.SeqRecord import SeqRecord\n",
+    "\n",
+    "records = []\n",
+    "\n",
+    "for strain in [\"H37Rv\"] + strains:\n",
+    "    \n",
+    "    sequence = Seq(strain_seqs[strain])\n",
+    "    record = SeqRecord(sequence, id=strain, name=\"\", description=\"\")\n",
+    "    records.append(record)\n",
+    "    \n",
+    "SeqIO.write(records, \"H37Rv_SNP_alignment.from_graph.fasta\", \"fasta\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Count non-variable positions\n",
+    "\n",
+    "Get the core nodes from the graph and count the number of A,C,G,T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from collections import Counter\n",
+    "\n",
+    "GRAPH='/home/cristobal/TB/projects/deletionism/NOTEBOOKS/A_ancestral_reference/pggb/assemblies_combined.fasta.gz.d71a954.11fba48.a4da63a.smooth.final.gfa'\n",
+    "\n",
+    "n_paths = 0\n",
+    "\n",
+    "node_seqs = {}\n",
+    "node_count = {}\n",
+    "\n",
+    "# First traversal: get nodes present in all paths\n",
+    "with open(GRAPH) as f:\n",
+    "    for line in f:\n",
+    "        \n",
+    "        fields = line.strip().split('\\t')\n",
+    "        \n",
+    "        if line.startswith('S'):\n",
+    "            node_seqs[fields[1]] = fields[2]\n",
+    "        \n",
+    "        elif line.startswith('P'):\n",
+    "            \n",
+    "            n_paths += 1\n",
+    "            \n",
+    "            paff = [x[:-1] for x in fields[2].split(',') ]\n",
+    "            \n",
+    "            for node in paff: \n",
+    "                \n",
+    "                if node not in node_count:\n",
+    "                    node_count[node] = 0\n",
+    "                \n",
+    "                node_count[node] += 1\n",
+    "                \n",
+    "invariable_seq = ''\n",
+    "\n",
+    "for node in node_count:\n",
+    "    if node_count[node] == n_paths:\n",
+    "        invariable_seq += node_seqs[node]\n",
+    "        \n",
+    "base_counts = Counter(invariable_seq)\n",
+    "\n",
+    "for base in base_counts:\n",
+    "    print(base, base_counts[base])\n",
+    "\n",
+    "\n",
+    "total = sum([base_counts[base] for base in base_counts])\n",
+    "gc = base_counts['G'] + base_counts['C']\n",
+    "\n",
+    "print('Total invariable bases: ', total)\n",
+    "print('GC content: ', gc/total)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Estimate tree!\n",
+    "\n",
+    "Use Stamatakis model in RAXML to account for invariable sites and correct branch lengths.\n",
+    "\n",
+    "--model GTR+G+ASC_STAM{633511/1206483/1214591/640523}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "~/programs/raxml-ng_v1.2.0_linux_x86_64/raxml-ng \\\n",
+    "  --msa H37Rv_SNP_alignment.from_graph.fasta \\\n",
+    "  --all --model GTR+G+ASC_STAM{633312/1206359/1214398/640320} --tree pars{10} --bs-trees 100 --threads 4 --outgroup CIPT_140010059_/_STB-A\n",
+    "\n",
+    "mv H37Rv_SNP_alignment.from_graph.fasta.* raxml\n",
+    "\n",
+    " "
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Create a phylogenetic tree from a pangenome graph
+
+Call SNPs from graph with vg deconstruct.
+
+Filter out resistance genes and clustered variants indicative of gene conversion or selection acting on sloppy targets.
+
+Maybe determine clustering threshold through simulations?
+
+%% Cell type:code id: tags:
+
+``` 
+%%bash
+
+singularity exec /home/cristobal/programs/pggb_latest.sif vg deconstruct \
+  pggb/assemblies_combined.fasta.gz.d71a954.11fba48.a4da63a.smooth.final.gfa -d1 -e \
+    -p H37Rv \
+    -t 4 \
+    --all-snarls \
+    > variants.vcf
+
+grep -v -c ^# variants.vcf
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# Create list of positions to exclude
+import pandas
+
+dr_loci = pandas.read_csv("DR_genes_coordinates.tsv", sep="\t")
+repeats = pandas.read_csv("nucmer/delta.out.filtered.tsv")
+
+excluded = set()
+
+for i, gene in dr_loci.iterrows():
+    excluded.update(range(gene["start_region"], gene["stop_region"]+1))
+
+for i,hpair in repeats.iterrows():
+    excluded.update(range(hpair["S1"], hpair["E1"]+1))
+    excluded.update(range(hpair["S2"], hpair["E2"]+1))
+```
+
+%% Cell type:code id: tags:
+
+``` 
+canettii = ["CIPT_140010059_/_STB-A", "ET1291"]
+
+keep = []
+
+discarded = {
+    "In repeat or DR locus" : 0,
+    "Not SNP" : 0,
+    "canettii only" : 0,
+    "Missing genotypes" : 0,
+    "Invariant" : 0
+}
+
+with open("variants.vcf") as f:
+
+    for line in f:
+
+        if line.startswith("#CHROM"):
+
+            fields = line.strip().split("\t")
+            strains = fields[9:]
+            canettii_i = [strains.index(x) for x in canettii]
+
+            # Dictionary to store SNPs for each strain
+            strain_seqs = {strain : "" for strain in strains}
+            strain_seqs["H37Rv"] = ""
+            continue
+
+        elif line.startswith("#"):
+            continue
+
+        fields = line.strip().split("\t")
+        position = int(fields[1])
+
+        ref = fields[3]
+        alt = fields[4].split(",")
+
+        gt = fields[9:]
+
+        # Two entries weirdly have missing genotypes. Skip them
+        if len(gt) != len(strains):
+            print(fields)
+            continue
+
+
+        canettii_gt = [gt[i] for i in canettii_i]
+        others = [gt[i] for i in range(len(gt)) if i not in canettii_i]
+
+
+        # Exclude sites in repeats and DR loci
+        if position in excluded:
+            discarded["In repeat or DR locus"] += 1
+            continue
+
+        # Only SNPs
+        if len(ref) > 1 or any(len(x) > 1 for x in alt):
+            discarded["Not SNP"] += 1
+            continue
+
+        # Skip canettii-only variants
+        if "1" in canettii_gt and "1" not in others:
+            discarded["canettii only"] += 1
+            continue
+
+        # Exclude missing genotypes
+        if any(x in gt for x in ["", "."]):
+            discarded["Missing genotypes"] += 1
+            #continue
+
+
+        # Write alleles to site
+        alleles = [ref] + alt
+        site = ''
+
+        for strain in strains:
+            strain_i = strains.index(strain)
+            strain_gt = gt[strain_i]
+            if not strain_gt or strain_gt not in '012345':
+                strain_allele = '-'
+            else:
+                strain_allele = alleles[int(strain_gt)]
+
+            site += strain_allele
+
+        # Re-check if there are remaining invariant sites
+        bases_only = ''
+        for allele in site:
+            if allele in ['A','C','G', 'T']:
+                bases_only += allele
+
+        #bases_only = ref + site.replace('-', '')
+
+
+        if len(set(bases_only)) > 1:
+
+            for i, allele in enumerate(site):
+                strain = strains[i]
+                strain_seqs[strain] += allele
+
+            # Write H37Rv allele
+            strain_seqs["H37Rv"] += ref
+
+            keep.append((position, alleles, gt))
+
+        else:
+            discarded["Invariant"] += 1
+
+
+
+print("Filtered out:")
+for k in discarded:
+    print(k, str(discarded[k]))
+
+print("Kept: ", str(len(keep)))
+```
+
+%% Cell type:markdown id: tags:
+
+Write variable alignment to fasta
+
+%% Cell type:code id: tags:
+
+``` 
+from Bio import SeqIO
+from Bio.Seq import Seq
+from Bio.SeqRecord import SeqRecord
+
+records = []
+
+for strain in ["H37Rv"] + strains:
+
+    sequence = Seq(strain_seqs[strain])
+    record = SeqRecord(sequence, id=strain, name="", description="")
+    records.append(record)
+
+SeqIO.write(records, "H37Rv_SNP_alignment.from_graph.fasta", "fasta")
+```
+
+%% Cell type:markdown id: tags:
+
+## Count non-variable positions
+
+Get the core nodes from the graph and count the number of A,C,G,T
+
+%% Cell type:code id: tags:
+
+``` 
+from collections import Counter
+
+GRAPH='/home/cristobal/TB/projects/deletionism/NOTEBOOKS/A_ancestral_reference/pggb/assemblies_combined.fasta.gz.d71a954.11fba48.a4da63a.smooth.final.gfa'
+
+n_paths = 0
+
+node_seqs = {}
+node_count = {}
+
+# First traversal: get nodes present in all paths
+with open(GRAPH) as f:
+    for line in f:
+
+        fields = line.strip().split('\t')
+
+        if line.startswith('S'):
+            node_seqs[fields[1]] = fields[2]
+
+        elif line.startswith('P'):
+
+            n_paths += 1
+
+            paff = [x[:-1] for x in fields[2].split(',') ]
+
+            for node in paff:
+
+                if node not in node_count:
+                    node_count[node] = 0
+
+                node_count[node] += 1
+
+invariable_seq = ''
+
+for node in node_count:
+    if node_count[node] == n_paths:
+        invariable_seq += node_seqs[node]
+
+base_counts = Counter(invariable_seq)
+
+for base in base_counts:
+    print(base, base_counts[base])
+
+
+total = sum([base_counts[base] for base in base_counts])
+gc = base_counts['G'] + base_counts['C']
+
+print('Total invariable bases: ', total)
+print('GC content: ', gc/total)
+```
+
+%% Cell type:markdown id: tags:
+
+## Estimate tree!
+
+Use Stamatakis model in RAXML to account for invariable sites and correct branch lengths.
+
+--model GTR+G+ASC_STAM{633511/1206483/1214591/640523}
+
+%% Cell type:code id: tags:
+
+``` 
+%%bash
+
+~/programs/raxml-ng_v1.2.0_linux_x86_64/raxml-ng \
+  --msa H37Rv_SNP_alignment.from_graph.fasta \
+  --all --model GTR+G+ASC_STAM{633312/1206359/1214398/640320} --tree pars{10} --bs-trees 100 --threads 4 --outgroup CIPT_140010059_/_STB-A
+
+mv H37Rv_SNP_alignment.from_graph.fasta.* raxml
+
+
+```