Skip to content
Snippets Groups Projects
Commit e97528b1 authored by Christoph Stritt's avatar Christoph Stritt
Browse files

Pangenome graph notebook added

parent dca40749
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Create a pangenome graph with pggb
pggb requires a single gzipped and indexed fasta file containing all assemblies.
%% Cell type:markdown id: tags:
## Install biopython
If not already present, create a conda environment with biopython.
```
conda create -n bio biopython
```
%% Cell type:markdown id: tags:
Then select the bio environent as the kernel for the Jupyter notebook.
%% Cell type:markdown id: tags:
## Combine assemblies
The script below creates a file 'assemblies_combined.fasta' in which all assemblies listed in 'paths_to_assemblies.txt' are combined.
'paths_to_assemblies.txt' should be a text file with one path per line.
If one_contig_only is set to True, only single-contig assemblies are included.
%% Cell type:code id: tags:
``` python
from Bio import SeqIO
assemblies = 'paths_to_assemblies.txt'
one_contig_only = False
fasta_combined = []
with open(assemblies) as f:
for line in f:
assembly_path = line.strip()
strain = assembly_path.split('/')[-1].split('.')[0]
records = [rec for rec in SeqIO.parse(assembly_path, "fasta")]
if one_contig_only and len(records) > 1:
print('Discarded multi-contig assembly for ' + strain)
continue
fasta_combined += records
SeqIO.write(fasta_combined, 'assemblies_combined.fasta', 'fasta')
```
%% Cell type:markdown id: tags:
## Zip and index fasta
%% Cell type:code id: tags:
``` python
%%bash
gzip single_contig_assemblies.fasta
samtools faidx single_contig_assemblies.fasta.gz
```
%% Cell type:markdown id: tags:
## Run pggb
Important paramters:
-i : path to combined assemblies
-o : output directory name
-n : Number of contigs included
-p : Similarity of contigs
-s : Maximum length of repeats in the genome
%% Cell type:markdown id: tags:
Example slurm script:
%% Cell type:markdown id: tags:
#!/bin/bash
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=4G
#SBATCH --time=24:00:00
#SBATCH --qos=1day
#SBATCH --output=stdout.%j
#SBATCH --error=stderr.%j
singularity exec /scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/pggb_latest.sif pggb \
-i assemblies_combined.fasta.gz \
-o ./pggb \
-m \
-t 20 \
-n 52 \
-p 99 \
-s 5k
%% Cell type:markdown id: tags:
# Create a phylogenetic tree from a pangenome graph
Call SNPs from graph with vg deconstruct.
Filter out resistance genes and clustered variants indicative of gene conversion or selection acting on sloppy targets.
Maybe determine clustering threshold through simulations?
%% Cell type:code id: tags:
```
%%bash
singularity exec /home/cristobal/programs/pggb_latest.sif vg deconstruct \
pggb/assemblies_combined.fasta.gz.d71a954.11fba48.a4da63a.smooth.final.gfa -d1 -e \
-p H37Rv \
-t 4 \
--all-snarls \
> variants.vcf
grep -v -c ^# variants.vcf
```
%% Cell type:code id: tags:
```
# Create list of positions to exclude
import pandas
dr_loci = pandas.read_csv("DR_genes_coordinates.tsv", sep="\t")
repeats = pandas.read_csv("nucmer/delta.out.filtered.tsv")
excluded = set()
for i, gene in dr_loci.iterrows():
excluded.update(range(gene["start_region"], gene["stop_region"]+1))
for i,hpair in repeats.iterrows():
excluded.update(range(hpair["S1"], hpair["E1"]+1))
excluded.update(range(hpair["S2"], hpair["E2"]+1))
```
%% Cell type:code id: tags:
```
canettii = ["CIPT_140010059_/_STB-A", "ET1291"]
keep = []
discarded = {
"In repeat or DR locus" : 0,
"Not SNP" : 0,
"canettii only" : 0,
"Missing genotypes" : 0,
"Invariant" : 0
}
with open("variants.vcf") as f:
for line in f:
if line.startswith("#CHROM"):
fields = line.strip().split("\t")
strains = fields[9:]
canettii_i = [strains.index(x) for x in canettii]
# Dictionary to store SNPs for each strain
strain_seqs = {strain : "" for strain in strains}
strain_seqs["H37Rv"] = ""
continue
elif line.startswith("#"):
continue
fields = line.strip().split("\t")
position = int(fields[1])
ref = fields[3]
alt = fields[4].split(",")
gt = fields[9:]
# Two entries weirdly have missing genotypes. Skip them
if len(gt) != len(strains):
print(fields)
continue
canettii_gt = [gt[i] for i in canettii_i]
others = [gt[i] for i in range(len(gt)) if i not in canettii_i]
# Exclude sites in repeats and DR loci
if position in excluded:
discarded["In repeat or DR locus"] += 1
continue
# Only SNPs
if len(ref) > 1 or any(len(x) > 1 for x in alt):
discarded["Not SNP"] += 1
continue
# Skip canettii-only variants
if "1" in canettii_gt and "1" not in others:
discarded["canettii only"] += 1
continue
# Exclude missing genotypes
if any(x in gt for x in ["", "."]):
discarded["Missing genotypes"] += 1
#continue
# Write alleles to site
alleles = [ref] + alt
site = ''
for strain in strains:
strain_i = strains.index(strain)
strain_gt = gt[strain_i]
if not strain_gt or strain_gt not in '012345':
strain_allele = '-'
else:
strain_allele = alleles[int(strain_gt)]
site += strain_allele
# Re-check if there are remaining invariant sites
bases_only = ''
for allele in site:
if allele in ['A','C','G', 'T']:
bases_only += allele
#bases_only = ref + site.replace('-', '')
if len(set(bases_only)) > 1:
for i, allele in enumerate(site):
strain = strains[i]
strain_seqs[strain] += allele
# Write H37Rv allele
strain_seqs["H37Rv"] += ref
keep.append((position, alleles, gt))
else:
discarded["Invariant"] += 1
print("Filtered out:")
for k in discarded:
print(k, str(discarded[k]))
print("Kept: ", str(len(keep)))
```
%% Cell type:markdown id: tags:
Write variable alignment to fasta
%% Cell type:code id: tags:
```
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
records = []
for strain in ["H37Rv"] + strains:
sequence = Seq(strain_seqs[strain])
record = SeqRecord(sequence, id=strain, name="", description="")
records.append(record)
SeqIO.write(records, "H37Rv_SNP_alignment.from_graph.fasta", "fasta")
```
%% Cell type:markdown id: tags:
## Count non-variable positions
Get the core nodes from the graph and count the number of A,C,G,T
%% Cell type:code id: tags:
```
from collections import Counter
GRAPH='/home/cristobal/TB/projects/deletionism/NOTEBOOKS/A_ancestral_reference/pggb/assemblies_combined.fasta.gz.d71a954.11fba48.a4da63a.smooth.final.gfa'
n_paths = 0
node_seqs = {}
node_count = {}
# First traversal: get nodes present in all paths
with open(GRAPH) as f:
for line in f:
fields = line.strip().split('\t')
if line.startswith('S'):
node_seqs[fields[1]] = fields[2]
elif line.startswith('P'):
n_paths += 1
paff = [x[:-1] for x in fields[2].split(',') ]
for node in paff:
if node not in node_count:
node_count[node] = 0
node_count[node] += 1
invariable_seq = ''
for node in node_count:
if node_count[node] == n_paths:
invariable_seq += node_seqs[node]
base_counts = Counter(invariable_seq)
for base in base_counts:
print(base, base_counts[base])
total = sum([base_counts[base] for base in base_counts])
gc = base_counts['G'] + base_counts['C']
print('Total invariable bases: ', total)
print('GC content: ', gc/total)
```
%% Cell type:markdown id: tags:
## Estimate tree!
Use Stamatakis model in RAXML to account for invariable sites and correct branch lengths.
--model GTR+G+ASC_STAM{633511/1206483/1214591/640523}
%% Cell type:code id: tags:
```
%%bash
~/programs/raxml-ng_v1.2.0_linux_x86_64/raxml-ng \
--msa H37Rv_SNP_alignment.from_graph.fasta \
--all --model GTR+G+ASC_STAM{633312/1206359/1214398/640320} --tree pars{10} --bs-trees 100 --threads 4 --outgroup CIPT_140010059_/_STB-A
mv H37Rv_SNP_alignment.from_graph.fasta.* raxml
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment