Skip to content
Snippets Groups Projects
Commit ddaf6496 authored by cstritt's avatar cstritt
Browse files

Workshop continued...

parent 922071c0
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Create a pangenome graph with PGGB
The folder **/scicore/home/gagneux/GROUP/PacbioSnake_resources/L1_samples** contains the output from running the assembly pipeline on the 19 L1 strains. In a first step, we will use the assembly summaries created by the pipeline to decide which assemblies to include in the graph.
The assembly summaries can be found here: **/scicore/home/gagneux/GROUP/PacbioSnake_resources/workshop/L1_samples/assembly_summaries.tsv**
PGGB requires as input a single fasta file, so in a second step we will combine the selected sequences and rename them.
%% Cell type:markdown id: tags:
# Inspect the Flye assembly summary
%% Cell type:code id: tags:
``` python
import pandas
path_to_summary = '/scicore/home/gagneux/GROUP/PacbioSnake_resources/workshop/L1_samples/assembly_summaries.tsv'
md = pandas.read_csv(path_to_summary, sep='\t')
md.head()
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group graph_path
0 PB000195 contig_1 4444823 113 Y N 1 * 1
1 PB000196 contig_29 4415329 133 N N 1 * 29,30
2 PB000196 contig_36 13589 31 N Y 1 38 *,36,*
3 PB000196 contig_46 11948 46 N Y 1 46 *,46,*
4 PB000196 contig_55 10270 45 N Y 1 56 *,55,*
%% Cell type:markdown id: tags:
How many sequences per strain?
%% Cell type:code id: tags:
``` python
md['sample'].value_counts()
```
%% Output
sample
PB000205 49
PB000202 7
PB000196 6
PB000220 2
PB000198 2
PB000207 1
PB000219 1
PB000211 1
PB000210 1
PB000209 1
PB000208 1
PB000195 1
PB000206 1
PB000203 1
PB000201 1
PB000200 1
PB000199 1
PB000197 1
PB000204 1
Name: count, dtype: int64
%% Cell type:markdown id: tags:
Which sequences are cirularized?
%% Cell type:code id: tags:
``` python
md[md['circ.'] == 'Y']
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group \
0 PB000195 contig_1 4444823 113 Y N 1 *
7 PB000197 contig_1 4410932 129 Y N 1 *
10 PB000199 contig_1 4426248 112 Y N 1 *
11 PB000200 contig_1 4428581 83 Y N 1 *
12 PB000201 contig_1 4408479 96 Y N 1 *
20 PB000203 contig_1 4434008 47 Y N 1 *
21 PB000204 contig_1 4419776 74 Y N 1 *
22 PB000205 contig_53 4420883 37 Y Y 2 *
71 PB000206 contig_1 4440794 117 Y N 1 *
72 PB000207 contig_1 4424460 85 Y N 1 *
73 PB000208 contig_1 4419922 119 Y N 1 *
74 PB000209 contig_1 4443791 80 Y N 1 *
75 PB000210 contig_1 4444571 115 Y N 1 *
76 PB000211 contig_1 4441994 121 Y N 1 *
77 PB000219 contig_1 4410932 101 Y N 1 *
graph_path
0 1
7 1
10 1
11 1
12 1
20 1
21 1
22 53
71 1
72 1
73 1
74 1
75 1
76 1
77 1
%% Cell type:markdown id: tags:
Sequences larger than 4 Mb
%% Cell type:code id: tags:
``` python
md[md['length']>4e6]
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group \
0 PB000195 contig_1 4444823 113 Y N 1 *
1 PB000196 contig_29 4415329 133 N N 1 *
7 PB000197 contig_1 4410932 129 Y N 1 *
8 PB000198 contig_1 4332804 86 N N 1 *
10 PB000199 contig_1 4426248 112 Y N 1 *
11 PB000200 contig_1 4428581 83 Y N 1 *
12 PB000201 contig_1 4408479 96 Y N 1 *
20 PB000203 contig_1 4434008 47 Y N 1 *
21 PB000204 contig_1 4419776 74 Y N 1 *
22 PB000205 contig_53 4420883 37 Y Y 2 *
71 PB000206 contig_1 4440794 117 Y N 1 *
72 PB000207 contig_1 4424460 85 Y N 1 *
73 PB000208 contig_1 4419922 119 Y N 1 *
74 PB000209 contig_1 4443791 80 Y N 1 *
75 PB000210 contig_1 4444571 115 Y N 1 *
76 PB000211 contig_1 4441994 121 Y N 1 *
77 PB000219 contig_1 4410932 101 Y N 1 *
78 PB000220 contig_1 4319283 88 N N 1 *
graph_path
0 1
1 29,30
7 1
8 2,1,2
10 1
11 1
12 1
20 1
21 1
22 53
71 1
72 1
73 1
74 1
75 1
76 1
77 1
78 2,1,2
%% Cell type:markdown id: tags:
Which samples lack a circular genome?
%% Cell type:code id: tags:
``` python
samples_circ = set(md['sample'][md['circ.'] == 'Y'])
samples_nocirc = md['sample'][md['sample'].isin(samples_circ) == False]
print(set(samples_nocirc))
```
%% Output
{'PB000220', 'PB000198', 'PB000202', 'PB000196'}
%% Cell type:markdown id: tags:
❓ What is the matter with these genomes?
❓ Should we include them in the graph?
%% Cell type:code id: tags:
``` python
md[md['sample'] == 'PB000220']
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group graph_path
78 PB000220 contig_1 4319283 88 N N 1 * 2,1,2
79 PB000220 contig_2 110156 166 N Y 2 * 2
%% Cell type:markdown id: tags:
# Combine complete genomes
Let us be liberal and just use all sequences > 4 Mb. We can add a tag to the name of the 'suspicuous' ones.
To combine assemblies, we will loop through the pipeline output directories and get the sequences from the individual fasta files. We will also add H37Rv, on the one hand to have an outgroup, on the other to use the H37Rv annotation.
%% Cell type:code id: tags:
``` python
import os
from Bio import SeqIO
path_to_pipeline_output = '/scicore/home/gagneux/GROUP/PacbioSnake_resources/workshop/L1_samples'
# Create results directory and output handle
os.makedirs('results', exist_ok=True)
output_handle = open('./results/assemblies_4mb.fasta', 'w')
# Get assembly summary data for all sequences > 4Mb
md_4mb = md[md['length']>4e6]
# Loop through pipeline output directory
for root, dirs, files in os.walk(path_to_pipeline_output):
for file in files:
if file.startswith('PB') and file.endswith('.fasta'):
for rec in SeqIO.parse(os.path.join(root, file), "fasta"):
if len(rec.seq) < 4e6:
continue
strain = rec.id.split('_')[0]
# Look up the strain in the assembly summary
strain_circ = md_4mb['circ.'][md_4mb['sample'] == strain].to_string(index=False)
strain_repeat = md_4mb['repeat'][md_4mb['sample'] == strain].to_string(index=False)
# Create new seq ID
new_id = strain
if strain_circ == 'N':
new_id += '#circNo'
if strain_repeat == 'Y':
new_id += '#repeatY'
rec.id = new_id
rec.description = ''
SeqIO.write(rec, output_handle, 'fasta')
output_handle.close()
```
%% Cell type:markdown id: tags:
Finally, add H37Rv, zip and index the sequences.
%% Cell type:code id: tags:
``` python
%%bash
PATH_TO_PGGB=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/pggb_latest.sif
sed '1s/^>.*$/>H37Rv/' data/GCF_000195955.2_ASM19595v2_genomic.fna >> ./results/assemblies_4mb.fasta
singularity exec ${PATH_TO_PGGB} bgzip ./results/assemblies_4mb.fasta
singularity exec ${PATH_TO_PGGB} samtools faidx ./results/assemblies_4mb.fasta.gz
```
%% Output
INFO: Converting SIF file to temporary sandbox...
INFO: Cleaning up image...
INFO: Converting SIF file to temporary sandbox...
INFO: Cleaning up image...
%% Cell type:markdown id: tags:
# Run the PanGenome Graph Builder
Now we can create a graph, which essentially is an all-vs-all alignment that is translated into a concise format of nodes, edges, and paths.
The main parameters parameters of PGGB are:
-i : path to combined assemblies
-o : output directory name
-n : number of haplotypes
-p : minimum pairwise identity between seeds
-s : seed length
https://pggb.readthedocs.io/en/latest/
As shown by [Yang et al. 2023](https://doi.org/10.3389/fgene.2023.1225248), different values for -p and -s affect the structure of the graph, especially when building it from fairly diverse genomes. For TB with its highly similar genomes, I don't expect a dramatic effect of changing these parameters. The exception might be -s when there are larger duplications. -s should approximately correspond to the maximum length of repeats in the genome. If set too small, a duplication larger than -s could be represented somewhat confusingly in the graph as multiple variants.
[Here](pics/pggb-flow-diagram.png) is a flow diagram of what PGGB does.
%% Cell type:markdown id: tags:
### Create slurm script
%% Cell type:code id: tags:
``` python
%%bash
PATH_TO_PGGB=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/pggb_latest.sif
INPUT=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/workshop/results/assemblies_4mb.fasta.gz
OUTDIR=/scicore/home/gagneux/stritt0001/programs/PacbioSnake/workshop/results/pggb
N_HAPLOTYPES=18
MIN_SEED_ID=99
SEED_LENGTH=5k
echo "\
#!/bin/bash
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1G
#SBATCH --time=06:00:00
#SBATCH --qos=6hours
#SBATCH --output=results/pggb_stdout.%j
#SBATCH --error=results/pggb_stderr.%j
singularity exec ${PATH_TO_PGGB} pggb \
-i ${INPUT} \
-o ${OUTDIR} \
-m \
-t 20 \
-n ${N_HAPLOTYPES} \
-p ${MIN_SEED_ID} \
-s ${SEED_LENGTH}" > results/run_pggb.slrm
```
%% Cell type:markdown id: tags:
And submit the job. With 20 CPUs and 18 strains, this should take less than 10 minutes.
%% Cell type:code id: tags:
``` python
%%bash
sbatch results/run_pggb.slrm
```
%% Output
Submitted batch job 12295701
%% Cell type:markdown id: tags:
PGGB creates quite some output. We will only use the file ending with .gfa, which is ..., and the one ending with .og, which is a ...
%% Cell type:markdown id: tags:
# Graph visualization
Like IGV for short reads, visualization tools
Bandage allows you to:
- browse a graph interactively
- zoom in on regions of interest, export figures
- extract node sequences
- blast sequences against the graph
A limitation of Bandage is that it does not allow you to see which strain follows which path through the graph. There are **more recent tools** I haven't explored yet, for example [VRPG](https://www.evomicslab.org/app/vrpg/) or [PanGraphViewer](https://github.com/TF-Chan-Lab/panGraphViewer), which seem to be able to do this.
To **run Bandage**, open a sciCORE Desktop on https://ood-ubuntu.scicore.unibas.ch. Then open a terminal, load the Bandage module, and open the GUI.
```bash
ml Bandage/0.9.0-GCCcore-13.2.0
Bandage
```
Then you should be able to load the graph from where you have saved it.
❓Explore the graph. Is there a lot of structural variation?
❓In the workshop/data/genes folder, there are the sequences of some features of interest. Pick one and blast it against the graph. What do you find?
❓In the workshop/data/genes folder, there are the sequences of some features of interest. TBD1, for instance, is a region absent in the reference genome H37Rv and thus invisible to most short-read approaches. How does TBD1 look like?
......
%% Cell type:markdown id: tags:
# From graph to phylogeny
Nothing makes sense in biology except in the light of a phylogenetic tree...
In this module, we take the variants from the vcf and create a SNP alignment. We further use the graph to estimate the number of non-variable positions.
In this module, we take the variants from the graph vcf and create a SNP alignment. We further use the graph to estimate the number of non-variable positions.
%% Cell type:code id: tags:
``` python
# Create list of positions to exclude
import pandas
dr_loci = pandas.read_csv("data/DR_genes_coordinates.tsv", sep="\t")
repeats = pandas.read_csv("data/H37Rv.nucmer_repeats_id90_l50.tsv")
excluded = set()
for i, gene in dr_loci.iterrows():
excluded.update(range(gene["start_region"], gene["stop_region"]+1))
for i,hpair in repeats.iterrows():
excluded.update(range(hpair["S1"], hpair["E1"]+1))
excluded.update(range(hpair["S2"], hpair["E2"]+1))
print(f'Number of excluded positions: {len(excluded)}')
```
%% Output
Number of excluded positions: 176136
%% Cell type:markdown id: tags:
By using the variants.vcf file, we will also include sites that do not vary within the L1 samples, but only between L1 and H37Rv.
%% Cell type:code id: tags:
``` python
variants = 'results/variants.vcf'
outgroup = 'H37Rv'
max_missing_prop = 0.1
keep = []
discarded = {
"In repeat or DR locus" : 0,
"Not SNP" : 0,
"Missing genotypes" : 0,
"Invariant" : 0
}
with open(variants) as f:
for line in f:
if line.startswith("#CHROM"):
fields = line.strip().split("\t")
strains = fields[9:]
# Dictionary to store SNPs for each strain
strain_seqs = {strain : "" for strain in strains}
strain_seqs["H37Rv"] = ""
continue
elif line.startswith("#"):
continue
fields = line.strip().split("\t")
position = int(fields[1])
ref = fields[3]
alt = fields[4].split(",")
alleles = [ref] + alt
gt = fields[9:]
# Exclude sites in repeats and DR loci
if position in excluded:
discarded["In repeat or DR locus"] += 1
continue
# Only SNPs
if len(ref) > 1 or any(len(x) > 1 for x in alt):
discarded["Not SNP"] += 1
continue
# Check if all strains have genotypes (not sure what happened if not)
if len(gt) != len(strains):
print(fields)
#print(fields)
continue
# Write alleles to site
site = ''
for strain in strains:
strain_i = strains.index(strain)
strain_gt = gt[strain_i]
if not strain_gt or strain_gt not in '012345':
strain_allele = '-'
else:
strain_allele = alleles[int(strain_gt)]
site += strain_allele
# Exclude sites with more too many missing genotypes
prop_missing = site.count('-') / len(site)
if prop_missing > max_missing_prop:
discarded["Missing genotypes"] += 1
continue
# Re-check if there are remaining invariant sites
bases_only = site.replace('-', '')
bases_only += ref
if len(set(bases_only)) == 1:
discarded["Invariant"] += 1
continue
for i, allele in enumerate(site):
strain = strains[i]
strain_seqs[strain] += allele
# Write H37Rv allele
strain_seqs["H37Rv"] += ref
keep.append((position, alleles, gt))
print("Filtered out:")
for key, value in discarded.items():
print(key, str(value))
print("Kept: ", str(len(keep)))
```
%% Output
['H37Rv', '2163475', '>10733>10736', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>10733>10734>10736,>10733>10735>10736;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
['H37Rv', '2163481', '>10736>10739', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>10736>10737>10739,>10736>10738>10739;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
['H37Rv', '2163484', '>10739>10742', 'A', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>10739>10740>10742,>10739>10741>10742;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
['H37Rv', '2163486', '>10742>10745', 'C', 'G', '60', '.', 'AC=0;AF=0;AN=17;AT=>10742>10743>10745,>10742>10744>10745;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
['H37Rv', '2163488', '>10745>10748', 'T', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>10745>10746>10748,>10745>10747>10748;CONFLICT=PB000196;NS=17;LV=1;PS=>10706>10752', 'GT', '', '.', '', '', '', '', '', '', '0']
['H37Rv', '3113846', '>15187>15190', 'G', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>15187>15188>15190,>15187>15189>15190;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']
['H37Rv', '3113855', '>15193>15196', 'G', 'C', '60', '.', 'AC=0;AF=0;AN=17;AT=>15193>15194>15196,>15193>15195>15196;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']
['H37Rv', '3113862', '>15196>15199', 'C', 'T', '60', '.', 'AC=0;AF=0;AN=17;AT=>15196>15198>15199,>15196>15197>15199;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '', '', '.', '', '', '', '0', '0']
['H37Rv', '3113874', '>15204>15207', 'G', 'A', '60', '.', 'AC=0;AF=0;AN=17;AT=>15204>15205>15207,>15204>15206>15207;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '0', '', '.', '', '', '', '0', '0']
['H37Rv', '3113879', '>15207>15210', 'C', 'T', '60', '.', 'AC=0;AF=0;AN=17;AT=>15207>15208>15210,>15207>15209>15210;CONFLICT=PB000206;NS=17;LV=1;PS=>15186>15211', 'GT', '', '0', '', '', '', '', '', '', '0', '', '.', '', '', '', '0', '0']
['H37Rv', '3935950', '>19372>19376', 'T', 'C', '60', '.', 'AC=17;AF=1;AN=17;AT=>19372>19373>19376,>19372>19375>19376;NS=17;LV=2;PS=>19356>19447', 'GT', '1', '1', '1', '', '.', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
Filtered out:
In repeat or DR locus 729
Not SNP 859
Missing genotypes 312
Invariant 21
Kept: 4505
In repeat or DR locus 849
Not SNP 967
Missing genotypes 317
Invariant 1
Kept: 5115
%% Cell type:markdown id: tags:
Write variable alignment to fasta
%% Cell type:code id: tags:
``` python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
records = []
for strain in ["H37Rv"] + strains:
sequence = Seq(strain_seqs[strain])
record = SeqRecord(sequence, id=strain, name="", description="")
records.append(record)
SeqIO.write(records, "results/H37Rv_SNP_alignment.from_graph.fasta", "fasta")
```
%% Output
19
%% Cell type:markdown id: tags:
# Count non-variable positions
The idea here is to use the graph and find all nodes which are shared by all strains (i.e. they do not vary).
From these we can estimate the number of non-variable positions.
%% Cell type:code id: tags:
``` python
from collections import Counter
graph='results/pggb/assemblies_4mb.fasta.gz.a206ba8.417fcdf.47e7f7c.smooth.final.gfa'
n_paths = 0
node_seqs = {}
node_count = {}
# First traversal: get nodes present in all paths
with open(graph) as f:
for line in f:
fields = line.strip().split('\t')
if line.startswith('S'):
node_seqs[fields[1]] = fields[2]
elif line.startswith('P'):
n_paths += 1
paff = [x[:-1] for x in fields[2].split(',') ]
for node in paff:
if node not in node_count:
node_count[node] = 0
node_count[node] += 1
invariable_seq = ''
for node in node_count:
if node_count[node] == n_paths:
invariable_seq += node_seqs[node]
base_counts = Counter(invariable_seq)
for base in base_counts:
print(base, base_counts[base])
total = sum([base_counts[base] for base in base_counts])
gc = base_counts['G'] + base_counts['C']
print('Total invariable bases: ', total)
print('GC content: ', gc/total)
```
%% Output
G 1354977
T 713950
C 1359777
A 712428
Total invariable bases: 4141132
GC content: 0.6555584318490693
%% Cell type:markdown id: tags:
# Estimate a tree
Use Stamatakis model in RAXML to account for invariable sites and correct branch lengths.
... using your method of choice.
--model GTR+G+ASC_STAM{633511/1206483/1214591/640523}
Below I use raxml-ng with the Stamatakis model to account for invariable sites and correct branch lengths.
%% Cell type:code id: tags:
``` python
%%bash
echo "\
#!/bin/bash
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1G
#SBATCH --time=06:00:00
#SBATCH --qos=6hours
#SBATCH --output=results/raxml_stdout.%j
#SBATCH --error=results/raxml_stderr.%j
source ~/miniconda3/etc/profile.d/conda.sh
conda activate raxml
raxml-ng \
--msa results/H37Rv_SNP_alignment.from_graph.fasta \
--all --model GTR+G+ASC_STAM{633312/1206359/1214398/640320} \
--all --model GTR+G+ASC_STAM{712428/1359777/1354977/713950} \
--tree pars{10} --bs-trees 100 \
--outgroup H37Rv \
--threads 20" > run_raxml.slrm
sbatch run_raxml.slrm
```
%% Output
Submitted batch job 12899598
Submitted batch job 13472974
%% Cell type:markdown id: tags:
Plot the tree, for example with https://itol.embl.de.
❓ Is there anything remarkable about the tree?
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
# Explore potential large duplications
The PB000198 and PB000220 assemblies could both not be closed and show the same pattern in the assembly graph: a piece of 110kb that is repeated twice and could not be integrated into the chromosome. These two isolates are closely related, so most likely the reason their assembly failed is biological and not technical.
One useful way to explore what is going on here is to go back to the long reads, align them against a closely related complete genome, and visualize the problematic region with IGV.
%% Cell type:code id: tags:
```
%%bash
READS_PB198=
ASM=
```
%% Cell type:markdown id: tags:
Load the assembly and the aligned reads in IGV (available on sciCORE Desktop).
❓Can you find the problematic region? What genes does it contain?
❓What has happened here, why has the assembly failed?
......
#!/bin/bash
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1G
#SBATCH --time=06:00:00
#SBATCH --qos=6hours
#SBATCH --output=results/raxml_stdout.%j
#SBATCH --error=results/raxml_stderr.%j
source ~/miniconda3/etc/profile.d/conda.sh
conda activate raxml
raxml-ng --msa results/H37Rv_SNP_alignment.from_graph.fasta --all --model GTR+G+ASC_STAM{712428/1359777/1354977/713950} --tree pars{10} --bs-trees 100 --outgroup H37Rv --threads 20
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment