Skip to content
Snippets Groups Projects
Commit b3e7b681 authored by Christoph Stritt's avatar Christoph Stritt
Browse files

Revert "minor changes"

This reverts commit 116f3f4c
parent 116f3f4c
No related branches found
No related tags found
1 merge request!2Revert "minor changes"
%% Cell type:markdown id: tags:
# Create a pangenome graph with PGGB
PGGB requires as input a single fasta file that contains the genome assemblies.
The file **/data/MTBC.L1.assemblies.fasta** contains all the contigs resulting from the assembly of the 19 L1 strains, while **/data/assembly_summaries.tsv** contains assembly metadata.
Starting from these two files, we do the following:
- load the metadata, select complete genomes
- write all complete genomes to a fasta
❓ What do we mean with 'complete'?
%% Cell type:markdown id: tags:
# Inspect the Flye assembly summary
%% Cell type:code id: tags:
``` python
import pandas
md = pandas.read_csv('data/assembly_summaries.tsv', sep='\t')
md.head()
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group graph_path
0 PB000195 contig_1 4444823 114 Y N 1 * 1
1 PB000196 contig_10 4421359 133 N N 1 * 10,41
2 PB000196 contig_47 13589 32 N Y 1 48 *,47,*
3 PB000196 contig_1 10270 45 N Y 1 6 *,1,*
4 PB000196 contig_42 1662 52 N N 1 * -41,42
%% Cell type:markdown id: tags:
How many sequences per strain?
%% Cell type:code id: tags:
``` python
md['sample'].value_counts()
```
%% Output
sample
PB000205 49
PB000202 7
PB000196 4
PB000220 2
PB000198 2
PB000207 1
PB000219 1
PB000211 1
PB000210 1
PB000209 1
PB000208 1
PB000195 1
PB000206 1
PB000203 1
PB000201 1
PB000200 1
PB000199 1
PB000197 1
PB000204 1
Name: count, dtype: int64
%% Cell type:markdown id: tags:
Which sequences are cirularized?
%% Cell type:code id: tags:
``` python
md[md['circ.'] == 'Y']
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group \
0 PB000195 contig_1 4444823 114 Y N 1 *
5 PB000197 contig_1 4410932 129 Y N 1 *
8 PB000199 contig_1 4426248 112 Y N 1 *
9 PB000200 contig_1 4428581 84 Y N 1 *
10 PB000201 contig_1 4408479 96 Y N 1 *
18 PB000203 contig_1 4434008 47 Y N 1 *
19 PB000204 contig_1 4419776 74 Y N 1 *
20 PB000205 contig_50 4420883 37 Y Y 2 *
69 PB000206 contig_1 4440794 119 Y N 1 *
70 PB000207 contig_1 4424460 85 Y N 1 *
71 PB000208 contig_1 4419922 119 Y N 1 *
72 PB000209 contig_1 4443791 80 Y N 1 *
73 PB000210 contig_1 4444571 115 Y N 1 *
74 PB000211 contig_1 4441994 120 Y N 1 *
75 PB000219 contig_1 4410932 101 Y N 1 *
graph_path
0 1
5 1
8 1
9 1
10 1
18 1
19 1
20 50
69 1
70 1
71 1
72 1
73 1
74 1
75 1
%% Cell type:markdown id: tags:
Sequences larger than 4 Mb
%% Cell type:code id: tags:
``` python
md[md['length']>4e6]
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group \
0 PB000195 contig_1 4444823 114 Y N 1 *
1 PB000196 contig_10 4421359 133 N N 1 *
5 PB000197 contig_1 4410932 129 Y N 1 *
6 PB000198 contig_1 4332805 86 N N 1 *
8 PB000199 contig_1 4426248 112 Y N 1 *
9 PB000200 contig_1 4428581 84 Y N 1 *
10 PB000201 contig_1 4408479 96 Y N 1 *
18 PB000203 contig_1 4434008 47 Y N 1 *
19 PB000204 contig_1 4419776 74 Y N 1 *
20 PB000205 contig_50 4420883 37 Y Y 2 *
69 PB000206 contig_1 4440794 119 Y N 1 *
70 PB000207 contig_1 4424460 85 Y N 1 *
71 PB000208 contig_1 4419922 119 Y N 1 *
72 PB000209 contig_1 4443791 80 Y N 1 *
73 PB000210 contig_1 4444571 115 Y N 1 *
74 PB000211 contig_1 4441994 120 Y N 1 *
75 PB000219 contig_1 4410932 101 Y N 1 *
76 PB000220 contig_1 4319282 88 N N 1 *
graph_path
0 1
1 10,41
5 1
6 2,1,2
8 1
9 1
10 1
18 1
19 1
20 50
69 1
70 1
71 1
72 1
73 1
74 1
75 1
76 2,1,2
%% Cell type:markdown id: tags:
For which samples do we lack a circular genome?
%% Cell type:code id: tags:
``` python
samples_circ = md['sample'][md['circ.'] == 'Y']
samples_noncirc = md['sample'][md['sample'].isin(samples_circ) == False]
print(set(samples_noncirc))
```
%% Output
{'PB000220', 'PB000202', 'PB000198', 'PB000196'}
%% Cell type:markdown id: tags:
❓ What is the matter with these genomes?
❓ Should we include them in the graph?
%% Cell type:code id: tags:
``` python
md[md['sample'] == 'PB000205']
```
%% Output
sample #seq_name length cov. circ. repeat mult. alt_group \
20 PB000205 contig_50 4420883 37 Y Y 2 *
21 PB000205 contig_49 216232 6 N Y 1 *
22 PB000205 contig_53 163344 6 N Y 1 *
23 PB000205 contig_51 142109 6 N Y 1 *
24 PB000205 contig_9 131264 6 N Y 1 *
25 PB000205 contig_8 120360 6 N Y 1 *
26 PB000205 contig_37 117735 6 N Y 1 *
27 PB000205 contig_2 117435 7 N Y 1 *
28 PB000205 contig_33 104292 6 N Y 1 *
29 PB000205 contig_44 102971 6 N Y 1 *
30 PB000205 contig_31 93943 5 N Y 1 *
31 PB000205 contig_6 89619 5 N Y 1 *
32 PB000205 contig_5 87578 6 N Y 1 *
33 PB000205 contig_23 85152 5 N Y 1 *
34 PB000205 contig_13 84548 5 N Y 1 *
35 PB000205 contig_19 76959 5 N Y 1 *
36 PB000205 contig_54 74739 5 N Y 1 *
37 PB000205 contig_52 72581 5 N Y 1 *
38 PB000205 contig_21 69776 6 N Y 1 *
39 PB000205 contig_20 68663 5 N Y 1 *
40 PB000205 contig_22 68060 6 N Y 1 *
41 PB000205 contig_15 67068 6 N Y 1 *
42 PB000205 contig_25 66397 7 N Y 1 *
43 PB000205 contig_12 55613 7 N Y 1 *
44 PB000205 contig_18 52108 5 N Y 1 *
45 PB000205 contig_17 51926 6 N Y 1 *
46 PB000205 contig_1 47711 6 N Y 1 *
47 PB000205 contig_48 45758 6 N Y 1 *
48 PB000205 contig_42 45451 5 N Y 1 *
49 PB000205 contig_38 45013 6 N Y 1 *
50 PB000205 contig_43 41645 5 N Y 1 *
51 PB000205 contig_34 39550 6 N Y 1 *
52 PB000205 contig_26 38539 7 N Y 1 *
53 PB000205 contig_30 37006 7 N Y 1 *
54 PB000205 contig_47 35474 6 N Y 1 *
55 PB000205 contig_28 33249 7 N Y 1 *
56 PB000205 contig_32 33146 5 N Y 1 *
57 PB000205 contig_41 32796 5 N Y 1 *
58 PB000205 contig_3 31924 7 N Y 1 *
59 PB000205 contig_40 31087 6 N Y 1 *
60 PB000205 contig_16 28303 6 N Y 1 *
61 PB000205 contig_36 26152 6 N Y 1 *
62 PB000205 contig_46 24509 6 N Y 1 *
63 PB000205 contig_45 19766 6 N Y 1 *
64 PB000205 contig_14 15999 5 N Y 1 *
65 PB000205 contig_24 15160 8 N Y 1 *
66 PB000205 contig_29 13880 6 N Y 1 *
67 PB000205 contig_39 12605 8 N Y 1 *
68 PB000205 contig_4 11914 6 N Y 1 *
graph_path
20 50
21 *,49,*
22 *,53,*
23 *,51,*
24 *,9,*
25 *,8,*
26 *,37,*
27 *,2,*
28 *,33,*
29 *,44,*
30 *,31,*
31 *,6,*
32 *,5,*
33 *,23,*
34 *,13,*
35 *,19,*
36 *,54,*
37 *,52,*
38 *,21,*
39 *,20,*
40 *,22,*
41 *,15,*
42 *,25,*
43 *,12,*
44 *,18,*
45 *,17,*
46 *,1,*
47 *,48,*
48 *,42,*
49 *,38,*
50 *,43,*
51 *,34,*
52 *,26,*
53 *,30,*
54 *,47,*
55 *,28,*
56 *,32,*
57 *,41,*
58 *,3,*
59 *,40,*
60 *,16,*
61 *,36,*
62 *,46,*
63 *,45,*
64 *,14,*
65 *,24,*
66 *,29,*
67 *,39,*
68 *,4,*
%% Cell type:markdown id: tags:
# Combine complete genomes
Let us be liberal and just use all sequences > 4 Mb. We can add a tag to the name of the 'unclean' ones.
First a quick look at the fasta headers.
%% Cell type:code id: tags:
``` python
%%bash
grep ^\> data/MTBC.L1.assemblies.fasta
```
%% Output
>PB000195_1
>PB000196_1
>PB000196_2
>PB000196_3
>PB000197_1
>PB000198_1
>PB000198_2
>PB000199_1
>PB000200_1
>PB000201_1
>PB000202_1
>PB000202_2
>PB000202_3
>PB000202_4
>PB000202_5
>PB000202_6
>PB000202_7
>PB000203_1
>PB000204_1
>PB000205_1
>PB000205_2
>PB000205_3
>PB000205_4
>PB000205_5
>PB000205_6
>PB000205_7
>PB000205_8
>PB000205_9
>PB000205_10
>PB000205_11
>PB000205_12
>PB000205_13
>PB000205_14
>PB000205_15
>PB000205_16
>PB000205_17
>PB000205_18
>PB000205_19
>PB000205_20
>PB000205_21
>PB000205_22
>PB000205_23
>PB000205_24
>PB000205_25
>PB000205_26
>PB000205_27
>PB000205_28
>PB000205_29
>PB000205_30
>PB000205_31
>PB000205_32
>PB000205_33
>PB000205_34
>PB000205_35
>PB000205_36
>PB000205_37
>PB000205_38
>PB000205_39
>PB000205_40
>PB000205_41
>PB000205_42
>PB000205_43
>PB000205_44
>PB000205_45
>PB000205_46
>PB000205_47
>PB000205_48
>PB000205_49
>PB000206_1
>PB000207_1
>PB000208_1
>PB000209_1
>PB000210_1
>PB000211_1
>PB000219_1
>PB000220_1
>PB000220_2
%% Cell type:code id: tags:
``` python
from Bio import SeqIO
assemblies_all = 'data/MTBC.L1.assemblies.fasta'
assemblies_pggb = open('results/assemblies_4mb.fasta', 'w')
# All sequences > 4Mb
md_4mb = md[md['length']>4e6]
for rec in SeqIO.parse(assemblies_all, "fasta"):
if len(rec.seq) < 4e6:
continue
strain = rec.id.split('_')[0]
strain_circ = md_4mb['circ.'][md_4mb['sample'] == strain].to_string(index=False)
strain_repeat = md_4mb['repeat'][md_4mb['sample'] == strain].to_string(index=False)
new_id = strain
if strain_circ == 'N':
new_id += '#circNo'
if strain_repeat == 'Y':
new_id += '#repeatY'
rec.id = new_id
rec.description = ''
SeqIO.write(rec, assemblies_pggb, 'fasta')
```
%% Cell type:markdown id: tags:
# Run PGGB
Important paramters:
-i : path to combined assemblies
-o : output directory name
-n : Number of haplotypes
-p : Similarity of contigs
-s : Maximum length of repeats in the genome
%% Cell type:markdown id: tags:
Create slurm script
## Create slurm script
%% Cell type:code id: tags:
``` python
%%bash
echo '\
#!/bin/bash
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1G
#SBATCH --time=06:00:00
#SBATCH --qos=6hours
#SBATCH --output=stdout.%j
#SBATCH --error=stderr.%j
singularity exec /scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/pggb_latest.sif pggb \
-i assemblies_combined.fasta.gz \
-o ./pggb_L1 \
-o ./pggb \
-m \
-t 20 \
-n 18 \
-n 52 \
-p 99 \
-s 5k' > run_pggb.slrm
-s 5k
```
%% Cell type:markdown id: tags:
Submit it
# Variant calling
%% Cell type:code id: tags:
``` python
%%bash
sbatch run_pggb.slrm
REF=
/scicore/home/gagneux/GROUP/PacbioSnake_resources/containers/pggb_latest.sif vg deconstruct
```
%% Cell type:markdown id: tags:
# Visualization
![](pics/is6110_ppe.small.png)
https://www.evomicslab.org/app/vrpg/
......
workshop/pics/is6110_ppe.small.png

37.7 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment