"# 14-15.05.2018 - Cape Town Genomics Workshop part III"
]
},
{
"cell_type": "markdown",
"metadata": {},
...
...
%% Cell type:markdown id: tags:
# 14-15.05.2018 - Cape Town Genomics Workshop part III
%% Cell type:markdown id: tags:
## 4. Variant Calling
In order to call SNPs, we need first to generate a pileup file. A pileup file summarizes, for each position in the reference genome, the number of reads covering it as well as the mapping quality of those reads. So, the most important difference with respect to the previous formats is that those were **read-centered outputs** and now we move to **reference-centered outputs**. We will use SAMtools again to produce the pileup:
### 4.1 Mpileup
The first step is to create a 'pileup file'. For this we use the sorted filtered bam-file that we produced in the mapping step before and the *M. tuberculosis* reference genome.
A pileup is a column wise representation of the aligned read - at the base level - to the reference. The pileup file summarises all data from the reads at each genomic region that is covered by at least one read.
please type the following command into your terminal:
Each line consists of the chromosome name, the genomic position, the reference base, the number of reads covering the site, read bases and base qualities.
At the read base column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence.
SNP and indel calling algorithms can scan this file to look at differences between the reference allele and the most common allele found in the reads. For the SNP calling, we will use VarScan which output is easy to interpret
### 4.2 SNP and InDel calling
please type the following command into your terminal:
VarScan makes consensus calls (SNP/Indel/Reference) from the 'pileup' file based created in the previous step.
We can indicate to Varscan parameters make the consensus calls.
These parameters are:
- Minimum read depth at a position to make a call (--min-coverage) : 7
- Minimum base quality at a position to count a read (--min-avg-qual): 20
- Minimum variant allele frequency threshold (--min-var-freq) : 0.1
- Minimum frequency to call homozygote (--min-freq-for-hom) : 0.9
What is a VCF file:
Variant Call Format (VCF) is a text file format for storing marker genotype data.
Every VCF file has three parts in the following order:
- Meta-information lines (lines beginning with "##").
- One header line (line beginning with "#CHROM").
- Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.
%% Cell type:markdown id: tags:
Let's look into the VCF we've just created:
%% Cell type:code id: tags:
``` python
! cat ERR760779.snps.vcf
```
%% Cell type:code id: tags:
``` python
! grep -v 'PASS' ERR760779.snps.vcf
```
%% Cell type:markdown id: tags:
### 4.3 Annotation of variants
The next steps would be to annotate the variants we have discovered (attribute functional characteristics to the variants, e.g if they fall on coding or intergenetic regions, in which genes, are they synonymous or nonsynonymous, etc).
For this we use the bioinformatics programm *snpEff*. SnpEff is a **variant annotation and effect prediction tool**. It annotates and predicts the effects of variants on genes (such as amino acid changes).
For more information on snpEff you can read the documentation: http://snpeff.sourceforge.net/SnpEff_manual.html
Please type the following command into your terminal:
- *-interval* to specify additionnal custom-made annotation. We've created this file and it contains annotation on essentiality mainly (as reported in this paper: http://mbio.asm.org/content/8/1/e02133-16.abstract). You can complement this file with any extra annotation you want !
if i[0] != 'position': # ignore the first line of the file as it is the header
print(i[4], i.count('1/1'),i.count('0/1')) # print the mutation, the number of genomes with this mutation in a fixed state, the number of genomes with this mutation in a variable state