nb3

1ed887b6 · Chloe Marie Loiseau · bc1dd3d8 · 1ed887b6
Commit 1ed887b6 authored 7 years ago by Chloe Marie Loiseau
--- a/notebooks/CapeTown_Genomics_Tutorial_partIII.ipynb
+++ b/notebooks/CapeTown_Genomics_Tutorial_partIII.ipynb
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 14-15.05.2018 - Cape Town Genomics Workshop part III"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},

 %% Cell type:markdown id: tags:

+# 14-15.05.2018 - Cape Town Genomics Workshop part III
+
+%% Cell type:markdown id: tags:
+
 ## 4. Variant Calling

 In order to call SNPs, we need first to generate a pileup file. A pileup file summarizes, for each position in the reference genome, the number of reads covering it as well as the mapping quality of those reads. So, the most important difference with respect to the previous formats is that those were **read-centered outputs** and now we move to **reference-centered outputs**. We will use SAMtools again to produce the pileup:


 ### 4.1 Mpileup

 The first step is to create a 'pileup file'. For this we use the sorted filtered bam-file that we produced in the mapping step before and the *M. tuberculosis* reference genome.

 A pileup is a column wise representation of the aligned read - at the base level - to the reference. The pileup file summarises all data from the reads at each genomic region that is covered by at least one read.

 please type the following command into your terminal:

    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_mpileup.slurm

 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_mpileup.slurm
 ```

 %% Cell type:code id: tags:

 ``` python
 ! head -n 20 ERR760779.pileup
 ```

 %% Cell type:markdown id: tags:

 Each line consists of the chromosome name, the genomic position, the reference base, the number of reads covering the site, read bases and base qualities.
 At the read base column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence.

 source: http://samtools.sourceforge.net/pileup.shtml


 SNP and indel calling algorithms can scan this file to look at differences between the reference allele and the most common allele found in the reads. For the SNP calling, we will use VarScan which output is easy to interpret


 ### 4.2 SNP and InDel calling

 please type the following command into your terminal:

    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_varscan.slurm

 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_varscan.slurm
 ```

 %% Cell type:markdown id: tags:


 VarScan makes consensus calls (SNP/Indel/Reference) from the 'pileup' file based created in the previous step.
 We can indicate to Varscan parameters make the consensus calls.

 These parameters are:
 - Minimum read depth at a position to make a call (--min-coverage)   : 7
 - Minimum base quality at a position to count a read (--min-avg-qual): 20
 - Minimum variant allele frequency threshold (--min-var-freq)        : 0.1
 - Minimum frequency to call homozygote (--min-freq-for-hom)          : 0.9


 What is a VCF file:

 Variant Call Format (VCF) is a text file format for storing marker genotype data.

 Every VCF file has three parts in the following order:

 - Meta-information lines (lines beginning with "##").
 - One header line (line beginning with "#CHROM").
 - Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.


 %% Cell type:markdown id: tags:

 Let's look into the VCF we've just created:

 %% Cell type:code id: tags:

 ``` python
 ! cat ERR760779.snps.vcf
 ```

 %% Cell type:code id: tags:

 ``` python
 ! grep -v 'PASS' ERR760779.snps.vcf
 ```

 %% Cell type:markdown id: tags:

 ### 4.3 Annotation of variants

 The next steps would be to annotate the variants we have discovered (attribute functional characteristics to the variants, e.g if they fall on coding or intergenetic regions, in which genes, are they synonymous or nonsynonymous, etc).

 For this we use the bioinformatics programm *snpEff*. SnpEff is a **variant annotation and effect prediction tool**. It annotates and predicts the effects of variants on genes (such as amino acid changes).

 For more information on snpEff you can read the documentation: http://snpeff.sourceforge.net/SnpEff_manual.html

 Please type the following command into your terminal:

    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_snpEff.slurm


 **Note**: we will only annotate the VCF with SNPs in this course.

 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_snpeff.slurm
 ```

 %% Cell type:markdown id: tags:

 We use:
 - *-interval* to specify additionnal custom-made annotation. We've created this file and it contains annotation on essentiality mainly (as reported in this paper: http://mbio.asm.org/content/8/1/e02133-16.abstract). You can complement this file with any extra annotation you want !


 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/annotation/additionnal_annotations.bed
 ```

 %% Cell type:markdown id: tags:

 ## 5. Data cleaning

 We have created a lot of intermediate files which we no longer need.
 We should remove them as they take up unnecessary disk space.

 %% Cell type:code id: tags:

 ``` python
 ! rm ERR760779.snps.vcf
 ! rm ERR760779.pileup
 ! rm ERR760779.sam
 ```

 %% Cell type:code id: tags:

 ``` python
 ! rm ERR760779_1P.trimmed.fastq.gz
 ! rm ERR760779_1U.trimmed.fastq.gz
 ! rm ERR760779_2P.trimmed.fastq.gz
 ! rm ERR760779_2U.trimmed.fastq.gz
 ```

 %% Cell type:code id: tags:

 ``` python
 ! rm ERR760779_paired.bam
 ! rm ERR760779.sorted.bam
 ! rm ERR760779_1U.bam
 ! rm ERR760779_1U.sorted.bam
 ! rm ERR760779_2U.bam
 ! rm ERR760779_2U.sorted.bam
 ```

 %% Cell type:markdown id: tags:

 #### How to reduce the amount of files produced ?

 By using *pipe* (|) one can ensure that the output of one command (stdout) feeds directly as input (stdin) to the next one.

 For example:

    - bwa mem -t 2 -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1P.trimmed.fastq.gz ERR760779_2P.trimmed.fastq.gz | samtools view -Sbhu -|samtools sort -m 4G - ERR760779_paired.sorted

    - samtools mpileup -ABQ0 -q 20 -f ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779.dedup.bam | varscan mpileup2indel ERR760779.pileup --min-coverage 7 --min-avg-qual 20 --min-var-freq 0.1 --min-freq-for-hom 0.9 --strand-filter 1 --output-vcf 1 1> ERR760779.indels.vcf

 %% Cell type:markdown id: tags:

 ## 6. Parsing the annotated VCF for SNPs involved in drug-resistance
 We've added a file with a list of high-confidence drug resistance-mutations in the MTBC:

 %% Cell type:code id: tags:

 ``` python
 ! cat Drug_resistance_mutations_MTBC.txt
 ```

 %% Cell type:markdown id: tags:

 Let us grep for non-synonymous mutations in Rv0667 (rpoB):

 %% Cell type:code id: tags:

 ``` python
 ! grep 'Rv0667' ERR760779.snp.ann.vcf | grep 'missense_variant'
 ```

 %% Cell type:markdown id: tags:

 Our genome had the **S450L** mutation which confers resistance to rifampicin.

 This position is covered by 164 reads (ADP=164) and 163 of these reads have a T instead of a C at this position.

 the SNP frequency is thus: **99.39%**. We can be confident this SNP is a true positive.

 %% Cell type:markdown id: tags:

 ### A more powerful way to look for mutations from WGS data


 #### How to efficiently recovered all drug resistance mutations in our genome ?

 We will use python for that:

 %% Cell type:code id: tags:

 ``` python
 list_of_genomics_positions = []
 list_of_mutations = []
 with open('Workshop_SA/notebooks/Drug_resistance_mutations_MTBC.txt','r') as input_file: # open the file with the list of known DRMs
    for row in input_file: # Loop into the list of known DRMs
        row = row.strip() # remove the \n at the end of the rows
        row = row.split() # split each row by tab (\t). Each row is now as a python list
        list_of_genomics_positions += [row[1]]
        ref_base = row[2]
        alt_base = row[3]
        locus = row[4]
        substitution = row[5]
        drug = row[6]
        source = row[7]
        list_of_mutations += [[row[1],ref_base,alt_base,locus,substitution,drug,source]]
 ```

 %% Cell type:code id: tags:

 ``` python
 VCF_dico = {}
 with open('ERR760779.snp.ann.vcf','r') as VCF: # open the VCF
    for i in VCF: # Loop into the lines of the VCF
        if i[0] != '#': # if the first character of the line in the VCF is # ( we want to ignore the header of the VCF)
            i = i.strip()
            i = i.split()

            POS = str(i[1]) # position is the second column of the VCF
            REF = str(i[3]) # reference is the fourth column of the VCF
            ALT = str(i[4]) # ALT is the fifth column of the VCF
            INFO = i[7].split(';')
            a = i[9].split(':')
            COVERAGE = a[3]
            SNP_FREQ = a[6]
            ANNOTATION = INFO[5].split('|')
            IMPACT = ANNOTATION[1]
            GENE = ANNOTATION[3]
            MUTATION = ANNOTATION[10]

            VCF_dico[POS] = [REF,ALT,COVERAGE,SNP_FREQ,IMPACT,GENE,MUTATION]
 ```

 %% Cell type:code id: tags:

 ``` python
 with open('DRM_in_ERR760779.txt','w') as output_file:
    output_file.write('position\tref\talt\tcoverage\tSNP_frequency\ttype\tgene\tprotein_change\n')
    for items in VCF_dico.items():
        genomic_position = items[0]
        ann = items[1]
        if genomic_position in list_of_genomics_positions:
            output_file.write(genomic_position+'\t'+ann[0]+'\t'+ann[1]+'\t'+ann[2]+'\t'+ann[3]+'\t'+ann[4]+'\t'+ann[5]+'\t'+ann[6]+'\n')

 ```

 %% Cell type:code id: tags:

 ``` python
 ! cat DRM_in_ERR760779.txt
 ```

 %% Cell type:markdown id: tags:

 ### Retrieving all drug resistance mutations in the dataset.

 For the time of this workshop we will not have time to perform the different bioinformatics steps for all the genomes of the Elholm dataset.

 We've prepared a merged VCF using GATK.

 %% Cell type:code id: tags:

 ``` python
 list_of_sample_names = []
 VCF_merged_dico = {}

 with open('Workshop_SA/data_Eldholm/Eldholm2015_vcf_merged','r') as vcf_merged:
    for lines in vcf_merged:
        if lines[0:2] != '##':
            lines = lines.strip().split()
            if lines[0] == '#CHROM':
                list_of_sample_names += lines[9:] # to exclude the 9 first columns which are #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
            else:
                genotypes = []
                POS = str(lines[1]) # position is the second column of the VCF
                REF = str(lines[3]) # reference is the fourth column of the VCF
                ALT = str(lines[4]) # ALT is the fifth column of the VCF
                for i in lines[9:]:
                    genotypes += [i[0:3]]



                VCF_merged_dico[POS] = [REF,ALT,genotypes]
 ```

 %% Cell type:code id: tags:

 ``` python
 with open('all_DRM_in_Eldholm.txt','w') as DRM_Eldholm_output_file:
    DRM_Eldholm_output_file.write('position\tref\talt\t'+'\t'.join(list_of_sample_names)+'\n')
    for items in VCF_merged_dico.items():
        genomic_position = items[0]
        ann = items[1]

        if genomic_position in list_of_genomics_positions:
            for m in list_of_mutations:
                if genomic_position in m and ann[0]==m[1] and ann[1]==m[2]:
                    print(m)
                    DRM_Eldholm_output_file.write(genomic_position+'\t'+ann[0]+'\t'+ann[1]+'\t'+m[3]+'\t'+m[4]+'\t'+'\t'.join(ann[2])+'\n')



 ```

 %% Cell type:code id: tags:

 ``` python

 with open('all_DRM_in_Eldholm.txt','r') as ifile:
    for i in ifile:
        i = i.strip().split()
        if i[0] != 'position': # ignore the first line of the file as it is the header

            print(i[4], i.count('1/1'),i.count('0/1')) # print the mutation, the number of genomes with this mutation in a fixed state, the number of genomes with this mutation  in a variable state



 ```