Skip to content
Snippets Groups Projects
Commit cc3b9304 authored by Chloe Marie Loiseau's avatar Chloe Marie Loiseau
Browse files

added gatk

parent c640ad02
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# 14-15.05.2018 - Cape Town Genomics Workshop part II
%% Cell type:markdown id: tags:
## 3. Mapping the processed reads to a reference genome (creation of a SAM/BAM file) and BAM post-processing
<img src="images/read_mapping.png" width="350">
### 3.1 Mapping the processed reads to a reference genome with BWA
<img src="images/bw.png" width="350">
To map the trimmed sequencing reads (\*.trimmed.fastq.gz) to the reference genome, we use the tool BWA. (BWA website: http://bio-bwa.sourceforge.net)
BWA is a short read aligner, that can take a reference genome and map single- or paired-end data to it. BWA will produce a mapping file in a SAM-format.
SAM stands for Sequence Alignment/Map format, and BAM is the binary version of a SAM file. These formats have become industry standards for reporting alignment/mapping information, so it's good to understand them to some degree.
%% Cell type:code id: tags:
``` python
! bwa
```
%% Cell type:markdown id: tags:
Please run in the command line:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_BWA.slurm
This might take some time, so we run it now and go over the commands after.
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_BWA.slurm
```
%% Output
#!/bin/bash
#SBATCH --job-name=BWA
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --output=BWA.o
#SBATCH --error=BWA.e
### 3.1.1 Making the index
singularity exec /home/container.img bwa index -a bwtsw ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta
### 3.1.2 Running BWA
singularity exec /home/container.img bwa mem -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1P.trimmed.fastq.gz ERR760779_2P.trimmed.fastq.gz > ERR760779_paired.sam
singularity exec /home/container.img bwa mem -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1U.trimmed.fastq.gz > ERR760779_1U.sam
singularity exec container.img bwa mem -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_refe/home/rence.fasta ERR760779_2U.trimmed.fastq.gz > ERR760779_2U.sam
### convert the SAM to BAM format
singularity exec /home/container.img samtools view -bhu ERR760779_paired.sam > ERR760779_paired.bam
singularity exec /home/container.img samtools view -bhu ERR760779_1U.sam > ERR760779_1U.bam
singularity exec /home/container.img samtools view -bhu ERR760779_2U.sam > ERR760779_2U.bam
### sort the BAM file by position
singularity exec /home/container.img samtools sort -m 4G ERR760779_paired.bam > ERR760779.sorted.bam
singularity exec /home/container.img samtools sort -m 4G ERR760779_1U.bam > ERR760779_1U.sorted.bam
singularity exec container.img samtools sort -m 4G ERR760779_2U.bam > ERR760779_2U.sorted.bam
### To go faster one could bring the three last commands into one using pipe:
### bwa mem -t 2 -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1P.trimmed.fastq.gz ERR760779_2P.trimmed.fastq.gz | samtools view -Sbhu -|samtools sort -m 4G - ERR760779_paired.sorted
### 3.1.3 Merging the BAMs
singularity exec /home/container.img samtools merge ERR760779_merged.bam ERR760779.sorted.bam ERR760779_1U.sorted.bam ERR760779_2U.sorted.bam
### 3.1.4 Retrieving the unmapped reads
singularity exec /home/container.img samtools view -b -f 4 ERR760779_merged.bam > ERR760779.unmapped.bam
%% Cell type:markdown id: tags:
### 3.1.1 Making an index
- bwa index -a bwtsw ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta
The first step of using BWA is to make an index of the reference genome in fasta format.
This creates a collection of files used by BWA to perform the alignment.
The basic usage of the bwa index is:
bwa index [-a bwtsw|is] input_reference.fasta index_prefix
%% Cell type:markdown id: tags:
### 3.1.2 Running BWA
The **bwa mem** algorithm is one of the three algorithms provided by BWA. It performs **local alignment** and produces alignments for different part of the query sequence.
Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]
where index_prefix is the index for the reference genome generated from bwa index, and in1.fq, in2.fq are the input files of sequencing data that can be **single-end or paired-end**.
Additional options for bwa mem can be found in the BWA manual.
%% Cell type:code id: tags:
``` python
! bwa mem
```
%% Output
Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]
Algorithm options:
-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-y INT seed occurrence for the 3rd round seeding [20]
-c INT skip seeds with more than INT occurrences [500]
-D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
-W INT discard a chain if seeded bases shorter than INT [0]
-m INT perform at most INT rounds of mate rescues for each read [50]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
Scoring options:
-A INT score for a sequence match, which scales options -TdBOELU unless overridden [1]
-B INT penalty for a mismatch [4]
-O INT[,INT] gap open penalties for deletions and insertions [6,6]
-E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
-L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]
-U INT penalty for an unpaired read pair [17]
-x STR read type. Setting -x changes multiple parameters unless overridden [null]
pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)
ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)
intractg: -B9 -O16 -L5 (intra-species contigs to ref)
Input/output options:
-p smart pairing (ignoring in2.fq)
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
-H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]
-o FILE sam file to output results to [stdout]
-j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
-5 for split alignment, take the alignment with the smallest coordinate as primary
-q don't modify mapQ of supplementary alignments
-K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []
-v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-h INT[,INT] if there are <INT hits with score >80% of the max score, output all in XA [5,200]
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-V output the reference FASTA header in the XR tag
-Y use soft clipping for supplementary alignments
-M mark shorter split hits as secondary
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean if absent), max
(4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
Note: Please read the man page for detailed description of the command line and options.
%% Cell type:markdown id: tags:
#### Mapping the forward and reverse paired reads:
Reads can be aligned to the formatted reference following the command:
- bwa mem -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1P.trimmed.fastq.gz ERR760779_2P.trimmed.fastq.gz > ERR760779_paired.sam
### Mapping the forward unpaired reads (treated as single-end reads)
- bwa mem -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1U.trimmed.fastq.gz > ERR760779_1U.sam
### Mapping the reverse unpaired reads (treated as single-end reads)
- bwa mem -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_2U.trimmed.fastq.gz > ERR760779_2U.sam
%% Cell type:markdown id: tags:
The command calls the program (bwa) and the mode of alignment (mem). The MEM algorithm is recommended when dealing with long reads like those generated by HiSeq or MiSeq sequencers.
%% Cell type:markdown id: tags:
The resulting file is a SAM file.
The SAM file is composed of a few header lines
Then, for each read that mapped to the reference genome, there is one line which defines the read and the position in the reference genome where the read mapped and a quality of the map.
Contrary to other formats, as we will see later on (BAM), SAM is a text tab-delimited file, so we can screen it:
%% Cell type:code id: tags:
``` python
! head ERR760779_paired.sam
! head $HOME/ERR760779_paired.sam
```
%% Cell type:markdown id: tags:
Let's have a look at the size of these three SAM files:
%% Cell type:code id: tags:
``` python
! du -h ERR760779_paired.sam
! du -h $HOME/ERR760779_paired.sam
```
%% Cell type:code id: tags:
``` python
! du -h ERR760779_1U.sam
! du -h $HOME/ERR760779_1U.sam
```
%% Cell type:code id: tags:
``` python
! du -h ERR760779_2U.sam
! du -h $HOME/ERR760779_2U.sam
```
%% Cell type:markdown id: tags:
For the most part, you don't really need to understand the SAM format since most analysis software will handle it for you.
%% Cell type:markdown id: tags:
Although the **SAM** file contains all the information for downstream analyses, the size of the files makes very difficult to work with it.
Another format called **BAM** can be used instead. This format can be interpreted as a **binary file** for NGS alignment data.
To generate a proper BAM file, we will need to convert the SAM to BAM format and then **sort the BAM file by position** so it can be used by SNP calling algorithms.
There is a suite of tools called **SAMtools** that can perform all these operation on the SAM file
%% Cell type:code id: tags:
``` python
! samtools view
```
%% Cell type:markdown id: tags:
The command we ran in the slurm script to convert the SAM file into a BAM file was:
- samtools view -bhu ERR760779_paired.sam > ERR760779_paired.bam
- samtools view -bhu ERR760779_1U.sam > ERR760779_1U.bam
- samtools view -bhu ERR760779_2U.sam > ERR760779_2U.bam
%% Cell type:markdown id: tags:
Look at the size of the BAM file:
%% Cell type:code id: tags:
``` python
! du -h ERR760779_paired.bam
! du -h $HOME/ERR760779_paired.bam
```
%% Cell type:markdown id: tags:
The command we ran in the slurm script to sort the BAM file by position:
- samtools sort ERR760779_paired.bam > ERR760779.sorted.bam
- samtools sort ERR760779_1U.bam > ERR760779_1U.sorted.bam
- samtools sort ERR760779_2U.bam > ERR760779_2U.sorted.bam
%% Cell type:markdown id: tags:
### 3.1.3 Merging the BAM's
We've just run *bwa mem* 3 times:
- once on the paired-end reads
- once on the forward unpaired reads
- once on the reverse unpaired reads
We must now merge the three BAM files into one using **samtools merge**:
- samtools merge ERR760779_merged.bam ERR760779.sorted.bam ERR760779_1U.sorted.bam ERR760779_2U.sorted.bam
%% Cell type:markdown id: tags:
### 3.1.4 Retrieving the unmapped reads ?
Unmapped reads are often discarded from whole-genome sequencing analysis although they may contain useful information.
Indeed, the unmapped reads may revel novel sequences (large insertions, sources of contamination, symbionts (not in Mtb), etc...).
We will not analyse the unmapped reads here but
To retrieve the unmapped reads we run:
- samtools view -b -f 4 ERR760779_merged.bam > ERR760779.unmapped.bam
the -b is used to indicate we want a BAM output
the -f 4 is to tell samtools we want to retrieve only the reads with the flag 4 present (=unmapped reads).
Here are the different flags:
<img src="images/SAM_flags.png" width="500">
%% Cell type:markdown id: tags:
### 3.2 Improving the quality of the alignment
### 3.2.1 Marking Duplicates with Picard
<img src="images/MD.png" width="350">
Please type in the command line the following command:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_MarkDuplicates.slurm
Duplicates can arise during sample preparation e.g. library construction using PCR.
MarkDuplicates identifies **read pairs with the same orientation** that have the exact **same 5′ start position** in the mapping. It takes into account clipping on the 5′ end of the read and makes calculations based on where the 5′ start position would be if the entire read had mapped to the reference.
The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for
each read.
MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.
Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_MarkDuplicates.slurm
```
%% Output
#!/bin/bash
#SBATCH --job-name=MarkDuplicates
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=6:00:00
#SBATCH --output=MarkDuplicates.o
#SBATCH --error=MarkDuplicates.e
#SBATCH --qos=6hours
singularity exec container.img picard MarkDuplicates INPUT=ERR760779.bam OUTPUT=ERR760779.dedup.bam METRICS_FILE=ERR760779.markduplicates.metrics ASSUME_SORTED=true
%% Cell type:code id: tags:
``` python
! cat ERR760779.markduplicates.metrics
! cat $HOME/ERR760779.markduplicates.metrics
```
%% Cell type:markdown id: tags:
The metrics file contains the number of duplicates flagged in your genome:
%% Cell type:markdown id: tags:
### 3.2.2 BAM index
Some tools require a BAM Index file to more efficiently access reads in a BAM file.
this is done using the command samtools index:
- samtools index ERR760779.dedup.bam
%% Cell type:markdown id: tags:
The command samtools index creates a an 'index' file which is a companion to the BAM file.
This index file has the same name, suffixed with **.bai. **
This file acts like an external table of contents, and allows other programs to jump directly to specific parts of the BAM file without reading through all of the sequences. Without the corresponding BAM file, your BAI file is useless, since it doesn't actually contain any sequence data.
%% Cell type:markdown id: tags:
Please run in the command line:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_index.slurm
%% Cell type:markdown id: tags:
### 3.2.3 InDel Realignment with GATK
it is common to realign reads around small indels. Since, differences in resolving the indels may
cause artificial SNPs in the downstream analysis.
The
GATK software for instance offers the possibility to realign
Please run in the command line:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_GATK.slurm
Then you have to index the newly produced BAM:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_index2.slurm
%% Cell type:code id: tags:
``` python
! cat /scicore/home/gagneux/loiseau/Workshop_SA/notebooks/slurm_scripts/launch_GATK.slurm
```
%% Output
#!/bin/bash
#SBATCH --job-name=GATK
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=6:00:00
#SBATCH --output=GATK.o
#SBATCH --error=GATK.e
singularity exec container.img gatk-launch -T RealignerTargetCreator -nt 1 -R ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta -o ERR760779.intervals -I ERR760779.dedup.bam
singularity exec container.img gatk-launch --disable_bam_indexing -T IndelRealigner R ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta -targetIntervals ERR760779.intervals -I ERR760779.dedup.bam -o ERR760779.dedup.realigned.bam
%% Cell type:markdown id: tags:
### 3.3 Metrics
%% Cell type:markdown id: tags:
We can run the programm qualimap with the 'algorithm' bamqc to get a quality report of the BAM we produced.
Specifically we get descriptive statistics on the coverage, the mapping percentage etc.
please type the following command into your terminal:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_bamqc.slurm
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_bamqc.slurm
```
%% Output
#!/bin/bash
#SBATCH --job-name=bamQC
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --time=0:30:00
#SBATCH --output=bamQC.o
#SBATCH --error=bamQC.e
#SBATCH --qos=30min
singularity exec container.img qualimap bamqc -bam ERR760779.dedup.bam -sd -sdmode 1 -outdir . -outfile ERR760779_bamqc
%% Cell type:markdown id: tags:
The program qualimap produces a directory. Enter into the directory and list the files:
%% Cell type:code id: tags:
``` python
%cd ERR760779.dedup_stats
```
%% Cell type:code id: tags:
``` python
ls
```
%% Cell type:markdown id: tags:
There are two files:
* one pdf file: you can open it from your terminal using firefox
* one text file, which we read here:
%% Cell type:code id: tags:
``` python
! cat genome_results.txt
```
%% Cell type:markdown id: tags:
### Exercise:
Get the mapping percentage, percentage of duplicated reads, Mean coverage, percentage of reference genomenot covered at all, percentage of reference genome covered by less than 7 reads:
- mapping percentage: ..................
- percentage of duplicated reads: ............
- Mean coverage: ............
- percentage of reference genomenot covered by less than 1 read: .........
- percentage of reference genome covered by less than 7 reads : .......
%% Cell type:markdown id: tags:
Go back into your home directory:
%% Cell type:code id: tags:
``` python
%cd
```
......
%% Cell type:markdown id: tags:
# 14-15.05.2018 - Cape Town Genomics Workshop part III
%% Cell type:markdown id: tags:
## 4. Variant Calling
In order to call SNPs, we need first to generate a pileup file. A pileup file summarizes, for each position in the reference genome, the number of reads covering it as well as the mapping quality of those reads. So, the most important difference with respect to the previous formats is that those were **read-centered outputs** and now we move to **reference-centered outputs**. We will use SAMtools again to produce the pileup:
### 4.1 Mpileup
The first step is to create a 'pileup file'. For this we use the sorted filtered bam-file that we produced in the mapping step before and the *M. tuberculosis* reference genome.
A pileup is a column wise representation of the aligned read - at the base level - to the reference. The pileup file summarises all data from the reads at each genomic region that is covered by at least one read.
please type the following command into your terminal:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_mpileup.slurm
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_mpileup.slurm
```
%% Cell type:code id: tags:
``` python
! head -n 20 ERR760779.pileup
```
%% Cell type:markdown id: tags:
Each line consists of the chromosome name, the genomic position, the reference base, the number of reads covering the site, read bases and base qualities.
At the read base column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence.
source: http://samtools.sourceforge.net/pileup.shtml
SNP and indel calling algorithms can scan this file to look at differences between the reference allele and the most common allele found in the reads. For the SNP calling, we will use VarScan which output is easy to interpret
### 4.2 SNP and InDel calling
please type the following command into your terminal:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_varscan.slurm
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_varscan.slurm
```
%% Cell type:markdown id: tags:
VarScan makes consensus calls (SNP/Indel/Reference) from the 'pileup' file based created in the previous step.
We can indicate to Varscan parameters make the consensus calls.
These parameters are:
- Minimum read depth at a position to make a call (--min-coverage) : 7
- Minimum base quality at a position to count a read (--min-avg-qual): 20
- Minimum variant allele frequency threshold (--min-var-freq) : 0.1
- Minimum frequency to call homozygote (--min-freq-for-hom) : 0.9
What is a VCF file:
Variant Call Format (VCF) is a text file format for storing marker genotype data.
Every VCF file has three parts in the following order:
- Meta-information lines (lines beginning with "##").
- One header line (line beginning with "#CHROM").
- Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.
%% Cell type:code id: tags:
``` python
! cat varscan.e
```
%% Output
cat: varscan.e: No such file or directory
%% Cell type:markdown id: tags:
#### Exercice
- How many consensus calls are in your VCF? How many would you expect ? Is it less, equal or more than what you expect ?
- how many SNPs were called ?
- how many InDels were called ?
%% Cell type:markdown id: tags:
Let's look into the VCF we've just created:
%% Cell type:markdown id: tags:
#### a) Have a look at the first 28 lines of the VCF which contains the Meta-information lines (beginning with ##)
Each meta-information line must have the form ##KEY=VALUE.
The first meta-information line must specify the VCF version number (version 4.1 in the example).
Additional meta-information lines are optional, but are often included to describe terms used in the FILTER, INFO, and FORMAT fields.
%% Cell type:code id: tags:
``` python
! head -n 28 ERR760779.snps.vcf
```
%% Cell type:markdown id: tags:
#### b) Have a look at the one header line (line beginning with "#CHROM").
The first nine columns of the header line and data lines describe the variants:
*CHROM* the chromosome name. (Here: MTB_anc)
*POS* the genome coordinate. Within a chromosome, VCF records are sorted in order of increasing position.
*ID* a semicolon-separated list of marker identifiers.
*REF* the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC").
*ALT* the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles.
*QUAL* probability that the ALT allele is incorrectly specified, expressed on the the phred scale (-10log10(probability)).
*FILTER* Either "PASS" or a semicolon-separated list of failed quality control filters.
*INFO* additional information (no white space, tabs, or semi-colons permitted).
*FORMAT* colon-separated list of data subfields reported for each sample.
%% Cell type:code id: tags:
``` python
! head -n 29 ERR760779.snps.vcf | tail -n 1
```
%% Cell type:code id: tags:
``` python
! grep -v 'PASS' ERR760779.snps.vcf
```
%% Cell type:markdown id: tags:
We now have the genomic positions where the variants are but we don't know in which gene they fall, or what the effect of the variant is.
%% Cell type:markdown id: tags:
### 4.3 Annotation of variants
The next steps would be to annotate the variants we have discovered (attribute functional characteristics to the variants, e.g if they fall on coding or intergenetic regions, in which genes, are they synonymous or nonsynonymous, etc).
For this we use the bioinformatics programm *snpEff*. SnpEff is a **variant annotation and effect prediction tool**. It annotates and predicts the effects of variants on genes (such as amino acid changes).
For more information on snpEff you can read the documentation: http://snpeff.sourceforge.net/SnpEff_manual.html
Please type the following command into your terminal:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_snpeff.slurm
**Note**: we will only annotate the VCF with SNPs in this course.
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_snpeff.slurm
```
%% Cell type:markdown id: tags:
We use:
- *-interval* to specify additionnal custom-made annotation. We've created this file and it contains annotation on essentiality mainly (as reported in this paper: http://mbio.asm.org/content/8/1/e02133-16.abstract). You can complement this file with any extra annotation you want !
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/annotation/additionnal_annotations.bed
```
%% Cell type:markdown id: tags:
## 5. Data cleaning
We have created a lot of intermediate files which we no longer need.
We should remove them as they take up unnecessary disk space.
%% Cell type:code id: tags:
``` python
! rm -r ERR760779_1_fastqc.zip
! rm -r ERR760779_1P.trimmed_fastqc.zip
```
%% Cell type:code id: tags:
``` python
! rm ERR760779_1P.trimmed.fastq.gz
! rm ERR760779_1U.trimmed.fastq.gz
! rm ERR760779_2P.trimmed.fastq.gz
! rm ERR760779_2U.trimmed.fastq.gz
```
%% Cell type:code id: tags:
``` python
! rm ERR760779.snps.vcf
! rm ERR760779.pileup
! rm ERR760779.sam
```
%% Cell type:code id: tags:
``` python
! rm ERR760779_paired.bam
! rm ERR760779.sorted.bam
! rm ERR760779_1U.bam
! rm ERR760779_1U.sorted.bam
! rm ERR760779_2U.bam
! rm ERR760779_2U.sorted.bam
```
%% Cell type:markdown id: tags:
#### How to reduce the amount of files produced ?
By using *pipe* (|) one can ensure that the output of one command (stdout) feeds directly as input (stdin) to the next one.
For example:
- bwa mem -t 2 -M ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779_1P.trimmed.fastq.gz ERR760779_2P.trimmed.fastq.gz | samtools view -Sbhu -|samtools sort -m 4G - ERR760779_paired.sorted
- samtools mpileup -ABQ0 -q 20 -f ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779.dedup.bam | varscan mpileup2indel ERR760779.pileup --min-coverage 7 --min-avg-qual 20 --min-var-freq 0.1 --min-freq-for-hom 0.9 --strand-filter 1 --output-vcf 1 1> ERR760779.indels.vcf
%% Cell type:markdown id: tags:
## 6. Parsing the annotated VCF for SNPs involved in drug-resistance
We've added a file with a list of high-confidence drug resistance-mutations in the MTBC:
%% Cell type:code id: tags:
``` python
! cat Drug_resistance_mutations_MTBC.txt
```
%% Cell type:markdown id: tags:
Let us grep for non-synonymous mutations in Rv0667 (rpoB):
%% Cell type:code id: tags:
``` python
! grep 'Rv0667' ERR760779.snp.ann.vcf | grep 'missense_variant'
```
%% Cell type:markdown id: tags:
Our genome had the **S450L** mutation which confers resistance to rifampicin.
This position is covered by 164 reads (ADP=164) and 163 of these reads have a T instead of a C at this position.
the SNP frequency is thus: **99.39%**. We can be confident this SNP is a true positive.
%% Cell type:markdown id: tags:
### A more powerful way to look for mutations from WGS data
#### How to efficiently recovered all drug resistance mutations in our genome ?
We will use python for that:
%% Cell type:code id: tags:
``` python
list_of_genomics_positions = []
list_of_mutations = []
with open('Workshop_SA/notebooks/Drug_resistance_mutations_MTBC.txt','r') as input_file: # open the file with the list of known DRMs
for row in input_file: # Loop into the list of known DRMs
row = row.strip() # remove the \n at the end of the rows
row = row.split() # split each row by tab (\t). Each row is now as a python list
list_of_genomics_positions += [row[1]]
ref_base = row[2]
alt_base = row[3]
locus = row[4]
substitution = row[5]
drug = row[6]
source = row[7]
list_of_mutations += [[row[1],ref_base,alt_base,locus,substitution,drug,source]]
```
%% Cell type:code id: tags:
``` python
VCF_dico = {}
with open('ERR760779.snp.ann.vcf','r') as VCF: # open the VCF
for i in VCF: # Loop into the lines of the VCF
if i[0] != '#': # if the first character of the line in the VCF is # ( we want to ignore the header of the VCF)
i = i.strip()
i = i.split()
POS = str(i[1]) # position is the second column of the VCF
REF = str(i[3]) # reference is the fourth column of the VCF
ALT = str(i[4]) # ALT is the fifth column of the VCF
INFO = i[7].split(';')
a = i[9].split(':')
COVERAGE = a[3]
SNP_FREQ = a[6]
ANNOTATION = INFO[5].split('|')
IMPACT = ANNOTATION[1]
GENE = ANNOTATION[3]
MUTATION = ANNOTATION[10]
VCF_dico[POS] = [REF,ALT,COVERAGE,SNP_FREQ,IMPACT,GENE,MUTATION]
```
%% Cell type:code id: tags:
``` python
with open('DRM_in_ERR760779.txt','w') as output_file:
output_file.write('position\tref\talt\tcoverage\tSNP_frequency\ttype\tgene\tprotein_change\n')
for items in VCF_dico.items():
genomic_position = items[0]
ann = items[1]
if genomic_position in list_of_genomics_positions:
output_file.write(genomic_position+'\t'+ann[0]+'\t'+ann[1]+'\t'+ann[2]+'\t'+ann[3]+'\t'+ann[4]+'\t'+ann[5]+'\t'+ann[6]+'\n')
```
%% Cell type:code id: tags:
``` python
! cat DRM_in_ERR760779.txt
```
%% Cell type:markdown id: tags:
### Retrieving all drug resistance mutations in the dataset.
For the time of this workshop we will not have time to perform the different bioinformatics steps for all the genomes of the Elholm dataset.
We've prepared a merged VCF using GATK.
%% Cell type:code id: tags:
``` python
%cd loiseau/
```
%% Output
/scicore/home/gagneux/loiseau
%% Cell type:code id: tags:
``` python
list_of_sample_names = []
VCF_merged_dico = {}
with open('Workshop_SA/data_Eldholm/Eldholm2015_vcf_merged','r') as vcf_merged:
for lines in vcf_merged:
if lines[0:2] != '##':
lines = lines.strip().split()
if lines[0] == '#CHROM':
list_of_sample_names += lines[9:] # to exclude the 9 first columns which are #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
else:
genotypes = []
POS = str(lines[1]) # position is the second column of the VCF
REF = str(lines[3]) # reference is the fourth column of the VCF
ALT = str(lines[4]) # ALT is the fifth column of the VCF
for i in lines[9:]:
genotypes += [i[0:3]]
VCF_merged_dico[POS] = [REF,ALT,genotypes]
```
%% Cell type:code id: tags:
``` python
with open('all_DRM_in_Eldholm.txt','w') as DRM_Eldholm_output_file:
DRM_Eldholm_output_file.write('position\tref\talt\t'+'\t'.join(list_of_sample_names)+'\n')
DRM_Eldholm_output_file.write('position\tref\talt\tlocus\tmutation\t'+'\t'.join(list_of_sample_names)+'\n')
for items in VCF_merged_dico.items():
genomic_position = items[0]
ann = items[1]
if genomic_position in list_of_genomics_positions:
for m in list_of_mutations:
if genomic_position in m and ann[0]==m[1] and ann[1]==m[2]:
print(m)
DRM_Eldholm_output_file.write(genomic_position+'\t'+ann[0]+'\t'+ann[1]+'\t'+m[3]+'\t'+m[4]+'\t'+'\t'.join(ann[2])+'\n')
```
%% Output
['761110', 'A', 'T', 'Rv0667', 'D435V', 'RIF', 'Walker_resistant-resistant']
['4247431', 'G', 'C', 'Rv3795', 'M306I', 'EMB', 'Walker_resistant-resistant']
['4249583', 'G', 'A', 'Rv3795', 'D1024N', 'EMB', 'Boettger_DST']
['761100', 'C', 'A', 'Rv0667', 'Q432K', 'RIF', 'Walker_resistant-resistant']
['2155168', 'C', 'G', 'Rv1908c', 'S315T', 'INH', 'Walker_resistant-resistant']
['761155', 'C', 'T', 'Rv0667', 'S450L', 'RIF', 'Walker_resistant-resistant']
['761140', 'A', 'G', 'Rv0667', 'H445R', 'RIF', 'Walker_resistant-resistant']
['7581', 'G', 'A', 'Rv0006', 'D94N', 'FQ', 'Walker_resistant-resistant']
['7582', 'A', 'G', 'Rv0006', 'D94G', 'FQ', 'Walker_resistant-resistant']
['4247429', 'A', 'G', 'Rv3795', 'M306V', 'EMB', 'Walker_resistant-resistant']
['7570', 'C', 'T', 'Rv0006', 'A90V', 'FQ', 'Walker_resistant-resistant']
['1473246', 'A', 'G', 'rrs', 'A1401G', 'AK;CAP;KAN', 'Walker_resistant-resistant']
['1673432', 'T', 'C', 'Rv1483', 'T-8C', 'INH;ETH', 'Walker_resistant-resistant;Boettger_DST']
['761139', 'C', 'T', 'Rv0667', 'H445Y', 'RIF', 'Walker_resistant-resistant']
['1673425', 'C', 'T', 'Rv1483', 'C-15T', 'INH;ETH', 'Walker_resistant-resistant;Boettger_DST']
['4247730', 'G', 'C', 'Rv3795', 'G406A', 'EMB', 'Walker_resistant-resistant']
%% Cell type:code id: tags:
``` python
with open('all_DRM_in_Eldholm.txt','r') as ifile:
for i in ifile:
i = i.strip().split()
if i[0] != 'position': # ignore the first line of the file as it is the header
print(i[4], i.count('1/1'),i.count('0/1')) # print the mutation, the number of genomes with this mutation in a fixed state, the number of genomes with this mutation in a variable state
```
%% Output
('D435V', 4, 0)
('M306I', 1, 0)
('D1024N', 1, 0)
('Q432K', 1, 0)
('S315T', 248, 0)
('S450L', 234, 1)
('H445R', 1, 0)
('D94N', 0, 1)
('D94G', 3, 0)
('M306V', 1, 0)
('A90V', 3, 0)
('A1401G', 230, 0)
('T-8C', 1, 0)
('H445Y', 3, 0)
('C-15T', 1, 0)
('G406A', 234, 0)
%% Cell type:code id: tags:
``` python
```
......
......@@ -4,4 +4,4 @@
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
singularity exec /home/container.img qualimap bamqc -bam ERR760779.dedup.bam -sd -sdmode 1 -outdir . -outfile ERR760779_bamqc
singularity exec /home/container.img qualimap bamqc -bam ERR760779.dedup.realigned.bam -sd -sdmode 1 -outdir . -outfile ERR760779_bamqc
......@@ -4,5 +4,5 @@
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
singularity exec /home/container.img samtools mpileup -ABQ0 -q 20 -f ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779.dedup.bam > ERR760779.pileup
singularity exec /home/container.img samtools mpileup -ABQ0 -q 20 -f ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta ERR760779.dedup.realigned.bam > ERR760779.pileup
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment