Skip to content
Snippets Groups Projects
Commit efdf6932 authored by Chloe Marie Loiseau's avatar Chloe Marie Loiseau
Browse files

changes

parent 7f7a0877
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# 14-15.05.2018 - Cape Town Genomics Workshop part I
%% Cell type:markdown id: tags:
## 0. Getting started
### How to start the jupyter notebook
1. Access the cloud: ssh student01@86.119.40.206
1. Access the cloud: ssh studentXX@86.119.40.206
2. Your password is: stphcourse2018
3. cp -r ../Workshop_SA.git
4. singularity exec ../container.img jupyter notebook --no-browser --ip='*' --port=YourPortNumber eg.30000
5. Type in the browser: http://86.119.40.206:YourPortNumber/?token=c0669c145a630ea14b6ec3b29b870811844fefe12c375feb
3. copy this folder to your home directory: cp -r /home/Workshop_SA/ .
4. In hour home, type: singularity exec /home/container.img jupyter notebook --no-browser --ip='*' --port=YourPortNumber eg.30000
If you wish to access the git from your webbrowser, the URL is: https://git.scicore.unibas.ch/TBRU/Workshop_SA
%% Cell type:markdown id: tags:
### Why use the jupyter notebook ?
- **Reproducibility**: it allows you to record your computational analysis, your figures, and comments together.
- It is your bioinformatics 'lab book'.
### Useful tips to use in the jupyter notebook
- Run the command in the 'code cell': Shift + Return
- Run the command in the 'code cell': Shift + Enter
- You can change the cell type from Code to Markdown to include explanatory text in your notebook
- Use the "tab" key to autocomplement commands
- https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
%% Cell type:markdown id: tags:
### Magics
Taken from: https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/
You can start notebooks with different kernels (e.g., R, Julia) — not just Python. What you might not know is that even within a notebook, you can run different types of code in different cells. With "magics", it is possible to use different languages
You can start notebooks with different kernels (e.g., R, Shell) — not just Python. What you might not know is that even within a notebook, you can run different types of code in different cells. With "magics", it is possible to use different languages
By running % lsmagic in a cell you get a list of all the available magics. You can use % to start a single-line expression to run with the magics command. Or you can use a double %% to run a multi-line expression.
Some of my favorites are:
!: to run a shell command.
% bash to run cell with bash in a subprocess.
! to run a shell command.
% bash to run cell with bash in a subprocess.
### Using shell commands
Any command that works at the command-line can be used in IPython by prefixing it with the ! character. For example, the ls, pwd, and echo commands can be run as follows:
%% Cell type:code id: tags:
``` python
! pwd
! echo 'The files in my working directory are:'
! ls
```
%% Output
/scicore/home/gagneux/loiseau/Workshop_SA/notebooks
The files in my working directory are:
adapters Drug_resistance_mutations_MTBC.txt
annotation images
CapeTown_Genomics_Tutorial_partIII.ipynb Locus_to_exclude_Mtb.txt
CapeTown_Genomics_Tutorial_partII.ipynb reference_genome
CapeTown_Genomics_Tutorial_partI.ipynb slurm_scripts
%% Cell type:markdown id: tags:
## What you will learn in these tutorials:
- Run programs via the command line.
- Perform essential steps of a Illumina whole-genome sequencing analysis pipeline of MTBC genomes.
## Content of this tutorial:
- Finding genetic variants from raw sequencing data:
- Looking into a fastq file: reads, Phred Quality scores
- Raw read processing and quality assessment
- **Finding genetic variants from raw sequencing data**:
- Looking into a fastq file: quality assessment of the reads
- Raw read processing: trimming of illumina adapters and low quality bases
- Mapping processed reads to a reference genome (creation of a BAM file)
- BAM post-processing
- BAM quality assesment
- Variant identification (creation of a VCF file)
- Variant Annotation
<img src="images/Pipeline1.png" width="500">
<img src="images/Pipeline1.png" width="600">
- You want to find genetic variants (SNPs, insertion, deletions) in these sequences.
- To do so, you need to perform the following bioinformatics steps:
<img src="images/Pipeline2.png" width="500">
<img src="images/Pipeline2.png" width="600">
%% Cell type:markdown id: tags:
## 1. Raw data quality control
<img src="images/fastqc.png" width="500">
%% Cell type:markdown id: tags:
### 1.1 FastQ file
%% Cell type:markdown id: tags:
#### The fastq file we will be working on is found here:
Forward read: ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz
Reverse read: ~/Workshop_SA/data_Eldholm/ERR760779_2.fastq.gz
The fastq files are compressed (.gz) to save space. Let's have a look at the first read of the file.
The fastq files are compressed (.gz) to save space. Let's have a look at the first read of the file using zcat.
For this, read the first 4 lines of the file:
%% Cell type:code id: tags:
``` python
! zcat ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz | head -n 4
```
%% Cell type:markdown id: tags:
Let's do the same thing for the reverse read:
%% Cell type:code id: tags:
``` python
! zcat ~/Workshop_SA/data_Eldholm/ERR760779_2.fastq.gz | head -n 4
```
%% Cell type:markdown id: tags:
### Exercise:
Get the file size, number of reads, total number of bases, and coverage per fastq file (length of *M. tuberculosis* genome: 4'411'532 bp)
- file size : ..................
- number of reads : ............
- lenght of reads : ............
- estimated coverage: .........
- Phred score encoding : .......
An estimation of the coverage can be calculated using the number of reads (N), the length of the *M. tuberculosis* genome (G) and the read length (L)
Clue:
<img src="images/phred_score_encoding.png" width="700">
You can also use this piece of code to detect the encoding of the phred quality scores in your fastq file:
%% Cell type:code id: tags:
``` python
%%!
zcat ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz| head -n 10000 |\
awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | \
awk 'BEGIN{min=100;max=0;} \
{for(i=1;i<=NF;i++) \
{if($i>max) max=$i; \
if($i<min) min=$i;}}END \
{if(max<=74 && min<59) \
print "Phred+33"; \
else
if(max>73 && min>=64) \
print "Phred+64"; \
else \
if(min>=59 && min<64 && max>73) \
print "Solexa+64"; else print "Unknown score encoding!";}'
```
%% Cell type:markdown id: tags:
### 1.2 FastQC: quality control of the raw data
You can access the documentation of fastQC here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Calling FastQC with the -h flag we get an indication of the different arguments we can use:
%% Cell type:code id: tags:
``` python
! fastqc -h
```
%% Cell type:markdown id: tags:
Go back to the terminal and run from the command line type:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc.slurm
For this you will have to open a new terminal window and reconnect:
- ssh studentXX@86.119.40.206
- password: stphcourse2018
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc.slurm
```
%% Cell type:markdown id: tags:
The command produces two file:
- an html which you can visualise using firefox for example
- a compressed folder (.zip). You can see the content of this folder by using the command 'unzip'.
You can visualise the html file using firefox.
To visualise the html file open a new terminal on MobaXterm and type:
- scp studentXX@86.119.40.206:/home/studentXX/ERR760779_1_fastqc.html Desktop
From the terminal, type:
firefox ERR760779_1_fastqc.html
The html file is now on your local computer, on your desktop. Double click on it.
%% Cell type:markdown id: tags:
In the fastqc report you can see that:
- the quality of the sequences decreases towards the end of the reads
- there are 'over-represented sequences'. These are the Illumina adapters.
To improve the quality of the raw sequences, we will use the software Trimmomatic.
Trimmomatic is used to remove low quality bases and remove illumina adapters.
%% Cell type:markdown id: tags:
## 2. Raw read processing
### 2.1 Removal of illumina adapters and low quality bases
<img src="images/trimmomatic.png" width="350">
Trimmomatic manual: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
#### What do we remove adapter sequences from our reads ?
Removal of adapter sequences in a process called **read trimming**, or **clipping**, is one of the first steps in analyzing NGS data.
Adapters have to be ligated to every single DNA molecule during library preparation. For Illumina short read sequencing, the corresponding protocols involve (in most cases) a DNA fragmentation step, followed by the ligation of certain oligonucleotides to the 5’ and 3’ ends. These 5' and 3' adapter sequences have important functions in Illumina sequencing, since they hold barcoding sequences, forward/reverse primers (for paired-end sequencing) and the important binding sequences for immobilizing the fragments to the flowcell and allowing bridge-amplification.
<img src="images/Illumina_adapters.png" width="350">
Adapter contamination will lead to **NGS alignment errors** and an **increased number of unaligned reads**, since the adapter sequences are synthetic and do not occur in the genomic sequence.
Source : https://www.ecseq.com/support/ngs/trimming-adapter-sequences-is-it-necessary
%% Cell type:markdown id: tags:
From the command-line type:
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_trimmomatic.slurm
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_trimmomatic.slurm
```
%% Cell type:markdown id: tags:
For running trimmomatic in paired-end mode:
2 input files:
- forward reads: ERR760779_1.fastq.gz
- reverse reads : ERR760779_2.fastq.gz
4 output files:
- forward paired reads: ERR760779_**1P**.trimmed.fastq.gz
- forward unpaired reads: ERR760779_ **1U**.trimmed.fastq.gz
- reverse paired reads: ERR760779_**2P**.trimmed.fastq.gz
- reverse unpaired reads: ERR760779_**2U**.trimmed.fastq.gz
When Trimmomatic performs trimming of adapters and low quality bases, the length of the reads will drop.
If the read length drops below a certain threshold (minlen) then the read will be removed from the fastq, resulting in *unpaired* reads. These *unpaired* reads will go into the output files **1U** or **2U**.
There are many possible *options* that come with Trimmomatic (cf. manual). Here are three we will use today:
- ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
- SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
- MINLEN: Drop the read if it is below a specified length
%% Cell type:code id: tags:
``` python
! cat trimmomatic.e
```
%% Cell type:markdown id: tags:
### Exercise:
- How many reads were dropped by Trimmomatic ?
- Why are complete reads dropped ?
- What is the percentage of reads we will find in the files ERR760779_**1P**.trimmed.fastq.gz and ERR760779_**2P**.trimmed.fastq.gz ?
- What is the percentage of reads we will find in the files ERR760779_**1U**.trimmed.fastq.gz and ERR760779_**2U**.trimmed.fastq.gz ?
- How many reads were dropped by Trimmomatic ? ................
- Why are complete reads dropped ? ................
- What is the percentage of reads we will find in the files ERR760779_**1P**.trimmed.fastq.gz and ERR760779_**2P**.trimmed.fastq.gz ? ................
- What is the percentage of reads we will find in the files ERR760779_**1U**.trimmed.fastq.gz and ERR760779_**2U**.trimmed.fastq.gz ? ................
%% Cell type:markdown id: tags:
### 2.2 Quality Check of the processed reads
We can run FastQC on the trimmed data to see how the quality of the reads has improved:
In the command line, type;
- sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc_on_trimmed_data.slurm
%% Cell type:code id: tags:
``` python
! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc_on_trimmed_data.slurm
```
%% Output
#!/bin/bash
#SBATCH --job-name=fastqc_trim
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
singularity exec /home/container.img fastqc -t 1 -q ERR760779_1P.trimmed.fastq.gz -o .
%% Cell type:markdown id: tags:
As before, to visualise the html file produce, open it with firefox:
- firefox ERR760779_1P.html
As before, to visualise the html file produced:
- scp studentXX@86.119.40.206:/home/studentXX/ERR760779_1P.trimmed.html Desktop
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
#!/bin/bash
#SBATCH --job-name=GATK
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=6:00:00
#SBATCH --output=GATK.o
#SBATCH --error=GATK.e
singularity exec container.img gatk-launch -T RealignerTargetCreator -nt 1 -R ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta -o ERR760779.intervals -I ERR760779.dedup.bam
singularity exec container.img gatk-launch --disable_bam_indexing -T IndelRealigner R ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta -targetIntervals ERR760779.intervals -I ERR760779.dedup.bam -o ERR760779.dedup.realigned.bam
#!/bin/bash
#SBATCH --job-name=index
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
singularity exec /home/container.img samtools index ERR760779.dedup.realigned.bam
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment