changes

efdf6932 · Chloe Marie Loiseau · 7f7a0877 · efdf6932 · efdf6932 · efdf6932
Commit efdf6932 authored 7 years ago by Chloe Marie Loiseau
--- a/notebooks/CapeTown_Genomics_Tutorial_partI.ipynb
+++ b/notebooks/CapeTown_Genomics_Tutorial_partI.ipynb
@@ -17,11 +17,13 @@
   "source": [
    "## 0. Getting started\n",
    "### How to start the jupyter notebook\n",
-    "1. Access the cloud: ssh student01@86.119.40.206\n",
+    "1. Access the cloud: ssh studentXX@86.119.40.206\n",
    "2. Your password is: stphcourse2018\n",
-    "3. cp -r ../Workshop_SA.git\n",
-    "4. singularity exec ../container.img jupyter notebook --no-browser --ip='*' --port=YourPortNumber eg.30000\n",
-    "5. Type in the browser: http://86.119.40.206:YourPortNumber/?token=c0669c145a630ea14b6ec3b29b870811844fefe12c375feb\n"
+    "3. copy this folder to your home directory: cp -r /home/Workshop_SA/ .\n",
+    "4. In hour home, type: singularity exec /home/container.img jupyter notebook --no-browser --ip='*' --port=YourPortNumber eg.30000\n",
+    "\n",
+    "\n",
+    "If you wish to access the git from your webbrowser, the URL is: https://git.scicore.unibas.ch/TBRU/Workshop_SA"
   ]
  },
  {
@@ -39,7 +41,7 @@
    "- It is your bioinformatics 'lab book'.\n",
    "\n",
    "### Useful tips to use in the jupyter notebook\n",
-    "- Run the command in the 'code cell': Shift + Return\n",
+    "- Run the command in the 'code cell': Shift + Enter\n",
    "- You can change the cell type from Code to Markdown to include explanatory text in your notebook\n",
    "- Use the \"tab\" key to autocomplement commands\n",
    "- https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/\n",
@@ -53,19 +55,47 @@
    "### Magics\n",
    "Taken from: https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/\n",
    "\n",
-    "You can start notebooks with different kernels (e.g., R, Julia) — not just Python. What you might not know is that even within a notebook, you can run different types of code in different cells. With \"magics\", it is possible to use different languages \n",
+    "You can start notebooks with different kernels (e.g., R, Shell) — not just Python. What you might not know is that even within a notebook, you can run different types of code in different cells. With \"magics\", it is possible to use different languages \n",
    "By running % lsmagic in a cell you get a list of all the available magics. You can use % to start a single-line expression to run with the magics command. Or you can use a double %% to run a multi-line expression.\n",
    "\n",
    "Some of my favorites are:\n",
    "\n",
-    "!: to run a shell command.\n",
-    "% bash to run cell with bash in a subprocess.\n",
+    "    ! to run a shell command.\n",
+    "\n",
+    "    % bash to run cell with bash in a subprocess.\n",
    "\n",
    "### Using shell commands\n",
    "\n",
    "Any command that works at the command-line can be used in IPython by prefixing it with the ! character. For example, the ls, pwd, and echo commands can be run as follows:\n"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/scicore/home/gagneux/loiseau/Workshop_SA/notebooks\n",
+      "The files in my working directory are:\n",
+      "adapters\t\t\t\t  Drug_resistance_mutations_MTBC.txt\n",
+      "annotation\t\t\t\t  images\n",
+      "CapeTown_Genomics_Tutorial_partIII.ipynb  Locus_to_exclude_Mtb.txt\n",
+      "CapeTown_Genomics_Tutorial_partII.ipynb   reference_genome\n",
+      "CapeTown_Genomics_Tutorial_partI.ipynb\t  slurm_scripts\n"
+     ]
+    }
+   ],
+   "source": [
+    "! pwd\n",
+    "! echo 'The files in my working directory are:'\n",
+    "! ls"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -75,19 +105,19 @@
    " - Perform essential steps of a Illumina whole-genome sequencing analysis pipeline of MTBC genomes.\n",
    "\n",
    "## Content of this tutorial:\n",
-    "- Finding genetic variants from raw sequencing data:\n",
-    "    - Looking into a fastq file: reads, Phred Quality scores\n",
-    "    - Raw read processing and quality assessment\n",
+    "- **Finding genetic variants from raw sequencing data**:\n",
+    "    - Looking into a fastq file: quality assessment of the reads\n",
+    "    - Raw read processing: trimming of illumina adapters and low quality bases \n",
    "    - Mapping processed reads to a reference genome (creation of a BAM file)\n",
    "    - BAM post-processing \n",
    "    - BAM quality assesment\n",
    "    - Variant identification (creation of a VCF file)\n",
    "    - Variant Annotation\n",
-    "<img src=\"images/Pipeline1.png\" width=\"500\">\n",
+    "<img src=\"images/Pipeline1.png\" width=\"600\">\n",
    "- You want to find genetic variants (SNPs, insertion, deletions) in these sequences.\n",
    "- To do so, you need to perform the following bioinformatics steps:\n",
    "\n",
-    "<img src=\"images/Pipeline2.png\" width=\"500\">\n",
+    "<img src=\"images/Pipeline2.png\" width=\"600\">\n",
    "\n"
   ]
  },
@@ -114,7 +144,7 @@
    "    Forward read: ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz\n",
    "    Reverse read: ~/Workshop_SA/data_Eldholm/ERR760779_2.fastq.gz\n",
    "    \n",
-    "The fastq files are compressed (.gz) to save space. Let's have a look at the first read of the file.\n",
+    "The fastq files are compressed (.gz) to save space. Let's have a look at the first read of the file using zcat.\n",
    "For this, read the first 4 lines of the file:"
   ]
  },
@@ -222,7 +252,11 @@
   "metadata": {},
   "source": [
    "Go back to the terminal and run from the command line type:\n",
-    "    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc.slurm"
+    "    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc.slurm\n",
+    "    \n",
+    "For this you will have to open a new terminal window and reconnect:\n",
+    "    - ssh studentXX@86.119.40.206\n",
+    "    - password: stphcourse2018"
   ]
  },
  {
@@ -244,11 +278,11 @@
    "    - an html which you can visualise using firefox for example\n",
    "    - a compressed folder (.zip). You can see the content of this folder by using the command 'unzip'. \n",
    "    \n",
-    "You can visualise the html file using firefox. \n",
+    "To visualise the html file open a new terminal on MobaXterm and type:\n",
+    "    - scp studentXX@86.119.40.206:/home/studentXX/ERR760779_1_fastqc.html Desktop\n",
    "\n",
-    "From the terminal, type:\n",
-    "    \n",
-    "    firefox ERR760779_1_fastqc.html"
+    "\n",
+    "The html file is now on your local computer, on your desktop. Double click on it."
   ]
  },
  {
@@ -342,10 +376,10 @@
   "metadata": {},
   "source": [
    "### Exercise: \n",
-    "- How many reads were dropped by Trimmomatic ? \n",
-    "- Why are complete reads dropped ? \n",
-    "- What is the percentage of reads we will find in the files ERR760779_**1P**.trimmed.fastq.gz and ERR760779_**2P**.trimmed.fastq.gz ?\n",
-    "- What is the percentage of reads we will find in the files ERR760779_**1U**.trimmed.fastq.gz and ERR760779_**2U**.trimmed.fastq.gz ?"
+    "- How many reads were dropped by Trimmomatic ? ................\n",
+    "- Why are complete reads dropped ? ................\n",
+    "- What is the percentage of reads we will find in the files ERR760779_**1P**.trimmed.fastq.gz and ERR760779_**2P**.trimmed.fastq.gz ? ................\n",
+    "- What is the percentage of reads we will find in the files ERR760779_**1U**.trimmed.fastq.gz and ERR760779_**2U**.trimmed.fastq.gz ? ................"
   ]
  },
  {
@@ -389,9 +423,18 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As before, to visualise the html file produce, open it with firefox:\n",
-    "    - firefox ERR760779_1P.html"
+    "As before, to visualise the html file produced:\n",
+    "    - scp studentXX@86.119.40.206:/home/studentXX/ERR760779_1P.trimmed.html Desktop"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {

 %% Cell type:markdown id: tags:

 # 14-15.05.2018 - Cape Town Genomics Workshop part I

 %% Cell type:markdown id: tags:

 ## 0. Getting started
 ### How to start the jupyter notebook
-1. Access the cloud: ssh student01@86.119.40.206
+1. Access the cloud: ssh studentXX@86.119.40.206
 2. Your password is: stphcourse2018
-3. cp -r ../Workshop_SA.git
-4. singularity exec ../container.img jupyter notebook --no-browser --ip='*' --port=YourPortNumber eg.30000
-5. Type in the browser: http://86.119.40.206:YourPortNumber/?token=c0669c145a630ea14b6ec3b29b870811844fefe12c375feb
+3. copy this folder to your home directory: cp -r /home/Workshop_SA/ .
+4. In hour home, type: singularity exec /home/container.img jupyter notebook --no-browser --ip='*' --port=YourPortNumber eg.30000
+
+
+If you wish to access the git from your webbrowser, the URL is: https://git.scicore.unibas.ch/TBRU/Workshop_SA

 %% Cell type:markdown id: tags:



 ### Why use the jupyter notebook ?
 - **Reproducibility**: it allows you to record your computational analysis, your figures, and comments together.
 - It is your bioinformatics 'lab book'.

 ### Useful tips to use in the jupyter notebook
- Run the command in the 'code cell': Shift + Return
+- Run the command in the 'code cell': Shift + Enter
 - You can change the cell type from Code to Markdown to include explanatory text in your notebook
 - Use the "tab" key to autocomplement commands
 - https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/


 %% Cell type:markdown id: tags:

 ### Magics
 Taken from: https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/

-You can start notebooks with different kernels (e.g., R, Julia) — not just Python. What you might not know is that even within a notebook, you can run different types of code in different cells. With "magics", it is possible to use different languages
+You can start notebooks with different kernels (e.g., R, Shell) — not just Python. What you might not know is that even within a notebook, you can run different types of code in different cells. With "magics", it is possible to use different languages
 By running % lsmagic in a cell you get a list of all the available magics. You can use % to start a single-line expression to run with the magics command. Or you can use a double %% to run a multi-line expression.

 Some of my favorites are:

-!: to run a shell command.
-% bash to run cell with bash in a subprocess.
+    ! to run a shell command.
+
+    % bash to run cell with bash in a subprocess.

 ### Using shell commands

 Any command that works at the command-line can be used in IPython by prefixing it with the ! character. For example, the ls, pwd, and echo commands can be run as follows:

+%% Cell type:code id: tags:
+
+``` python
+! pwd
+! echo 'The files in my working directory are:'
+! ls
+```
+
+%% Output
+
+    /scicore/home/gagneux/loiseau/Workshop_SA/notebooks
+    The files in my working directory are:
+    adapters				  Drug_resistance_mutations_MTBC.txt
+    annotation				  images
+    CapeTown_Genomics_Tutorial_partIII.ipynb  Locus_to_exclude_Mtb.txt
+    CapeTown_Genomics_Tutorial_partII.ipynb   reference_genome
+    CapeTown_Genomics_Tutorial_partI.ipynb	  slurm_scripts
+
 %% Cell type:markdown id: tags:

 ## What you will learn in these tutorials:
 - Run programs via the command line.
 - Perform essential steps of a Illumina whole-genome sequencing analysis pipeline of MTBC genomes.

 ## Content of this tutorial:
- Finding genetic variants from raw sequencing data:
-    - Looking into a fastq file: reads, Phred Quality scores
-    - Raw read processing and quality assessment
+- **Finding genetic variants from raw sequencing data**:
+    - Looking into a fastq file: quality assessment of the reads
+    - Raw read processing: trimming of illumina adapters and low quality bases
    - Mapping processed reads to a reference genome (creation of a BAM file)
    - BAM post-processing
    - BAM quality assesment
    - Variant identification (creation of a VCF file)
    - Variant Annotation
-<img src="images/Pipeline1.png" width="500">
+<img src="images/Pipeline1.png" width="600">
 - You want to find genetic variants (SNPs, insertion, deletions) in these sequences.
 - To do so, you need to perform the following bioinformatics steps:

-<img src="images/Pipeline2.png" width="500">
+<img src="images/Pipeline2.png" width="600">


 %% Cell type:markdown id: tags:

 ## 1. Raw data quality control
 <img src="images/fastqc.png" width="500">

 %% Cell type:markdown id: tags:

 ### 1.1 FastQ file

 %% Cell type:markdown id: tags:

 #### The fastq file we will be working on is found here:
    Forward read: ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz
    Reverse read: ~/Workshop_SA/data_Eldholm/ERR760779_2.fastq.gz

-The fastq files are compressed (.gz) to save space. Let's have a look at the first read of the file.
+The fastq files are compressed (.gz) to save space. Let's have a look at the first read of the file using zcat.
 For this, read the first 4 lines of the file:

 %% Cell type:code id: tags:

 ``` python
 ! zcat ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz | head -n 4
 ```

 %% Cell type:markdown id: tags:

 Let's do the same thing for the reverse read:

 %% Cell type:code id: tags:

 ``` python
 ! zcat ~/Workshop_SA/data_Eldholm/ERR760779_2.fastq.gz | head -n 4
 ```

 %% Cell type:markdown id: tags:

 ### Exercise:
 Get the file size, number of reads, total number of bases, and coverage per fastq file (length of *M. tuberculosis*  genome: 4'411'532 bp)

 - file size         : ..................
 - number of reads   : ............
 - lenght of reads   : ............
 - estimated coverage: .........
 - Phred score encoding : .......

 An estimation of the coverage can be calculated using the number of reads (N), the length of the *M. tuberculosis* genome (G) and the read length (L)

 Clue:

 <img src="images/phred_score_encoding.png" width="700">

 You can also use this piece of code to detect the encoding of the phred quality scores in your fastq file:

 %% Cell type:code id: tags:

 ``` python
 %%!
 zcat ~/Workshop_SA/data_Eldholm/ERR760779_1.fastq.gz| head -n 10000 |\
  awk '{if(NR%4==0) printf("%s",$0);}' |  od -A n -t u1 | \
  awk 'BEGIN{min=100;max=0;} \
      {for(i=1;i<=NF;i++) \
          {if($i>max) max=$i; \
               if($i<min) min=$i;}}END \
          {if(max<=74 && min<59) \
                     print "Phred+33"; \
           else
           if(max>73 && min>=64) \
                     print "Phred+64"; \
           else \
           if(min>=59 && min<64 && max>73) \
                     print "Solexa+64"; else print "Unknown score encoding!";}'
 ```

 %% Cell type:markdown id: tags:

 ### 1.2 FastQC: quality control of the raw data

 You can access the documentation of fastQC here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/


 Calling FastQC with the -h flag we get an indication of the different arguments we can use:

 %% Cell type:code id: tags:

 ``` python
 ! fastqc -h
 ```

 %% Cell type:markdown id: tags:

 Go back to the terminal and run from the command line type:
    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc.slurm

+For this you will have to open a new terminal window and reconnect:
+    - ssh studentXX@86.119.40.206
+    - password: stphcourse2018
+
 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc.slurm
 ```

 %% Cell type:markdown id: tags:

 The command produces two file:
    - an html which you can visualise using firefox for example
    - a compressed folder (.zip). You can see the content of this folder by using the command 'unzip'.

-You can visualise the html file using firefox.
+To visualise the html file open a new terminal on MobaXterm and type:
+    - scp studentXX@86.119.40.206:/home/studentXX/ERR760779_1_fastqc.html Desktop

-From the terminal, type:

-    firefox ERR760779_1_fastqc.html
+The html file is now on your local computer, on your desktop. Double click on it.

 %% Cell type:markdown id: tags:

 In the fastqc report you can see that:
    - the quality of the sequences decreases towards the end of the reads
    - there are 'over-represented sequences'. These are the Illumina adapters.

 To improve the quality of the raw sequences, we will use the software Trimmomatic.
 Trimmomatic is used to remove low quality bases and remove illumina adapters.

 %% Cell type:markdown id: tags:

 ## 2. Raw read processing

 ### 2.1 Removal of illumina adapters and low quality bases

 <img src="images/trimmomatic.png" width="350">
 Trimmomatic manual: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
 #### What do we remove adapter sequences  from our reads ?
 Removal of adapter sequences in a process called **read trimming**, or **clipping**, is one of the first steps in analyzing NGS data.
 Adapters have to be ligated to every single DNA molecule during library preparation. For Illumina short read sequencing, the corresponding protocols involve (in most cases) a DNA fragmentation step, followed by the ligation of certain oligonucleotides to the 5’ and 3’ ends. These 5' and 3' adapter sequences have important functions in Illumina sequencing, since they hold barcoding sequences, forward/reverse primers (for paired-end sequencing) and the important binding sequences for immobilizing the fragments to the flowcell and allowing bridge-amplification.
 <img src="images/Illumina_adapters.png" width="350">

 Adapter contamination will lead to **NGS alignment errors** and an **increased number of unaligned reads**, since the adapter sequences are synthetic and do not occur in the genomic sequence.

 Source : https://www.ecseq.com/support/ngs/trimming-adapter-sequences-is-it-necessary


 %% Cell type:markdown id: tags:

 From the command-line type:
    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_trimmomatic.slurm

 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_trimmomatic.slurm
 ```

 %% Cell type:markdown id: tags:

 For running trimmomatic in paired-end mode:
 2 input files:
    -  forward reads: ERR760779_1.fastq.gz
    -  reverse reads : ERR760779_2.fastq.gz
 4 output files:
    - forward paired reads:   ERR760779_**1P**.trimmed.fastq.gz
    - forward unpaired reads: ERR760779_ **1U**.trimmed.fastq.gz
    - reverse paired reads:   ERR760779_**2P**.trimmed.fastq.gz
    - reverse unpaired reads: ERR760779_**2U**.trimmed.fastq.gz

 When Trimmomatic performs trimming of adapters and low quality bases, the length of the reads will drop.
 If the read length drops below a certain threshold (minlen) then the read will be removed from the fastq, resulting in *unpaired* reads. These *unpaired* reads will go into the output files **1U** or **2U**.

 There are many possible *options* that come with Trimmomatic (cf. manual). Here are three we will use today:
 - ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
 - SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
 - MINLEN: Drop the read if it is below a specified length

 %% Cell type:code id: tags:

 ``` python
 ! cat trimmomatic.e
 ```

 %% Cell type:markdown id: tags:

 ### Exercise:
- How many reads were dropped by Trimmomatic ?
- Why are complete reads dropped ?
- What is the percentage of reads we will find in the files ERR760779_**1P**.trimmed.fastq.gz and ERR760779_**2P**.trimmed.fastq.gz ?
- What is the percentage of reads we will find in the files ERR760779_**1U**.trimmed.fastq.gz and ERR760779_**2U**.trimmed.fastq.gz ?
+- How many reads were dropped by Trimmomatic ? ................
+- Why are complete reads dropped ? ................
+- What is the percentage of reads we will find in the files ERR760779_**1P**.trimmed.fastq.gz and ERR760779_**2P**.trimmed.fastq.gz ? ................
+- What is the percentage of reads we will find in the files ERR760779_**1U**.trimmed.fastq.gz and ERR760779_**2U**.trimmed.fastq.gz ? ................

 %% Cell type:markdown id: tags:

 ### 2.2 Quality Check of the processed reads

 We can run FastQC on the trimmed data to see how the quality of the reads has improved:

 In the command line, type;
    - sbatch ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc_on_trimmed_data.slurm

 %% Cell type:code id: tags:

 ``` python
 ! cat ~/Workshop_SA/notebooks/slurm_scripts/launch_fastqc_on_trimmed_data.slurm
 ```

 %% Output

    #!/bin/bash
    
    #SBATCH --job-name=fastqc_trim
    #SBATCH --cpus-per-task=1
    #SBATCH --mem-per-cpu=4G
    
    singularity exec /home/container.img fastqc -t 1 -q ERR760779_1P.trimmed.fastq.gz -o .

 %% Cell type:markdown id: tags:

-As before, to visualise the html file produce, open it with firefox:
-    - firefox ERR760779_1P.html
+As before, to visualise the html file produced:
+    - scp studentXX@86.119.40.206:/home/studentXX/ERR760779_1P.trimmed.html Desktop
+
+%% Cell type:code id: tags:
+
+``` python
+```

--- a/notebooks/Locus_to_exclude_Mtb.txt
+++ b/notebooks/Locus_to_exclude_Mtb.txt
--- a/notebooks/slurm_scripts/launch_GATK.slurm
+++ b/notebooks/slurm_scripts/launch_GATK.slurm
+#!/bin/bash 
+
+#SBATCH --job-name=GATK
+#SBATCH --cpus-per-task=1
+#SBATCH --mem-per-cpu=4G
+#SBATCH --time=6:00:00
+#SBATCH --output=GATK.o
+#SBATCH --error=GATK.e
+
+singularity exec container.img gatk-launch -T RealignerTargetCreator -nt 1 -R ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta -o ERR760779.intervals -I ERR760779.dedup.bam
+
+
+singularity exec container.img gatk-launch --disable_bam_indexing -T IndelRealigner R ~/Workshop_SA/notebooks/reference_genome/MTB_ancestor_reference.fasta -targetIntervals ERR760779.intervals -I ERR760779.dedup.bam -o ERR760779.dedup.realigned.bam
--- a/notebooks/slurm_scripts/launch_index2.slurm
+++ b/notebooks/slurm_scripts/launch_index2.slurm
+#!/bin/bash 
+
+#SBATCH --job-name=index
+#SBATCH --cpus-per-task=1
+#SBATCH --mem-per-cpu=2G
+
+singularity exec /home/container.img samtools index ERR760779.dedup.realigned.bam