diff --git a/exercises_MolEpiSA/R_tutorial_Eldholm2015.ipynb b/exercises_MolEpiSA/R_tutorial_Eldholm2015.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..41fde2875eebd8ab068c4f9359789ff4ff5dcfde
--- /dev/null
+++ b/exercises_MolEpiSA/R_tutorial_Eldholm2015.ipynb
@@ -0,0 +1,583 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {},
+   "source": [
+    "The aim of this tutorial is to become familiarised with some of the tools and conceps used to approach epidemiology using WGS. \n",
+    "\n",
+    "We will use the data of the Buenos Aires outbreak published in Eldholm et al 2015. We have extracted the variants from those Fastq files for you, and from the files that contain the variants you have built alignments containing all the genomes of that dataset yesterday. Today we will be using one of the alignments produced yesterday.\n",
+    "\n",
+    "We will be using R troughout the tutorial, however there are others methods and implementations which you could use (check presentation). We start by loading the R libraries necessary for the exercise. Outside the course you would need first to install those packages using the command install.packages(\"nameOfThePackage\").\n",
+    "The find more about the methods used in this tutorial and application of the method to influenza check \"Jombart T. et al. Reconstructing disease outbreaks from genetic data: a graph approach.2001.Heredity.106(2): 383–390\". "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "library(ape)\n",
+    "library(adegenet)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Check what is your working directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "getwd()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Read the DNA alignment into R"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "alignment <- read.dna(file=\"~/Workshop_SA/data_Eldholm/Eldholm2015_vcf_merged_hap_miss_0.9_transposed.fasta_only_var_also_outgroup\", format=\"fasta\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Check the characteritics of the aligment. How many sequences? What are their size? What is the base composition? In what would a whole genome alignment be different (appart from the size)?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "alignment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will now check what are the mean pair-wise distances of our population. Does it fit with an outbreak?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "distanceN <- dist.dna(alignment,model= \"N\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We plot now the calculated pair-wise genetic distances"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist(distanceN)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is going on here? \n",
+    "Let us try ploting the distances in another way. For that we need to transform out the distances into a data frame object. Data frames are a handy way of storing objects in R because they are easy to manipulate. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "temp <- as.data.frame(as.matrix(distanceN))\n",
+    "#plot temp\n",
+    "table.paint(temp, cleg = 0, clabel.row = 0.5, clabel.col = 0.5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can see that one of the genomes is very different from the others. Let us inspect the names of the genomes. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rownames(temp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can see that the reference genome MTC_ANC is present in our alignment. \n",
+    "Confirm what you saw the matrix of distances by extracting the distance of the MTB_ANC to any other genome in the dataset. In R to extract a column from  data frame you use the symbol \"$\" followed by the name of the column"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "temp$MTB_ANC"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will remove the column and the row of MTB_ANC from our  pair-wise distances. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "temp <-temp[-which(names(temp) %in% c(\"MTB_ANC\")), -which(names(temp) %in% c(\"MTB_ANC\"))]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we check that the ancestor has been removed by  checking the dimensions of your dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dim(temp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will transform the \"temp\" data frame into a matrix object (another kind of R objects) for the remaining exercises, and called it \"distances_noOutGroup\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "distances_noOutGroup <-(as.matrix(temp))\n",
+    "distances_noOutGroup[lower.tri(distances_noOutGroup)] <- NA"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us plot the pair-wise genetic distances after removing the MTB_ANC"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist(distances_noOutGroup,main=\"Distribution of pairwise genetic distances Eldholm2015\", xlab=\"Number of differing SNPs\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let us try to identify how many clusters of cases we have in this \n",
+    "population. We will be using a simple clustering approach based on the number of mutations separating sequences, classifying them in the same cluster if their distance is less than a given threshold.Of course we need to decide before hand what the threshold will be. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#here we use the cutoff of 5 SNPs\n",
+    "clust5 <- gengraph(distances_noOutGroup,cutoff=5,truenames=F)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " #sequences are the nodes of the graphs; edges link sequences from the same cluster; numbers on the edges indicate numbers of mutations\n",
+    "clust5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "How many clusters did you find? WHat is the distribution of the size of the clusters?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "clust5$clust$csize"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us plot the clusters. Generally to change the different options of the plot you can check http://igraph.org/r/doc/plot.common.html "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot(clust5$g, vertex.size=2,vertex.label.dist=0.5, vertex.color=\"red\", edge.arrow.size=0.5,vertex.label=NA)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Setting up another threshold would lead to a different number of clusters. As we have discuss it is difficult to know what is the right threshold. \n",
+    "In any case the results obtained so far confirm that these strains belong to an outbreak as  we have seen with the results obtained with the phylogenetic tree; most of the strains are very closely related suggesting recent transmission. The phylogenetic tree built yesterday gives us an idea of the possible chains of transmissions, but it does not have into account the collection dates. \n",
+    "The SeqTrack algorithm implemented in the  R \"adegenet\" package aims to reconstruct ancestries between the sampled sequences based on their genetic distances and collection dates based on maximum parsimony.\n",
+    "This is implemented in \"adegenet\" by the function seqTrack (see ?seqTrack). \n",
+    "We will use SeqTrack on the matrix of pairwise distances (distances_noOutGroup), indicating the labels of the cases (x.names=cases$id) and the collection dates (x.dates=dates).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The collection dates of the strains is available as supplementary table together with the publication of Elhdholm et al 2015. Let us read that table into R."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "metadata <- read.csv(\"~/Workshop_SA/exercises_MolEpiSA/Eldholm2015_metatadata_edited.txt\", sep=\"\\t\", header=T,stringsAsFactors = FALSE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#check what kind of information is in the metadata\n",
+    "head(metadata)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#check the size of the metadata\n",
+    "dim(metadata)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that we have more metadata (n=256) than we have genomes in our alignment (n=248). Let us extract from  the metadata the Gnumbers which are in the alignment.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#We create an object called \"gnumbers_alignment\" where we store the names of the genomes\n",
+    "gnumbers_alignment <- colnames(distances_noOutGroup)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#we compare our genetic distance matrix with the metadata "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#we extract from the metadata the genomes we have in teh first.\n",
+    "\n",
+    "metadata248 <- metadata[which(metadata$G_number %in% gnumbers_alignment),]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#check the size of your new metadata table. \n",
+    "dim(metadata248)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#the following commands will just format the column corresponding to the isolation date in a way that R understands it. \n",
+    "metadata248<- transform(metadata248, Year = substr(metadata248$ISOLATE_DATE, 1, 4), Month = substr(metadata248$ISOLATE_DATE, 5, 6))\n",
+    "dates <-as.Date(metadata248$Year,\"%Y\")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#the seqTrack reconstruction requires a matrix object so we will transform our matrix of distancse again\n",
+    "distances <- as.matrix(temp)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "reconstruction <- seqTrack(distances, x.names=metadata248$G_number, x.dates=dates)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#let us inspect the reconstruction done by seqTrack.\n",
+    "#reconstruction$id: the indices of the cases.\n",
+    "#reconstruction$ances:the indices of the putative ancestors of the cases.\n",
+    "#reconstruction$weight:the number of mutations for each putative ancestry.\n",
+    "#reconstruction$date:the collection dates of the cases.\n",
+    "#reconstruction$ances.date:the collection dates of the putative ancestors.\n",
+    "\n",
+    "reconstruction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Let us plot the reconstruction. Again, generally to change the different options of the plot you can check http://igraph.org/r/doc/plot.common.html\n",
+    "reconstrunction_graph <- plot(reconstruction, vertex.size=2,vertex.label.dist=0.5, vertex.color=\"red\", edge.arrow.size=0.5,vertex.label=NA)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Compare that to the first plot you had generated. The temporal information allows you to gain some resolution. However, as we have discussed in the introduction to molecular epidemiology of Mtb, latency and HIV co-infection might distort the relationship between time and SNP accumulation. So we should not fully trust the resconstruction and ideally one should have more epi data that could help us with confirming some of the links identified by seqTrack. \n",
+    "For the purpose of this exercise we will assume now that the reconstruction is correct. \n",
+    "The plot seems to suggest that there are a few individuals key to the transmission process. We will calculate the number of secondary cases per infected individual, that is, the individual effective reproduction numbers (Ri). That information is  in reconstruction$ances. We will just write a small R code to retrieve it. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Ri <- sapply(1:248, function(i) sum(reconstruction$ances==i, na.rm=TRUE))\n",
+    "names(Ri) <- paste((rownames(reconstruction)),sep=\"\")\n",
+    "Ri <-as.data.frame(Ri)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Let us visualise Ri\n",
+    "plot(table(Ri),ylab=\"counts\", main=\"Number of secondary cases per infected individum\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we saw in the reconstruction plot, it is apparent that there are key individuals that have generated a lot of secondary cases. We could compare this to the phylogenetic tree you have built yesterday and to the  the original publication time scaled phylogenetic tree. For the sake of time, again let us believe our reconstruction for the purpose of the our tutorial. \n",
+    "\n",
+    "Perhaps one interesting aspect to look at would be to check if there is any relationship between the drug resistance profiles and the Ri. For that we will important the DR resistance profiles of each genome."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DR_table <-read.csv(\"~/Workshop_SA/exercises_MolEpiSA/all_DRM_in_Eldholm_edited.txt\",sep=\"\\t\",header=T,stringsAsFactors=F)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#check the size of  your table\n",
+    "dim(DR_table)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Check how the table looks like. The first 5 column contain annotation\n",
+    "DR_table[1:10,1:10]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#create a new table with mutation encoded as 1 if they are fixed inside the host and 0 if otherwise. This is a simplification because we are ignoring the variable (e.g. 0/1) genotypes\n",
+    "mutations_per_genome <- ifelse (DR_table==\"1/1\",1,0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mutations_per_genome"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#count how many DR mutations there are per genome.\n",
+    "mutations_per_genome_table <- colSums(mutations_per_genome[,6:253])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# let us see if there is any relatioship between the number of resistance mutation and Ri\n",
+    "mutations_per_genome_table <- as.data.frame(mutations_per_genome_table)\n",
+    "ordered.mutations_per_genome_table <- mutations_per_genome_table[ order(rownames(mutations_per_genome_table)) , ,drop=F]\n",
+    "ordered.Ri <- Ri[order(rownames(Ri)),,drop=F]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot(ordered.mutations_per_genome_table$mutations_per_genome_table,ordered.Ri$Ri,ylab=\"Ri\",xlab=\"Number of resistance mutations\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Interestingly we see that isolates with high Ri have generally also high numbers of DR conferring mutations. What could that mean?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R",
+   "language": "R",
+   "name": "ir"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "3.4.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}