Remove PCR duplicates

added To Do label

changed the description

Suggested tool to be used : https://umi-tools.readthedocs.io/en/latest/ Docker image available: https://hub.docker.com/r/zavolab/umi-tools

unassigned @katsanto

assigned to @katsanto

unassigned @katsanto

changed title from Deal with UMIs to Add rule for UMI-tools

added 1 deleted label

removed To Do label

added To Do label

changed milestone to %v0.1.0 release

assigned to @katsanto

removed milestone

removed To Do label

added Won't fix label

I believe the UMI collapse requires a lot of work and has minimal application for this workflow. The time consuming steps will be due to the fact that sometimes UMIs are in the seuquence, other times they are already eliminated and added in the sequence name, which is additionally not always in a consistent way. We somehow need to know if there are UMIs in the samples to be analysed and then enforce a common format for the processed one. Since this workflow is for bulk RNA-seq, I also do not know to what extent this would be useful. Won't fix for now, but if people ask for this feature, we can revisit.

closed

reopened

@zavolan suggested to look into this: https://www.nature.com/articles/s41598-019-48242-w.pdf (does not require UMIs and runs on FASTQ files)

added Discuss pipeline labels and removed Won't fix label

changed title from Add rule for UMI-tools to Remove PCR duplicates

I did a quick test run of that tools suggested by @zavolan: https://sourceforge.net/projects/ngsreadstreatment/

It is a compiled Java binary, so no dependencies, but it does require Java version 13 for the newest release. This is important, since the previous releases of that tool do not support multi-threading (crucial).

I pulled some paired-end RNA-Seq compressed reads (fastq.gz) and run the tool with 8 threads:

$ time java -jar NgsReadsTreatment_v1.3.jar GSM1502498_1.fastq.gz GSM1502498_2.fastq.gz 8
Total memory consumed: 230MB

real	2493m12.292s
user	30m59.395s
sys	248m8.630s

So: it finished after 40h; we could provide 16/32 cores but still, I think this is a little too long to include it in zarp by default(?)

These input files are ~5.2GB (~27GB after unzipping); The output (~7.2GB) is unzipped by default, compared to the compressed input (So no - it turns out it is not introducing duplicates ). The output is ~4x smaller then the uncompressed input:

$ ls -al
total 24335625
drwxr-xr-x  2 bakma zavolan       4096 Apr 23 18:57 .
drwxr-x--- 84 bakma zavolan       8192 Apr 23 15:35 ..
-rw-r--r--  1 bakma zavolan 7216070381 Apr 23 18:18 GSM1502498_1_1_trated.fastq
-rwxr-x---  1 bakma zavolan 5223961535 Apr 20 15:25 GSM1502498_1.fastq.gz
-rw-r--r--  1 bakma zavolan        178 Apr 23 18:18 GSM1502498_1_Report.log
-rw-r--r--  1 bakma zavolan 7186719878 Apr 23 18:18 GSM1502498_2_2_trated.fastq
-rwxr-x---  1 bakma zavolan 5292211941 Apr 20 15:25 GSM1502498_2.fastq.gz
-rw-r--r--  1 bakma zavolan     277227 Apr 22 00:34 NgsReadsTreatment_v1.3.jar

added To Do label

removed To Do label

removed Discuss label

added Future label

Check if there is any easy to use tool out there. Otherwise keep this in mind if you ever adapt this workflow for small-rna seq or single-cell.

removed Future label

added To Do label

Some recommendations from twitter (just to write them down):

samtools-markdup: http://www.htslib.org/doc/samtools-markdup.html
MarkDuplicates from picard tools: http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates and https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-

It seems that Picard http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates and https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard- is the tool that runs equally efficiently for single and paired-end seq data. However if we implement this, we need to have some additional rules for sorting the files and reversing the bam files (it works at the bam level) back to fastq as some downstream analyses also use the fastq files. Should we go ahead with this? @zavolan @kanitz @herrmchr @bakma @burri0000 @gypas

Personally, I don't think it's worth the effort, but I don't have a strong opinion on it.

Personally, I don't think it's worth the effort, but I don't have a strong opinion on it.

Same here.

I also don't think it's a big issue in bulk RNA-seq. If time permits, OK, but after we close the currently open issues.

added low priority label

removed To Do label

added Future label

[26.08.2021] Minigroup meeting: we decided we will not support this feature for this release.
CC: @zavolan @katsanto @kanitz

closed

reopened

closed

Remove PCR duplicates

Child items 0

Activity