Remove PCR duplicates
Include UMItools in pipeline
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- CJHerrmann added To Do label
added To Do label
- CJHerrmann changed the description
changed the description
- CJHerrmann assigned to @katsanto
assigned to @katsanto
- Maintainer
Suggested tool to be used : https://umi-tools.readthedocs.io/en/latest/ Docker image available: https://hub.docker.com/r/zavolab/umi-tools
- CJHerrmann unassigned @katsanto
unassigned @katsanto
- BIOPZ-Katsantoni Maria assigned to @katsanto
assigned to @katsanto
- BIOPZ-Katsantoni Maria unassigned @katsanto
unassigned @katsanto
- Alex Kanitz changed title from Deal with UMIs to Add rule for UMI-tools
changed title from Deal with UMIs to Add rule for UMI-tools
- Alex Kanitz added 1 deleted label
added 1 deleted label
- Alex Kanitz removed To Do label
removed To Do label
- Alex Kanitz added To Do label
added To Do label
- Alex Kanitz changed milestone to %v0.1.0 release
changed milestone to %v0.1.0 release
- BIOPZ-Katsantoni Maria assigned to @katsanto
assigned to @katsanto
- Alex Kanitz removed milestone
removed milestone
- BIOPZ-Katsantoni Maria removed To Do label
removed To Do label
- BIOPZ-Katsantoni Maria added Won't fix label
added Won't fix label
- Maintainer
I believe the UMI collapse requires a lot of work and has minimal application for this workflow. The time consuming steps will be due to the fact that sometimes UMIs are in the seuquence, other times they are already eliminated and added in the sequence name, which is additionally not always in a consistent way. We somehow need to know if there are UMIs in the samples to be analysed and then enforce a common format for the processed one. Since this workflow is for bulk RNA-seq, I also do not know to what extent this would be useful. Won't fix for now, but if people ask for this feature, we can revisit.
- BIOPZ-Katsantoni Maria closed
closed
- BIOPZ-Katsantoni Maria reopened
reopened
- Owner
@zavolan suggested to look into this: https://www.nature.com/articles/s41598-019-48242-w.pdf (does not require UMIs and runs on FASTQ files)
- Alex Kanitz changed title from Add rule for UMI-tools to Remove PCR duplicates
changed title from Add rule for UMI-tools to Remove PCR duplicates
- Maintainer
I did a quick test run of that tools suggested by @zavolan: https://sourceforge.net/projects/ngsreadstreatment/
It is a compiled Java binary, so no dependencies, but it does require Java version 13 for the newest release. This is important, since the previous releases of that tool do not support multi-threading (crucial).
I pulled some paired-end RNA-Seq compressed reads (fastq.gz) and run the tool with
8
threads:$ time java -jar NgsReadsTreatment_v1.3.jar GSM1502498_1.fastq.gz GSM1502498_2.fastq.gz 8 Total memory consumed: 230MB real 2493m12.292s user 30m59.395s sys 248m8.630s
So: it finished after 40h; we could provide 16/32 cores but still, I think this is a little too long to include it in zarp by default(?)
These input files are ~5.2GB (~27GB after unzipping); The output (~7.2GB) is unzipped by default, compared to the compressed input (So no - it turns out it is not introducing duplicates
). The output is ~4x smaller then the uncompressed input:$ ls -al total 24335625 drwxr-xr-x 2 bakma zavolan 4096 Apr 23 18:57 . drwxr-x--- 84 bakma zavolan 8192 Apr 23 15:35 .. -rw-r--r-- 1 bakma zavolan 7216070381 Apr 23 18:18 GSM1502498_1_1_trated.fastq -rwxr-x--- 1 bakma zavolan 5223961535 Apr 20 15:25 GSM1502498_1.fastq.gz -rw-r--r-- 1 bakma zavolan 178 Apr 23 18:18 GSM1502498_1_Report.log -rw-r--r-- 1 bakma zavolan 7186719878 Apr 23 18:18 GSM1502498_2_2_trated.fastq -rwxr-x--- 1 bakma zavolan 5292211941 Apr 20 15:25 GSM1502498_2.fastq.gz -rw-r--r-- 1 bakma zavolan 277227 Apr 22 00:34 NgsReadsTreatment_v1.3.jar
- BIOPZ-Katsantoni Maria added To Do label
added To Do label
- BIOPZ-Katsantoni Maria removed To Do label
removed To Do label
- BIOPZ-Katsantoni Maria removed Discuss label
removed Discuss label
- BIOPZ-Katsantoni Maria added Future label
added Future label
- Maintainer
Check if there is any easy to use tool out there. Otherwise keep this in mind if you ever adapt this workflow for small-rna seq or single-cell.
- BIOPZ-Katsantoni Maria removed Future label
removed Future label
- BIOPZ-Katsantoni Maria added To Do label
added To Do label
- Maintainer
Some recommendations from twitter (just to write them down):
- samtools-markdup: http://www.htslib.org/doc/samtools-markdup.html
- MarkDuplicates from picard tools: http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates and https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-
1 - Maintainer
It seems that Picard http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates and https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard- is the tool that runs equally efficiently for single and paired-end seq data. However if we implement this, we need to have some additional rules for sorting the files and reversing the bam files (it works at the bam level) back to fastq as some downstream analyses also use the fastq files. Should we go ahead with this? @zavolan @kanitz @herrmchr @bakma @burri0000 @gypas
Collapse replies - Maintainer
Personally, I don't think it's worth the effort, but I don't have a strong opinion on it.
- Maintainer
Personally, I don't think it's worth the effort, but I don't have a strong opinion on it.
Same here.
- Owner
I also don't think it's a big issue in bulk RNA-seq. If time permits, OK, but after we close the currently open issues.
- BIOPZ-Katsantoni Maria added low priority label
added low priority label
- BIOPZ-Katsantoni Maria removed To Do label
removed To Do label
- BIOPZ-Bak Maciej added Future label
added Future label
- Maintainer
- BIOPZ-Bak Maciej closed
closed
- CJHerrmann reopened
reopened
- CJHerrmann closed
closed