Which output files should be merged? The one from pe_genome_quantification_kallisto and se_genome_quantification_kallisto? It is a SAM file, so likely not. Or is there another file in quant_kallisto/?
And how should the summary tables look like? Sample name, gene name and TPM as columns? And another table for projected raw gene counts?
And the same for transcript level?
Single rule ("kallisto_merge") to generate separated summary tables for transcripts and genes, for expression (TPM) and projected raw counts. To do so I am writing an R script using tximport package. To generate the gene-level summarization tables, transcripts need to be associated with gene IDs in a two-column data frame linking transcript id to gene id, since kallisto only provides transcript-level information. In the link below there's an example of how they generate such table within R, but since we want to apply this pipeline to different organisms I'm not sure how to do it for a general case, and not just humans. (https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#kallisto)
Any idea/suggestion on what's the best way to approach this issue?
You give as input the gtf file (option --gtf) and inside your script, you generate the gene-transcript table
Or you create the gene transcript table in another rule/script and you give it as input in your script
Maybe the second solution is better because we might use the gene/transcript tables in other rules as well.
Thanks @gypas and @katsanto. I have pushed the branch so you can have a look. I wired the rules but I still need to finish the Rscript, but I think I'm close the make it work. Of course the test failed, expected.
For the gene-transcript table I'm creating it in the same R script where I'm merging the kallisto tables, but I am not returining that gene-trx table as an output result, I didn't know we may needed. Should I also save the gene-trx table in the same directory where I'm returning the merged kallisto tables?
I did it on Friday evening before seeing your message @gypas, but if it's better to have it in a separated rule I can modify it. One rule for generating the gene-trx table, and separately the rules for merging kallisto tables.
For running those rules, I will need to create the dockerimage with R and the libraries I'm using.
@gypas do you think you can help me with that? I did create once a dockerimage but was a long time ago. I have seen that there's an explanation on how to do so in the "joining the computational side" tutorials. I can also have a look and try it to do myself.
Hi @devagy74 Yes, just save the gene/transcript table in the same directory. This should be fine for now.
Regarding the Dockerfile I can generate the image. Once you are done with the script (push it in the same branch) and let me know. I will take the script and store it in a new github repo as I did for the following R script (https://github.com/zavolanlab/prune_tree).
After that, I will let you know to adjust the correspondin rules in the Snakefiles. Is that OK?
Perfect @gypas. I am just having some issues on how to pass as an argument in R a list of files (the tables to merge). I'm fixing it, after that I will push again the script. Thanks!
I have update the merge_kallisto.R script.
I tested it locally with the test input files that I previously generate running the pipeline: Rscript workflow/scripts/merge_kallisto.R --input tests/test_integration_workflow/results/samples/synthetic_10_reads_paired_synthetic_10_reads_paired/quant_kallisto/abundance.h5,tests/test_integration_workflow/results/samples/synthetic_10_reads_mate_1_synthetic_10_reads_mate_1/quant_kallisto/abundance.h5 --names synthetic_10_reads_paired_synthetic_10_reads_paired,synthetic_10_reads_mate_1_synthetic_10_reads_mate_1 --anno tests/input_files/homo_sapiens/annotation.gtf -v
I have test it for generating tpm and counts tables for both transcripts and genes, and it worked. But when running it in snakemake it fails and I have been trying to fix it without success. I think it may be the way in how snakemake returns the input tables that R script should read. But honestly I'm running out of ideas, and maybe it's a small detail I'm not seeing.
Can you maybe have a look?
I'm using it for the function tr2g_gtf (https://rdrr.io/bioc/BUSpaRse/man/tr2g_gtf.html) to extract the transcript IDs with its corresponding gene ID from the annotation file, and build the trx/gene IDs table. If you know another function that does so, but I haven't found anything else.