@kanitz@gypas
I realised that there is an issue with the structure of the names in the snakemake pipelines
I wanted to make these indexes depending on the organism and the index size , independent of the sample, but I never propagated these values in the output of the snakemake rules, leading to many limitations.
I propose that I now include all the wildcards somehow in the output name
This means we have to fix the md5sums in the test_integration_workflow
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Not sure what the limitations are, could you please give some more background here. It doesn't sound like a bug to me though.
And perhaps propose some convention here on how you want files to be named. Personally I'm not very fond at all of this "single-end"/"paired-end" dictionary tree. I would find it most intuitive to have a directory for each tool in the root results directory, and then all the logs for individual samples under these, e.g.:
Alternatively, a layout with the samples in the root, and the logs of all tools as subdirectories would be okay, too, I guess. Perhaps talk with @bakma to see how this affects the MultiQC rule.
In general, I think we should find some solution that is more amenable to programmatic access to the files. Right now it's very painful (see problem with MD5 sums).
The way I understand the wiring logic is that the single_end and paired_end flags are taken from the TSV table (from every row which is every sample) and snakemake is able to execute the proper versions of rules (se/pe) exactly because of these words being hardcoded into the output paths in the subsequent steps. Therefore I am assuming that these identifiers would have to be present at some level of the result directory tree, is that right @katsanto?
What I can imagine is that we write a function inside the Snakefile that would append an appropriate prefix to the sample name based on the mentioned flag. In the end we would be able to obtain a structure @kanitz proposed:
From my end (final report) such a structure is also more convenient but I do not think it is crucial. MultiQC config seems customizable enough that I should be able to parse the current result and log directories (this is my current opinion as of 22.02.2020, it might change in future if I run into problems).
What is important, though, is that right now for most of the rules we are redirecting both stdout and stderr streams to the same log file. This should change, as MultiQC wants to parse a stdout file, after some tools, not a combination of two streams. So if it is possible I think we should diversify logs into two separate streams: out, err. I am not sure if snakemake supports two separate variables under log keyword and how would that work...
All in all, I think we should not design this ad hoc.
We should sit together and come up with a reasonable results and logs directory trees which would satisfy both snakemake syntax and MultiQC parsing constraints.
EDIT
Actually, when I think about this now - it would be much better for me to have the structure @kanitz proposed: if every sample would have the exact same name throughout the whole pipeline execution it would be much less cleaning in the final report table. They would be placed under different directories that mark their distinct processing steps, of course. Appendix a suffix like "adapter_trimmed" or "polyA_trimmed" to the sample name itself could be a pain for me later to parse so i would like to avoid that...
I strongly agree with sitting together and discussing this carefully. What goes in and what comes out of the pipeline is absolutely crucial for usability and interfacing with upstream and downstream solutions, respectively (it doesn't all end with MultiQC, the results are going to be used in other pipelines etc). Probably not the best idea to rush this as changing this later would mean API breaking changes (and more work).
Perhaps some of you Snakemake users/gurus (@bakma, @gypas, @katsanto, @herrmchr, @devagy74) could expand on the limitations and try to see what alternatives Snakemake offers to overcome them, other than using wildcards and complicated filenames / directory structures.
On redirecting stdout and stderr streams, I also agree with @bakma: this is pretty standard practice and should be easy enough to change. Would you write an issue for that, @bakma?
eliminate samples folders -- put the name in the sample
name of the results folder should be indicative of the run (unique but at the same time not rerun if the same samples have run with the same parameters)
BIOPZ-Katsantoni Mariachanged title from Restructure rules to contain all wildcards to Restructure snakemake rules
changed title from Restructure rules to contain all wildcards to Restructure snakemake rules
So I think that rewiring the whole existing pipelines WILL break everything and we will spend an equally long time to reach the place where we are now.
What I will apply are the following changes:
indexes outside of the results
Then when a run has finished we move the results to the specific sample uuid architecture (we can do that in the final rule).
When we later have new rules specific for analyses that have to do with multiple samples, we take into account this uuid information and we think then how to do this in a more structured manner.
In the end the overhead is minimal (so no more work on the existing and working things)
I do not think the overhead coming together with all this "optimal" approach is worth it.
Summary of some discussion. Outdated perhaps but didn't want to lose it ;-)
Sample- or run-specific parameters are not stored in any filenames or paths as wildcards, because apart from being unwieldy, they may also lead to issues with maximum allowed filenames or paths (255 and 4096 chars, respectively, on most Linux file systems). Exception: single-end/paired-end. Instead, we will use unique sample- and run-specific identifiers, generated based on the checksums of a row of the sample table or the entire sample table, respectively. In particular, The initial n (configurable, but set to perhaps 8 by default) characters of the checksum could be used as identifiers.
Agreed to keep the wiring as is for now, and getting to some first production version quickly. Re-designing of wiring, if any, could be the focus of a future version.
Points to consider for possible re-design:
minimizing Snakemake re-runs if data (sample-specific, project-specific or resource-specific) is already available