It would be useful to get the CPU, time and memory demands from each snakemake job and the overall Rhea run.
This gives users some idea how demanding in terms of overall runtime and memory consumption the samples for organism x are.
For the runtime I use benchmark mechanism. I have such field for every shell-based rule in my workflows. It creates one more logfile apart from stdout and stderr.
wall clock time should be the same without multi-threading. If distinct number of cores will be provided locally and on the cluster then I would expect a distinct time.
Cool. So in principle we could write a script that harvests these files and prepares a report that could be integrated into MultiQC. Could be nice, and reusable for any combination of Snakemake and MultiQC.
Yes, I agree, parsing those would be very easy.
However, what do you think about snakemake reports? Just like this one: https://koesterlab.github.io/resources/report.html
This is another option. I am unsure if we want to expand into another direction, though...
For the paper, @zavolan recommended using the RBP ENCODE knockdown data as a use case, which I think it's a good idea.
My recommendation would be to reserve a node from scicore with a specific amount of ram and cpus and make plots similar to the ones we had here. This is what we did in the past in order to make sure that we are not affected by other processes running in the node/cluster.
But we are not there yet. We need to fix a few more things first and add some extra functionality. My bet is that the TIN score script is still the bottleneck of the pipeline in terms of performance, but I hope I am wrong.
@gypas this is not about showing what resources Rhea consumes under optional conditions but what it consumes I'm production. It will allow us to report ranges of minimal hardware requirements for end users to estimate if they can run an analysis on a laptop/desktop or for sys admins to estimate how to configure their HPC when setting up Rhea.
We need these info for every test run so it's not something to do in the future.
On that note, I think even in a publication reporting ranges of real-world resource requirements for a set of input parameters (sample size, genome size, anno size) is probably more useful than some hypothetical minimal value on some arbitrary node with an arbitrary chipset and bust. For our benchmarking that was different, because we were comparing different tools, so the relative numbers mattered.
I believe that implementing this is highly usefool for zarp https://koesterlab.github.io/resources/report.html . Since this is not required for making the workflow functional I suggest we work on this for the v1.0.0 release.
As I understand it, Snakemake report is more generic and would nit some changes in the code to replace MultiQC. I guess the idea is to use Snakemake report for runtime statistics.
Also, I think this should be included and reported as soon as zarp is public.
This is also relevant to #44 (closed). I also think that these are not mutually exclusive. Multiqc is a good way to have some statistics on the samples and the various processing steps. The snakemake report part is more useful to have these summaries of the runtimes. If the user is more interested in the sample specific info, then a notebook or html of a notebook even could contain interactive searchable plots. These could be then packaged with the ro-crate. Since this workflow is specific to samples and not experiment level analyses, I think it is good to provide some diversity and more flexibility in the output descriptions, allowing for easier integration of features in later versions.
In my understanding, MultiQC includes primarily summaries about sample content. Snakemake summaries are about the execution. I suppose one could start including reports of one type into the other, but is that necessary? Whe don't we leave these two conceptually separate issues separate?