Computing requirements report for each job

For the runtime I use benchmark mechanism. I have such field for every shell-based rule in my workflows. It creates one more logfile apart from stdout and stderr.

https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#benchmarking

Could you post an example from this benchmark result?

I guess there is no difference in cluster and local usage as this measures the wall clock time, right?

benchmark mechanism generates a text file like this:

s	h:m:s	max_rss	max_vms	max_uss	max_pss	io_in	io_out	mean_load
17.8738	0:00:17	173.08	880.87	170.49	170.59	0.55	0.00	0.00

wall clock time should be the same without multi-threading. If distinct number of cores will be provided locally and on the cluster then I would expect a distinct time.

Cool. So in principle we could write a script that harvests these files and prepares a report that could be integrated into MultiQC. Could be nice, and reusable for any combination of Snakemake and MultiQC.

Yes, I agree, parsing those would be very easy.
However, what do you think about snakemake reports? Just like this one: https://koesterlab.github.io/resources/report.html
This is another option. I am unsure if we want to expand into another direction, though...

Wow, that's really cool :-)

For the paper, @zavolan recommended using the RBP ENCODE knockdown data as a use case, which I think it's a good idea.

My recommendation would be to reserve a node from scicore with a specific amount of ram and cpus and make plots similar to the ones we had here. This is what we did in the past in order to make sure that we are not affected by other processes running in the node/cluster.

But we are not there yet. We need to fix a few more things first and add some extra functionality. My bet is that the TIN score script is still the bottleneck of the pipeline in terms of performance, but I hope I am wrong.

added Discuss Future To Do labels

Here are the todo comments from google docs:

@gypas this is not about showing what resources Rhea consumes under optional conditions but what it consumes I'm production. It will allow us to report ranges of minimal hardware requirements for end users to estimate if they can run an analysis on a laptop/desktop or for sys admins to estimate how to configure their HPC when setting up Rhea.

We need these info for every test run so it's not something to do in the future.

removed Future label

On that note, I think even in a publication reporting ranges of real-world resource requirements for a set of input parameters (sample size, genome size, anno size) is probably more useful than some hypothetical minimal value on some arbitrary node with an arbitrary chipset and bust. For our benchmarking that was different, because we were comparing different tools, so the relative numbers mattered.

removed To Do label

assigned to @burri0000

I believe that implementing this is highly usefool for zarp https://koesterlab.github.io/resources/report.html . Since this is not required for making the workflow functional I suggest we work on this for the v1.0.0 release.

I think we should decide if we go with a MiltiQC report or a Snakemake report.

As I understand it, Snakemake report is more generic and would nit some changes in the code to replace MultiQC. I guess the idea is to use Snakemake report for runtime statistics.

Also, I think this should be included and reported as soon as zarp is public.

This is also relevant to #44 (closed). I also think that these are not mutually exclusive. Multiqc is a good way to have some statistics on the samples and the various processing steps. The snakemake report part is more useful to have these summaries of the runtimes. If the user is more interested in the sample specific info, then a notebook or html of a notebook even could contain interactive searchable plots. These could be then packaged with the ro-crate. Since this workflow is specific to samples and not experiment level analyses, I think it is good to provide some diversity and more flexibility in the output descriptions, allowing for easier integration of features in later versions.

In my understanding, MultiQC includes primarily summaries about sample content. Snakemake summaries are about the execution. I suppose one could start including reports of one type into the other, but is that necessary? Whe don't we leave these two conceptually separate issues separate?

unassigned @burri0000

mentioned in issue #127 (closed)

changed milestone to %downstream

We keep both multiqc and snakemake report, since they cover different aspects. Should be dealt with in the zarp wrapper( #135 (closed))

added upstream label

added To Do label

added downstream label

changed milestone to %upstream

removed milestone %upstream

changed milestone to %multiple

closed

Computing requirements report for each job

Child items 0

Activity