Snakemake report requires path in the bash script

added Discuss Need more info To Do labels

Create a function that makes the call to the snakemake report. Create a function that also takes care of the slurm call. Do this in the labkey to snakemake script and rename it, as it does more than that now.

Isn't this labkey-to-snakemake script executed before the pipeline runs/finishes? If so - this will not work, as the command-line call for the snakemake to create a report has to be executed after the pipeline finishes.

Or maybe I don't get it... is this labkey-to-snakemake script gonna be a master script that both fetches data, then calls a function with snakemake pipeline execution and then calls another function with snakamake-report execution?

I haven't thought about all of the implications yet, but I could imagine having a CLI executable zarp that serves as the single entry point for a user wanting to run their samples. However, one thing is clear to my mind, if we do have such a master script, apart from being well written over multiple modules, packaged as an executable and properly unit- and integration-tested, it would have to have a really intuitive API. Basically the following (or very similar) should all work:

# zarp local single-end library
zarp my_single_end_sample.fastq.gz
# zarp local paired-end library
zarp my_first_mate.fastq.gz,my_second_mate.fastq.gz
# zarp multiple local libraries
zarp my_single_end_sample.fastq.gz my_first_mate.fastq.gz,my_second_mate.fastq.gz
# zarp library stored on SRA
zarp SRR10690190
# zarp both local and remote (SRA) libraries
zarp my_single_end_sample.fastq.gz SRR10690190
# zarp sample with non-default public resources
zarp -g GRCh38.100 SRR10690190
# zarp multiple samples in batch mode
zarp -t my_sample_table
# zarp samples described in a LabKey table
zarp -t labkey:my_labkey_table
# zarp sample and automatically push results RO-Crate to Zenodo
zarp --publish my_single_end_sample.fastq.gz

To achieve this, the tool would need to:

run interactive self-configuration (cluster config, default genome resources, any other sample-independent config; upon first use or upon explicit request with zarp --configure)
run HTSinfer on samples (unless run in batch-mode with all metadata complete)
run ZARP pre-processing
run ZARP
run ZARP post-processing
publish ZARP results (optional)

I think it's doable and worthwhile. But if we want to do it, I think we should think about structuring this first, probably in a separate repo ZARP-cli, that we could publish on PyPI and then import in the main ZARP repo. In this way, we would stay neutral to any other entry points to the pipeline (e.g., cloud execution).

I agree with most of the points, and I was going to this direction even not as eloquently. So in this case there should be options like 'slurm' or 'kubernetes' or 'local' so that the appropriate cluster config e.g is triggered with standard parameters ? I still do not understand at which point the selection of the system would be chosen.. I believe this is worth the effort but as you said planning and thinking of the problems ahead is key.. I think for now it is worth writing some modest functions and make a wrapper, and once this is out we can start the restructuring.

Hey @kanitz revisiting this, should I create a bash script in the prepare input script? How are people currently expected to trigger the workflow? Do you think it makes sense to create a local and slurm version option in the current input script and we refactor this later?

Not sure what you mean by the "current input script", could you point to a file? I'm suffering from quite extreme context switches, sorry. I also don't have a clear idea how anyone is expected to run ZARP at the moment or what this issue is about in detail. My comment was really just a suggestion of how I could envision the usage in the mid- to long-term. And I agreed with your previous comment to write modest functions that however we should ideally still be able to at least partly reuse. The list in my comment gives some high-level idea on how we could structure this. We could try to start breaking these individual points up into individual manageable issues. I assume (though it's not very clear to me) that this issue is one such bite-sized work package.

Apart from this for now I can only say that I would advise against writing Bash scripts and write proper Python code instead. In the worst case we could use subprocesses to execute system code in Python, but I think Snakemake also can be imported and run directly, so probably whatever wrapper you have in mind would translate much better to a Python module, class, or perhaps just a function. It will be much easier to make use of that later on and, together with other such code pieces/scripts, could be packaged in a way similar to how I have outlined it above.

Hi @kanitz sorry for the crude expl. Let me rephrase: currently there is not bash script for triggering snakemake (like those making the calls for the test_integration_workflow). So I was wondering if it makes sense to write up some functions that will create such a bash script in the prepare_inputs.py where we create the samples_table and the config. We can also add an argument taking values local and slurm, upon which the corresponding bash script is created. Does this make sense?