Deal with relative paths in input tables

We now slightly changed LabKey test table by changing the relative path to test FASTQ files. If someone gets any errors in this direction, let us know.

I'm not entirely sure I understand the issue completely. Guess the way I've addressed this for the CI is how it's currently being done in the test scripts. Basically, you cd into a defined directory (the test directory in that case) and then from there all relative paths are correct. Then upon finishing the test, cd back to the directory where the test script was called from. This way the tests can be called from any location and still relative paths can be used.

Generally, I do believe of course that absolute paths in, e.g., the sample table should be supported. But so should relative paths. So any logic dealing with paths should consider this.

Yes, correct. My issue is relative to what? I changed Snakefile to run from the level bellow Rhea no matter where we trigger it from. So now the labkey table that previously had a relative path in the form of "../inputs/" will be "tests/inputs". But the same issue arises for the python labkey_to_snakemake that reads those files to determine read size. So I added one more variable called "testdir" so that I can always read the fastq files no matter where my test is relative to Rhea level. Tell me if something sounds not unreasonable. I am fixing now.

Could I ask what the reason is for introducing this constraint? Is this convention? Cause it seems to me that it should be up to the person calling Snakemake (either directly or via a calling script) to determine how to supply arguments, e.g., where the Snakefile is and so on, not somehow in the Snakefile itself (and this is kinda what I understand you suggest doing).

This issue arises mainly because I make a call to a python script within scripts lets say. If the call to Snakefile is from any other level the script cannot be found. So I thought I would create an option workdir within the config. If you have a better suggestion on that, then we do not need this. When one decides to provide a samples_table I though it was best to have a realtive path relative to the Snakefile, otherwise how can we make sure that where the call to Snakemake is done from coincides with the relative paths provided in the samples table? Thats why I considered this convention.

Yes, let's quickly step back and think through the possibilities and see what's perhaps most intuitive for relative paths.

Paths are relative to:

the location of the file in which they are specified
the location from which Snakemake is called (either the directory in which the calling user resides OR the directory in which the calling script resides)
a common fixed anchor, e.g., the project root or directory in which Snakefile resides (same as project root in our case)

Any other consistent options?

Yes. I though instead of interchanging between those 3 to start using on of these to make things easier moving on. I have made a few changes maybe we can discuss more once you see them in detail yourself.

I think the solutions differ in the required mental strain and knowledge of underlying assumptions by the user and developers in order to set up correctly.

In my opinion 1. is most intuitive for the user: if you fill out the sample table, it makes sense to specify paths relative to the position of the file you are just working on. On the other hand, option 3 (the one you are suggesting) seems to be easier to manage for us (as developers). We know that all relative paths always have to be interpreted relative to the Snakefile. That's easy enough to remember and we are few enough people to propagate and enforce this rule among ourselves, probably without major conflicts. Number 2 seems to be the worst option because it neither has the advantage of a fixed reference point (as has 3.), nor is clear to the user when he enters his or her paths.

My vote, in order:

1. (because it's always better to make it easier for the user than for the developer)
1. (not much behind, because end users would likely rather use absolute paths and fixing the reference point does make things a little easier for us)
.....
1. (wouldn't really advise)

Other opinions? @katsanto @gypas @bakma @herrmchr @boersch

I thought we could have like a standard folder in Rhea where users could put their samples and thus make them only indicate names and no paths in general and we make a few decisions internally (like specifying a root dir, which someone could easily change if they understand the concept of paths)

I'm not sure... Relative paths are a convention that everyone knows about (including all shells), and I think we should not only allow users to make use of them but also to make sure they can rely on them working as expected. The convention you suggest sounds convenient for many cases, but first it is still a burden on the user who needs to be aware of it, and secondly (and more importantly): if we do support relative paths, then how would you differentiate a sample name (as referring to a file in some standard sample directory) from a a bare sample name (without path) that would normally be interpreted by any shell as a file in the current directory?

I personally think that the organization of a sample folder is up to the user. He should be free to define a var SAMPLE_DIR, export it and write in the sample table something like ${SAMPLE_DIR}/sample_1.fastq.gz. This would give the user the convenience that you have in mind without us playing around with the way that paths are typically understood.

EDIT: We could of course implement a fallback system that checks in some (optional) pre-defined sample directory if it doesn't find, say, sample_1.fastq.gz in the current working directory (when interpreted as a relative path). That would offer some built-in, easily configurable support for a common sample directory out of the box. However, I'm not sure whether this isn't "feature creep" as it's so easy to set up yourself. For sure I don't think it's something that we should implement without anyone specifically asking for it.

Ok then I should create a docker image for the python script that is ran within Snakefile , or is there a way to anchor the scripts file to Snakefile so that wherever I make a call to Snakefile from I can still see the scripts folder?

Sorry, I don't understand where the Docker image comes into play. It's a bit too abstract for me. Could you perhaps try to distill your thought a bit more? :) Perhaps something like:

"How can we know the (user's current working directory / directory of the Snakefile / directory of the sample table) from inside the Snakemake run?"

Is (any of) that your problem?

When I run a rule in Snakemake I use the relative path scripts which is at the same level as the Snakefile, as generally suggested. The suggestion with Snakemake is also to run snakemake being on the same level as the Snakefile. Now, when we test we make a call like, ../../Snakefile. This now means that the workdir understood by Snakefile changes each time there is call from a different level. So now the scripts path is not going to be recognised anymore, because the workdir of Snakemake is tests/test_..

I thought it was best to always consider as workdir the level below Rhea. To do this we either:

go to the level below Rhea to make the call to Snakefile. or
we set in the config the workdir as the level below Rhea, and make all the other assumptions
Or we never use scripts which means we should containarize every little thing. Is there another option perhaps?

Not sure whether this is relevant to the discussion, but just to let you know how it is implemented in the PolyASite pipeline:
In the config file we specify among others

scripts_dir
samples_dir
annotation_dir ...
and in the Snakefile we just have to refer to config['scripts_dir'] if we want to call a script. According to the Snakemake recommendations on organizing projects, this scripts directory should reside in workflows/scripts, relative to the Snakefile. Also see our Zavolab snakemake cookiecutter.
In the config I can then set either absolute paths (as I do when testing and if I have my annotations stored somewhere already), or relative paths, which I then define relative to the location where I call the pipeline (which is ideally where the Snakefile is, because otherwise I get confused ;).

Thanks a lot @herrmchr, that's good to know! :)

Important point about workflows/scripts/, I forgot that. I think we should follow those recommendations for any scripts that are actually part of the pipeline, but not scripts that run before or after the pipeline (such as the one that processes LabKey tables to generate Snakemake inputs). I've intentially put these scripts in the root folder scripts/ when I set up our repo in the structure recommended by Snakemake (as cited by @herrmchr), because they are not part of the workflow.

And note that those scripts run inside the pipeline (i.e., the ones that should go in workflows/scripts/) should be avoided as much as possible and are only acceptable for cases where some logic is needed to tie the pipeline together. They should be all Python and should in principle be replaceable by a "run rule". They are not meant for any general heavy lifting/computation because all of these "usual rules" should run inside containers and any related scripts (such as e.g. the TIN score calculation) should go inside their own repos. So please don't use this as a dump to offload any scripts to avoid setting up Dockerfiles and repos ;-)

As for using config variables: fine with me to define these internally if it makes things easier (usually it's nice to have a single file with all config/hardcoded things of course). However, as we discussed with @katsanto yesterday, consider that things like script dirs (whether scripts or workflows/scripts) are not to be configured by users. Or anyone really (which is why I'd be okay with keeping them hardcoded). There is no use case for a stand-in replacement for these scripts (let alone entire folders), so it's probably not a good idea to mix their configuration with that of other parameters that are configurable by the user in the same file, because the availability of too many options is just a source of errors, confusion and frustration from a usability point of view.

As for your handling of relative paths: if you tell me that your config file is also in the same location as the Snakefile, then you are basically following both options 1. and 2. above and usually, when executing inside the "Snakefile folder", option 3. as well. That's fine of course, but I'd probably still implement option 1 for reasons described above (and discussed on the phone with @katsanto). It's just the most intuitive when actually writing a config file, because it's what the user would most likely expect without any additional knowledge. And all of options 1. through 3. are easy to implement technically.

mentioned in merge request !61 (closed)

mentioned in merge request !62 (merged)

changed title from Local paths in labkey table in the test_scripts_labkey_to_snakemake_api and test_scripts_labkey_to_snakemake to Deal with relative paths in input tables

assigned to @kanitz

mentioned in commit fc53c215

closed via merge request !62 (merged)

Deal with relative paths in input tables

Child items ...

Activity