Puuuuuhhhh... PLEASE promise me not to file a merge request for this anytime soon. Like maybe when we're back in the offices and social distancing is a thing of the past?
Well, that's what I counted on when I wrote that. I'm just utterly unconvinced that Rhea should have any kind of group comparisons. It sounds way too much like one-tool-to-rule-them-all, Swiss army knife kind-of-thing. I'm a modular guy and it's clearly a separate pipeline to me. I'm not even too happy with the rules we have now that summarize things over several samples in the same pipeline (don't wanna get into this now, it's not that big of a deal). I'm afraid it's gonna be a maintenance and usability nightmare (as if it isn't already hard enough to prepare inputs for Rhea as it is now!).
What's wrong with having a separate pipeline in a separate project (ideally after we have finished up and submitted Rhea)? It would just take a count table and a design file, simple enough API. If you absolutely must, you could optionally pass it a Rhea output directory instead of the count table (and then write a wrapper that starts Rhea and that new DE pipeline successively).
Yes, I understand modularity and for sure the goal is not to develop a pipeline to rule them all. The goal is to have a meaningful, easy to use (and maintain) pipeline with well-documented steps that will allow people to get a good overview of their experiment. With that in mind (and as an end-user that I plan to use the pipeline on a daily basis) you normally design an experiment with some comparisons you want to perform in advance. This means that in most cases you have to perform some DE analysis, have a look at how the data look in a PCA or heatmap plot and generate summary tables that will let you load them in the future for further analysis. This is I think the most common use-case where people go from raw data to a list of genes to validate in the wet-lab. This is why it would be nice to have some additional (optional) steps for those (like me) who want to perform the analysis with 1-click after they properly defined the necessary config files.
On the other hand, I also find it useful to have a separate pipeline to perform different comparisons (DE at the gene/isoform level) that will rely on the output of a run or different runs of Rhea. For example, you want to compare the data from an experiment you performed with a published experiment. This means that you run Rhea for each of the datasets and then you use the new pipeline to perform the comparisons you want to have. In such cases, you have to consider batch effects so this is something that a more specialized DE analysis pipeline has to focus on.
With these in mind, I propose to do both. First, finish Rhea with the simple DE steps (it's just 3-4 new rules - I can be responsible for maintaining the pipeline in the future) and then develop a second more specialized pipeline that can perform more advanced comparisons (or just the simple ones). I do not see any big problem of having some overlap between the pipelines. The first pipeline gives you some general-purpose results, while the second one allows you to compare things that you are not aware of in advance. For example, you might want to use only high-quality datasets from a public experiment.
I think that in general, we need to have a collection of tools that will help us analyze data faster, allow us/others to use our tools to increase the number of citations, and let us describe easier material and methods or supplement sections. It's very convenient to say I run Rhea (version XYZ) with these options than describing all the tools one by one again and again. This is also one of the reasons I want to publish the pipeline. We have the know-how, we know what needs to be done, so it's sad to not get advantage of the opportunity we have here. What we need to do is advertise our tools more and more. I think I can advertise our pipelines to a big number of labs. I think Alex @kanitz can advertise it in the ELIXIR/GA4GH community and push them as the example/default workflows in the different cloud or cluster solutions, while Mihaela @zavolan can advertise it more in the RNA world.
I went a bit off the topic, but this is how I see things. See you tomorrow.
Well, Rhea is not easy-to-use at the moment, so let's talk about adding more complexity to running it (i.e., add a design file) after we have actually reached a point where it is in fact easy-to-use. Maintenance is also already an issue, and I think there are more fundamental rules to be added that deal with analyzing individual samples (e.g., UMI support) or multiple samples (e.g., PCA) without going into comparing groups of samples. So just adding another 3 or 4 rules is not exactly making it easier to maintain.
That being said - I totally get your point about having to run one thing to get the most common results. But I repeat (paraphrased): what's wrong with having a separate pipeline for sample group comparisons and writing a simple wrapper executable that starts one after the other. It forces us to separates concerns, write clean APIs for the handshake between the two (or three, IMO) pipelines, increases usage flexibility and avoids an unmaintainable "God object"-like situation. And it can still be made as convenient (in fact more) than running Snakemake directly - after all, we are already writing wrappers for basically all Snakemake pipelines.
If it's about having these pipelines under the same namespace (Rhea) or even inside the same repo, that's a whole different thing and I think we are much more likely to find an agreement. And if we can manage to write the wrapper in Snakemake (if that's your concern), that's also fine.
The differential expression analysis and any analysis steps that rely on "merging" samples into groups is outside of the scope of this workflow. Development of a separate workflow should be considered.