Clean up rule prepare_files_for_report

mentioned in merge request !56 (merged)

OK, whereas I agree with most of the points please be aware that the cleanup is limited in case of this rule. I can make the code more readable but I cannot overcome the fact that we are parsing logs with their own given structure or I operate on a given directory tree of the results.
This refers to the exact line you mention: modifications to log files are our own adjustments so that MultiQC parses sample names nicely. They are required as a legacy of the file naming and directory structure we chose for the results. If we ever decide to change that design and restructure at any point inside the pipeline this rule will have to adjust accordingly, so it is highly variable and dependant on all the previous steps. That is why there is no point investing a lot of time into in now - we would end up re-working all the wirings upon each consecutive minor version of the software and that is just not productive.

Maybe it would be easier to explain why is all these modifications are needed on a concrete example. All this is a legacy of the design we have from the beginning. Please take a look at the output field of the rule pe_remove_adapters_cutadapt:

    output:
        reads1 = os.path.join(
            config["output_dir"],
            "paired_end",
            "{sample}",
            "{sample}.remove_adapters_mate1.fastq.gz"),
        reads2 = os.path.join(
            config["output_dir"],
            "paired_end",
            "{sample}",
            "{sample}.remove_adapters_mate2.fastq.gz")

Notice the name of the output files above.

Later, we have a rule pe_remove_polya_cutadapt:

    output:
        reads1 = os.path.join(
            config["output_dir"],
            "paired_end",
            "{sample}",
            "{sample}.remove_polya_mate1.fastq.gz"),
        reads2 = os.path.join(
            config["output_dir"],
            "paired_end",
            "{sample}",
            "{sample}.remove_polya_mate2.fastq.gz")

Please notice the name of the output files above.

And now, since all logs and output files have incorporated suffixes like remove_adapters or remove_polya directly in the names of the files then MultiQC recognises them as different samples, not the same sample at different steps of processing. Therefore I have to step in and in the end carry out these modifications for proper parsing - so that in the end MultiQC actually understands that they are the same sample.

I presume that if we had a directory structure with files following a logic like this:

remove_adapter/"{sample}.mate1.fastq.gz")
remove_adapter/"{sample}.mate2.fastq.gz")
remove_polya/"{sample}.mate1.fastq.gz")
remove_polya/"{sample}.mate2.fastq.gz")

Then the filenames would be parsed properly, the logs would contain these filenames and I guess there would be no confusion.

I was thinking about actually adjusting the whole structure to match something like above but then I came to the conclusion that it would be a very poor design: to have a whole pipeline following one logic and structure and then, at the very last minute, after all the analysis are done, inject a rule that re-organizes whole results, adjust names of the output files et cetera. Not to mention we would have to parse all standard output and error streams and adjust names inside there accordingly - to avoid discrepancy between the logs content and what is actually in the directories. A very poor choice - we should either follow one logic from start to end (with minor modifications for parsing, if required) or rewrite the logic from the start, at the level of each rule. I think, at least for now, the former is much better.

So this rule prepare_files_for_report is a little "hacky" by nature and I don't think we will ever make it perfectly clean. IMO: we just have to make it good enough.

assigned to @katsanto

mentioned in merge request !61 (closed)

mentioned in merge request !62 (merged)

mentioned in commit fc53c215

closed via merge request !62 (merged)

Clean up rule prepare_files_for_report

Child items ...

Activity