For example, samples downloaded from SRA can be divided into several runs (i.e. SRR identifiers).
The user wants to analyse the sample as a whole.
Therefore, the runs from this sample need to be concatenated before running Rhea.
How and when should the FASTQ be concatenated?
Possible solutions:
The user concatenates by him/herself
The naming of the file is like SAMPLE01_001.fastq.gz, SAMPLE01_002.fastq.gz, etc.
Each FASTQ associated to the sample is entered in samples.tsv
@gypas If it were just about running your own samples, I would probably agree with you. However, if we want users to be able to run samples (more or less) straight from SRA, it becomes an issue that will complicate the workflow considerably: download sample table, concatenate samples, then enter things manually again into our complicated and error-prone format :-/ Anyway, it shouldn't be hard to implement this in some satisfactory way.
You might wonder why do I create 'ugly' bash scripts in the middle:
In that pipeline the user provides just the record ID and the samples' names and the data are downloaded and merged irrespective of lanes. Therefore since the user (and following that - snakemake) do not possess the information about potential lanes per sample the whole graph is build around samples IDs. That is why I need to create small bash scripts in the middle of the workflow that will download and merge lanes-files.
We should devise a simple mechanism that optionally allows the user to concatenate samples. Merging all SRR entries of an SRX may or may not be desired.
Also, if I recall correctly, there are cases where there are multiple .sra files for the same SRR entry. In these cases it may actually be appropriate to concatenate by default, because I think these result from files surpassing some fixed size limit for individual files (2 Gb for original FASTQ files, I think), so they result from the same sample/lane/run and the splitting of the data is arbitrary.
Thanks @burri0000, very useful link. Posting the entire section on RUNS as well as the one on EXPERIMENTS:
EXPERIMENT
Each SRA EXPERIMENT (SRA accession SRX#) is a unique sequencing result for a specific sample.
Example
Six sequencing libraries were prepared from a single biological sample. Three were single-end libraries, and three paired-end, although the paired-end libraries were sequenced using both paired and unidirectional sequencing. Two of the single-end libraries were treated using a targeted selection approach for some runs. Libraries were sequenced on two different instruments at three sequencing labs. In all there are 13 different combinations of library + sequencing strategy + layout + instrument model. Each combination represents a unique EXPERIMENT.
What an example?! Can anyone tell me how they got to 13??? Anyway, to continue:
Additional information may be included in the EXPERIMENT. For example, you should differentiate biological replicates using EXPERIMENTs if sequencing results were obtained separately from each animal in a group of otherwise identical animals (treated, non-treated, healthy, infected, etc.), the above EXPERIMENTs may be represented by a combination replicate number + library + sequencing strategy + layout + instrument model.
An SRA EXPERIMENT is the main publishable unit in the SRA database.
Most of descriptive information is captured at the level of the SRA EXPERIMENT will be displayed in the public record.
RUN
SRA RUN is simply a manifest of data file(s) that are derived from sequencing a library described by the associated EXPERIMENT.
Check When submitting in The SRA Submission Portal wizard, submitters only provide types and names for the sequence data files that they will be uploading.
Paired-end data files (forward/reverse) must be listed together in the same RUN in order for the two files to be correctly processed as paired-end.
All data files listed in a RUN will be merged into a single .sra archive file. Therefore, files from different samples or experiments should not be grouped in the same RUN.
So yes, multiple .sra files SHOULD not happen anymore. On the other hand, I've seen it happen, so to forego any bugs, if possible, we should probably write code that expects a list of .sra files from a single SRR.
Also, as @burri0000 has already said, in principle runs should match 1:1 to experiments, which should be a unique combination of library + sequencing strategy + layout + instrument model, which would suggest that the default mode for multiple RUNs per EXPERIMENT is to concatenate them. However, I think we should NOT automatically concatenate them, because the SRA does not enforce these rules (at least they did not use to), so whatever they describe here is really just how things would be in an ideal world where manuals are infallible from Day One, and everyone would not only RTFM, but also do what it says. Since that's not the case, there will be plenty of cases where people may not want to auto-concatenate multiple runs of an experiment.
See below how the PolyASite pipeline does it. The PolyASite design file contains SRP, SRX and SRR for each sample, but I got the information from SraRunTables mostly anyways. But of course there's an ugly shell call...
rule change_fastq_to_fq_se: ##LOCAL## #ATTENTION: For some samples, multiple SRR run files (aka fastq files) belong to # a single sample. If this is the case, they are concatenated here # before the softlink is established ##No Singularity support required## input: sample_fq = lambda wildcards: expand(os.path.join(config["samples_dir"], wildcards.sample, "{srr}.fastq.gz"), srr = samples.loc[wildcards.sample, "SRR"].split(",")) output: sample_fq = os.path.join(config["samples_dir"], "{sample}", "{sample}.fq.gz") params: file_dir = os.path.join(config["samples_dir"], "{sample}"), sample_srr = lambda wildcards: samples.loc[wildcards.sample, "SRR"], first_srr = lambda wildcards: samples.loc[wildcards.sample, "SRR"].split(",")[0], sample_id = "{sample}" shell: ''' cd {params.file_dir} IFS=',' read -ra SRR <<< "{params.sample_srr}" if [[ "${{#SRR[@]}}" > "1" ]];then first_file="${{SRR[0]}}.fastq.gz" for i in $(seq 1 $((${{#SRR[@]}}-1))); do curr_file="${{SRR[$i]}}.fastq.gz"; cat ${{curr_file}} >> ${{first_file}};done fi ln -fs {params.first_srr}.fastq.gz {params.sample_id}.fq.gz cd - '''
Thanks! :) Not a problem with a shell solution in a case like this. A lot saner than writing a Python script that needs to be maintained and anyway uses a system library that eventually ends up doing the same thing that the shell calls do. Possibly inside a 300 Mb Docker image.
I do think this can probably be simplified a little though... :)
@herrmchr and @devagy74 would be nice to comment on what the issue is currently and how you solved this before closing since this issue has expanded and is hard to follow.
subworkflow to concatenate samples and write a new samples.tsv where each sample name appears only once. Issue here is that snakemake doesn't know about the new samples table, because subworkflow outputs have to be used as rule inputs in order to be considered. Just reading the samples table as we do is not considered by Snakemake in the dependencies, and it will fail because the file is not present from the beginning.
Keeping the original samples table, and if duplicate rows occur, use only info from one row, if single rows, take that info. Here we ran into strange issues, where we couldn't treat the two or one value outputs in the same way.
Example:
samples_table.loc["s1","index"].tolist()
for duplicate: [75,75] (type list)
for single row: 75 (type int, can't be indexed)
@devagy74 and me experienced similar problems with pandas objects independently of each other.
Notes:
For concatenating the samples with bash, a string (not a list) of comma separated input filenames is needed
pandas .unique() might come in handy when expanding on samples table in rule finish, as duplicates would just be ignored. I was working on that in the beginning but stopped because it was decided to use subworkflows.
The concatenation does not work. I tested the pipeline with single-end samples, where each sample was sequenced in two lanes (the sample table is attached) samples.tsv. I got the following error message, when I triggered the pipeline:
IndexError in line 20 of /scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile:0 File "/scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile", line 583, in <module> File "/scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile", line 583, in <listcomp> File "/scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile", line 20, in get_sample File "/scicore/home/zavolan/boersch/Software/miniconda3/envs/zarp/lib/python3.7/site-packages/pandas/core/series.py", line 1071, in __getitem__ File "/scicore/home/zavolan/boersch/Software/miniconda3/envs/zarp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4750, in get_value
I modified the sample table by removing the second lane for each sample and the run was clean.
Ok I tried to reproduce the error and I could not get the same error. One thing I noticed though when trying to run this example, is that there is a need for two additional bindings:/scicore/home/zavolan/GROUP/Genomes,/scicore/projects/openbis/userstore/biozentrum_zavolan/
Also make sure once more that you have the correct zarp env. Please tell me if any of these fixes the errors.
Ok so in an effort to reproduce the error, I cloned the dev branch. I activated the install/environment.yml . I put the table you provided as samples_multiple_lanes.tsv and then triggered the test_integration_workflow_multiple_lanes, that is testing the new approach with the addition of these two bindings in the bindings options: /scicore/home/zavolan/GROUP/Genomes,/scicore/projects/openbis/userstore/biozentrum_zavolan/
This was running successfully. I did not want to wait until end so I killed the process. From the error above I can only assume that the pipeline does not even start? Did you run snakemake -np --verbose so that maybe we have more insight on why this is happening?
I did not try to run tests, I was running real samples and the error is reproducible in my case (I checked one more time). I used the following commands to operate only with the environment, all other files remained unchanged:
(zarp) [boersch@worker03 gcn4_project_multilane]$ bash 02_run.sh IndexError in line 20 of /scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile:0 File "/scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile", line 583, in <module> File "/scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile", line 583, in <listcomp> File "/scicore/home/zavolan/boersch/GroupProjects/zarp/Snakefile", line 20, in get_sample File "/scicore/home/zavolan/boersch/Software/miniconda3/envs/zarp/lib/python3.7/site-packages/pandas/core/series.py", line 1071, in __getitem__ File "/scicore/home/zavolan/boersch/Software/miniconda3/envs/zarp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4750, in get_value
And trigger the pipeline again, which results in a clean run:
(zarp) [boersch@worker03 gcn4_project_multilane]$ bash 02_run.sh Building DAG of jobs...Using shell: /usr/bin/bashProvided cluster nodes: 256Job counts:...
It seems there is indeed some incosistency in the env dependencies. In the future it is helpful to run the specific tests and point to the error, since these are easier to reproduce, and additionally we need to create extra test if none of the existing ones capture the error, so that we can avoid similar errors happening in future versions. I am working on this for now. I will soon push the fix.
I believe you, as I said in slack, I haven't tested with your new yeast samples table. The question is now whether the test we are using for merge_multiple_lanes does (not) reflect reality, and whether we should change the test setup/files/samples_table? Can you pin down the difference between your real world example and the test?
@herrmchr if you substitute the samples.mulitple_lanes.tsv with the attached tsv above does the test_integration_workflow_multiple_lanes work? By using the non dev zarp conda.The part of the test that requires bedtools will fail, but do you get an error in the snakemake part? I do not get an error like the one @boersch does.. Maybe you can see if you can reproduce this..
@katsanto Sure, I decided to attach all files that I used for running. Files cluster.json, config.yaml and samples.tsv were located in the directory called config. The directory config was located in the same directory with 02_run.sh, which I triggered.
It is wrong.. this should not be there at all..There are other versions of pandas and biopython, so things get messed up.. also within the lab we should always use the dev conda env as it contains the labkey that we need. If you can modify this in the tutorial you prepared. First see if this works for you.. for me it is already working because I did not do the pip install step at all..