Add rule for ALFA
-
Investigate ALFA (https://github.com/biocompibens/ALFA)
- to use for creating summaries of sample content
- whether it makes sense to use it in combination with STAR
-
Add Snakemake rule for ALFA
- Run alfa for all the samples together and not for each of the samples separetely.
-
Add tests for ALFA rule
- create test sample table (e.g. for parts of chr21.bam) with sample info
- create test case
- write bash script that executes Snakemake with the relevant settings
-
add script to
.gitlab-ci.yml
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Maintainer
Multimapping reads
Use the
quick_start
mapped reads to investigate counting behaviour of multi-mapping reads. These reads are annotated in two ways: i) FLAG 256 for the secondary alignment and ii) NH tag for the number of occurrences of this read in the alignment file.adjust
quick_start.bam
create file with same number of aligned reads, but two reads are multimapping to two different locations (from intergenic to 5UTR and from CDS to 3UTR) and are annotated with the above tags. Resulting
multimapping.sam
:@HD VN:1.4 SO:coordinate @SQ SN:Chr1 LN:1250 @PG ID:STAR PN:STAR VN:STAR_2.4.2a_modified CL:condor_exec.exe --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 1 --outFilterMismatchNoverLmax 0.06 --alignIntronMax 10000 @CO user command line: condor_exec.exe --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --outFilterMismatchNoverLmax 0.06 --outFilterMultimapNmax 1 --alignIntronMax 10000 --outSAMtype BAM SortedByCoordinate HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 0 Chr1 150 255 50M * 0 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:2 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 256 Chr1 300 255 50M * 0 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:2 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 0 Chr1 300 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1215:9399:59794 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2207:19417:41334 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJIJJJJJJJJJJJJJIJIJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2213:9216:81119 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJIJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1111:13970:73107 0 Chr1 500 255 50M * 0 0 AGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCCAACGACGAGG @C@FFFFFHHHHHJJJJJJJIIJJJIJJJJJHHHJJJIIJJJJJJJJJJH NH:i:2 HI:i:1 AS:i:48 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1111:13970:73107 256 Chr1 1100 255 50M * 0 0 AGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCCAACGACGAGG @C@FFFFFHHHHHJJJJJJJIIJJJIJJJJJHHHJJJIIJJJJJJJJJJH NH:i:2 HI:i:1 AS:i:48 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1116:13448:90140 0 Chr1 1100 255 50M * 0 0 CGGCCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTG BBCFFFFFHHHHHJJJJJJJJJJJJJIJJJJJJJJJJJJJIIJJIJHFFF NH:i:1 HI:i:1 AS:i:47 nM:i:1
run
alfa
Snakemake rule:
rule alfa_process: input: one="../ALFA/multimapping_reads/multimapping.bam", two="../ALFA/Quick_start/quick_start.bam" output: "output.Biotypes.png", "output.Categories.png" params: strand="forward", filetype="png", genome_index="../ALFA/quick_start", lone="multimapping", ltwo="quicK_start" conda: "../env_alfa.yaml" threads: 8 shell: """alfa -g {params.genome_index} \ --bam {input.one} {params.lone} {input.two} {params.ltwo} \ -s {params.strand} \ --{params.filetype} output \ -p {threads} """
Result
The annotations change according to the new positions output.Categories.
It appears that there is no check on the aligned reads. Multi-mapping reads are counted several times.
- Maintainer
Investigate source code
alfa.py
has a functiondef run_genomecov(arguments): """ Run genomecov (from Bedtools through pybedtools lib) for a set of parameters to produce a BedGraph file. """ (strand, bam_file, sample_label, name, bedgraph_extension, options) = arguments # Load the BAM file input_file = pybedtools.BedTool(bam_file) # Run genomecov if strand == "": input_file.genome_coverage(bg=True, split=True).saveas(options.output_dir + sample_label + name + bedgraph_extension) else: input_file.genome_coverage(bg=True, split=True, strand=strand).saveas(options.output_dir + sample_label + name + bedgraph_extension) return None
The coverage of the aligned reads is computed with pybedtools
genome_coverage()
and saved as.bedgraph
.There seems to be no option in
bedtools genomecov
to account for multi-mapping reads. - Author Maintainer
Thanks, @burri0000. So to conclude, does it mean that it uses multimappers by default or that it ignores multimappers?
- Maintainer
ALFA uses all the entries from the
.bam
. So if the.bam
contains multi-mappers, then these reads count multiple times.One way is to remove all secondary alignments with the SAM FLAG 1024. Then each read is only counted once. But this looses information. I will look a bit more in detail how we can account for this.
- Maintainer
I am sorry, it is FLAG 256, not 1024.
- Maintainer
Things to address:
-
Adjust snakemake
rule alfa_process
for multiple samples and directory structure -
Adjust
alfa.py
to account for multimappers: divide count by number of mapping positions
-
Adjust snakemake
- Dominik Burri added Doing label
added Doing label
- Maintainer
Instead of adjusting
alfa.py
, runalfa
from coverage files with--bedgraph
.- The coverage
.bedgraph
should be computed by multiplying each alignment with a factor corresponding to sth. like 1/{NH:i:X} where X is the number. - Another nice thing would be to display the percentage of reads with multiple mappings
Example of a multi-mapping read from a 10x genomics run (sample3a):
K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 1 29895 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:3 AS:i:84 nM:i:1 TX:Z:ENST00000473358,+341,88M GX:Z:ENSG00000243485 GN:Z:MIR1302-2HG fx:Z:ENSG00000243485 RE:A:E li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 1 200411 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:2 AS:i:84 nM:i:1 RE:A:I li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 12 31949 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:6 AS:i:84 nM:i:1 RE:A:I li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 272 15 101960980 0 88M * 0 0 TTCTGTCTCTGTGAATTTTGACTATTCTAGGCACTTCACAAAAGTGGACTCATACGATATCTGTAGTTTTGCGTCTGGCTTCTCTATT JJJJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJFJJJJJFJJJJJFJFAJJJJJJJJJJFJJJJJJJJJJJJJJJFFFAA NH:i:6 HI:i:5 AS:i:84 nM:i:1 TX:Z:ENST00000561145,+341,88M GX:Z:ENSG00000259553 GN:Z:AC140725.1 fx:Z:ENSG00000259553 RE:A:E li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 0 19 71502 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:1 AS:i:84 nM:i:1 RE:A:I xf:i:0 li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 9 29673 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:4 AS:i:84 nM:i:1 TX:Z:ENST00000422679,+2016,88M GX:Z:ENSG00000227518 GN:Z:AL928970.1 fx:Z:ENSG00000227518 RE:A:E li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1
- The coverage
- Maintainer
Snakemake rules for running ALFA
ALFA requires a special form of sample input, namely a .bam label construct for each sample. Therefore, I hacked it by putting the samples and their labels in a dictionary
samples = {"possorted": "../../polyasite/cellranger_count/sample3a/outs/possorted_genome_bam.bam", "mapq": "../../polyasite/out/sample3a/filtered/mapq.bam", "uniqueumi": "../../polyasite/out/sample3a/filtered/uniqueumi.bam", "cbtags": "../../polyasite/out/sample3a/filtered/cbtags.bam"}
and a list for the rule
alfa_process
is generatedsamples_list = [] for k, v in samples.items(): samples_list.append(v) samples_list.append(k)
The rules for genome indexing and sample processing
rule all: input: "output/output.Biotypes.png", "output/output.Categories.png" rule alfa_index: input: "../../polyasite/datasets/ref_transcriptome/refdata-cellranger-GRCh38-3.0.0/genes/genes.sorted.gtf" output: "sorted_genes.stranded.ALFA_index", "sorted_genes.unstranded.ALFA_index" conda: "../env_alfa.yaml" threads: 8 shell: "alfa -a {input} -g sorted_genes -p {threads}" rule alfa_process: input: samples=list(samples.values()), str_index="sorted_genes.stranded.ALFA_index", unstr_index="sorted_genes.unstranded.ALFA_index" output: "output/output.Biotypes.png", "output/output.Categories.png" params: strand="forward", filetype="png", genome_index="sorted_genes", samples=samples_list conda: "../env_alfa.yaml" threads: 8 shell: """alfa -g {params.genome_index} \ --bam {params.samples} \ -s {params.strand} \ --{params.filetype} output/output \ -p {threads} """
I run
snakemake
on the slurm cluster withrun_cluster.sh
#!/usr/bin/bash # use rule all snakemake --rerun-incomplete \ --use-conda \ --cluster-config cluster.json \ --cluster "sbatch --cpus-per-task={cluster.threads} --mem={cluster.mem} --time={cluster.time} --qos={cluster.queue} --job-name={cluster.name} -o {cluster.out}" \ --cores 8
and
cluster.json
{ "alfa_index": { "queue":"6hours", "time":"02:00:00", "threads":"8", "mem":"8G", "name": "{rule}.{wildcards}", "out": "$PWD/logs/cluster_log/{rule}.{wildcards}-%j-%N.out" }, "alfa_process": { "queue":"6hours", "time":"02:00:00", "threads":"4", "mem":"8G", "name": "{rule}.{wildcards}", "out": "$PWD/logs/cluster_log/{rule}.{wildcards}-%j-%N.out" } }
env_alfa.yaml
containsname: alfa channels: - biocomp - conda-forge - bioconda - defaults dependencies: - alfa=1.1.1
- MihaelaZavolan changed the description
changed the description
- Owner
Mikhail and Dominik will check if the source code can be modified to include weighting for multimappers.
- Maintainer
Account for multi-mappers
There are two possibilities to account for multi-mapping reads:
- create a weighted coverage
.bedgraph
(by dividing each alignment by the number of mappings in the NH tag). Supply this.bedgraph
to ALFA and compute feature distribution and create plots with ALFA. ALFA should allow this option with multiple samples. - create the feature distribution and plots with own script by assigning each alignment, resp. mapped base pair, to a feature (CDS, 3'UTR, intergenic, etc.). The features can be stored with intervaltree, see
make_plots.py
from Mikhail's crunch-mara.
Both options need to iterate over all lines in the alignment file
.bam
. Also, since the mapping is done with STAR, we assume the existence of the NH tag for weighting the reads. - create a weighted coverage
- Maintainer
Normalised coverage
STAR has an option to create wig tracks (e.g.
.begraph
) from.bam
with normalisation to reads per million (RPM):STAR --runMode inputAlignmentsFromBAM --runThreadN XX --inputBAMfile XX.bam --outWigType bedGraph --outWigStrand Unstranded (or Stranded) --outWigNorm RPM --outFileNamePrefix ./
RPM calculation
The RPM is calculated as:
[(raw count)/(genomic hit number; NH tag)] / (library size) * 1e6
We found the code for this in STAR's
signalFromBAM.cpp
.Plots with ALFA
The obtained
.bedgraph
can now be fed into ALFA with the option--bedgraph
to create the desired graphs, stranded or unstranded is possible. Collapse replies - Maintainer
This is the description regarding RPM of STAR Mapping RNA-seq Reads with STAR: Alternatve protocol 4:
The signal are generated separately for two genomic strands: .str1. and .str2., which correspond to different genomic strands depending on the library preparation protocols. For the protocols in which the 1st read is on the opposite strand to the RNA molecule (such as Illumina stranded Tru-Seq), .str1. corresponds to the (−) strand and .str2. corresponds to the (+) strand.
- Dominik Burri mentioned in commit 4af9f0b1
mentioned in commit 4af9f0b1
- Maintainer
Created branch
alfa_qc
with snakemake rules for performing ALFA. So far tested this on one sample/scicore/home/zavolan/burri0000/test_alfa/sample/chr21.bam
.Tests can be performed with:
snakemake --use-singularity --singularity-args "--bind ${PWD},'/scicore/home/zavolan/burri0000/test_alfa/sample/'" run_alfa_bg_stranded
and
snakemake --use-singularity --singularity-args "--bind ${PWD},'/scicore/home/zavolan/burri0000/test_alfa/sample/'" run_alfa_bg_unstranded
- Alex Kanitz mentioned in issue #42 (closed)
mentioned in issue #42 (closed)
- Alex Kanitz changed title from Investigate ALFA to Add ALFA
changed title from Investigate ALFA to Add ALFA
- Alex Kanitz changed the description
changed the description
- Alex Kanitz marked the checklist item Add Snakemake rule for ALFA as completed
marked the checklist item Add Snakemake rule for ALFA as completed
- Alex Kanitz marked the checklist item Add Snakemake rule for ALFA as incomplete
marked the checklist item Add Snakemake rule for ALFA as incomplete
- Alex Kanitz marked the checklist item Add Snakemake rule for ALFA as completed
marked the checklist item Add Snakemake rule for ALFA as completed
- Alex Kanitz changed title from Add ALFA to Add rule for ALFA
changed title from Add ALFA to Add rule for ALFA
- Alex Kanitz added 1 deleted label
added 1 deleted label
- Alex Kanitz mentioned in issue #6 (closed)
mentioned in issue #6 (closed)
- Maintainer
Integrated alfa rules into Snakemake pipeline with small test files. Things still to consider:
- Orientation: STAR_rpm gives
str1
andstr2
, where the first one is for mate 1. Inalfa
, this should be considered by renamingstr1
toplus
forforward
orientation andstr1
tominus
inreverse
orientation. Need to test with a more meaningful file. -
alfa_bg_XX
is copying into a local directory, should be changed.
- Orientation: STAR_rpm gives
- Author Maintainer
This might help for multiple samples...
params: bam_and_sample_name = expand(str(os.path.join(config["output_dir"], "{sample1}", "bowtie", "alignment.sorted.bam ")) +str("{sample1}"), sample1=get_samples()),
- BIOPZ-Gypas Foivos changed the description
changed the description
- Alex Kanitz changed milestone to %v0.1.0 release
changed milestone to %v0.1.0 release
- Dominik Burri mentioned in issue #45 (closed)
mentioned in issue #45 (closed)
- Maintainer
ALFA runs now (4f7447ac) for all samples in the sample table (tested for synthetic data). However, there are some limitations to it:
- all the samples must have the same genome
- all the samples must have the same orientation
It is implemented by only using the first entry in the lists, no checks are performed.
To do's
- test the rule for meaningful sample
- extend the rule to create for each genome, orientation combination ALFA output
- Dominik Burri marked the checklist item create test case as completed
marked the checklist item create test case as completed
- Dominik Burri marked the checklist item write bash script that executes Snakemake with the relevant settings as completed
marked the checklist item write bash script that executes Snakemake with the relevant settings as completed
- Dominik Burri marked the checklist item add script to
.gitlab-ci.yml
as completedmarked the checklist item add script to
.gitlab-ci.yml
as completed - Dominik Burri marked the checklist item Run alfa for all the samples together and not for each of the samples separetely. as completed
marked the checklist item Run alfa for all the samples together and not for each of the samples separetely. as completed
- Dominik Burri marked the checklist item create test sample table (e.g. for parts of chr21.bam) with sample info as completed
marked the checklist item create test sample table (e.g. for parts of chr21.bam) with sample info as completed
- Dominik Burri marked the checklist item Add tests for ALFA rule as completed
marked the checklist item Add tests for ALFA rule as completed
- Maintainer
Tested ALFA with sample
chr21
.- The resulting category plots for paired end are: PE Sense: ALFA_plots.Categories.pdf and PE Antisense: ALFA_plots.Categories.pdf.
- Single end sense and antisense: SE Sense: ALFA_plots.Categories.pdf and SE Antisense: ALFA_plots.Categories.pdf
1 - Owner
Looks good, eh? :)
Just before I forget: I played around with interactive
git rebase
yesterday for some merge requests on some other project on GitHub. It's really cool and super useful especially in your case where you have a long-living branch with many commits. So let's do this for you as well. Let me know when you are ready to merge, so that I can show you how it works.Edited by Alex Kanitz - Owner
I am not sure yet of the results. Is it typical that 70% of reads map to introns? Of course, it's not an enrichment, but it looks like a large number. Also for the single end, the results for sense and antisense are the same. Is it my mistake (i.e. input files are identical) or is it a 'feature' of the tool that it does not consider the sense for the single end reads?
- Owner
Didn't look at introns vs exons as I don't know what sample was run. But you are definitely right about single end results looking funny.
- Maintainer
I am not totally convinced of the results either. I played a little by manually re-setting strand orientation (ALFA option
-s fr-firststrand
or-s fr-secondstrand
and I got the same results in the PE and SE sample. - Owner
I think the best way to make sense of this is to use a synthetic data set (like the one I generated with 10 reads or so).
- Maintainer
alfa.py
generates.bedgraph
internally withif options.strandness in ["forward", "fr-firststrand"]: parameter_sets.append(["+", b, l, ".plus", bedgraph_extension, o]) parameter_sets.append(["-", b, l, ".minus", bedgraph_extension, o]) elif options.strandness in ["reverse", "fr-secondstrand", bedgraph_extension, o]: parameter_sets.append(["-", b, l, ".plus", bedgraph_extension, o]) parameter_sets.append(["+", b, l, ".minus", bedgraph_extension, o])
with parameter_sets: (strand, bam_file, sample_label, name, bedgraph_extension, options)
which then is called in
run_genomecov(arguments)
input_file.genome_coverage(bg=True, split=True, strand=strand).saveas(options.output_dir + sample_label + name + bedgraph_extension)
So, the naming for alfa should always be
.plus
forstr1
and.minus
forstr2
of STAR RPM output, irrespective of orientation.(preliminary) Results
- Maintainer
Okay, so Anastasiya and I tested manually the output of
chr21.bam
:- running ALFA directly (incl. BEDGraphs and generating final reports) for
chr21.bam
we get: chr21 directly: ALFA_plots.Categories.pdf - generating BEDGraphs with STAR, renaming resulting BEDGraphs str1 into minus and str2 into plus, then running ALFA from renamed bedgraphs: chr21 manually with STAR RPM: ALFA_plots.Categories.pdf
Checking
chr21.bam
, and all the above.bedgraph
in IGV reveals that STAR BEDGraphs do report correctly that mapped reads are assigned on the respective strand, based on the library orientation, whereas ALFA's bedgraph (which usesbedtools genomecov
) reports coverage irrespective of library orientation. That is, for a gene on the plus strand, ALFA's begraph reports approx. half of the reads on the plus and half on the minus strand. STAR RPM reports almost all reads on the plus strand.If we rename the STAR RPM bedgraphs as I mentioned above, about 70% of reads are reported to map to the opposite strand. So my previous statement about renaming is wrong.
Edited by Dominik Burri - running ALFA directly (incl. BEDGraphs and generating final reports) for
- Maintainer
I opened an issue on github: https://github.com/biocompibens/ALFA/issues/3
Mathieu answered but I don't quite get his plan.
Anyway, I adjust the snakemake rule for creating ALFA plots for all samples with the same orientation.
- Author Maintainer
Note: The conclusion from today's discussion is that we will use STAR to export the bedgraph, but we have to equally distribute the reads in case of multi-mappers.
- Maintainer
Sounds good. About equally distributing reads: This is why I use STAR RPM, i.e.
STAR --runMode inputAlignmentsFromBAM --runThreadN XX --inputBAMfile XX.bam --outWigType bedGraph --outWigStrand Stranded --outWigNorm RPM --outFileNamePrefix ./
and use the
.UniqueMultiple.
output files. - Owner
@burri0000: perhaps you could comment on the issue on the ALFA repo what we are going to do and that perhaps that they could recommend this procedure in their docs for people who are running ALFA for paired end samples (until they have time to fix the code). Apparently it's an underlying issue with
bedtools genomecov
which ALFA uses, and so there are not likely to be able to fix it with just a few lines of code. (there's an option-pc
which sounds like it might address this issue, but it's not well documented and apparently there are issues where people say that also with that option it doesn't really work as expected; I would link to the issue(s) but I haven't seen them myself, please add if anybody finds it) 1 - Dominik Burri mentioned in commit ab7f578c
mentioned in commit ab7f578c
- Maintainer
Adjusted rules to run ALFA for all samples; the rule gathers
feature_counts.tsv
tables from each sample and plots them together. The result for the chr21 test is: ALFA_plots.Categories.pdf and ALFA_plots.Biotypes.pdfI think these results make sense. The PE antisense could be wrongly annotated because the reads look the same in IGV (checked with Anastasiya).
- Maintainer
Added rules for producing ALFA output per sample and for all samples. See commit b8347944.
- Maintainer
Test paired end mapping
I want to test how the SAM flags affect the output
test files
The files I create are adapted from
quick_start.bam
of ALFA. I create two files with each 10 reads, resp. 5 read pairs. 8/10 reads map according to the file name, and the other 2 are on the opposite strand. The reason to include 2 reads on the opposite strand is, that the resulting.bedgraph
from STAR RPM is not empty. If the.bedgraph
is empty, ALFA returns an error.For the file, where 8/10 reads map to the plus strand: The flag
99
means that R1 (mate1) maps to the plus strand and147
that this is R2 (mate2) which maps to the minus strand. For the file in which 8/10 reads map to the minus strand: SAM flag83
means that R1 maps to minus strand and flage163
that R2 maps to the plus strand.paired_end_R1_on_plus.bam
@HD VN:1.4 SO:coordinate @SQ SN:Chr1 LN:1250 @PG ID:STAR PN:STAR VN:STAR_2.4.2a_modified CL:condor_exec.exe --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 1 --outFilterMismatchNoverLmax 0.06 --alignIntronMax 10000 @CO user command line: condor_exec.exe --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --outFilterMismatchNoverLmax 0.06 --outFilterMultimapNmax 1 --alignIntronMax 10000 --outSAMtype BAM SortedByCoordinate HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 99 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:1 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 147 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:1 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1110:3857:68388 99 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFGHHHHJJJJJJJJJJJIIIJJJEHJJJJJJJJIJJJJJIIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1110:3857:68388 147 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFGHHHHJJJJJJJJJJJIIIJJJEHJJJJJJJJIJJJJJIIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 99 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 147 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1215:9399:59794 99 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1215:9399:59794 147 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 163 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 83 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1
paired_end_R1_on_minus.bam
@HD VN:1.4 SO:coordinate @SQ SN:Chr1 LN:1250 @PG ID:STAR PN:STAR VN:STAR_2.4.2a_modified CL:condor_exec.exe --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 1 --outFilterMismatchNoverLmax 0.06 --alignIntronMax 10000 @CO user command line: condor_exec.exe --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --outFilterMismatchNoverLmax 0.06 --outFilterMultimapNmax 1 --alignIntronMax 10000 --outSAMtype BAM SortedByCoordinate HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 163 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:1 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 83 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:1 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1110:3857:68388 163 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFGHHHHJJJJJJJJJJJIIIJJJEHJJJJJJJJIJJJJJIIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1110:3857:68388 83 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFGHHHHJJJJJJJJJJJIIIJJJEHJJJJJJJJIJJJJJIIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 163 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 83 Chr1 300 255 50M = 300 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1215:9399:59794 163 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1215:9399:59794 83 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 99 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 147 Chr1 500 255 50M = 500 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1
genome & chrNameLength
I use the genome
quick_start.gtf
which is provided in ALFA. The authors provide a nice overview in their readme.I rename
quick_start.chr_len.txt
toresults/star_indexes/homo_sapiens/75/STAR_index/chrNameLength.txt
samples table
In the samples table I use each
.bam
twice, once with the information that the library is sense, and once in which the library is antisense. This will generate all 4 possible outcomes.Note that I will run the pipeline from
.bam
and therefore only need columns sample, seqmode, index_size, gtf, * kallisto_directionality*.samples.tsv
sample seqmode fq1 index_size kmer fq2 fq1_3p fq1_5p fq2_3p fq2_5p organism gtf gtf_filtered genome tr_fasta_filtered sd mean multimappers soft_clip pass_mode libtype kallisto_directionality fq1_polya fq2_polya paired_end_R1_on_plus_sense paired_end input_files/project1/synthetic.mate_1.fastq.gz 75 31 input_files/project1/synthetic.mate_2.fastq.gz AGATCGGAAGAGCACA XXXXXXXXXXXXX AGATCGGAAGAGCGT XXXXXXXXXXXXX homo_sapiens input_files/homo_sapiens/quick_start.gtf homo_sapiens/quick_start.gtf input_files/homo_sapiens/genome.fa input_files/homo_sapiens/transcriptome.fa 100 250 10 EndToEnd None A --fr AAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTT paired_end_R1_on_plus_antisense paired_end input_files/project1/synthetic.mate_1.fastq.gz 75 31 input_files/project1/synthetic.mate_2.fastq.gz AGATCGGAAGAGCACA XXXXXXXXXXXXX AGATCGGAAGAGCGT XXXXXXXXXXXXX homo_sapiens input_files/homo_sapiens/quick_start.gtf homo_sapiens/quick_start.gtf input_files/homo_sapiens/genome.fa input_files/homo_sapiens/transcriptome.fa 100 250 10 EndToEnd None A --rf AAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTT paired_end_R1_on_minus_sense paired_end input_files/project1/synthetic.mate_1.fastq.gz 75 31 input_files/project1/synthetic.mate_2.fastq.gz AGATCGGAAGAGCACA XXXXXXXXXXXXX AGATCGGAAGAGCGT XXXXXXXXXXXXX homo_sapiens input_files/homo_sapiens/quick_start.gtf homo_sapiens/quick_start.gtf input_files/homo_sapiens/genome.fa input_files/homo_sapiens/transcriptome.fa 100 250 10 EndToEnd None A --fr AAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTT paired_end_R1_on_minus_antisense paired_end input_files/project1/synthetic.mate_1.fastq.gz 75 31 input_files/project1/synthetic.mate_2.fastq.gz AGATCGGAAGAGCACA XXXXXXXXXXXXX AGATCGGAAGAGCGT XXXXXXXXXXXXX homo_sapiens input_files/homo_sapiens/quick_start.gtf homo_sapiens/quick_start.gtf input_files/homo_sapiens/genome.fa input_files/homo_sapiens/transcriptome.fa 100 250 10 EndToEnd None A --rf AAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTT
Run pipeline
the files and directories are located in
config/synthetic_paired/
.snakemake \ --snakefile="../../Snakefile" \ --configfile="input_files/config.yaml" \ --cores=4 \ --printshellcmds \ --rerun-incomplete \ --use-singularity \ --use-conda \ --singularity-args="--bind ${PWD}" \ --verbose \ results/ALFA/ALFA_plots.Categories.pdf
Results
The gene in
quick_start.gtf
is located on the plus strand. So I should get 80% mapped for sample paired_end_R1_on_plus_sense and paired_end_R1_on_minus_antisense. This is the case: ALFA_plots.Biotypes.pdf & ALFA_plots.Categories.pdf.The other two samples, paired_end_R1_on_minus_sense and paired_end_R1_on_plus_antisense are with 8/10 reads mapping to opposite strand of the gene.
- Maintainer
I included the samples above as independent test for ALFA in
tests/test_alfa
(81c4c7b4). - Dominik Burri mentioned in merge request !39 (closed)
mentioned in merge request !39 (closed)
- Maintainer
Added in merge request !41 (merged)
- Dominik Burri closed
closed
- CJHerrmann mentioned in merge request !93 (merged)
mentioned in merge request !93 (merged)