Add rule for ALFA
-
Investigate ALFA (https://github.com/biocompibens/ALFA)
- to use for creating summaries of sample content
- whether it makes sense to use it in combination with STAR
-
Add Snakemake rule for ALFA
- Run alfa for all the samples together and not for each of the samples separetely.
-
Add tests for ALFA rule
- create test sample table (e.g. for parts of chr21.bam) with sample info
- create test case
- write bash script that executes Snakemake with the relevant settings
-
add script to
.gitlab-ci.yml
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Maintainer
Multimapping reads
Use the
quick_start
mapped reads to investigate counting behaviour of multi-mapping reads. These reads are annotated in two ways: i) FLAG 256 for the secondary alignment and ii) NH tag for the number of occurrences of this read in the alignment file.adjust
quick_start.bam
create file with same number of aligned reads, but two reads are multimapping to two different locations (from intergenic to 5UTR and from CDS to 3UTR) and are annotated with the above tags. Resulting
multimapping.sam
:@HD VN:1.4 SO:coordinate @SQ SN:Chr1 LN:1250 @PG ID:STAR PN:STAR VN:STAR_2.4.2a_modified CL:condor_exec.exe --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 1 --outFilterMismatchNoverLmax 0.06 --alignIntronMax 10000 @CO user command line: condor_exec.exe --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --outFilterMismatchNoverLmax 0.06 --outFilterMultimapNmax 1 --alignIntronMax 10000 --outSAMtype BAM SortedByCoordinate HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 0 Chr1 150 255 50M * 0 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:2 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 256 Chr1 300 255 50M * 0 0 ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ NH:i:2 HI:i:1 AS:i:43 nM:i:2 HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 0 Chr1 300 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1215:9399:59794 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2207:19417:41334 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJIJJJJJJJJJJJJJIJIJJJJJJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:2213:9216:81119 0 Chr1 500 255 50M * 0 0 ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJIJJJ NH:i:1 HI:i:1 AS:i:45 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1111:13970:73107 0 Chr1 500 255 50M * 0 0 AGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCCAACGACGAGG @C@FFFFFHHHHHJJJJJJJIIJJJIJJJJJHHHJJJIIJJJJJJJJJJH NH:i:2 HI:i:1 AS:i:48 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1111:13970:73107 256 Chr1 1100 255 50M * 0 0 AGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCCAACGACGAGG @C@FFFFFHHHHHJJJJJJJIIJJJIJJJJJHHHJJJIIJJJJJJJJJJH NH:i:2 HI:i:1 AS:i:48 nM:i:1 HWI-1KL110:162:C7268ACXX:8:1116:13448:90140 0 Chr1 1100 255 50M * 0 0 CGGCCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTG BBCFFFFFHHHHHJJJJJJJJJJJJJIJJJJJJJJJJJJJIIJJIJHFFF NH:i:1 HI:i:1 AS:i:47 nM:i:1
run
alfa
Snakemake rule:
rule alfa_process: input: one="../ALFA/multimapping_reads/multimapping.bam", two="../ALFA/Quick_start/quick_start.bam" output: "output.Biotypes.png", "output.Categories.png" params: strand="forward", filetype="png", genome_index="../ALFA/quick_start", lone="multimapping", ltwo="quicK_start" conda: "../env_alfa.yaml" threads: 8 shell: """alfa -g {params.genome_index} \ --bam {input.one} {params.lone} {input.two} {params.ltwo} \ -s {params.strand} \ --{params.filetype} output \ -p {threads} """
Result
The annotations change according to the new positions output.Categories.
It appears that there is no check on the aligned reads. Multi-mapping reads are counted several times.
- Maintainer
Investigate source code
alfa.py
has a functiondef run_genomecov(arguments): """ Run genomecov (from Bedtools through pybedtools lib) for a set of parameters to produce a BedGraph file. """ (strand, bam_file, sample_label, name, bedgraph_extension, options) = arguments # Load the BAM file input_file = pybedtools.BedTool(bam_file) # Run genomecov if strand == "": input_file.genome_coverage(bg=True, split=True).saveas(options.output_dir + sample_label + name + bedgraph_extension) else: input_file.genome_coverage(bg=True, split=True, strand=strand).saveas(options.output_dir + sample_label + name + bedgraph_extension) return None
The coverage of the aligned reads is computed with pybedtools
genome_coverage()
and saved as.bedgraph
.There seems to be no option in
bedtools genomecov
to account for multi-mapping reads. - Author Maintainer
Thanks, @burri0000. So to conclude, does it mean that it uses multimappers by default or that it ignores multimappers?
- Maintainer
ALFA uses all the entries from the
.bam
. So if the.bam
contains multi-mappers, then these reads count multiple times.One way is to remove all secondary alignments with the SAM FLAG 1024. Then each read is only counted once. But this looses information. I will look a bit more in detail how we can account for this.
- Maintainer
I am sorry, it is FLAG 256, not 1024.
- Maintainer
Things to address:
-
Adjust snakemake
rule alfa_process
for multiple samples and directory structure -
Adjust
alfa.py
to account for multimappers: divide count by number of mapping positions
-
Adjust snakemake
- Dominik Burri added Doing label
added Doing label
- Maintainer
Instead of adjusting
alfa.py
, runalfa
from coverage files with--bedgraph
.- The coverage
.bedgraph
should be computed by multiplying each alignment with a factor corresponding to sth. like 1/{NH:i:X} where X is the number. - Another nice thing would be to display the percentage of reads with multiple mappings
Example of a multi-mapping read from a 10x genomics run (sample3a):
K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 1 29895 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:3 AS:i:84 nM:i:1 TX:Z:ENST00000473358,+341,88M GX:Z:ENSG00000243485 GN:Z:MIR1302-2HG fx:Z:ENSG00000243485 RE:A:E li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 1 200411 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:2 AS:i:84 nM:i:1 RE:A:I li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 12 31949 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:6 AS:i:84 nM:i:1 RE:A:I li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 272 15 101960980 0 88M * 0 0 TTCTGTCTCTGTGAATTTTGACTATTCTAGGCACTTCACAAAAGTGGACTCATACGATATCTGTAGTTTTGCGTCTGGCTTCTCTATT JJJJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJFJJJJJFJJJJJFJFAJJJJJJJJJJFJJJJJJJJJJJJJJJFFFAA NH:i:6 HI:i:5 AS:i:84 nM:i:1 TX:Z:ENST00000561145,+341,88M GX:Z:ENSG00000259553 GN:Z:AC140725.1 fx:Z:ENSG00000259553 RE:A:E li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 0 19 71502 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:1 AS:i:84 nM:i:1 RE:A:I xf:i:0 li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1 K00335:36:HJ5J5BBXX:6:1213:6380:23276 256 9 29673 0 88M * 0 0 AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ NH:i:6 HI:i:4 AS:i:84 nM:i:1 TX:Z:ENST00000422679,+2016,88M GX:Z:ENSG00000227518 GN:Z:AL928970.1 fx:Z:ENSG00000227518 RE:A:E li:i:0 CR:Z:CCCTCCTAGCTGTCTA CY:Z:AAFFFJJJJJJJFJJJ CB:Z:CCCTCCTAGCTGTCTA-1 UR:Z:TGTTCCGCAA UY:Z:JJJJJJJJJJ UB:Z:TGTTCCGCAA RG:Z:sample3a:0:1:HJ5J5BBXX:1
- The coverage
- Maintainer
Snakemake rules for running ALFA
ALFA requires a special form of sample input, namely a .bam label construct for each sample. Therefore, I hacked it by putting the samples and their labels in a dictionary
samples = {"possorted": "../../polyasite/cellranger_count/sample3a/outs/possorted_genome_bam.bam", "mapq": "../../polyasite/out/sample3a/filtered/mapq.bam", "uniqueumi": "../../polyasite/out/sample3a/filtered/uniqueumi.bam", "cbtags": "../../polyasite/out/sample3a/filtered/cbtags.bam"}
and a list for the rule
alfa_process
is generatedsamples_list = [] for k, v in samples.items(): samples_list.append(v) samples_list.append(k)
The rules for genome indexing and sample processing
rule all: input: "output/output.Biotypes.png", "output/output.Categories.png" rule alfa_index: input: "../../polyasite/datasets/ref_transcriptome/refdata-cellranger-GRCh38-3.0.0/genes/genes.sorted.gtf" output: "sorted_genes.stranded.ALFA_index", "sorted_genes.unstranded.ALFA_index" conda: "../env_alfa.yaml" threads: 8 shell: "alfa -a {input} -g sorted_genes -p {threads}" rule alfa_process: input: samples=list(samples.values()), str_index="sorted_genes.stranded.ALFA_index", unstr_index="sorted_genes.unstranded.ALFA_index" output: "output/output.Biotypes.png", "output/output.Categories.png" params: strand="forward", filetype="png", genome_index="sorted_genes", samples=samples_list conda: "../env_alfa.yaml" threads: 8 shell: """alfa -g {params.genome_index} \ --bam {params.samples} \ -s {params.strand} \ --{params.filetype} output/output \ -p {threads} """
I run
snakemake
on the slurm cluster withrun_cluster.sh
#!/usr/bin/bash # use rule all snakemake --rerun-incomplete \ --use-conda \ --cluster-config cluster.json \ --cluster "sbatch --cpus-per-task={cluster.threads} --mem={cluster.mem} --time={cluster.time} --qos={cluster.queue} --job-name={cluster.name} -o {cluster.out}" \ --cores 8
and
cluster.json
{ "alfa_index": { "queue":"6hours", "time":"02:00:00", "threads":"8", "mem":"8G", "name": "{rule}.{wildcards}", "out": "$PWD/logs/cluster_log/{rule}.{wildcards}-%j-%N.out" }, "alfa_process": { "queue":"6hours", "time":"02:00:00", "threads":"4", "mem":"8G", "name": "{rule}.{wildcards}", "out": "$PWD/logs/cluster_log/{rule}.{wildcards}-%j-%N.out" } }
env_alfa.yaml
containsname: alfa channels: - biocomp - conda-forge - bioconda - defaults dependencies: - alfa=1.1.1
- MihaelaZavolan changed the description
changed the description
- Owner
Mikhail and Dominik will check if the source code can be modified to include weighting for multimappers.
- Maintainer
Account for multi-mappers
There are two possibilities to account for multi-mapping reads:
- create a weighted coverage
.bedgraph
(by dividing each alignment by the number of mappings in the NH tag). Supply this.bedgraph
to ALFA and compute feature distribution and create plots with ALFA. ALFA should allow this option with multiple samples. - create the feature distribution and plots with own script by assigning each alignment, resp. mapped base pair, to a feature (CDS, 3'UTR, intergenic, etc.). The features can be stored with intervaltree, see
make_plots.py
from Mikhail's crunch-mara.
Both options need to iterate over all lines in the alignment file
.bam
. Also, since the mapping is done with STAR, we assume the existence of the NH tag for weighting the reads. - create a weighted coverage
- Maintainer
Normalised coverage
STAR has an option to create wig tracks (e.g.
.begraph
) from.bam
with normalisation to reads per million (RPM):STAR --runMode inputAlignmentsFromBAM --runThreadN XX --inputBAMfile XX.bam --outWigType bedGraph --outWigStrand Unstranded (or Stranded) --outWigNorm RPM --outFileNamePrefix ./
RPM calculation
The RPM is calculated as:
[(raw count)/(genomic hit number; NH tag)] / (library size) * 1e6
We found the code for this in STAR's
signalFromBAM.cpp
.Plots with ALFA
The obtained
.bedgraph
can now be fed into ALFA with the option--bedgraph
to create the desired graphs, stranded or unstranded is possible. Collapse replies - Maintainer
This is the description regarding RPM of STAR Mapping RNA-seq Reads with STAR: Alternatve protocol 4:
The signal are generated separately for two genomic strands: .str1. and .str2., which correspond to different genomic strands depending on the library preparation protocols. For the protocols in which the 1st read is on the opposite strand to the RNA molecule (such as Illumina stranded Tru-Seq), .str1. corresponds to the (−) strand and .str2. corresponds to the (+) strand.
- Dominik Burri mentioned in commit 4af9f0b1
mentioned in commit 4af9f0b1
- Maintainer
Created branch
alfa_qc
with snakemake rules for performing ALFA. So far tested this on one sample/scicore/home/zavolan/burri0000/test_alfa/sample/chr21.bam
.Tests can be performed with:
snakemake --use-singularity --singularity-args "--bind ${PWD},'/scicore/home/zavolan/burri0000/test_alfa/sample/'" run_alfa_bg_stranded
and
snakemake --use-singularity --singularity-args "--bind ${PWD},'/scicore/home/zavolan/burri0000/test_alfa/sample/'" run_alfa_bg_unstranded
- Alex Kanitz mentioned in issue #42 (closed)
mentioned in issue #42 (closed)
- Alex Kanitz changed title from Investigate ALFA to Add ALFA
changed title from Investigate ALFA to Add ALFA
- Alex Kanitz changed the description
changed the description
- Alex Kanitz marked the checklist item Add Snakemake rule for ALFA as completed
marked the checklist item Add Snakemake rule for ALFA as completed
- Alex Kanitz marked the checklist item Add Snakemake rule for ALFA as incomplete
marked the checklist item Add Snakemake rule for ALFA as incomplete