Add rule for ALFA

Multimapping reads

Use the quick_start mapped reads to investigate counting behaviour of multi-mapping reads. These reads are annotated in two ways: i) FLAG 256 for the secondary alignment and ii) NH tag for the number of occurrences of this read in the alignment file.

adjust `quick_start.bam`

create file with same number of aligned reads, but two reads are multimapping to two different locations (from intergenic to 5UTR and from CDS to 3UTR) and are annotated with the above tags. Resulting multimapping.sam:

@HD	VN:1.4	SO:coordinate
@SQ	SN:Chr1	LN:1250
@PG	ID:STAR	PN:STAR	VN:STAR_2.4.2a_modified	CL:condor_exec.exe   --runThreadN 10   --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF   --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq      --outFileNamePrefix 241.   --outSAMtype BAM   SortedByCoordinate      --outFilterMultimapNmax 1   --outFilterMismatchNoverLmax 0.06   --alignIntronMax 10000
@CO	user command line: condor_exec.exe --readFilesIn /projects/merip/Mathieu/Cutadapt/Min_48nt/241.trim.fastq --outFileNamePrefix 241. --runThreadN 10 --genomeDir /projects/merip/Mathieu/Mapping/TAIR10_STAR_index_with_GTF --outFilterMismatchNoverLmax 0.06 --outFilterMultimapNmax 1 --alignIntronMax 10000 --outSAMtype BAM SortedByCoordinate
HWI-1KL110:162:C7268ACXX:8:1105:18693:99774	0	Chr1	150	255	50M	*	0	0	ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG	@CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ	NH:i:2	HI:i:1	AS:i:43	nM:i:2
HWI-1KL110:162:C7268ACXX:8:1105:18693:99774	256	Chr1	300	255	50M	*	0	0	ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG	@CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ	NH:i:2	HI:i:1	AS:i:43	nM:i:2
HWI-1KL110:162:C7268ACXX:8:1206:17943:50608	0	Chr1	300	255	50M	*	0	0	ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG	CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ	NH:i:1	HI:i:1	AS:i:45	nM:i:1
HWI-1KL110:162:C7268ACXX:8:1215:9399:59794	0	Chr1	500	255	50M	*	0	0	ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG	@<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ	NH:i:1	HI:i:1	AS:i:45	nM:i:1
HWI-1KL110:162:C7268ACXX:8:2101:19036:57639	0	Chr1	500	255	50M	*	0	0	ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG	CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ	NH:i:1	HI:i:1	AS:i:45	nM:i:1
HWI-1KL110:162:C7268ACXX:8:2207:19417:41334	0	Chr1	500	255	50M	*	0	0	ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG	CCCFFFFFHHHHHJJJJJJJJJJJJIJJJJJJJJJJJJJIJIJJJJJJJJ	NH:i:1	HI:i:1	AS:i:45	nM:i:1
HWI-1KL110:162:C7268ACXX:8:2213:9216:81119	0	Chr1	500	255	50M	*	0	0	ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG	CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJIJJJ	NH:i:1	HI:i:1	AS:i:45	nM:i:1
HWI-1KL110:162:C7268ACXX:8:1111:13970:73107	0	Chr1	500	255	50M	*	0	0	AGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCCAACGACGAGG	@C@FFFFFHHHHHJJJJJJJIIJJJIJJJJJHHHJJJIIJJJJJJJJJJH	NH:i:2	HI:i:1	AS:i:48	nM:i:1
HWI-1KL110:162:C7268ACXX:8:1111:13970:73107	256	Chr1	1100	255	50M	*	0	0	AGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCCAACGACGAGG	@C@FFFFFHHHHHJJJJJJJIIJJJIJJJJJHHHJJJIIJJJJJJJJJJH	NH:i:2	HI:i:1	AS:i:48	nM:i:1
HWI-1KL110:162:C7268ACXX:8:1116:13448:90140	0	Chr1	1100	255	50M	*	0	0	CGGCCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTG	BBCFFFFFHHHHHJJJJJJJJJJJJJIJJJJJJJJJJJJJIIJJIJHFFF	NH:i:1	HI:i:1	AS:i:47	nM:i:1

run `alfa`

Snakemake rule:

rule alfa_process:
    input:
        one="../ALFA/multimapping_reads/multimapping.bam",
        two="../ALFA/Quick_start/quick_start.bam"
    output:
        "output.Biotypes.png",
        "output.Categories.png"
    params:
        strand="forward",
        filetype="png",
        genome_index="../ALFA/quick_start",
        lone="multimapping",
        ltwo="quicK_start"
    conda: 
        "../env_alfa.yaml"
    threads: 8
    shell:
        """alfa -g {params.genome_index} \
            --bam {input.one} {params.lone} {input.two} {params.ltwo} \
            -s {params.strand} \
            --{params.filetype} output \
            -p {threads}
        """

Result

The annotations change according to the new positions output.Categories.

It appears that there is no check on the aligned reads. Multi-mapping reads are counted several times.

Investigate source code

alfa.py has a function

def run_genomecov(arguments):
    """ Run genomecov (from Bedtools through pybedtools lib) for a set of parameters to produce a BedGraph file. """
    (strand, bam_file, sample_label, name, bedgraph_extension, options) = arguments
    # Load the BAM file
    input_file = pybedtools.BedTool(bam_file)
    # Run genomecov
    if strand == "":
        input_file.genome_coverage(bg=True, split=True).saveas(options.output_dir + sample_label + name + bedgraph_extension)
    else:
        input_file.genome_coverage(bg=True, split=True, strand=strand).saveas(options.output_dir + sample_label + name + bedgraph_extension)
    return None

The coverage of the aligned reads is computed with pybedtools genome_coverage() and saved as .bedgraph.

There seems to be no option in bedtools genomecov to account for multi-mapping reads.

Thanks, @burri0000. So to conclude, does it mean that it uses multimappers by default or that it ignores multimappers?

ALFA uses all the entries from the .bam. So if the .bam contains multi-mappers, then these reads count multiple times.

One way is to remove all secondary alignments with the SAM FLAG 1024. Then each read is only counted once. But this looses information. I will look a bit more in detail how we can account for this.

I am sorry, it is FLAG 256, not 1024.

Things to address:

Adjust snakemake rule alfa_process for multiple samples and directory structure
Adjust alfa.py to account for multimappers: divide count by number of mapping positions

added Doing label

Instead of adjusting alfa.py, run alfa from coverage files with --bedgraph.

The coverage .bedgraph should be computed by multiplying each alignment with a factor corresponding to sth. like 1/{NH:i:X} where X is the number.
Another nice thing would be to display the percentage of reads with multiple mappings

Example of a multi-mapping read from a 10x genomics run (sample3a):

K00335:36:HJ5J5BBXX:6:1213:6380:23276	256	1	29895	0	88M	*	0	0	AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA	AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ	NH:i:6	HI:i:3	AS:i:84	nM:i:1	TX:Z:ENST00000473358,+341,88M	GX:Z:ENSG00000243485	GN:Z:MIR1302-2HG	fx:Z:ENSG00000243485	RE:A:E	li:i:0	CR:Z:CCCTCCTAGCTGTCTA	CY:Z:AAFFFJJJJJJJFJJJ	CB:Z:CCCTCCTAGCTGTCTA-1	UR:Z:TGTTCCGCAA	UY:Z:JJJJJJJJJJ	UB:Z:TGTTCCGCAA	RG:Z:sample3a:0:1:HJ5J5BBXX:1
K00335:36:HJ5J5BBXX:6:1213:6380:23276	256	1	200411	0	88M	*	0	0	AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA	AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ	NH:i:6	HI:i:2	AS:i:84	nM:i:1	RE:A:I	li:i:0	CR:Z:CCCTCCTAGCTGTCTA	CY:Z:AAFFFJJJJJJJFJJJ	CB:Z:CCCTCCTAGCTGTCTA-1	UR:Z:TGTTCCGCAA	UY:Z:JJJJJJJJJJ	UB:Z:TGTTCCGCAA	RG:Z:sample3a:0:1:HJ5J5BBXX:1
K00335:36:HJ5J5BBXX:6:1213:6380:23276	256	12	31949	0	88M	*	0	0	AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA	AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ	NH:i:6	HI:i:6	AS:i:84	nM:i:1	RE:A:I	li:i:0	CR:Z:CCCTCCTAGCTGTCTA	CY:Z:AAFFFJJJJJJJFJJJ	CB:Z:CCCTCCTAGCTGTCTA-1	UR:Z:TGTTCCGCAA	UY:Z:JJJJJJJJJJ	UB:Z:TGTTCCGCAA	RG:Z:sample3a:0:1:HJ5J5BBXX:1
K00335:36:HJ5J5BBXX:6:1213:6380:23276	272	15	101960980	0	88M	*	0	0	TTCTGTCTCTGTGAATTTTGACTATTCTAGGCACTTCACAAAAGTGGACTCATACGATATCTGTAGTTTTGCGTCTGGCTTCTCTATT	JJJJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJFJJJJJFJJJJJFJFAJJJJJJJJJJFJJJJJJJJJJJJJJJFFFAA	NH:i:6	HI:i:5	AS:i:84	nM:i:1	TX:Z:ENST00000561145,+341,88M	GX:Z:ENSG00000259553	GN:Z:AC140725.1	fx:Z:ENSG00000259553	RE:A:E	li:i:0	CR:Z:CCCTCCTAGCTGTCTA	CY:Z:AAFFFJJJJJJJFJJJ	CB:Z:CCCTCCTAGCTGTCTA-1	UR:Z:TGTTCCGCAA	UY:Z:JJJJJJJJJJ	UB:Z:TGTTCCGCAA	RG:Z:sample3a:0:1:HJ5J5BBXX:1
K00335:36:HJ5J5BBXX:6:1213:6380:23276	0	19	71502	0	88M	*	0	0	AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA	AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ	NH:i:6	HI:i:1	AS:i:84	nM:i:1	RE:A:I	xf:i:0	li:i:0	CR:Z:CCCTCCTAGCTGTCTA	CY:Z:AAFFFJJJJJJJFJJJ	CB:Z:CCCTCCTAGCTGTCTA-1	UR:Z:TGTTCCGCAA	UY:Z:JJJJJJJJJJ	UB:Z:TGTTCCGCAA	RG:Z:sample3a:0:1:HJ5J5BBXX:1
K00335:36:HJ5J5BBXX:6:1213:6380:23276	256	9	29673	0	88M	*	0	0	AATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCACTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAA	AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJAFJFJJJJJFJJJJJFJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJ	NH:i:6	HI:i:4	AS:i:84	nM:i:1	TX:Z:ENST00000422679,+2016,88M	GX:Z:ENSG00000227518	GN:Z:AL928970.1	fx:Z:ENSG00000227518	RE:A:E	li:i:0	CR:Z:CCCTCCTAGCTGTCTA	CY:Z:AAFFFJJJJJJJFJJJ	CB:Z:CCCTCCTAGCTGTCTA-1	UR:Z:TGTTCCGCAA	UY:Z:JJJJJJJJJJ	UB:Z:TGTTCCGCAA	RG:Z:sample3a:0:1:HJ5J5BBXX:1

Snakemake rules for running ALFA

ALFA requires a special form of sample input, namely a .bam label construct for each sample. Therefore, I hacked it by putting the samples and their labels in a dictionary

samples = {"possorted": "../../polyasite/cellranger_count/sample3a/outs/possorted_genome_bam.bam",
            "mapq": "../../polyasite/out/sample3a/filtered/mapq.bam",
            "uniqueumi": "../../polyasite/out/sample3a/filtered/uniqueumi.bam",
            "cbtags": "../../polyasite/out/sample3a/filtered/cbtags.bam"}

and a list for the rule alfa_process is generated

samples_list = []
for k, v in samples.items():
    samples_list.append(v)
    samples_list.append(k)

The rules for genome indexing and sample processing

rule all:
    input:
        "output/output.Biotypes.png", "output/output.Categories.png"

rule alfa_index:
    input:
        "../../polyasite/datasets/ref_transcriptome/refdata-cellranger-GRCh38-3.0.0/genes/genes.sorted.gtf"
    output:
        "sorted_genes.stranded.ALFA_index",
        "sorted_genes.unstranded.ALFA_index"
    conda: 
        "../env_alfa.yaml"
    threads: 8
    shell: 
        "alfa -a {input} -g sorted_genes -p {threads}"

rule alfa_process:
    input:
        samples=list(samples.values()),
        str_index="sorted_genes.stranded.ALFA_index",
        unstr_index="sorted_genes.unstranded.ALFA_index"
    output:
        "output/output.Biotypes.png",
        "output/output.Categories.png"
    params:
        strand="forward",
        filetype="png",
        genome_index="sorted_genes",
        samples=samples_list
    conda: 
        "../env_alfa.yaml"
    threads: 8
    shell:
        """alfa -g {params.genome_index} \
            --bam {params.samples} \
            -s {params.strand} \
            --{params.filetype} output/output \
            -p {threads}
        """

I run snakemake on the slurm cluster with run_cluster.sh

#!/usr/bin/bash
# use rule all 
snakemake --rerun-incomplete \
    --use-conda \
    --cluster-config cluster.json \
    --cluster "sbatch --cpus-per-task={cluster.threads} --mem={cluster.mem} --time={cluster.time} --qos={cluster.queue} --job-name={cluster.name} -o {cluster.out}" \
    --cores 8

and cluster.json

{
    "alfa_index":
    {
    "queue":"6hours",
    "time":"02:00:00",
    "threads":"8",
    "mem":"8G",
    "name": "{rule}.{wildcards}",
    "out": "$PWD/logs/cluster_log/{rule}.{wildcards}-%j-%N.out"
    },
    "alfa_process":
    {
    "queue":"6hours",
    "time":"02:00:00",
    "threads":"4",
    "mem":"8G",
    "name": "{rule}.{wildcards}",
    "out": "$PWD/logs/cluster_log/{rule}.{wildcards}-%j-%N.out"
    }
}

env_alfa.yaml contains

name: alfa
channels:
  - biocomp
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - alfa=1.1.1

changed the description

Mikhail and Dominik will check if the source code can be modified to include weighting for multimappers.

Account for multi-mappers

There are two possibilities to account for multi-mapping reads:

create a weighted coverage .bedgraph (by dividing each alignment by the number of mappings in the NH tag). Supply this .bedgraph to ALFA and compute feature distribution and create plots with ALFA. ALFA should allow this option with multiple samples.
create the feature distribution and plots with own script by assigning each alignment, resp. mapped base pair, to a feature (CDS, 3'UTR, intergenic, etc.). The features can be stored with intervaltree, see make_plots.py from Mikhail's crunch-mara.

Both options need to iterate over all lines in the alignment file .bam. Also, since the mapping is done with STAR, we assume the existence of the NH tag for weighting the reads.

Normalised coverage

STAR has an option to create wig tracks (e.g. .begraph) from .bam with normalisation to reads per million (RPM):

STAR --runMode inputAlignmentsFromBAM 
  --runThreadN XX 
  --inputBAMfile XX.bam
  --outWigType bedGraph 
  --outWigStrand Unstranded (or Stranded) 
  --outWigNorm RPM 
  --outFileNamePrefix ./

RPM calculation

The RPM is calculated as:

[(raw count)/(genomic hit number; NH tag)] / (library size) * 1e6

We found the code for this in STAR's signalFromBAM.cpp.

Plots with ALFA

The obtained .bedgraph can now be fed into ALFA with the option --bedgraph to create the desired graphs, stranded or unstranded is possible.

This is the description regarding RPM of STAR Mapping RNA-seq Reads with STAR: Alternatve protocol 4:

The signal are generated separately for two genomic strands: .str1. and .str2., which correspond to different genomic strands depending on the library preparation protocols. For the protocols in which the 1st read is on the opposite strand to the RNA molecule (such as Illumina stranded Tru-Seq), .str1. corresponds to the (−) strand and .str2. corresponds to the (+) strand.

mentioned in commit 4af9f0b1

Created branch alfa_qc with snakemake rules for performing ALFA. So far tested this on one sample /scicore/home/zavolan/burri0000/test_alfa/sample/chr21.bam.

Tests can be performed with:

snakemake --use-singularity --singularity-args "--bind ${PWD},'/scicore/home/zavolan/burri0000/test_alfa/sample/'" run_alfa_bg_stranded

and

snakemake --use-singularity --singularity-args "--bind ${PWD},'/scicore/home/zavolan/burri0000/test_alfa/sample/'" run_alfa_bg_unstranded

mentioned in issue #42 (closed)

changed title from Investigate ALFA to Add ALFA

changed the description

marked the checklist item Add Snakemake rule for ALFA as completed

marked the checklist item Add Snakemake rule for ALFA as incomplete

Add rule for ALFA

Child items ...

Activity

Multimapping reads

adjust `quick_start.bam`

run `alfa`

Result

Investigate source code

Snakemake rules for running ALFA

Account for multi-mappers

Normalised coverage

RPM calculation

Plots with ALFA

Add rule for ALFA

Activity

Multimapping reads

adjust quick_start.bam

run alfa

Result

Investigate source code

Snakemake rules for running ALFA

Account for multi-mappers

Normalised coverage

RPM calculation

Plots with ALFA

adjust `quick_start.bam`

run `alfa`