Skip to content
Snippets Groups Projects

A-seq read processing pipeline

Scripts and pipelines provided in this repository aid to process A-seq read libraries. It contains all scripts to allow a self-assembled processing and additionally provides pipeline scripts that run the entire processing automatically.

To get you started, clone the repository into you target-directory. Once, this is done, ensure that all required external software/scripts are installed.

git clone https://git.scicore.unibas.ch/zavolan_public/A-seq2-processing.git path/to/workdir
cd path/to/workdir

Requirements

To run this pipeline, your computer requires 40 GB of available memory (RAM) to process larger genomes (e.g. human or mouse). Moreover, snakemake was used to facilitate the automated execution of all analysis steps. The easiest way to make use of the pipeline is to set up a python3 virtual environment and run the pipeline is this environment. In the following section, instructions on how to install the virtual environment for the analysis are given.

snakemake

Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires python3 and can be most easily installed via the bioconda package of the python anaconda distribution. You're setup right after the following steps:

Step 1: Downloading Miniconda3

On Linux:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

On MacOS X:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh

Step 2: Creating a new environment with all required packages/software

conda create -n Aseq-snakemake -c bioconda -c conda-forge -c ostrokach --file requirements.txt

If you already have a conda version with python2 installed, create the new python 3 environment like this:

conda create -n Aseq-snakemake -c bioconda -c conda-forge -c ostrokach --file requirements.txt python=3

With the installation of the environment, the following software is installed as well:

Step 3: Activate the environment

source activate Aseq-snakemake

To exit the environment (after finishing the usage of the pipeline), just execute

source deactivate

Step 4: Download/Provide all necessary files for the used genome:

  • genome sequence for the organism of interest: Let's assume we want to work with the human genome (ENSEMBL GRCh38). It can downloaded from Ensembl with the following command:

    wget -P resources ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
    gzip -d resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

    Please note: With the commands above, the genome sequence is stored as fasta file in the resources folder of the working directory. The file must be unzipped for the later use during the run of the pipeline.

  • gtf annotation for the organism of interest: The repository already contains the ENSEMBL annotation for the human genome version GRCh38. Other gtf files can be download directly at ENSEMBL. To download another gtf file, just follow the genome download instructions above, i.e. download the file into the resources folder and unzip it afterwards. Please note: GTF and fasta file should be downloaded from the same source to ensure that the chromosome names match.

Step 5: Prepare cluster execution

If you plan to run the pipeline with individual jobs on an high performance computer (HPC) cluster: Execute the following command to ensure that the jobs know where to find the software within the environment we setup in steps 2 and 3.

python scripts/complete-jobscript.py -i jobscript.sh -o jobscript.withJOBID.withPATH.sh

Run the pipeline

Configure input parameters

The working directory contains a file named Aseq_config.yaml. It's the central file in which all user settings, paramter values and path specifications are stored. During a run, all steps of the pipeline will retrieve their paramter values from this file. It follows the yaml syntax (find more information about yaml and it's syntax here) what makes it easy to read and edit. The main principles are:

  • everything that comes after a # symbol is considered as comment and will not be interpreted
  • paramters are given as key-value pair, with key being the name and value the value of any paramter

Before starting the pipeline, open the Aseq_config.yaml configuration file and set all options according as required. This should at least include:

  • name of the input directory - where are your input fastq files stored
  • name of the output directory - where should the pipeline store the output files (the direcotry is created if not existing)
  • name of the log directory - where should the pipeline store the log files
  • name(s) of your input samples - please note: If your sample is named sample1.fq.gz then sample1 will be kept as naming scheme throughout the entire run to indicate output files that belong to this input file, e.g. the pipeline will create a file called sample1.3pSites.noIP.bed.gz. If you have multiple input files, just follow the given pattern with one sample name per line (and a dash that indicates another list item).
  • UMI length - if the length of random nucleotides was changed than the read 5' starts need to be adjusted (default: read_5p_start: ....TTT)

If you downloaded another genome than the one given above, it is required to also set the correct path and file name for the genome key. When the gtf annotation is different as well, the values for gene_annotation and utr_name also need to be changed (utr_name refers to the label used in the gtf to mark 3' UTR regions).


Start a run

Once you set up your configuration file, running the pipeline locally on your computer is as easy as invoking:

max_cores=4 # maximum number of threads that will run in parallel
# if not yet done: source activate Aseq-snakemake
snakemake -p --cores ${max_cores}

or, if you like to save the snakemake comments in a log-file:

max_cores=4 # maximum number of threads that will run in parallel
snakemake -p --cores ${max_cores} &> log_output.log

It is recommend to set the max_cores parameter so that mutliple steps of the pipeline can run in parallel. Moreover, some scripts of the pipeline make use of a parallel environment and defining a maximum number of available cores prevents overloading the machine.

The same pipeline can also be run an a parallel cluster environment easily. It was tested on a cluster computer with CentOS 6.5 as operating system and Univa Grid Engine 8.3 as queing system. Here, the pipeline has to be invoked as follows:

max_jobs=100 # maximum number of concurrently submitted jobs
snakemake \
  --cluster-config cluster_config.json \
  --jobscript jobscript.withJOBID.withPATH.sh \
  --cluster "qsub -pe smp {cluster.threads} \
    -cwd \
    -l membycore={cluster.mem} \
    -R y \
    -l runtime={cluster.time} \
    -o {params.cluster_log} \
    -j y" \
    --jobs ${max_jobs}

Detailed explanations for each step of the pipeline

The following notes assume that the pipeline was started with default configurations, the current working directory as results directory and one sample called test.fq.gz. Log-files for the different steps are stored in the logs/ directory.

create_log_dir:

input: -

output: *temp, created_log.tmp

details: Create a directory for log-entries written by HPC jobs (only used when the pipeline is run on a HPC cluster). *local

shell command: pipeline inherent python code

fastq2fasta:

input: test.fq.gz

output: *temp, test.fa.gz

details: Convert the input fastq file into fasta format.

shell command:

perl scripts/ag-convert-FASTQ-to-fasta.pl \
     --fastq_to_fasta=fastq_to_fasta \
     --fastx_renamer=fastx_renamer \
     --oufile=test.fa.gz \
     test.fq.gz

raw_read_cnt:

input: test.fa.gz

output: *temp, counts/test.raw.nr.out

details: Count the number of raw reads the run starts with. *local

shell command: pipeline inherent python code

select_for_valid_5p_configuration:

input: test.fa.gz

parameters:

  • adapter: sequence of the 5' adapter that is cleaved off (default: "....TTT"; name in config-file: 'read_5p_start')

output: *temp, test.5ptrimmed.UMI_available.fa.gz

details: Discard reads that do not start with the expected pattern NNNNTTT; append the starting tetramer sequence as unique molecular identifier (UMI) to the read name.

shell command:

adapter="....TTT"
zcat test.fa.gz \
     | perl scripts/rs-filter-by-5p-adapter.keep5pAdapter.pl \
     --adapter=${adapter} \
     | gzip > test.5ptrimmed.UMI_available.fa.gz

proper_5p_end_read_cnt:

input: test.5ptrimmed.UMI_available.fa.gz

output: *temp, counts/test.5ptrimmed.nr.out

details: Count the number of remaining reads. *local

shell command: pipeline inherent python code

collapse_full_UMI_duplicates:

input: test.5ptrimmed.UMI_available.fa.gz

output: *temp, test.UMI_dupl_removed.fa.gz

details: Keep only one copy of reads that share the same sequence as well as their UMI.

shell command:

zcat test.5ptrimmed.UMI_available.fa.gz \
     | perl scripts/rs-collapse-UMIduplicates.keepUMI.pl \
     | gzip > test.UMI_dupl_removed.fa.gz

no_PCR_dupl_read_cnt:

input: test.UMI_dupl_removed.fa.gz

output: *temp, counts/test.no_full_PCR_dupl.nr.out

details: Count the number of remaining reads. *local

shell command: pipeline inherent python code

trim_3p_adapter:

input: test.UMI_dupl_removed.fa.gz

parameters:

  • adapt: 5' sequence of the 3' adapter (default: "TGGAATTCTCGGGTGCCAAGG"; name in config-file: '3p_adapter')
  • minLen: minimum length a read must have after adapter trimming, otherwise it is discarded (default: 15, name in config-file: 'min_length')

output: *temp, test.trimmed.UMI.fa.gz

details: Remove adapter sequence from the 3' end of reads, reverse complement the reads and keep only those with a minimum length.

shell command:

adapt="TGGAATTCTCGGGTGCCAAGG"
minLen=15
cutadapt -a ${adapt} \
    --minimum-length ${minLen} \
    test.UMI_dupl_removed.fa.gz \
    | fastx_reverse_complement \
    | gzip > test.trimmed.UMI.fa.gz

no_3pAdapter_read_cnt:

input: test.trimmed.UMI.fa.gz

output: *temp, counts/test.trimmed.nr.out

details: Count the number of remaining reads. *local

shell command: pipeline inherent python code

get_valid_reads:

input: test.trimmed.UMI.fa.gz

parameters:

  • maxN: maximum number of N nucleotides per read (default: 2; name in config-file: 'maxN')
  • maxAcontent: maximum fraction of A nucleotides per read (default: 0.8; name in config-file: 'maxAcontent')

output: *temp, test.valid.trimmed.UMI.fa.gz

details: Remove reads with more than 2 Ns, more than 80 % As or an A at the 3' end. *local

shell command:

maxN=2
maxAcontent=0.8
zcat test.trimmed.UMI.fa.gz \
    | perl scripts/ag-filter-seqs-by-nucleotide-composition.pl --max ${maxN} --nuc N \
    | perl scripts/ag-filter-seqs-by-nucleotide-composition.pl --max ${maxAcontent} --nuc A \
    | perl scripts/ag-filter-seqs-by-last-nuc.pl \
    | gzip > test.valid.trimmed.UMI.fa.gz

valid_read_cnt:

input: test.valid.trimmed.UMI.fa.gz

output: *temp, counts/test.valid.nr.out

details: Count the number of remaining reads with appropriate quality. *local

shell command: pipeline inherent python code

create_STAR_index:

input: -

parameters:

  • read_len: the index should be build for the longest read minus one (default: number_of_sequencing_cycles - 5_prime_adapter_length - 1 = 50 - 7 - 1; name in config file: 'read_length')
  • genome: fasta file of the used genome (name in config-file: 'genome')
  • gtf: corresponding gene annotation as gtf file (name in config-file: 'gene_annotation')
  • idx_dir: directory to store the created files into (name in config-file: 'STAR_idx')

threads: default: 8, name in config-file: 'threads.STAR'

output: the directory given above (idx_dir) with several index files, e.g. SAindex

details: Create an index for the genome. This step has high memory requirements (GRCh38: ~31 GB).

shell command:

genome="resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
gtf="resources/Homo_sapiens.GRCh38.87.gtf"
idx_dir="resources/Homo_sapiens.GRCh38.50nt.index"

if [[ ! -d resources/Homo_sapiens.GRCh38.50nt.index ]]; then mkdir resources/Homo_sapiens.GRCh38.50nt.index; fi
STAR \
    --runMode genomeGenerate \
    --sjdbOverhang 52 \
    --genomeDir ${idx_dir} \
    --genomeFastaFiles ${genome} \
    --runThreadN 8 \
    --sjdbGTFfile ${gtf}

mapping:

input: STAR_index, test.valid.trimmed.UMI.fa.gz

parameters:

  • anno: gene annotation as gtf file (name in config-file: 'gene_annotation')
  • outdir: name of the directory for the STAR output (default: "test.STAR_out/")
  • idx_dir: directory with index files (name in config-file: 'STAR_idx')

threads: default: 8, name in cofig-file: 'threads.STAR'

output: test.STAR_out/Aligned.sortedByCoord.out.bam

details: Map the reads to the genome (splice mappings are considered). STAR must not soft-clip the ends of reads. Similar to "create_STAR_index", this steps has high memory requirements.

shell command:

anno="resources/Homo_sapiens.GRCh38.87.gtf"
outdir="test.STAR_out/"
idx_dir="resources/Homo_sapiens.GRCh38.50nt.index"

if [[ ! -d test.STAR_out ]]; then mkdir test.STAR_out/; fi
STAR \
    --runMode alignReads \
    --twopassMode Basic \
    --runThreadN 8 \
    --genomeDir ${idx_dir} \
    --sjdbGTFfile ${anno} \
    --readFilesIn test.valid.trimmed.UMI.fa.gz \
    --outFileNamePrefix ${outdir} \
    --outSAMtype BAM SortedByCoordinate \
    --limitBAMsortRAM 31532137230 \
    --readFilesCommand zcat \
    --outSAMstrandField intronMotif \
    --alignIntronMax 200000 \
    --outReadsUnmapped Fastx \
    --outSAMattributes All \
    --alignEndsType EndToEnd

bam2bed:

input: test.STAR_out/Aligned.sortedByCoord.out.bam

threads: default: 8 (name in config-file: 'threads.bam2bed')

output: *temp, test.reads.bed.gz

details: Convert each mapping into BED-format. Store the edit distance for each mapping as score in BED column 5.

shell command:

samtools view test.STAR_out/Aligned.sortedByCoord.out.bam \
    | python scripts/rs-bam2bed.py \
    --processors 8 \
    | gzip > test.reads.bed.gz

select_by_edit_distance:

input: test.reads.bed.gz

output: *temp, test.edit_distance_filtered.bed.gz

details: When reads map to multiple locations only keep the mapping(s) with the smallest edit distance. Adjust the score for each read properly as 1 / n, with n = number of mappings with smallest edit distance.

shell command:

python scripts/rs-select-min-edit-distance.py \
       --bed test.reads.bed.gz \
        | gzip > test.edit_distance_filtered.bed.gz

collapse_mapped_UMI_duplicates:

input: test.edit_distance_filtered.bed.gz

output: *temp, test.reads.collapsed.bed.gz, counts/test.mapped.nr.out

details: From multiple reads with the same UMI mapping to the same location, keep only one. Also, count the number of uniquely mapping reads and reads that map to multiple loci.

shell command:

perl scripts/rs-collapse-mapped-UMIduplicates.pl \
     --count_out=counts/test.mapped.nr.out \
     | gzip > test.reads.collapsed.bed.gz

mapped_read_cnt:

input: counts/test.valid.nr.out, counts/test.mapped.nr.out

output: counts/test.mapped.noUMI_duplicates.nr.out

details: Appends the number of unique and multi mappers to the counts of valid reads. *local

shell commands: pipeline inherent python code

get_3p_ends:

input: test.reads.collapsed.bed.gz

parameters:

  • exclude_chr: list of chromosome that are excluded from the output (e.g. mitochondrial or Y chromosome; name in config-file: 'excluded_chr')
  • min_align: minimum number of nucleotides at the 3' end that have to align perfectly to the reference (default: 4, name in config-file: 'min_3p_align'

output: *temp, test.3pSites.bed.gz

details: From all mappings, only keep the 3' end if it mapped perfectly to the reference. These sites indicate the actual cleavage sites.

shell command:

min_align=4

perl scripts/rs-get-3pEnds-from-bed.pl \
     --exclude=Y --exclude=M \
     --strict \
     --min_align=${min_align} \
     test.reads.collapsed.bed.gz \
     | gzip > test.3pSites.bed.gz

raw_3p_ends_cnt:

input: test.3pSites.bed.gz

output: *temp, counts/test.raw_3p_ends.nr.out

details: Count the number of retrieved raw 3' end processing sites. *local

shell command: pipeline inherent python code

fetch_flanking_seqs:

input: test.3pSites.bed.gz

parameters:

  • upstream_ext: number of nucleotides added upstream to the cleavage site for sequence retrieval (default: 0, name in config-file: 'upstream_region.IP')
  • downstream_ext: number of nucleotides added downstream to the cleavage site for sequence retrieval (default: 10, name in config-file: 'downstream_region.IP')
  • genome: fasta file for the genome sequence (name in config-file: 'genome')

output: *temp, test.3pSites.bed.seqs.gz

details: Retrieve the genomic sequence around the raw 3' end processing sites.

shell command:

upstream_ext=0
downstream_ext=10
genome="resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"

perl scripts/rs-fetch-flanking-fasta.pl \
     --genome=${genome} \
     --upstream=${upstream_ext} \
     --downstream=${downstream_ext} \
     test.3pSites.bed.seqs.gz \
     | gzip > test.3pSites.bed.seqs.gz

assign_IP_sites:

input: test.3pSites.bed.seqs.gz

parameters:

  • upstream_ext: number of nucleotides added upstream to the cleavage site for sequence retrieval (default: 0, name in config-file: 'upstream_region.IP')
  • downstream_ext: number of nucleotides added downstream to the cleavage site for sequence retrieval (default: 10, name in config-file: 'downstream_region.IP')
  • tot_As: number of overall As in the downstream region so that the 3' end site is considered as internal priming event (default: 7, name in config-file: 'total_As')
  • consec_As: number of consecutive As in the downstream region so that the 3' end site is considered as internal priming event (default: 6, name in config-file: 'consecutive_As')
  • ds_patterns: list of patterns. If the sequnce directly downstream of the cleavage site starts with one of these patterns, the 3' end site is considered as internal priming event (default: AAAA, AGAA, AAGA, AAAG, name in config-file: 'downstream_patterns')

output: test.3pSites.ip.bed.gz

details: Based on pre-defined patterns, assess which 3' end processing sites are likely internal priming artifacts. Mark these. This file is an important output file that is used in later steps to create poly(A) site clusters.

shell command:

upstream_ext=0
downstream_ext=10
tot_As=7
consec_As=6

perl scripts/ag-assign-internal-priming-sites.pl \
     --upstream_len=${upstream_ext} \
     --downstream_len=${downstream_ext} \
     --consecutive_As=${consec_As} \
     --total_As=${tot_As} \
     --ds_pattern=AAAA --ds_pattern=AGAA --ds_pattern=AAGA --ds_pattern=AAAG \
     test.3pSites.bed.seqs.gz \
     | gzip > test.3pSites.ip.bed.gz

noIP_3p_ends:

input: test.3pSites.ip.bed.gz

output: test.3pSites.noIP.bed.gz

details: Create a BED-file that only contains valid (i.e. non-IP) 3' end processing sites. This file is intended to be user relevant in the case, the user wants to proceed with individual 3' end sites. *local

shell command:

zcat test.3pSites.ip.bed.gz \
     | perl -ne 'my @F = split("\t"); print $_ if($F[3]) ne IP;' \
     | gzip > test.3pSites.noIP.bed.gz

noIP_3p_ends_cnt:

input: test.3pSites.noIP.bed.gz

output: counts/test.summary.tsv

details: Count the number of valid individual raw 3' end processing sites. Ouput a tsv file that contains all numbers collected for the current samples (from raw reads to 3' end processing sites). All other count-related files are deleted. *local

shell command: pipeline inherent python code

sites_feature_overlap_stats:

input: test.3pSites.noIP.bed.gz

parameters:

  • gtf: gene annotation as gtf file (see above)
  • utr_name: label that indicates 3' UTR regions in the gtf annotation (if 3' UTR and 5' UTR are not marked separately, the output reports also the overlap of 3' end sites with annotated 5' UTR regions; default: 'three_prime_utr', name in config-file: 'utr_name')

output: counts/annotation_overlap/test.raw_3p_ends.noIP.summary.tsv

details: Intersect all provided 3' end processing sites with the given annotation. If sites do not overlap with any annotated feature, mark them as intergenic. Otherwise, differentiate if a site intersects with:

  • an exon
  • an intron
  • a terminal exon
  • a 3' UTR.

If a site overlaps with different features, count it partially to each of them. For each feature from the above list, the output reports the absolut number of overlaps (these numbers might not be integers) and the fraction of sites falling into this feature from all sites that intersect with any of the features. So the fractions from the features in the list sum up to one.

shell command:

gtf="resources/Homo_sapiens.GRCh38.87.gtf"
utr_name="three_prime_utr"

python scripts/rs-sites-to-feature-annotation.py \
       --verbose \
       --gtf ${gtf} \
       --utr_name ${utr_name} \
       --bed test.3pSites.noIP.bed.gz
       > counts/annotation_overlap/test.raw_3p_ends.noIP.summary.tsv

pool_samples:

input: test.3pSites.ip.bed.gz # and all other files of this study, like: test2.3pSites.ip.bed.gz, test3.3pSites.ip.bed.gz, ...

output: 3pSites.tsv.gz

details: Merge 3' end processing sites from individual samples into a comprehensive table for the entire study. Keep track of the raw reads per site from each individual sample.

shell command:

perl scripts/ag-pool-sites.pl \
     --noip \
     test.3pSites.ip.bed.gz \
     | gzip > 3pSites.tsv.gz

pooled_3p_sites_cnt:

input: 3pSites.tsv.gz

output: *temp, counts/pooled_3p_ends.nr.out

details: Count the number of overall determined 3' end processing sites. *local

shell command: pipeline inherent python code

assign_polyA_signals:

input: 3pSites.tsv.gz

parameters:

  • signals: poly(A) signals that are assigned to individual 3' end sites (the list of signals comprises: AATAAA, ATTAAA, TATAAA, AGTAAA, AATACA, CATAAA, AATATA, GATAAA, AATGAA, AATAAT, AAGAAA, ACTAAA, AATAGA, ATTACA, AACAAA, ATTATA, AACAAG, AATAAG)
  • genome: fasta file for the genome sequence (name in config-file: 'genome')

output: 3pSites.PAS.tsv.gz

details: Assign poly(A) signals to 3' end processing sites: consider the region 60 nt upstream to 10 nt downstream and note every occurrence of any of the hexamer motif given under signals.

shell command:

genome="resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"

perl scripts/ag-assign-polyAsignals.pl \
     --motif=AATAAA --motif=ATTAAA # only 2 out of 18 motifs are listed here exemplarily \
     --genome=${genome} \
     3pSites.tsv.gz \
     | gzip > 3pSites.PAS.tsv.gz

sample_specific_bg:

input: 3pSites.PAS.tsv.gz

parameters:

  • cutoff: Percent of preliminary clusters that are required to have an annotated poly(A) signal (default: 90; name in config-file: 'polyAsignal_cutoff')
  • upstream: length in nucleotide to search upstream for 3' end sites in order create preliminary poly(A) site clusters (default: 25, name in config-file: 'upstream_bg_clusters')
  • downstream: length in nucleotide to search downstream for 3' end sites in order create preliminary poly(A) site clusters (default: 25, name in config-file: name in config-file: 'downstream_bg_clusters')

output: *temp, filteredSites/test.filtered.tsv # and potential other samples from the study like: filteredSites/test2.filtered.tsv, ...

details: Define a sample-specific background level. Do so by merging individual 3' end sites that are in the region 25 nt up- or downstream of higher-ranked 3' end sites into poly(A) site clusters. Iterate through the list of created clusters from highest to lowest expressed. Find the expression below which less percent of the clusters than the given cutoff have an annotated poly(A) signal. Consider all 3' end sites that contribute to poly(A) site clusters below the expression cutoff as background and set their expression to 0 (i.e. ignore them in future steps).

shell command:

cutoff=90
upstream=25
downstream=25

perl scripts/rs-find-sample-specific-cutoff.pl \
     --cutoff=${cutoff} \
     --upstream=${upstream} \
     --downstream=${downstream} \
     --sample = test \
     3pSites.PAS.tsv.gz \
     > filteredSites/test.filtered.tsv

create_noBG_3pSites_table:

input: filteredSites/test.filtered.tsv # and potential other samples from the study like: filteredSites/test2.filtered.tsv, ...

output: *temp, 3pSites.PAS.filtered.tsv.gz

details: Merge together the sample-specific tables in which the expression of background 3' end sites was set to 0. The output table is essentially the same as 3pSites.PAS.tsv.gz except that background reads are deleted.

shell command: pipeline inherent python code

delete_zero_rows:

input: 3pSites.PAS.filtered.tsv.gz

output: 3pSites.PAS.noBG.tsv.gz

details: By setting the expression of 3' end processing sites in different samples to zero (see the step before last) it is possible that individual 3' end sites do not have any assigned expression in any samples. Delete these 3' end processing site without expression in any sample.

shell command: pipeline inherent python code

cluster_sites:

input: 3pSites.PAS.noBG.tsv.gz"

parameters:

  • upstream_ext: region upstream of high ranked sites to find lower ranked sites that are clustered (default: 12, name in config-file: 'upstream_clusters')
  • downstream_ext: region downstream of high ranked sites to find lower ranked sites that are clustered (default: 12, name in config-file: 'downstream_clusters')

output: *temp, clusters.primary.tsv.gz

details: Cluster together closely spaced 3' end processing sites. Do so by:

  1. ranking sites according to their number of supporting samples following by the total expression of sites.
  2. iterating over the sites from highest to lowest rank
  3. assigning lower ranked sites to higher ranked sites if they are in the predefined up- or downstream region of the higher ranked site.

Additionally label poly(A) site clusters as potential internal priming candidate as soon as any constituting 3' end site of the cluster overlaps with an annotated poly(A) signal. **

shell command:

upstream_ext=12
downstream_ext=12

perl scripts/ag-generate-clusters.pl \
     --upstream=${upstream_ext} \
     --downstream=${downstream_ext} \
     3pSites.PAS.noBG.tsv.gz \
     | gzip > clusters.primary.tsv.gz

merge_clusters:

input: clusters.primary.tsv.gz

parameters:

  • maxsize: maximum cluster size that merged clusters must not exceed (except for those that share the same poly(A) signal; default: 25, name in config-file: 'max_cluster_size')
  • minDistToPAS: only relevant for the processing of potential internal priming candidates; minimum distance of the most 3' located individual site of a cluster to the closest annotated poly(A) signal; rescue the cluster if the distance is higher or equal to the given minimum (default: 15, name in config-file: 'min_dist_to_PAS')

output: clusters.merged.tsv.gz

details: Further merge poly(A) site clusters if they share the same set of poly(A) signals or if their combined length does not exceed the given maximum. Further resolve potential internal priming candidates. **

shell command:

maxsize=25
minDistToPAS=15

perl scripts/ag-merge-clusters.pl \
     --maxsize=${maxsize} \
     --minDistToPAS=${minDistToPAS} \
     clusters.primary.tsv.gz \
     | gzip > 

clusters_to_bed:

input: clusters.merged.tsv.gz

output: clusters.merged.bed

details: Convert the tsv-table from the last step into proper BED-format. Use the summed up normalized expression per sample as score in column 5 (the read expression per 3' end site per sample is library normalized as reads per million, and the overall cluster expression is the sum over all constituting individual 3' end sites across all samples).

shell command:

perl scripts/ag-clusters-to-BED.pl clusters.merged.tsv.gz clusters.merged.bed

rule clusters_cnt:

input: clusters.merged.bed

output: counts/clusters.nr.out

details: Count the number of obtained poly(A) site clusters. Create a file that reports the number of pooled 3' end sites and the number of obtained clusters. *local

shell command: pipeline inherent python code

clusters_feature_overlap_stats:

input: clusters.merged.bed

parameters:

  • gtf: gene annotation as gtf file (see above)
  • utr_name: label that indicates 3' UTR regions in the given annotation (see above)

output: counts/annotation_overlap/clusters.summary.tsv

details: Intersect all provided poly(A) site clusters with the given annotation. If clusters do not overlap with any annotated feature, mark them as intergenic. Otherwise, differentiate if a cluster intersects with:

  • an exon
  • an intron
  • a terminal exon
  • a 3' UTR.

If a cluster overlaps with different features, count it for each of them. In the output, for each of the given features the total number of clusters that overlap with this type and the fraction are reported. Due to the counting scheme described above, the reported fractions do not sum up to 1. There are two scenarios why a cluster is assigned to multiple features:

  1. the cluster might overlap an intron-exon border of a transcript (i.e. it overlaps the boundary between two features)
  2. the cluster might be located in a region that is annotated as terminal exon for one transcript but as intron for another transcript (i.e. the annotation for a certain region is unambiguous).

In both cases, it makes sense to count the cluster for both features. Hence the fraction per feature in the output report can be read as: from all clusters, the given fraction x overlaps with the feature y.

shell command:

gtf="resources/Homo_sapiens.GRCh38.87.gtf"
utr_name="three_prime_utr"

python scripts/rs-clusters-to-feature-annotation.py \
       --verbose \
       --gtf ${gtf} \
       --utr_name ${utr_name} \
       --bed clusters.merged.bed \
       > counts/annotation_overlap/clusters.summary.tsv

*temp: indicates temporary files that are deleted as soon as the pipeline was run successfully. So, under normal conditions you will never find these files in your results directory.

*local: these steps are executed on the local machine, even when the pipeline is run in cluster mode

** The description given in this README hardly describes the case of potential internal priming events due to priming to A-rich regions around the poly(A) signal. Please refer to the accompanying publication to get more details for this case.