A-seq read processing pipeline
Scripts and pipelines provided in this repository aid to process A-seq read libraries. It contains all scripts to allow a self-assembled processing and additionally provides pipeline scripts that run the entire processing automatically.
To get you started, clone the repository into you target-directory. Once, this is done, ensure that all required external software/scripts are installed.
git clone https://git.scicore.unibas.ch/zavolan_public/A-seq2-processing.git path/to/workdir
cd path/to/workdir
Requirements
To run this pipeline, your computer requires 40 GB of available memory (RAM) to process larger genomes (e.g. human or mouse). Moreover, snakemake was used to facilitate the automated execution of all analysis steps. The easiest way to make use of the pipeline is to set up a python3 virtual environment and run the pipeline is this environment. In the following section, instructions on how to install the virtual environment for the analysis are given.
snakemake
Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires python3 and can be most easily installed via the bioconda package of the python anaconda distribution. You're setup right after the following steps:
Step 1: Downloading Miniconda3
On Linux:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
On MacOS X:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
Step 2: Creating a new environment with all required packages/software
conda create -n Aseq-snakemake -c bioconda -c conda-forge -c ostrokach --file requirements.txt
If you already have a conda version with python2 installed, create the new python 3 environment like this:
conda create -n Aseq-snakemake -c bioconda -c conda-forge -c ostrokach --file requirements.txt python=3
With the installation of the environment, the following software is installed as well:
- Python (v3.5.1)
- Snakemake (v3.9.1)
- Graphviz (v2.38.0)
- PyYaml (v3.12)
- Numpy (v1.11.3)
- Docutils (v0.12)
- Perl (v5.22.2.1)
- fastx toolkit (v0.0.14)
- cutadapt (v1.12)
- STAR (v2.5.2a)
- samtools (v1.3.1)
- bedtools (v2.26.0)
- gzip (v1.7)
Step 3: Activate the environment
source activate Aseq-snakemake
To exit the environment (after finishing the usage of the pipeline), just execute
source deactivate
Step 4: Download/Provide all necessary files for the used genome:
-
genome sequence for the organism of interest: Let's assume we want to work with the human genome (ENSEMBL GRCh38). It can downloaded from Ensembl with the following command:
wget -P resources ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz gzip -d resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
Please note: With the commands above, the genome sequence is stored as fasta file in the
resources
folder of the working directory. The file must be unzipped for the later use during the run of the pipeline. -
gtf annotation for the organism of interest: The repository already contains the ENSEMBL annotation for the human genome version GRCh38. Other gtf files can be download directly at ENSEMBL. To download another gtf file, just follow the genome download instructions above, i.e. download the file into the resources folder and unzip it afterwards. Please note: GTF and fasta file should be downloaded from the same source to ensure that the chromosome names match.
Step 5: Prepare cluster execution
If you plan to run the pipeline with individual jobs on an high performance computer (HPC) cluster: Execute the following command to ensure that the jobs know where to find the software within the environment we setup in steps 2 and 3.
python scripts/complete-jobscript.py -i jobscript.sh -o jobscript.withJOBID.withPATH.sh
Run the pipeline
Configure input parameters
The working directory contains a file named Aseq_config.yaml
. It's the central file in which all user settings, paramter values and path specifications are stored. During a run, all steps of the pipeline will retrieve their paramter values from this file. It follows the yaml syntax (find more information about yaml and it's syntax here) what makes it easy to read and edit. The main principles are:
- everything that comes after a
#
symbol is considered as comment and will not be interpreted - paramters are given as key-value pair, with
key
being the name andvalue
the value of any paramter
Before starting the pipeline, open the Aseq_config.yaml
configuration file and set all options according as required. This should at least include:
- name of the input directory - where are your input fastq files stored
- name of the output directory - where should the pipeline store the output files (the direcotry is created if not existing)
- name of the log directory - where should the pipeline store the log files
-
name(s) of your input samples - please note: If your sample is named
sample1.fq.gz
thensample1
will be kept as naming scheme throughout the entire run to indicate output files that belong to this input file, e.g. the pipeline will create a file calledsample1.3pSites.noIP.bed.gz
. If you have multiple input files, just follow the given pattern with one sample name per line (and a dash that indicates another list item). - UMI length - if the length of random nucleotides was changed than the read 5' starts need to be adjusted (default: read_5p_start: ....TTT)
If you downloaded another genome than the one given above, it is required to also set the correct path and file name for the genome
key. When the gtf annotation is different as well, the values for gene_annotation
and utr_name
also need to be changed (utr_name
refers to the label used in the gtf to mark 3' UTR regions).
Start a run
Once you set up your configuration file, running the pipeline locally on your computer is as easy as invoking:
max_cores=4 # maximum number of threads that will run in parallel
# if not yet done: source activate Aseq-snakemake
snakemake -p --cores ${max_cores}
or, if you like to save the snakemake comments in a log-file:
max_cores=4 # maximum number of threads that will run in parallel
snakemake -p --cores ${max_cores} &> log_output.log
It is recommend to set the max_cores
parameter so that mutliple steps of the pipeline can run in parallel. Moreover, some scripts of the pipeline make use of a parallel environment and defining a maximum number of available cores prevents overloading the machine.
The same pipeline can also be run an a parallel cluster environment easily. It was tested on a cluster computer with CentOS 6.5 as operating system and Univa Grid Engine 8.3 as queing system. Here, the pipeline has to be invoked as follows:
max_jobs=100 # maximum number of concurrently submitted jobs
snakemake \
--cluster-config cluster_config.json \
--jobscript jobscript.withJOBID.withPATH.sh \
--cluster "qsub -pe smp {cluster.threads} \
-cwd \
-l membycore={cluster.mem} \
-R y \
-l runtime={cluster.time} \
-o {params.cluster_log} \
-j y" \
--jobs ${max_jobs}
Detailed explanations for each step of the pipeline
The following notes assume that the pipeline was started with default configurations, the current working directory as results directory and one sample called test.fq.gz. Log-files for the different steps are stored in the logs/
directory.
create_log_dir:
input: -
output: *temp, created_log.tmp
details: Create a directory for log-entries written by HPC jobs (only used when the pipeline is run on a HPC cluster). *local
shell command: pipeline inherent python code
fastq2fasta:
input: test.fq.gz
output: *temp, test.fa.gz
details: Convert the input fastq file into fasta format.
shell command:
perl scripts/ag-convert-FASTQ-to-fasta.pl \
--fastq_to_fasta=fastq_to_fasta \
--fastx_renamer=fastx_renamer \
--oufile=test.fa.gz \
test.fq.gz
raw_read_cnt:
input: test.fa.gz
output: *temp, counts/test.raw.nr.out
details: Count the number of raw reads the run starts with. *local
shell command: pipeline inherent python code
select_for_valid_5p_configuration:
input: test.fa.gz
parameters:
- adapter: sequence of the 5' adapter that is cleaved off (default: "....TTT"; name in config-file: 'read_5p_start')
output: *temp, test.5ptrimmed.UMI_available.fa.gz
details: Discard reads that do not start with the expected pattern NNNNTTT; append the starting tetramer sequence as unique molecular identifier (UMI) to the read name.
shell command:
adapter="....TTT"
zcat test.fa.gz \
| perl scripts/rs-filter-by-5p-adapter.keep5pAdapter.pl \
--adapter=${adapter} \
| gzip > test.5ptrimmed.UMI_available.fa.gz
proper_5p_end_read_cnt:
input: test.5ptrimmed.UMI_available.fa.gz
output: *temp, counts/test.5ptrimmed.nr.out
details: Count the number of remaining reads. *local
shell command: pipeline inherent python code
collapse_full_UMI_duplicates:
input: test.5ptrimmed.UMI_available.fa.gz
output: *temp, test.UMI_dupl_removed.fa.gz
details: Keep only one copy of reads that share the same sequence as well as their UMI.
shell command:
zcat test.5ptrimmed.UMI_available.fa.gz \
| perl scripts/rs-collapse-UMIduplicates.keepUMI.pl \
| gzip > test.UMI_dupl_removed.fa.gz
no_PCR_dupl_read_cnt:
input: test.UMI_dupl_removed.fa.gz
output: *temp, counts/test.no_full_PCR_dupl.nr.out
details: Count the number of remaining reads. *local
shell command: pipeline inherent python code
trim_3p_adapter:
input: test.UMI_dupl_removed.fa.gz
parameters:
- adapt: 5' sequence of the 3' adapter (default: "TGGAATTCTCGGGTGCCAAGG"; name in config-file: '3p_adapter')
- minLen: minimum length a read must have after adapter trimming, otherwise it is discarded (default: 15, name in config-file: 'min_length')
output: *temp, test.trimmed.UMI.fa.gz
details: Remove adapter sequence from the 3' end of reads, reverse complement the reads and keep only those with a minimum length.
shell command:
adapt="TGGAATTCTCGGGTGCCAAGG"
minLen=15
cutadapt -a ${adapt} \
--minimum-length ${minLen} \
test.UMI_dupl_removed.fa.gz \
| fastx_reverse_complement \
| gzip > test.trimmed.UMI.fa.gz
no_3pAdapter_read_cnt:
input: test.trimmed.UMI.fa.gz
output: *temp, counts/test.trimmed.nr.out
details: Count the number of remaining reads. *local
shell command: pipeline inherent python code
get_valid_reads:
input: test.trimmed.UMI.fa.gz
parameters:
- maxN: maximum number of N nucleotides per read (default: 2; name in config-file: 'maxN')
- maxAcontent: maximum fraction of A nucleotides per read (default: 0.8; name in config-file: 'maxAcontent')
output: *temp, test.valid.trimmed.UMI.fa.gz
details: Remove reads with more than 2 Ns, more than 80 % As or an A at the 3' end. *local
shell command:
maxN=2
maxAcontent=0.8
zcat test.trimmed.UMI.fa.gz \
| perl scripts/ag-filter-seqs-by-nucleotide-composition.pl --max ${maxN} --nuc N \
| perl scripts/ag-filter-seqs-by-nucleotide-composition.pl --max ${maxAcontent} --nuc A \
| perl scripts/ag-filter-seqs-by-last-nuc.pl \
| gzip > test.valid.trimmed.UMI.fa.gz
valid_read_cnt:
input: test.valid.trimmed.UMI.fa.gz
output: *temp, counts/test.valid.nr.out
details: Count the number of remaining reads with appropriate quality. *local
shell command: pipeline inherent python code
create_STAR_index:
input: -
parameters:
- read_len: the index should be build for the longest read minus one (default: number_of_sequencing_cycles - 5_prime_adapter_length - 1 = 50 - 7 - 1; name in config file: 'read_length')
- genome: fasta file of the used genome (name in config-file: 'genome')
- gtf: corresponding gene annotation as gtf file (name in config-file: 'gene_annotation')
- idx_dir: directory to store the created files into (name in config-file: 'STAR_idx')
threads: default: 8, name in config-file: 'threads.STAR'
output: the directory given above (idx_dir) with several index files, e.g. SAindex
details: Create an index for the genome. This step has high memory requirements (GRCh38: ~31 GB).
shell command:
genome="resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
gtf="resources/Homo_sapiens.GRCh38.87.gtf"
idx_dir="resources/Homo_sapiens.GRCh38.50nt.index"
if [[ ! -d resources/Homo_sapiens.GRCh38.50nt.index ]]; then mkdir resources/Homo_sapiens.GRCh38.50nt.index; fi
STAR \
--runMode genomeGenerate \
--sjdbOverhang 52 \
--genomeDir ${idx_dir} \
--genomeFastaFiles ${genome} \
--runThreadN 8 \
--sjdbGTFfile ${gtf}
mapping:
input: STAR_index, test.valid.trimmed.UMI.fa.gz
parameters:
- anno: gene annotation as gtf file (name in config-file: 'gene_annotation')
- outdir: name of the directory for the STAR output (default: "test.STAR_out/")
- idx_dir: directory with index files (name in config-file: 'STAR_idx')
threads: default: 8, name in cofig-file: 'threads.STAR'
output: test.STAR_out/Aligned.sortedByCoord.out.bam
details: Map the reads to the genome (splice mappings are considered). STAR must not soft-clip the ends of reads. Similar to "create_STAR_index", this steps has high memory requirements.
shell command:
anno="resources/Homo_sapiens.GRCh38.87.gtf"
outdir="test.STAR_out/"
idx_dir="resources/Homo_sapiens.GRCh38.50nt.index"
if [[ ! -d test.STAR_out ]]; then mkdir test.STAR_out/; fi
STAR \
--runMode alignReads \
--twopassMode Basic \
--runThreadN 8 \
--genomeDir ${idx_dir} \
--sjdbGTFfile ${anno} \
--readFilesIn test.valid.trimmed.UMI.fa.gz \
--outFileNamePrefix ${outdir} \
--outSAMtype BAM SortedByCoordinate \
--limitBAMsortRAM 31532137230 \
--readFilesCommand zcat \
--outSAMstrandField intronMotif \
--alignIntronMax 200000 \
--outReadsUnmapped Fastx \
--outSAMattributes All \
--alignEndsType EndToEnd
bam2bed:
input: test.STAR_out/Aligned.sortedByCoord.out.bam
threads: default: 8 (name in config-file: 'threads.bam2bed')
output: *temp, test.reads.bed.gz
details: Convert each mapping into BED-format. Store the edit distance for each mapping as score in BED column 5.
shell command:
samtools view test.STAR_out/Aligned.sortedByCoord.out.bam \
| python scripts/rs-bam2bed.py \
--processors 8 \
| gzip > test.reads.bed.gz
select_by_edit_distance:
input: test.reads.bed.gz
output: *temp, test.edit_distance_filtered.bed.gz
details: When reads map to multiple locations only keep the mapping(s) with the smallest edit distance. Adjust the score for each read properly as 1 / n, with n = number of mappings with smallest edit distance.
shell command:
python scripts/rs-select-min-edit-distance.py \
--bed test.reads.bed.gz \
| gzip > test.edit_distance_filtered.bed.gz
collapse_mapped_UMI_duplicates:
input: test.edit_distance_filtered.bed.gz
output: *temp, test.reads.collapsed.bed.gz, counts/test.mapped.nr.out
details: From multiple reads with the same UMI mapping to the same location, keep only one. Also, count the number of uniquely mapping reads and reads that map to multiple loci.
shell command:
perl scripts/rs-collapse-mapped-UMIduplicates.pl \
--count_out=counts/test.mapped.nr.out \
| gzip > test.reads.collapsed.bed.gz
mapped_read_cnt:
input: counts/test.valid.nr.out, counts/test.mapped.nr.out
output: counts/test.mapped.noUMI_duplicates.nr.out
details: Appends the number of unique and multi mappers to the counts of valid reads. *local
shell commands: pipeline inherent python code
get_3p_ends:
input: test.reads.collapsed.bed.gz
parameters:
- exclude_chr: list of chromosome that are excluded from the output (e.g. mitochondrial or Y chromosome; name in config-file: 'excluded_chr')
- min_align: minimum number of nucleotides at the 3' end that have to align perfectly to the reference (default: 4, name in config-file: 'min_3p_align'
output: *temp, test.3pSites.bed.gz
details: From all mappings, only keep the 3' end if it mapped perfectly to the reference. These sites indicate the actual cleavage sites.
shell command:
min_align=4
perl scripts/rs-get-3pEnds-from-bed.pl \
--exclude=Y --exclude=M \
--strict \
--min_align=${min_align} \
test.reads.collapsed.bed.gz \
| gzip > test.3pSites.bed.gz
raw_3p_ends_cnt:
input: test.3pSites.bed.gz
output: *temp, counts/test.raw_3p_ends.nr.out
details: Count the number of retrieved raw 3' end processing sites. *local
shell command: pipeline inherent python code
fetch_flanking_seqs:
input: test.3pSites.bed.gz
parameters:
- upstream_ext: number of nucleotides added upstream to the cleavage site for sequence retrieval (default: 0, name in config-file: 'upstream_region.IP')
- downstream_ext: number of nucleotides added downstream to the cleavage site for sequence retrieval (default: 10, name in config-file: 'downstream_region.IP')
- genome: fasta file for the genome sequence (name in config-file: 'genome')
output: *temp, test.3pSites.bed.seqs.gz
details: Retrieve the genomic sequence around the raw 3' end processing sites.
shell command:
upstream_ext=0
downstream_ext=10
genome="resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
perl scripts/rs-fetch-flanking-fasta.pl \
--genome=${genome} \
--upstream=${upstream_ext} \
--downstream=${downstream_ext} \
test.3pSites.bed.seqs.gz \
| gzip > test.3pSites.bed.seqs.gz
assign_IP_sites:
input: test.3pSites.bed.seqs.gz
parameters:
- upstream_ext: number of nucleotides added upstream to the cleavage site for sequence retrieval (default: 0, name in config-file: 'upstream_region.IP')
- downstream_ext: number of nucleotides added downstream to the cleavage site for sequence retrieval (default: 10, name in config-file: 'downstream_region.IP')
- tot_As: number of overall As in the downstream region so that the 3' end site is considered as internal priming event (default: 7, name in config-file: 'total_As')
- consec_As: number of consecutive As in the downstream region so that the 3' end site is considered as internal priming event (default: 6, name in config-file: 'consecutive_As')
- ds_patterns: list of patterns. If the sequnce directly downstream of the cleavage site starts with one of these patterns, the 3' end site is considered as internal priming event (default: AAAA, AGAA, AAGA, AAAG, name in config-file: 'downstream_patterns')
output: test.3pSites.ip.bed.gz
details: Based on pre-defined patterns, assess which 3' end processing sites are likely internal priming artifacts. Mark these. This file is an important output file that is used in later steps to create poly(A) site clusters.
shell command:
upstream_ext=0
downstream_ext=10
tot_As=7
consec_As=6
perl scripts/ag-assign-internal-priming-sites.pl \
--upstream_len=${upstream_ext} \
--downstream_len=${downstream_ext} \
--consecutive_As=${consec_As} \
--total_As=${tot_As} \
--ds_pattern=AAAA --ds_pattern=AGAA --ds_pattern=AAGA --ds_pattern=AAAG \
test.3pSites.bed.seqs.gz \
| gzip > test.3pSites.ip.bed.gz
noIP_3p_ends:
input: test.3pSites.ip.bed.gz
output: test.3pSites.noIP.bed.gz
details: Create a BED-file that only contains valid (i.e. non-IP) 3' end processing sites. This file is intended to be user relevant in the case, the user wants to proceed with individual 3' end sites. *local
shell command:
zcat test.3pSites.ip.bed.gz \
| perl -ne 'my @F = split("\t"); print $_ if($F[3]) ne IP;' \
| gzip > test.3pSites.noIP.bed.gz
noIP_3p_ends_cnt:
input: test.3pSites.noIP.bed.gz
output: counts/test.summary.tsv
details: Count the number of valid individual raw 3' end processing sites. Ouput a tsv file that contains all numbers collected for the current samples (from raw reads to 3' end processing sites). All other count-related files are deleted. *local
shell command: pipeline inherent python code
sites_feature_overlap_stats:
input: test.3pSites.noIP.bed.gz
parameters:
- gtf: gene annotation as gtf file (see above)
- utr_name: label that indicates 3' UTR regions in the gtf annotation (if 3' UTR and 5' UTR are not marked separately, the output reports also the overlap of 3' end sites with annotated 5' UTR regions; default: 'three_prime_utr', name in config-file: 'utr_name')
output: counts/annotation_overlap/test.raw_3p_ends.noIP.summary.tsv
details: Intersect all provided 3' end processing sites with the given annotation. If sites do not overlap with any annotated feature, mark them as intergenic. Otherwise, differentiate if a site intersects with:
- an exon
- an intron
- a terminal exon
- a 3' UTR.
If a site overlaps with different features, count it partially to each of them. For each feature from the above list, the output reports the absolut number of overlaps (these numbers might not be integers) and the fraction of sites falling into this feature from all sites that intersect with any of the features. So the fractions from the features in the list sum up to one.
shell command:
gtf="resources/Homo_sapiens.GRCh38.87.gtf"
utr_name="three_prime_utr"
python scripts/rs-sites-to-feature-annotation.py \
--verbose \
--gtf ${gtf} \
--utr_name ${utr_name} \
--bed test.3pSites.noIP.bed.gz
> counts/annotation_overlap/test.raw_3p_ends.noIP.summary.tsv
pool_samples:
input: test.3pSites.ip.bed.gz # and all other files of this study, like: test2.3pSites.ip.bed.gz, test3.3pSites.ip.bed.gz, ...
output: 3pSites.tsv.gz
details: Merge 3' end processing sites from individual samples into a comprehensive table for the entire study. Keep track of the raw reads per site from each individual sample.
shell command:
perl scripts/ag-pool-sites.pl \
--noip \
test.3pSites.ip.bed.gz \
| gzip > 3pSites.tsv.gz
pooled_3p_sites_cnt:
input: 3pSites.tsv.gz
output: *temp, counts/pooled_3p_ends.nr.out
details: Count the number of overall determined 3' end processing sites. *local
shell command: pipeline inherent python code
assign_polyA_signals:
input: 3pSites.tsv.gz
parameters:
- signals: poly(A) signals that are assigned to individual 3' end sites (the list of signals comprises: AATAAA, ATTAAA, TATAAA, AGTAAA, AATACA, CATAAA, AATATA, GATAAA, AATGAA, AATAAT, AAGAAA, ACTAAA, AATAGA, ATTACA, AACAAA, ATTATA, AACAAG, AATAAG)
- genome: fasta file for the genome sequence (name in config-file: 'genome')
output: 3pSites.PAS.tsv.gz
details: Assign poly(A) signals to 3' end processing sites: consider the region 60 nt upstream to 10 nt downstream and note every occurrence of any of the hexamer motif given under signals.
shell command:
genome="resources/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
perl scripts/ag-assign-polyAsignals.pl \
--motif=AATAAA --motif=ATTAAA # only 2 out of 18 motifs are listed here exemplarily \
--genome=${genome} \
3pSites.tsv.gz \
| gzip > 3pSites.PAS.tsv.gz
sample_specific_bg:
input: 3pSites.PAS.tsv.gz
parameters:
- cutoff: Percent of preliminary clusters that are required to have an annotated poly(A) signal (default: 90; name in config-file: 'polyAsignal_cutoff')
- upstream: length in nucleotide to search upstream for 3' end sites in order create preliminary poly(A) site clusters (default: 25, name in config-file: 'upstream_bg_clusters')
- downstream: length in nucleotide to search downstream for 3' end sites in order create preliminary poly(A) site clusters (default: 25, name in config-file: name in config-file: 'downstream_bg_clusters')
output: *temp, filteredSites/test.filtered.tsv # and potential other samples from the study like: filteredSites/test2.filtered.tsv, ...
details: Define a sample-specific background level. Do so by merging individual 3' end sites that are in the region 25 nt up- or downstream of higher-ranked 3' end sites into poly(A) site clusters. Iterate through the list of created clusters from highest to lowest expressed. Find the expression below which less percent of the clusters than the given cutoff have an annotated poly(A) signal. Consider all 3' end sites that contribute to poly(A) site clusters below the expression cutoff as background and set their expression to 0 (i.e. ignore them in future steps).
shell command:
cutoff=90
upstream=25
downstream=25
perl scripts/rs-find-sample-specific-cutoff.pl \
--cutoff=${cutoff} \
--upstream=${upstream} \
--downstream=${downstream} \
--sample = test \
3pSites.PAS.tsv.gz \
> filteredSites/test.filtered.tsv
create_noBG_3pSites_table:
input: filteredSites/test.filtered.tsv # and potential other samples from the study like: filteredSites/test2.filtered.tsv, ...
output: *temp, 3pSites.PAS.filtered.tsv.gz
details: Merge together the sample-specific tables in which the expression of background 3' end sites was set to 0. The output table is essentially the same as 3pSites.PAS.tsv.gz except that background reads are deleted.
shell command: pipeline inherent python code
delete_zero_rows:
input: 3pSites.PAS.filtered.tsv.gz
output: 3pSites.PAS.noBG.tsv.gz
details: By setting the expression of 3' end processing sites in different samples to zero (see the step before last) it is possible that individual 3' end sites do not have any assigned expression in any samples. Delete these 3' end processing site without expression in any sample.
shell command: pipeline inherent python code
cluster_sites:
input: 3pSites.PAS.noBG.tsv.gz"
parameters:
- upstream_ext: region upstream of high ranked sites to find lower ranked sites that are clustered (default: 12, name in config-file: 'upstream_clusters')
- downstream_ext: region downstream of high ranked sites to find lower ranked sites that are clustered (default: 12, name in config-file: 'downstream_clusters')
output: *temp, clusters.primary.tsv.gz
details: Cluster together closely spaced 3' end processing sites. Do so by:
- ranking sites according to their number of supporting samples following by the total expression of sites.
- iterating over the sites from highest to lowest rank
- assigning lower ranked sites to higher ranked sites if they are in the predefined up- or downstream region of the higher ranked site.
Additionally label poly(A) site clusters as potential internal priming candidate as soon as any constituting 3' end site of the cluster overlaps with an annotated poly(A) signal. **
shell command:
upstream_ext=12
downstream_ext=12
perl scripts/ag-generate-clusters.pl \
--upstream=${upstream_ext} \
--downstream=${downstream_ext} \
3pSites.PAS.noBG.tsv.gz \
| gzip > clusters.primary.tsv.gz
merge_clusters:
input: clusters.primary.tsv.gz
parameters:
- maxsize: maximum cluster size that merged clusters must not exceed (except for those that share the same poly(A) signal; default: 25, name in config-file: 'max_cluster_size')
- minDistToPAS: only relevant for the processing of potential internal priming candidates; minimum distance of the most 3' located individual site of a cluster to the closest annotated poly(A) signal; rescue the cluster if the distance is higher or equal to the given minimum (default: 15, name in config-file: 'min_dist_to_PAS')
output: clusters.merged.tsv.gz
details: Further merge poly(A) site clusters if they share the same set of poly(A) signals or if their combined length does not exceed the given maximum. Further resolve potential internal priming candidates. **
shell command:
maxsize=25
minDistToPAS=15
perl scripts/ag-merge-clusters.pl \
--maxsize=${maxsize} \
--minDistToPAS=${minDistToPAS} \
clusters.primary.tsv.gz \
| gzip >
clusters_to_bed:
input: clusters.merged.tsv.gz
output: clusters.merged.bed
details: Convert the tsv-table from the last step into proper BED-format. Use the summed up normalized expression per sample as score in column 5 (the read expression per 3' end site per sample is library normalized as reads per million, and the overall cluster expression is the sum over all constituting individual 3' end sites across all samples).
shell command:
perl scripts/ag-clusters-to-BED.pl clusters.merged.tsv.gz clusters.merged.bed
rule clusters_cnt:
input: clusters.merged.bed
output: counts/clusters.nr.out
details: Count the number of obtained poly(A) site clusters. Create a file that reports the number of pooled 3' end sites and the number of obtained clusters. *local
shell command: pipeline inherent python code
clusters_feature_overlap_stats:
input: clusters.merged.bed
parameters:
- gtf: gene annotation as gtf file (see above)
- utr_name: label that indicates 3' UTR regions in the given annotation (see above)
output: counts/annotation_overlap/clusters.summary.tsv
details: Intersect all provided poly(A) site clusters with the given annotation. If clusters do not overlap with any annotated feature, mark them as intergenic. Otherwise, differentiate if a cluster intersects with:
- an exon
- an intron
- a terminal exon
- a 3' UTR.
If a cluster overlaps with different features, count it for each of them. In the output, for each of the given features the total number of clusters that overlap with this type and the fraction are reported. Due to the counting scheme described above, the reported fractions do not sum up to 1. There are two scenarios why a cluster is assigned to multiple features:
- the cluster might overlap an intron-exon border of a transcript (i.e. it overlaps the boundary between two features)
- the cluster might be located in a region that is annotated as terminal exon for one transcript but as intron for another transcript (i.e. the annotation for a certain region is unambiguous).
In both cases, it makes sense to count the cluster for both features. Hence the fraction per feature in the output report can be read as: from all clusters, the given fraction x overlaps with the feature y.
shell command:
gtf="resources/Homo_sapiens.GRCh38.87.gtf"
utr_name="three_prime_utr"
python scripts/rs-clusters-to-feature-annotation.py \
--verbose \
--gtf ${gtf} \
--utr_name ${utr_name} \
--bed clusters.merged.bed \
> counts/annotation_overlap/clusters.summary.tsv
*temp: indicates temporary files that are deleted as soon as the pipeline was run successfully. So, under normal conditions you will never find these files in your results directory.
*local: these steps are executed on the local machine, even when the pipeline is run in cluster mode
** The description given in this README hardly describes the case of potential internal priming events due to priming to A-rich regions around the poly(A) signal. Please refer to the accompanying publication to get more details for this case.