Skip to content
Snippets Groups Projects
seqalg.rst 40.42 KiB

:mod:`seq.alg <ost.seq.alg>` -- Algorithms for Sequences

Algorithms for Alignments

Create pairwise alignments

OpenStructure provides naive implementations to create pairwise local, global and semi-global alignments between two sequences:

  • :func:`LocalAlign`
  • :func:`GlobalAlign`
  • :func:`SemiGlobalAlign`

The use of parasail as a drop in replacement is optional and provides significant speedups. It must be enabled at compile time - see installation instructions.

Reference:

Jeff Daily. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. (2016) BMC Bioinformatics

Parasail allows to choose from various strategies but for the sake of simplicity, this Python binding always calls parasail_<mode>_trace_scan_sat which seems reasonably fast across the global, semi-global and local modes. See parasail documentation for more information.

You can always check if the alignment algorithms use parasail or the naive implementations by calling:

Substitution Weight Matrices and BLOSUM Matrices

Substitution weights for alignment algorithms

Four already preset BLOSUM (BLOcks SUbstitution Matrix) matrices are available at different levels of sequence identity:

  • BLOSUM45
  • BLOSUM62
  • BLOSUM80
  • BLOSUM100

Nucleotide substitution matrices:

  • NUC44: Nucleotide substitution matrix used in blastn that can deal with IUPAC ambiguity codes. ATTENTION: has been edited to explicitely encode T/U equivalence, i.e. you can just do m.GetWeight('G', 'U') instead of first translating 'U' to 'T'.

They can be directly accessed upon importing the sequence module:

from ost import seq
mat = seq.alg.BLOSUM62
print(mat.GetWeight('A', 'A'))

Contact Prediction

This is a set of functions for predicting pairwise contacts from a multiple sequence alignment (MSA). The core method here is mutual information which uses coevolution to predict contacts. Mutual information is complemented by two other methods which score pairs of columns of a MSA from the likelyhood of certain amino acid pairs to form contacts (statistical potential) and the likelyhood of finding certain substitutions of aminio-acid pairs in columns of the MSA corresponding to interacting residues.

Object containing the results form a contact prediction.

This class is used to associate a weight to any substitution from one amino-acid pair (a,b) to any other pair (c,d).

This class is used to associate a weight to any pair of amino-acids.

Get and analyze distance matrices from alignments

Given a multiple sequence alignment between a reference sequence (first sequence in alignment) and a list of structures (remaining sequences in alignment with an attached view to the structure), this set of functions can be used to analyze differences between the structures.

Example:

# SETUP: aln is multiple sequence alignment, where first sequence is the
#        reference sequence and all others have a structure attached

# clip alignment to only have parts with at least 3 sequences (incl. ref.)
# -> aln will be cut and clip_start is 1st column of aln that was kept
clip_start = seq.alg.ClipAlignment(aln, 3)

# get variance measure and distance to mean for each residue pair
d_map = seq.alg.CreateDistanceMap(aln)
var_map = seq.alg.CreateVarianceMap(d_map)
dist_to_mean = seq.alg.CreateDist2Mean(d_map)

# report min. and max. variances
print("MIN-MAX:", var_map.Min(), "-", var_map.Max())
# get data and json-strings for further processing
var_map_data = var_map.GetData()
var_map_json = var_map.GetJsonString()
dist_to_mean_data = dist_to_mean.GetData()
dist_to_mean_json = dist_to_mean.GetJsonString()

Container used by :class:`DistanceMap` to store a pair wise distance for each structure. Each structure is identified by its index in the originally used alignment (see :func:`CreateDistanceMap`).

Container returned by :func:`CreateDistanceMap`. Essentially a symmetric :meth:`GetSize` x :meth:`GetSize` matrix containing up to :meth:`GetNumStructures` distances (list stored as :class:`Distances`). Indexing of residues starts at 0 and corresponds to the positions in the originally used alignment (see :func:`CreateDistanceMap`).

Container returned by :func:`CreateVarianceMap`. Like :class:`DistanceMap`, it is a symmetric :meth:`GetSize` x :meth:`GetSize` matrix containing variance measures. Indexing of residues is as in :class:`DistanceMap`.

Container returned by :func:`CreateDist2Mean`. Stores distances to mean for :meth:`GetNumResidues` residues of :meth:`GetNumStructures` structures. Indexing of residues is as in :class:`DistanceMap`. Indexing of structures goes from 0 to :meth:`GetNumStructures` - 1 and is in the same order as the structures in the originally used alignment.

Container returned by :func:`CreateMeanlDDTHA`. Stores mean lDDT values for :meth:`GetNumResidues` residues of :meth:`GetNumStructures` structures. Has the exact same functionality and behaviour as :class:`Dist2Mean`

HMM Algorithms

Openstructure implements basic HMM-related functionality that aims at calculating an HMM-HMM alignment score as described in Soding, Bioinformatics (2005) 21(7), 951-60. This is the score which is optimized in the Viterbi algorithm of the hhalign tool. As a prerequisite, OpenStructure also implements adding pseudo counts to :class:`ost.seq.ProfileHandle` in order to avoid zero probabilities for unobserved transitions/emissions. Given these requirements, all functions in this section require HMM related data (transition probabilities, neff values, etc.) to be set, which is the case if you load a file in hhm format.

Example with pseudo count assignment:

from ost import io, seq

prof_query = io.LoadSequenceProfile("query.hhm")
prof_tpl = io.LoadSequenceProfile("tpl.hhm")
aln = io.LoadAlignment("aln.fasta")

# assign pseudo counts to transition probabilities
seq.alg.AddTransitionPseudoCounts(prof_query)
seq.alg.AddTransitionPseudoCounts(prof_tpl)

# hhblits/hhalign 3 assign different pseudo counts to
# query and template. The reason is computational efficiency.
# The more expensive Angermueller et al. pseudo counts
# are assigned to the query.
path_to_crf = "/path/to/hh-suite/data/context_data.crf"
lib = seq.alg.ContextProfileDB.FromCRF(path_to_crf)
seq.alg.AddAAPseudoCounts(prof_query, lib)

# templates are assigned the computationally cheaper pseudo
# counts derived from a Gonnet substitution matrix
seq.alg.AddAAPseudoCounts(prof_tpl)

# assign null model pseudo counts
# this should be done AFTER you assigned pseudo counts to emission
# probabilities as this affects the result
seq.alg.AddNullPseudoCounts(prof_query)
seq.alg.AddNullPseudoCounts(prof_tpl)

print("score:", seq.alg.HMMScore(prof_query, prof_tpl, aln, 0, 1))

Database that contains context profiles which will be used to add pseudo counts as described by Angermueller et al., Bioinformatics (2012) 28, 3240-3247.

AAIndex annotations

The annotations/scores can either refer to single amino acids or represent pairwise values. The two types are:

The actual data of an entry in the aaindex database is stored in a :class:`aaindex.AAIndexData` object: