-
Gerardo Tauriello authoredGerardo Tauriello authored
:mod:`seq.alg <ost.seq.alg>` -- Algorithms for Sequences
Algorithms for Alignments
Contact Prediction
This is a set of functions for predicting pairwise contacts from a multiple sequence alignment (MSA). The core method here is mutual information which uses coevolution to predict contacts. Mutual information is complemented by two other methods which score pairs of columns of a MSA from the likelyhood of certain amino acid pairs to form contacts (statistical potential) and the likelyhood of finding certain substitutions of aminio-acid pairs in columns of the MSA corresponding to interacting residues.
Object containing the results form a contact prediction.
This class is used to associate a weight to any substitution from one amino-acid pair (a,b) to any other pair (c,d).
This class is used to associate a weight to any pair of amino-acids.
Get and analyze distance matrices from alignments
Given a multiple sequence alignment between a reference sequence (first sequence in alignment) and a list of structures (remaining sequences in alignment with an attached view to the structure), this set of functions can be used to analyze differences between the structures.
Example:
# SETUP: aln is multiple sequence alignment, where first sequence is the
# reference sequence and all others have a structure attached
# clip alignment to only have parts with at least 3 sequences (incl. ref.)
# -> aln will be cut and clip_start is 1st column of aln that was kept
clip_start = seq.alg.ClipAlignment(aln, 3)
# get variance measure and distance to mean for each residue pair
d_map = seq.alg.CreateDistanceMap(aln)
var_map = seq.alg.CreateVarianceMap(d_map)
dist_to_mean = seq.alg.CreateDist2Mean(d_map)
# report min. and max. variances
print("MIN-MAX:", var_map.Min(), "-", var_map.Max())
# get data and json-strings for further processing
var_map_data = var_map.GetData()
var_map_json = var_map.GetJsonString()
dist_to_mean_data = dist_to_mean.GetData()
dist_to_mean_json = dist_to_mean.GetJsonString()
Container used by :class:`DistanceMap` to store a pair wise distance for each structure. Each structure is identified by its index in the originally used alignment (see :func:`CreateDistanceMap`).
Container returned by :func:`CreateDistanceMap`. Essentially a symmetric :meth:`GetSize` x :meth:`GetSize` matrix containing up to :meth:`GetNumStructures` distances (list stored as :class:`Distances`). Indexing of residues starts at 0 and corresponds to the positions in the originally used alignment (see :func:`CreateDistanceMap`).
Container returned by :func:`CreateVarianceMap`. Like :class:`DistanceMap`, it is a symmetric :meth:`GetSize` x :meth:`GetSize` matrix containing variance measures. Indexing of residues is as in :class:`DistanceMap`.
Container returned by :func:`CreateDist2Mean`. Stores distances to mean for :meth:`GetNumResidues` residues of :meth:`GetNumStructures` structures. Indexing of residues is as in :class:`DistanceMap`. Indexing of structures goes from 0 to :meth:`GetNumStructures` - 1 and is in the same order as the structures in the originally used alignment.
Container returned by :func:`CreateMeanlDDTHA`. Stores mean lDDT values for :meth:`GetNumResidues` residues of :meth:`GetNumStructures` structures. Has the exact same functionality and behaviour as :class:`Dist2Mean`
HMM Algorithms
Openstructure implements basic HMM-related functionality that aims at calculating an HMM-HMM alignment score as described in Soding, Bioinformatics (2005) 21(7), 951-60. This is the score which is optimized in the Viterbi algorithm of the hhalign tool. As a prerequisite, OpenStructure also implements adding pseudo counts to :class:`ost.seq.ProfileHandle` in order to avoid zero probabilities for unobserved transitions/emissions. Given these requirements, all functions in this section require HMM related data (transition probabilities, neff values, etc.) to be set, which is the case if you load a file in hhm format.
Example with pseudo count assignment:
from ost import io, seq
prof_query = io.LoadSequenceProfile("query.hhm")
prof_tpl = io.LoadSequenceProfile("tpl.hhm")
aln = io.LoadAlignment("aln.fasta")
# assign pseudo counts to transition probabilities
seq.alg.AddTransitionPseudoCounts(prof_query)
seq.alg.AddTransitionPseudoCounts(prof_tpl)
# hhblits/hhalign 3 assign different pseudo counts to
# query and template. The reason is computational efficiency.
# The more expensive Angermueller et al. pseudo counts
# are assigned to the query.
path_to_crf = "/path/to/hh-suite/data/context_data.crf"
lib = seq.alg.ContextProfileDB.FromCRF(path_to_crf)
seq.alg.AddAAPseudoCounts(prof_query, lib)
# templates are assigned the computationally cheaper pseudo
# counts derived from a Gonnet substitution matrix
seq.alg.AddAAPseudoCounts(prof_tpl)
# assign null model pseudo counts
# this should be done AFTER you assigned pseudo counts to emission
# probabilities as this affects the result
seq.alg.AddNullPseudoCounts(prof_query)
seq.alg.AddNullPseudoCounts(prof_tpl)
print("score:", seq.alg.HMMScore(prof_query, prof_tpl, aln, 0, 1))
Database that contains context profiles which will be used to add pseudo counts as described by Angermueller et al., Bioinformatics (2012) 28, 3240-3247.