diff --git a/modules/seq/alg/doc/seqalg.rst b/modules/seq/alg/doc/seqalg.rst index 2137e5ab17b5d3cd1ba68715c6b050f4790c0f32..ca4a9c304d411a69d436104a634f0864ad675b77 100644 --- a/modules/seq/alg/doc/seqalg.rst +++ b/modules/seq/alg/doc/seqalg.rst @@ -4,6 +4,8 @@ .. module:: ost.seq.alg :synopsis: Algorithms for sequences +Algorithms for Alignments +-------------------------------------------------------------------------------- .. function:: MergePairwiseAlignments(pairwise_alns, ref_seq) @@ -215,8 +217,10 @@ corresponding to interacting residues. .. autofunction:: PredictContacts -.. function:: CalculateMutualInformation(aln,weights=LoadConstantContactWeightMatrix(), - apc_correction=true,zpx_transformation=true,small_number_correction=0.05) +.. function:: CalculateMutualInformation(aln, \ + weights=LoadConstantContactWeightMatrix(), \ + apc_correction=true, zpx_transformation=true, \ + small_number_correction=0.05) Calculates the mutual information (MI) from a multiple sequence alignemnt. Contributions of each pair of amino-acids are weighted using the matrix **weights** (weighted mutual information). The average product correction (**apc_correction**) correction and transformation into Z-scores (**zpx_transofrmation**) increase prediciton accuracy by reducing the effect of phylogeny and other noise sources. The small number correction reduces noise for alignments with small number of sequences of low diversity. @@ -231,7 +235,8 @@ corresponding to interacting residues. :param small_number_correction: initial values for the probabilities of having a given pair of amino acids *p(a,b)*. :type small_number_correction: :class:`float` -.. function:: CalculateContactScore(aln,weights=LoadDefaultContactWeightMatrix()) +.. function:: CalculateContactScore(aln, \ + weights=LoadDefaultContactWeightMatrix()) Calculates the Contact Score (*CoSc*) from a multiple sequence alignment. For each pair of residues *(i,j)* (pair of columns in the MSA), *CoSc(i,j)* is the average over the values of the **weights** corresponding to the amino acid pairs in the columns. @@ -240,8 +245,8 @@ corresponding to interacting residues. :param weights: The contact weight matrix :type weights: :class`ContactWeightMatrix` -.. function:: CalculateContactSubstitutionScore(aln,ref_seq_index=0, - weights=LoadDefaultPairSubstWeightMatrix()) +.. function:: CalculateContactSubstitutionScore(aln, ref_seq_index=0, \ + weights=LoadDefaultPairSubstWeightMatrix()) Calculates the Contact Substitution Score (*CoEvoSc*) from a multiple sequence alignment. For each pair of residues *(i,j)* (pair of columns in the MSA), *CoEvoSc(i,j)* is the average over the values of the **weights** corresponding to substituting the amino acid pair in the reference sequence (given by **ref_seq_index**) with all other pairs in columns *(i,j)* of the **aln**. @@ -290,3 +295,281 @@ corresponding to interacting residues. A :class:`CharList` of one letter codes of the amino acids for which weights are found in the **weights** matrix. +Get and analyze distance matrices from alignments +-------------------------------------------------------------------------------- + +**Example:** + +.. code-block:: python + + # SETUP: aln is multiple sequence alignment, where first sequence is the + # reference sequence and all others have a structure attached + + # clip alignment to only have parts with at least 3 sequences (incl. ref.) + # -> aln will be cut and clip_start is 1st column of aln that was kept + clip_start = seq.alg.ClipAlignment(aln, 3) + # get variance measure and distance to mean for each residue pair + d_map = seq.alg.CreateDistanceMap(aln) + var_map = seq.alg.CreateVarianceMap(d_map) + dist_to_mean = seq.alg.CreateDist2Mean(d_map) + # report min. and max. variances + print "MIN-MAX:", var_map.Min(), "-", var_map.Max() + # get data and json-strings for further processing + var_map_data = var_map.GetData() + var_map_json = var_map.GetJsonString() + dist_to_mean_data = dist_to_mean.GetData() + dist_to_mean_json = dist_to_mean.GetJsonString() + +.. function:: ClipAlignment(aln, n_seq_thresh=2, set_offset=true, \ + remove_empty=true) + + Clips alignment so that first and last column have at least the desired number + of structures. + + :param aln: Multiple sequence alignment. Will be cut! + :type aln: :class:`~ost.seq.AlignmentHandle` + :param n_seq_thresh: Minimal number of sequences desired. + :type n_seq_thresh: :class:`int` + :param set_offset: Shall we update offsets for attached views? + :type set_offset: :class:`bool` + :param remove_empty: Shall we remove sequences with only gaps in cut aln? + :type remove_empty: :class:`bool` + :returns: Starting column (0-indexed), where cut region starts (w.r.t. + original aln). -1, if there is no region in the alignment with + at least the desired number of structures. + :rtype: :class:`int` + +.. function:: CreateDistanceMap(aln) + + Create distance map from a multiple sequence alignment. + + The algorithm requires that the sequence alignment consists of at least two + sequences. The sequence at index 0 serves as a frame of reference. All the + other sequences must have an attached view and a properly set sequence offset + (see :meth:`~ost.seq.AlignmentHandle.SetSequenceOffset`). + + For each of the attached views, the C-alpha distance pairs are extracted and + mapped onto the corresponding C-alpha distances in the reference sequence. + + :param aln: Multiple sequence alignment. + :type aln: :class:`~ost.seq.AlignmentHandle` + :returns: Distance map. + :rtype: :class:`DistanceMap` + :raises: Exception if *aln* has less than 2 sequences or any sequence (apart + from index 0) is lacking an attached view. + +.. function:: CreateVarianceMap(d_map, sigma=25) + + :returns: Variance measure for each entry in *d_map*. + :rtype: :class:`VarianceMap` + :param d_map: Distance map as created with :func:`CreateDistanceMap`. + :type d_map: :class:`DistanceMap` + :param sigma: Used for weighting of variance measure + (see :meth:`Distances.GetWeightedStdDev`) + :type sigma: :class:`float` + :raises: Exception if *d_map* has no entries. + +.. function:: CreateDist2Mean(d_map) + + :returns: Distances to mean for each structure in *d_map*. + Structures are in the same order as passed when creating *d_map*. + :rtype: :class:`Dist2Mean` + :param d_map: Distance map as created with :func:`CreateDistanceMap`. + :type d_map: :class:`DistanceMap` + :raises: Exception if *d_map* has no entries. + +.. class:: Distances + + Container used by :class:`DistanceMap` to store a pair wise distance for each + structure. Each structure is identified by its index in the originally used + alignment (see :func:`CreateDistanceMap`). + + .. method:: GetDataSize() + + :returns: Number of pairwise distances. + :rtype: :class:`int` + + .. method:: GetAverage() + + :returns: Average of all distances. + :rtype: :class:`float` + :raises: Exception if there are no distances. + + .. method:: GetMin() + GetMax() + + :returns: Minimal/maximal distance. + :rtype: :class:`tuple` (distance (:class:`float`), index (:class:`int`)) + :raises: Exception if there are no distances. + + .. method:: GetDataElement(index) + + :returns: Element at given *index*. + :rtype: :class:`tuple` (distance (:class:`float`), index (:class:`int`)) + :param index: Index within list of distances (must be < :meth:`GetDataSize`). + :type index: :class:`int` + :raises: Exception if there are no distances or *index* out of bounds. + + .. method:: GetStdDev() + + :returns: Standard deviation of all distances. + :rtype: :class:`float` + :raises: Exception if there are no distances. + + .. method:: GetWeightedStdDev(sigma) + + :returns: Standard deviation of all distances multiplied by + exp( :meth:`GetAverage` / (-2*sigma) ). + :rtype: :class:`float` + :param sigma: Defines weight. + :type sigma: :class:`float` + :raises: Exception if there are no distances. + + .. method:: GetNormStdDev() + + :returns: Standard deviation of all distances divided by :meth:`GetAverage`. + :rtype: :class:`float` + :raises: Exception if there are no distances. + +.. class:: DistanceMap + + Container returned by :func:`CreateDistanceMap`. + Essentially a symmetric :meth:`GetSize` x :meth:`GetSize` matrix containing + up to :meth:`GetNumStructures` distances (list stored as :class:`Distances`). + Indexing of residues starts at 0 and corresponds to the positions in the + originally used alignment (see :func:`CreateDistanceMap`). + + .. method:: GetDistances(i_res1, i_res2) + + :returns: List of distances for given pair of residue indices. + :rtype: :class:`Distances` + :param i_res1: Index of residue. + :type i_res1: :class:`int` + :param i_res2: Index of residue. + :type i_res2: :class:`int` + + .. method:: GetSize() + + :returns: Number of residues in map. + :rtype: :class:`int` + + .. method:: GetNumStructures() + + :returns: Number of structures originally used when creating the map + (see :func:`CreateDistanceMap`). + :rtype: :class:`int` + +.. class:: VarianceMap + + Container returned by :func:`CreateVarianceMap`. + Like :class:`DistanceMap`, it is a symmetric :meth:`GetSize` x :meth:`GetSize` + matrix containing variance measures. + Indexing of residues is as in :class:`DistanceMap`. + + .. method:: Get(i_res1, i_res2) + + :returns: Variance measure for given pair of residue indices. + :rtype: :class:`float` + :param i_res1: Index of residue. + :type i_res1: :class:`int` + :param i_res2: Index of residue. + :type i_res2: :class:`int` + + .. method:: GetSize() + + :returns: Number of residues in map. + :rtype: :class:`int` + + .. method:: Min() + Max() + + :returns: Minimal/maximal variance in the map. + :rtype: :class:`float` + + .. method:: ExportDat(file_name) + ExportCsv(file_name) + ExportJson(file_name) + + Write all variance measures into a file. The possible formats are: + + - "dat" file: a list of "*i_res1+1* *i_res2+1* variance" lines + - "csv" file: a list of ";" separated variances (one line for each *i_res1*) + - "json" file: a JSON formatted file (see :meth:`GetJsonString`) + + :param file_name: Path to file to be created. + :type file_name: :class:`str` + :raises: Exception if the file cannot be opened for writing. + + .. method:: GetJsonString() + + :returns: A JSON formatted list of :meth:`GetSize` lists with + :meth:`GetSize` variances + :rtype: :class:`str` + + .. method:: GetData() + + Gets all the data in this map at once. Note that this is much faster (10x + speedup observed) than parsing :meth:`GetJsonString` or using :meth:`Get` + on each element. + + :returns: A list of :meth:`GetSize` lists with :meth:`GetSize` variances. + :rtype: :class:`list` of :class:`list` of :class:`float` + +.. class:: Dist2Mean + + Container returned by :func:`CreateDist2Mean`. + Stores distances to mean for :meth:`GetNumResidues` residues of + :meth:`GetNumStructures` structures. + Indexing of residues is as in :class:`DistanceMap`. + Indexing of structures goes from 0 to :meth:`GetNumStructures` - 1 and is in + the same order as the structures in the originally used alignment. + + .. method:: Get(i_res, i_str) + + :returns: Distance to mean for given residue and structure indices. + :rtype: :class:`float` + :param i_res: Index of residue. + :type i_res: :class:`int` + :param i_str: Index of structure. + :type i_str: :class:`int` + + .. method:: GetNumResidues() + + :returns: Number of residues. + :rtype: :class:`int` + + .. method:: GetNumStructures() + + :returns: Number of structures. + :rtype: :class:`int` + + .. method:: ExportDat(file_name) + ExportCsv(file_name) + ExportJson(file_name) + + Write all distance measures into a file. The possible formats are: + + - "dat" file: a list of "*i_res+1* distances" lines (distances are space + separated) + - "csv" file: a list of ";" separated distances (one line for each *i_res*) + - "json" file: a JSON formatted file (see :meth:`GetJsonString`) + + :param file_name: Path to file to be created. + :type file_name: :class:`str` + :raises: Exception if the file cannot be opened for writing. + + .. method:: GetJsonString() + + :returns: A JSON formatted list of :meth:`GetNumResidues` lists with + :meth:`GetNumStructures` distances. + :rtype: :class:`str` + + .. method:: GetData() + + Gets all the data in this map at once. Note that this is much faster (10x + speedup observed) than parsing :meth:`GetJsonString` or using :meth:`Get` + on each element. + + :returns: A list of :meth:`GetNumResidues` lists with + :meth:`GetNumStructures` distances. + :rtype: :class:`list` of :class:`list` of :class:`float` diff --git a/modules/seq/alg/pymod/wrap_seq_alg.cc b/modules/seq/alg/pymod/wrap_seq_alg.cc index 50f84ab9cd18235608c2db8e3e941656aeb6b791..30909790c3c9b61944523e4f966d42d1ffb0a40d 100644 --- a/modules/seq/alg/pymod/wrap_seq_alg.cc +++ b/modules/seq/alg/pymod/wrap_seq_alg.cc @@ -1,7 +1,7 @@ //------------------------------------------------------------------------------ // This file is part of the OpenStructure project <www.openstructure.org> // -// Copyright (C) 2008-2011 by the OpenStructure authors +// Copyright (C) 2008-2016 by the OpenStructure authors // // This library is free software; you can redistribute it and/or modify it under // the terms of the GNU Lesser General Public License as published by the Free @@ -32,12 +32,56 @@ #include <ost/seq/alg/pair_subst_weight_matrix.hh> #include <ost/seq/alg/contact_weight_matrix.hh> #include <ost/seq/alg/contact_prediction_score.hh> +#include <ost/seq/alg/clip_alignment.hh> +#include <ost/seq/alg/distance_map.hh> +#include <ost/seq/alg/variance_map.hh> using namespace boost::python; using namespace ost::seq; using namespace ost::seq::alg; -BOOST_PYTHON_MODULE(_ost_seq_alg) +//////////////////////////////////////////////////////////////////// +// wrappers +namespace { +tuple WrapDistancesGetMin(const Distances& distance) { + const std::pair<Real, int> dist = distance.GetMin(); + return boost::python::make_tuple(dist.first, dist.second); +} +tuple WrapDistancesGetMax(const Distances& distance) { + const std::pair<Real, int> dist = distance.GetMax(); + return boost::python::make_tuple(dist.first, dist.second); +} +tuple WrapDistancesGetDataElement(const Distances& distance, uint index) { + const std::pair<Real, int> dist = distance.GetDataElement(index); + return boost::python::make_tuple(dist.first, dist.second); +} + +template <typename T> +list GetList(const T& data, uint num_rows, uint num_cols) { + list ret; + for (uint row = 0; row < num_rows; ++row) { + list my_row; + for (uint col = 0; col < num_cols; ++col) { + my_row.append(data(row, col)); + } + ret.append(my_row); + } + return ret; +} + +list VarMapGetData(const VarianceMapPtr v_map) { + return GetList(*v_map, v_map->GetSize(), v_map->GetSize()); +} + +list DistToMeanGetData(const Dist2MeanPtr d2m) { + return GetList(*d2m, d2m->GetNumResidues(), d2m->GetNumStructures()); +} +} // anon ns +//////////////////////////////////////////////////////////////////// + +//////////////////////////////////////////////////////////////////// +// Work on alignments +void export_aln_alg() { enum_<RefMode::Type>("RefMode") .value("ALIGNMENT", RefMode::ALIGNMENT) @@ -71,7 +115,12 @@ BOOST_PYTHON_MODULE(_ost_seq_alg) def("SemiGlobalAlign", &SemiGlobalAlign,(arg("seq1"),arg("seq2"),arg("subst_weight"), arg("gap_open")=-5, arg("gap_ext")=-2)); def("ShannonEntropy", &ShannonEntropy, (arg("aln"), arg("ignore_gaps")=true)); +} +//////////////////////////////////////////////////////////////////// +// Contact Prediction +void export_contact_prediction() +{ class_<PairSubstWeightMatrix>("PairSubstWeightMatrix",init< std::vector <std::vector <std::vector <std::vector <Real> > > >,std::vector <char> >()) .def(init<>()) .def_readonly("weights",&PairSubstWeightMatrix::weights) @@ -99,7 +148,9 @@ BOOST_PYTHON_MODULE(_ost_seq_alg) def("LoadConstantContactWeightMatrix",LoadConstantContactWeightMatrix); def("LoadDefaultPairSubstWeightMatrix",LoadDefaultPairSubstWeightMatrix); - scope mat_scope = class_<SubstWeightMatrix, SubstWeightMatrixPtr>("SubstWeightMatrix", init<>()) + // NOTE: anything after this is within SubstWeightMatrix scope + scope mat_scope = class_<SubstWeightMatrix, SubstWeightMatrixPtr> + ("SubstWeightMatrix", init<>()) .def("GetWeight", &SubstWeightMatrix::GetWeight) .def("SetWeight", &SubstWeightMatrix::SetWeight) .def("GetMinWeight", &SubstWeightMatrix::GetMinWeight) @@ -113,5 +164,68 @@ BOOST_PYTHON_MODULE(_ost_seq_alg) .value("BLOSUM80", SubstWeightMatrix::BLOSUM80) .value("BLOSUM100", SubstWeightMatrix::BLOSUM100) ; +} +//////////////////////////////////////////////////////////////////// +// getting/analyzing distance matrices from alignments +void export_distance_analysis() +{ + def("ClipAlignment", &ClipAlignment, (arg("aln"), arg("n_seq_thresh")=2, + arg("set_offset")=true, arg("remove_empty")=true)); + def("CreateDistanceMap", &CreateDistanceMap, (arg("aln"))); + def("CreateVarianceMap", &CreateVarianceMap, (arg("d_map"), arg("sigma")=25)); + def("CreateDist2Mean", &CreateDist2Mean, (arg("d_map"))); + + class_<Distances>("Distances", no_init) + .def("GetDataSize", &Distances::GetDataSize) + .def("GetAverage", &Distances::GetAverage) + .def("GetMin", &WrapDistancesGetMin) + .def("GetMax", &WrapDistancesGetMax) + .def("GetDataElement", &WrapDistancesGetDataElement, (arg("index"))) + .def("GetStdDev", &Distances::GetStdDev) + .def("GetWeightedStdDev", &Distances::GetWeightedStdDev, (arg("sigma"))) + .def("GetNormStdDev", &Distances::GetNormStdDev) + ; + + class_<DistanceMap, DistanceMapPtr, + boost::noncopyable>("DistanceMap", no_init) + .def("GetDistances", &DistanceMap::GetDistances, + return_value_policy<reference_existing_object>(), + (arg("i_res1"), arg("i_res2"))) + .def("GetSize", &DistanceMap::GetSize) + .def("GetNumStructures", &DistanceMap::GetNumStructures) + ; + + class_<VarianceMap, VarianceMapPtr, + boost::noncopyable>("VarianceMap", no_init) + .def("Get", &VarianceMap::Get, (arg("i_res1"), arg("i_res2")), + return_value_policy<return_by_value>()) + .def("GetSize", &VarianceMap::GetSize) + .def("Min", &VarianceMap::Min) + .def("Max", &VarianceMap::Max) + .def("ExportDat", &VarianceMap::ExportDat, (arg("file_name"))) + .def("ExportCsv", &VarianceMap::ExportCsv, (arg("file_name"))) + .def("ExportJson", &VarianceMap::ExportJson, (arg("file_name"))) + .def("GetJsonString", &VarianceMap::GetJsonString) + .def("GetData", &VarMapGetData) + ; + + class_<Dist2Mean, Dist2MeanPtr, + boost::noncopyable>("Dist2Mean", no_init) + .def("Get", &Dist2Mean::Get, (arg("i_res"), arg("i_str"))) + .def("GetNumResidues", &Dist2Mean::GetNumResidues) + .def("GetNumStructures", &Dist2Mean::GetNumStructures) + .def("ExportDat", &Dist2Mean::ExportDat, (arg("file_name"))) + .def("ExportCsv", &Dist2Mean::ExportCsv, (arg("file_name"))) + .def("ExportJson", &Dist2Mean::ExportJson, (arg("file_name"))) + .def("GetJsonString", &Dist2Mean::GetJsonString) + .def("GetData", &DistToMeanGetData) + ; +} + +BOOST_PYTHON_MODULE(_ost_seq_alg) +{ + export_aln_alg(); + export_contact_prediction(); + export_distance_analysis(); } diff --git a/modules/seq/alg/src/CMakeLists.txt b/modules/seq/alg/src/CMakeLists.txt index 7e969a6405fded0d522879d06a99ca86b2fb29b6..ea32bbd6cb8c98977e6d0933f132edc7a33d93c7 100644 --- a/modules/seq/alg/src/CMakeLists.txt +++ b/modules/seq/alg/src/CMakeLists.txt @@ -1,40 +1,44 @@ set(OST_SEQ_ALG_HEADERS -sequence_identity.hh -sequence_similarity.hh -module_config.hh -ins_del.hh -subst_weight_matrix.hh alignment_opts.hh -entropy.hh -merge_pairwise_alignments.hh +clip_alignment.hh conservation.hh -local_align.hh +contact_prediction_score.hh +contact_weight_matrix.hh +distance_map.hh +entropy.hh global_align.hh -semiglobal_align.hh +ins_del.hh +local_align.hh +merge_pairwise_alignments.hh +module_config.hh pair_subst_weight_matrix.hh -contact_weight_matrix.hh -contact_prediction_score.hh -data/default_pair_subst_weight_matrix.hh +semiglobal_align.hh +sequence_identity.hh +sequence_similarity.hh +subst_weight_matrix.hh +variance_map.hh data/default_contact_weight_matrix.hh +data/default_pair_subst_weight_matrix.hh ) set(OST_SEQ_ALG_SOURCES -merge_pairwise_alignments.cc -local_align.cc +clip_alignment.cc +conservation.cc +contact_prediction_score.cc +contact_weight_matrix.cc +distance_map.cc +entropy.cc global_align.cc +ins_del.cc +local_align.cc +merge_pairwise_alignments.cc +pair_subst_weight_matrix.cc semiglobal_align.cc -entropy.cc sequence_identity.cc sequence_similarity.cc -ins_del.cc -conservation.cc -contact_weight_matrix.cc subst_weight_matrix.cc -pair_subst_weight_matrix.cc -contact_prediction_score.cc +variance_map.cc ) module(NAME seq_alg HEADER_OUTPUT_DIR ost/seq/alg SOURCES ${OST_SEQ_ALG_SOURCES} HEADERS ${OST_SEQ_ALG_HEADERS} DEPENDS_ON ost_seq) - - diff --git a/modules/seq/alg/src/clip_alignment.cc b/modules/seq/alg/src/clip_alignment.cc new file mode 100644 index 0000000000000000000000000000000000000000..02ff9c14b572ff46f0386ae57ce9adb3f81a2d9a --- /dev/null +++ b/modules/seq/alg/src/clip_alignment.cc @@ -0,0 +1,99 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Gerardo Tauriello, Juergen Haas + */ + +#include <vector> +#include <ost/seq/alg/clip_alignment.hh> + +namespace ost { namespace seq { namespace alg { + +int ClipAlignment(AlignmentHandle& aln, uint n_seq_thresh, + bool set_offset, bool remove_empty) { + // count entries per column + const int num_residues = aln.GetLength(); + const int num_seqs = aln.GetCount(); + std::vector<uint> counts(num_residues, 0); + // parse all sequences + for (int i_s = 0; i_s < num_seqs; ++i_s) { + String s = aln.GetSequence(i_s).GetString(); + // only consider non-gaps + const size_t from = s.find_first_not_of("-"); + if (from == String::npos) continue; + const size_t last = s.find_last_not_of("-"); + for (size_t i = from; i <= last; ++i) { + if (s[i] != '-') ++counts[i]; + } + } + + // find clip_start and clip_end + int clip_start = 0; + while (clip_start < num_residues && counts[clip_start] < n_seq_thresh) { + ++clip_start; + } + // check if we clipped away everything -> ABORT + if (clip_start == num_residues) { + if (remove_empty) aln = seq::CreateAlignment(); + return -1; + } + int clip_end = num_residues - 1; + while (clip_end > clip_start && counts[clip_end] < n_seq_thresh) { + --clip_end; + } + + // update offsets + if (set_offset) { + for (int i_s = 0; i_s < num_seqs; ++i_s) { + // only relevant if a view attached... + if (!aln.GetSequence(i_s).HasAttachedView()) continue; + // count non-gapped part to be cut + String s = aln.GetSequence(i_s).GetString(); + const size_t from = s.find_first_not_of("-"); + if (from == String::npos) continue; + uint cut_res = 0; + for (int i = from; i < clip_start; ++i) { + if (s[i] != '-') ++cut_res; + } + // update offset + aln.SetSequenceOffset(i_s, aln.GetSequenceOffset(i_s) + cut_res); + } + } + + // cut alignment + if (clip_end < num_residues - 1) aln.Cut(clip_end + 1, num_residues); + if (clip_start > 0) aln.Cut(0, clip_start); + + // remove empty sequences + if (remove_empty) { + int i_s = 0; + while (i_s < aln.GetCount()) { + if (aln.GetSequence(i_s).GetFirstNonGap() == -1) { + aln.RemoveSequence(i_s); + } else { + ++i_s; + } + } + } + + return clip_start; +} + +}}} diff --git a/modules/seq/alg/src/clip_alignment.hh b/modules/seq/alg/src/clip_alignment.hh new file mode 100644 index 0000000000000000000000000000000000000000..7af6abf5bd04bd2f33f9721b9e1edf9b0e81e272 --- /dev/null +++ b/modules/seq/alg/src/clip_alignment.hh @@ -0,0 +1,47 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Gerardo Tauriello, Juergen Haas + */ + +#ifndef OST_SEQ_ALG_CLIP_ALIGNMENT_HH +#define OST_SEQ_ALG_CLIP_ALIGNMENT_HH + +#include <ost/seq/alignment_handle.hh> +#include <ost/seq/alg/module_config.hh> + +namespace ost { namespace seq { namespace alg { + +/// \brief Clips alignment so that first and last column have at least the +/// desired number of structures. +/// +/// \param aln Multiple sequence alignment. Will be cut. +/// \param n_seq_thresh Minimal number of sequences desired. +/// \param set_offset Shall we update offsets for attached views? +/// \param remove_empty Shall we remove sequences with only gaps in cut aln? +/// \returns Starting column (0-indexed), where cut region starts (w.r.t. +/// original aln). -1, if there is no region in the alignment with +/// at least the desired number of structures. +int DLLEXPORT_OST_SEQ_ALG +ClipAlignment(AlignmentHandle& aln, uint n_seq_thresh=2, bool set_offset=true, + bool remove_empty=true); +}}} + +#endif diff --git a/modules/seq/alg/src/distance_map.cc b/modules/seq/alg/src/distance_map.cc new file mode 100644 index 0000000000000000000000000000000000000000..57a1729135147b1707c47221e67d3768a304064e --- /dev/null +++ b/modules/seq/alg/src/distance_map.cc @@ -0,0 +1,75 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Marco Biasini, Juergen Haas, Gerardo Tauriello + */ + +#include <ost/mol/mol.hh> +#include <ost/seq/alg/distance_map.hh> + +namespace ost { namespace seq { namespace alg { + +DistanceMapPtr CreateDistanceMap(const seq::AlignmentHandle& alignment) { + + // setup and check + if (alignment.GetCount() <= 1) { + throw IntegrityError("At least two sequences required!"); + } + seq::ConstSequenceHandle ref_seq = alignment.GetSequence(0); + String ref_seq_str = ref_seq.GetString(); + uint len = ref_seq.GetGaplessString().size(); + DistanceMapPtr dist_map(new DistanceMap(len, alignment.GetCount() - 1)); + + // parse all non-ref sequences + for (int j = 1; j < alignment.GetCount(); ++j) { + seq::ConstSequenceHandle s = alignment.GetSequence(j); + if (!s.HasAttachedView()) { + throw IntegrityError("All non-ref sequences must have an attached view!"); + } + // parse all residues in ref seq. + for (uint x = 0; x < ref_seq_str.size(); ++x) { + // skip gaps + if (ref_seq_str[x] == '-') continue; + // find CA and skip gaps and weird residues + mol::ResidueView res_a = s.GetResidue(x); + if (!res_a) continue; + mol::AtomView ca_a = res_a.FindAtom("CA"); + if (!ca_a) continue; + // compare with all others (considering symmetry) + for (uint y = x+1; y < ref_seq_str.size(); ++y) { + // skip gaps + if (ref_seq_str[y]=='-') continue; + // find CA and skip gaps and weird residues + mol::ResidueView res_b = s.GetResidue(y); + if (!res_b) continue; + mol::AtomView ca_b = res_b.FindAtom("CA"); + if (!ca_b) continue; + // add it to the map + dist_map->AddDistance(ref_seq.GetResidueIndex(x), + ref_seq.GetResidueIndex(y), + geom::Distance(ca_a.GetPos(), ca_b.GetPos()), + j); + } + } + } + return dist_map; +} + +}}} diff --git a/modules/seq/alg/src/distance_map.hh b/modules/seq/alg/src/distance_map.hh new file mode 100644 index 0000000000000000000000000000000000000000..faf1f553ccfee616d58d40a760234a5dc8f8dd63 --- /dev/null +++ b/modules/seq/alg/src/distance_map.hh @@ -0,0 +1,187 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Marco Biasini, Juergen Haas, Gerardo Tauriello + */ + +#ifndef OST_SEQ_ALG_DISTANCE_MAP_HH +#define OST_SEQ_ALG_DISTANCE_MAP_HH + +#include <numeric> +#include <cmath> +#include <algorithm> + +#include <ost/tri_matrix.hh> +#include <ost/integrity_error.hh> +#include <ost/seq/alignment_handle.hh> +#include <ost/seq/alg/module_config.hh> + +namespace ost { namespace seq { namespace alg { + +/////////////////////////////////////////////////////////////////////////////// +// HELPERS +namespace { + +template <typename Pair> +struct sum_first { + typename Pair::first_type operator()(const typename Pair::first_type& x, + const Pair& y) const + { + return x + y.first; + } +}; + +template <typename Pair> +struct CompareFirst { + bool operator()(const Pair& x, const Pair& y) const { + return (x.first < y.first); + } +}; + +} // anon ns +/////////////////////////////////////////////////////////////////////////////// + +/// \brief Container for a pair wise distance for each structure. +/// Each structure is identified by its index in the originally used alignment. +struct DLLEXPORT_OST_SEQ_ALG Distances { + + typedef std::vector<std::pair<Real,int> > FloatArray; + + uint GetDataSize() const { return values_.size(); } + + void Add(Real dist, int index) { + values_.push_back(std::make_pair(dist,index)); + } + + Real GetAverage() const { + if (values_.empty()) throw IntegrityError("List of distances empty!"); + Real sum = std::accumulate(values_.begin(), values_.end(), 0.0, + sum_first<FloatArray::value_type>()); + return sum/values_.size(); + } + + std::pair<Real,int> GetMin() const { + if (values_.empty()) throw IntegrityError("List of distances empty!"); + return *std::min_element(values_.begin(), values_.end(), + CompareFirst<FloatArray::value_type>()); + } + + std::pair<Real,int> GetMax() const { + if (values_.empty()) throw IntegrityError("List of distances empty!"); + return *std::max_element(values_.begin(), values_.end(), + CompareFirst<FloatArray::value_type>()); + } + + std::pair<Real,int> GetDataElement(uint index) const { + if (values_.size() != 0) { + if (values_.size() > index) { + return values_[index]; + } else { + std::stringstream error_message; + error_message << "size of distances data vector: " << values_.size() + << " is smaller than requested: " << index+1; + throw IntegrityError(error_message.str()); + } + } else { + throw IntegrityError("List of distances empty!"); + } + } + + Real GetStdDev() const { + if (values_.empty()) throw IntegrityError("List of distances empty!"); + const Real avg = this->GetAverage(); + return this->do_get_stddev(avg); + } + + Real GetWeightedStdDev(Real sigma) const { + if (values_.empty()) throw IntegrityError("List of distances empty!"); + const Real avg = this->GetAverage(); + const Real weight = std::exp(avg/(-2*sigma)); + return this->do_get_stddev(avg) * weight; + } + + Real GetNormStdDev() const { + if (values_.empty()) throw IntegrityError("List of distances empty!"); + const Real avg = this->GetAverage(); + return this->do_get_stddev(avg) / avg; + } + + bool operator==(const Distances& rhs) const { return values_ == rhs.values_; } + bool operator!=(const Distances& rhs) const { return values_ != rhs.values_; } + +private: + + Real do_get_stddev(Real avg) const { + Real var = 0.0; + for (FloatArray::const_iterator i = values_.begin(), + e = values_.end(); i != e; ++i) { + const Real diff = i->first - avg; + var += diff*diff; + } + return std::sqrt(var/values_.size()); + } + + FloatArray values_; +}; + +/// \brief Container for pair wise distances of a set of structures. +/// Essentially a symmetric GetSize x GetSize matrix containing up to +/// GetNumStructures distances (see Distances). +class DistanceMap; +typedef boost::shared_ptr<DistanceMap> DistanceMapPtr; + +class DLLEXPORT_OST_SEQ_ALG DistanceMap { +public: + DistanceMap(uint nresidues, uint num_structures) + : dist_(nresidues), num_structures_(num_structures) { } + + void AddDistance(uint i, uint j, Real distance, int index) { + dist_(i ,j).Add(distance,index); + } + + const Distances& GetDistances(uint i, uint j) const { + return dist_(i, j); + } + + uint GetSize() const { return dist_.GetSize(); } + + uint GetNumStructures() const { return num_structures_; } + +private: + TriMatrix<Distances> dist_; + uint num_structures_; +}; + + +/// \brief create distance map from a multiple sequence alignment. +/// +/// The algorithm requires that the sequence alignment consists of at least two +/// sequences. The sequence at index 0 serves as a frame of reference. All the +/// other sequences must have an attached view and a properly set +/// \ref seq::SequenceHandle::SetSequenceOffset() "sequence offset". +/// +/// For each of the attached views, the C-alpha distance pairs are extracted and +/// mapped onto the corresponding C-alpha distances in the reference sequence. +DistanceMapPtr DLLEXPORT_OST_SEQ_ALG +CreateDistanceMap(const seq::AlignmentHandle& alignment); + +}}} + +#endif diff --git a/modules/seq/alg/src/variance_map.cc b/modules/seq/alg/src/variance_map.cc new file mode 100644 index 0000000000000000000000000000000000000000..c3c214f836f124096bd4ddc2a08ffafe4a575b35 --- /dev/null +++ b/modules/seq/alg/src/variance_map.cc @@ -0,0 +1,196 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Gerardo Tauriello, Juergen Haas + */ + +#include <fstream> +#include <iostream> +#include <sstream> + +#include <ost/mol/mol.hh> + +#include <ost/seq/alg/distance_map.hh> +#include <ost/seq/alg/variance_map.hh> + +namespace ost { namespace seq { namespace alg { + +/////////////////////////////////////////////////////////////////////////////// +// HELPERS +namespace { +// dump stuff using () operator +template <typename T> +void DumpCsv(const String& file_name, const T& data, uint num_rows, + uint num_cols) { + // open file + std::ofstream out_file(file_name.c_str()); + if (!out_file) throw Error("The file '" + file_name + "' cannot be opened."); + // dump each row + for (uint row = 0; row < num_rows; ++row) { + for (uint col = 0; col < num_cols; ++col) { + out_file << data(row, col); + if (col < num_cols-1) out_file << ";"; + } + out_file << std::endl; + } + out_file.close(); +} + +template <typename T> +String GetJson(const T& data, uint num_rows, uint num_cols) { + std::ostringstream out_stream; + out_stream << "["; + for (uint row = 0; row < num_rows; ++row) { + out_stream << "["; + for (uint col = 0; col < num_cols; ++col) { + out_stream << data(row, col); + if (col < num_cols-1) out_stream << ","; + } + out_stream << "]"; + if (row < num_rows-1) out_stream << ","; + } + out_stream << "]"; + return out_stream.str(); +} +} // anon ns + +/////////////////////////////////////////////////////////////////////////////// +// IO +void VarianceMap::ExportDat(const String& file_name) { + uint len = this->GetSize(); + if (len == 0) throw IntegrityError("Matrix empty"); + // dump it + std::ofstream out_file(file_name.c_str()); + if (!out_file) throw Error("The file '" + file_name + "' cannot be opened."); + for (uint rows = 0; rows < len; ++rows) { + for (uint cols = 0; cols < len; ++cols) { + out_file << (rows+1) << " " << (cols+1) << " " << this->Get(rows, cols) + << std::endl; + } + } + out_file.close(); +} + +void VarianceMap::ExportCsv(const String& file_name) { + uint len = this->GetSize(); + if (len == 0) throw IntegrityError("Matrix empty"); + DumpCsv(file_name, *this, len, len); +} + +void VarianceMap::ExportJson(const String& file_name) { + if (this->GetSize() == 0) throw IntegrityError("Matrix empty"); + std::ofstream out_file(file_name.c_str()); + if (!out_file) throw Error("The file '" + file_name + "' cannot be opened."); + out_file << this->GetJsonString() << std::endl; + out_file.close(); +} + +String VarianceMap::GetJsonString() { + uint len = this->GetSize(); + if (len == 0) throw IntegrityError("Matrix empty"); + return GetJson(*this, len, len); +} + +void Dist2Mean::ExportDat(const String& file_name) { + if (values_.size() == 0) throw IntegrityError("Matrix empty"); + // dump it + std::ofstream out_file(file_name.c_str()); + if (!out_file) throw Error("The file '" + file_name + "' cannot be opened."); + for (uint i_res = 0; i_res < num_residues_; ++i_res) { + out_file << (i_res+1); + for (uint i_str = 0; i_str < num_structures_; ++i_str) { + out_file << " " << this->Get(i_res, i_str); + } + out_file << std::endl; + } + out_file.close(); +} + +void Dist2Mean::ExportCsv(const String& file_name) { + if (values_.size() == 0) throw IntegrityError("Matrix empty"); + DumpCsv(file_name, *this, num_residues_, num_structures_); +} + +void Dist2Mean::ExportJson(const String& file_name) { + if (values_.size() == 0) throw IntegrityError("Matrix empty"); + std::ofstream out_file(file_name.c_str()); + if (!out_file) throw Error("The file '" + file_name + "' cannot be opened."); + out_file << this->GetJsonString() << std::endl; + out_file.close(); +} + +String Dist2Mean::GetJsonString() { + if (values_.size() == 0) throw IntegrityError("Matrix empty"); + return GetJson(*this, num_residues_, num_structures_); +} + +/////////////////////////////////////////////////////////////////////////////// +// Create maps +VarianceMapPtr CreateVarianceMap(const DistanceMapPtr dmap, Real sigma) { + // setup + uint len = dmap->GetSize(); + if (len == 0) { + throw IntegrityError("empty distance map provided"); + } + // get weighted std for each pair (symmetric storage!) + VarianceMapPtr vmap(new VarianceMap(len)); + for (uint i_res = 0; i_res < len; ++i_res) { + for (uint i_res2 = 0; i_res2 <= i_res; ++i_res2) { + const Distances& di = dmap->GetDistances(i_res, i_res2); + if (di.GetDataSize() > 0) { + vmap->Set(i_res, i_res2, di.GetWeightedStdDev(sigma)); + } else { + vmap->Set(i_res, i_res2, 0); + } + } + } + return vmap; +} + +Dist2MeanPtr CreateDist2Mean(const DistanceMapPtr dmap) { + // setup/check + uint nstruc = dmap->GetNumStructures(); + uint len = dmap->GetSize(); + if (len == 0 || nstruc == 0) { + throw IntegrityError("empty distance map provided"); + } + // sum up all distances to mean for each residue (symmetric dmap!) + Dist2MeanPtr dist2mean(new Dist2Mean(len, nstruc)); + for (uint i_res = 0; i_res < len; ++i_res) { + for (uint i_res2 = 0; i_res2 <= i_res; ++i_res2) { + const Distances& di = dmap->GetDistances(i_res, i_res2); + const uint size = di.GetDataSize(); + if (size >= 2) { + const Real avg = di.GetAverage(); + for (uint h = 0; h < size; ++h) { + const std::pair<Real, int> ret = di.GetDataElement(h); + const Real val = std::abs(ret.first - avg); + dist2mean->Add(i_res, ret.second-1, val); + if (i_res != i_res2) dist2mean->Add(i_res2, ret.second-1, val); + } + } + } + } + // normalize + dist2mean->DivideBy(len); + return dist2mean; +} + +}}} diff --git a/modules/seq/alg/src/variance_map.hh b/modules/seq/alg/src/variance_map.hh new file mode 100644 index 0000000000000000000000000000000000000000..0c7e02824241a7cee6becbc580cc408b6bf41747 --- /dev/null +++ b/modules/seq/alg/src/variance_map.hh @@ -0,0 +1,137 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Gerardo Tauriello, Juergen Haas + */ + +#ifndef OST_SEQ_ALG_VARIANCE_MAP_HH +#define OST_SEQ_ALG_VARIANCE_MAP_HH + +#include <algorithm> + +#include <ost/seq/alignment_handle.hh> +#include <ost/tri_matrix.hh> +#include <ost/integrity_error.hh> +#include <ost/seq/alg/module_config.hh> +#include <ost/seq/alg/distance_map.hh> + +namespace ost { namespace seq { namespace alg { + +class VarianceMap; +class Dist2Mean; +typedef boost::shared_ptr<VarianceMap> VarianceMapPtr; +typedef boost::shared_ptr<Dist2Mean> Dist2MeanPtr; + +/// \brief Container for variances for each entry in a distance map. +/// Main functionality: Get/Set, Min, Max, ExportXXX +/// Get/Set and GetSize taken from TriMatrix. +class DLLEXPORT_OST_SEQ_ALG VarianceMap: public TriMatrix<Real> { +public: + // all values initialized to 0 in constructor! + VarianceMap(int nresidues): TriMatrix<Real>(nresidues, 0) { } + + Real Min() { + if (this->GetSize() == 0) throw IntegrityError("Matrix empty"); + std::vector<Real>& data = this->Data(); + return *std::min_element(data.begin(), data.end()); + } + + Real Max() { + if (this->GetSize() == 0) throw IntegrityError("Matrix empty"); + std::vector<Real>& data = this->Data(); + return *std::max_element(data.begin(), data.end()); + } + + void ExportDat(const String& file_name); + void ExportCsv(const String& file_name); + void ExportJson(const String& file_name); + String GetJsonString(); +}; + +/// \brief Container for distances to mean for N structures. +/// Main functionality: Get/Set, ExportXXX +class DLLEXPORT_OST_SEQ_ALG Dist2Mean { +public: + // all values initialized to 0 in constructor! + Dist2Mean(uint num_residues, uint num_structures) + : num_residues_(num_residues), num_structures_(num_structures) + , values_(num_residues * num_structures, 0) { } + + void Set(uint i_res, uint i_str, Real val) { + values_[GetIndex(i_res, i_str)] = val; + } + + Real Get(uint i_res, uint i_str) const { + return values_[GetIndex(i_res, i_str)]; + } + + Real& operator()(uint i_res, uint i_str) { + return values_[GetIndex(i_res, i_str)]; + } + Real operator()(uint i_res, uint i_str) const { + return values_[GetIndex(i_res, i_str)]; + } + + std::vector<Real>& Data() { return values_; } + + uint GetNumResidues() const { return num_residues_; } + uint GetNumStructures() const { return num_structures_; } + + void Add(uint i_res, uint i_str, Real val) { + values_[GetIndex(i_res, i_str)] += val; + } + + void DivideBy(Real val) { + for (uint i = 0; i < values_.size(); ++i) values_[i] /= val; + } + + void ExportDat(const String& file_name); + void ExportCsv(const String& file_name); + void ExportJson(const String& file_name); + String GetJsonString(); + +private: + uint GetIndex(uint i_res, uint i_str) const { + assert(i_res < num_residues_); + assert(i_str < num_structures_); + return (i_res * num_structures_ + i_str); + } + + uint num_residues_; + uint num_structures_; + std::vector<Real> values_; +}; + +/// \returns Variance measure for each entry in dmap. +/// \param dmap Distance map as created with CreateDistanceMap. +/// \param sigma Used for weighting of variance measure +/// (see Distances::GetWeightedStdDev) +VarianceMapPtr DLLEXPORT_OST_SEQ_ALG +CreateVarianceMap(const DistanceMapPtr dmap, Real sigma=25); + +/// \returns Distances to mean for each structure in dmap. +/// Structures are in the same order as passed when creating dmap. +/// \param dmap Distance map as created with CreateDistanceMap. +Dist2MeanPtr DLLEXPORT_OST_SEQ_ALG +CreateDist2Mean(const DistanceMapPtr dmap); + +}}} + +#endif diff --git a/modules/seq/alg/tests/CMakeLists.txt b/modules/seq/alg/tests/CMakeLists.txt index ca0e72988f5fb0e4e8e7841a405e7f1303541cd5..529a3eb62e02f3273258680064670f4e9ceb8850 100644 --- a/modules/seq/alg/tests/CMakeLists.txt +++ b/modules/seq/alg/tests/CMakeLists.txt @@ -1,4 +1,5 @@ set(OST_SEQ_ALG_UNIT_TESTS + test_distance_analysis.cc test_merge_pairwise_alignments.cc test_sequence_identity.cc tests.cc @@ -10,5 +11,6 @@ set(OST_SEQ_ALG_UNIT_TESTS test_aligntoseqres.py ) -ost_unittest(MODULE seq_alg SOURCES "${OST_SEQ_ALG_UNIT_TESTS}") +ost_unittest(MODULE seq_alg SOURCES "${OST_SEQ_ALG_UNIT_TESTS}" + LINK ost_mol ost_io) diff --git a/modules/seq/alg/tests/test_distance_analysis.cc b/modules/seq/alg/tests/test_distance_analysis.cc new file mode 100644 index 0000000000000000000000000000000000000000..e8b41e819329c2c6adcf60e6f241154bc740f186 --- /dev/null +++ b/modules/seq/alg/tests/test_distance_analysis.cc @@ -0,0 +1,269 @@ +//------------------------------------------------------------------------------ +// This file is part of the OpenStructure project <www.openstructure.org> +// +// Copyright (C) 2008-2016 by the OpenStructure authors +// +// This library is free software; you can redistribute it and/or modify it under +// the terms of the GNU Lesser General Public License as published by the Free +// Software Foundation; either version 3.0 of the License, or (at your option) +// any later version. +// This library is distributed in the hope that it will be useful, but WITHOUT +// ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +// details. +// +// You should have received a copy of the GNU Lesser General Public License +// along with this library; if not, write to the Free Software Foundation, Inc., +// 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +//------------------------------------------------------------------------------ + +/* + Author: Gerardo Tauriello + */ + +#define BOOST_TEST_DYN_LINK +#include <boost/test/unit_test.hpp> +#include <boost/test/auto_unit_test.hpp> + +#include <ost/mol/atom_view.hh> +#include <ost/mol/xcs_editor.hh> +#include <ost/io/mol/load_entity.hh> +#include <ost/seq/alg/clip_alignment.hh> +#include <ost/seq/alg/distance_map.hh> +#include <ost/seq/alg/variance_map.hh> +#include <ost/integrity_error.hh> + +using namespace ost; +using namespace ost::seq; + +BOOST_AUTO_TEST_SUITE(ost_seq_alg); + +// HELPERS +AlignmentHandle GetClippingAln() { + AlignmentHandle aln = CreateAlignment(); + aln.AddSequence(CreateSequence("ref", "DDFAGTHN")); + aln.AddSequence(CreateSequence("A", "-DFAG---")); + aln.AddSequence(CreateSequence("B", "-----THN")); + aln.AddSequence(CreateSequence("C", "--FAG---")); + aln.AddSequence(CreateSequence("D", "DDFAG---")); + // attach view + mol::EntityHandle ent = io::LoadEntity("testfiles/dummy.pdb"); + for (int i_s = 1; i_s < aln.GetCount(); ++i_s) { + aln.AttachView(i_s, ent.CreateFullView()); + aln.SetSequenceOffset(i_s, aln.GetSequence(i_s).GetFirstNonGap()); + } + return aln; +} + +AlignmentHandle GetDistanceMapAln() { + // prepare alignment + AlignmentHandle aln = CreateAlignment(); + aln.AddSequence(CreateSequence("ref", "DDFAGTHN")); + aln.AddSequence(CreateSequence("A", "DDFAGT--")); + aln.AddSequence(CreateSequence("B", "--FAGTH-")); + // attach views + mol::EntityHandle ent = io::LoadEntity("testfiles/dummy.pdb"); + // move CA of res. at index 3 (res. num = 4) for ent + mol::AtomHandle atom = ent.FindAtom("A", 4, "CA"); + ent.EditXCS(mol::UNBUFFERED_EDIT).SetAtomPos(atom, geom::Vec3(0,0,0)); + BOOST_CHECK_EQUAL(ent.FindAtom("A", 4, "CA").GetPos(), geom::Vec3(0,0,0)); + mol::EntityHandle ent2 = io::LoadEntity("testfiles/dummy.pdb"); + aln.AttachView(1, ent.CreateFullView()); + aln.AttachView(2, ent2.CreateFullView()); + aln.SetSequenceOffset(2, 2); + return aln; +} + +alg::DistanceMapPtr GetDistanceMap() { + AlignmentHandle aln = GetDistanceMapAln(); + return alg::CreateDistanceMap(aln); +} + +// TESTS +BOOST_AUTO_TEST_CASE(test_clip_alignment_no_change) { + // clip it at level 2 (no change!) + AlignmentHandle aln_cut = GetClippingAln(); + int clip_start = alg::ClipAlignment(aln_cut, 2); + BOOST_CHECK_EQUAL(clip_start, 0); + BOOST_CHECK_EQUAL(aln_cut.GetLength(), 8); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 5); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(0).GetString(), "DDFAGTHN"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(1).GetString(), "-DFAG---"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(2).GetString(), "-----THN"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(3).GetString(), "--FAG---"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(4).GetString(), "DDFAG---"); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(0), 0); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(1), 1); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(2), 5); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(3), 2); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(4), 0); +} + +BOOST_AUTO_TEST_CASE(test_clip_alignment_no_offset) { + // clip it at level 3 without updating offsets and deleting + AlignmentHandle aln_cut = GetClippingAln(); + int clip_start = alg::ClipAlignment(aln_cut, 3, false, false); + BOOST_CHECK_EQUAL(clip_start, 1); + BOOST_CHECK_EQUAL(aln_cut.GetLength(), 4); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 5); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(0).GetString(), "DFAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(1).GetString(), "DFAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(2).GetString(), "----"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(3).GetString(), "-FAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(4).GetString(), "DFAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(0), 0); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(1), 1); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(2), 5); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(3), 2); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(4), 0); +} + +BOOST_AUTO_TEST_CASE(test_clip_alignment_def) { + // clip it at level 3 with updating offsets and deleting + AlignmentHandle aln_cut = GetClippingAln(); + int clip_start = alg::ClipAlignment(aln_cut, 3, true, true); + BOOST_CHECK_EQUAL(clip_start, 1); + BOOST_CHECK_EQUAL(aln_cut.GetLength(), 4); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 4); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(0).GetString(), "DFAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(1).GetString(), "DFAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(2).GetString(), "-FAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(3).GetString(), "DFAG"); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(0), 0); // REF! + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(1), 1); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(2), 2); + BOOST_CHECK_EQUAL(aln_cut.GetSequenceOffset(3), 1); +} + +BOOST_AUTO_TEST_CASE(test_clip_alignment_remove_all) { + // clip it at level 5 to remove all + AlignmentHandle aln_cut = GetClippingAln(); + int clip_start = alg::ClipAlignment(aln_cut, 5, true, true); + BOOST_CHECK_EQUAL(clip_start, -1); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 0); +} + +BOOST_AUTO_TEST_CASE(test_clip_alignment_generic) { + // generic tests + AlignmentHandle aln_cut = CreateAlignment(); + int clip_start = alg::ClipAlignment(aln_cut); + BOOST_CHECK_EQUAL(clip_start, -1); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 0); + aln_cut.AddSequence(CreateSequence("ref", "DDFAG")); + aln_cut.AddSequence(CreateSequence("A", "-----")); + clip_start = alg::ClipAlignment(aln_cut, 1, true, true); + BOOST_CHECK_EQUAL(clip_start, 0); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 1); + BOOST_CHECK_EQUAL(aln_cut.GetSequence(0).GetString(), "DDFAG"); + aln_cut.AddSequence(CreateSequence("A", "-----")); + clip_start = alg::ClipAlignment(aln_cut, 2, true, true); + BOOST_CHECK_EQUAL(clip_start, -1); + BOOST_CHECK_EQUAL(aln_cut.GetCount(), 0); +} + +BOOST_AUTO_TEST_CASE(test_distance_map) { + // setup + AlignmentHandle aln = GetDistanceMapAln(); + alg::DistanceMapPtr d_map = alg::CreateDistanceMap(aln); + geom::Vec3 pos2 = aln.GetResidue(2, 2).FindAtom("CA").GetPos(); + geom::Vec3 pos3 = aln.GetResidue(2, 3).FindAtom("CA").GetPos(); + geom::Vec3 pos4 = aln.GetResidue(2, 4).FindAtom("CA").GetPos(); + // check map size + BOOST_CHECK_EQUAL(d_map->GetSize(), 8u); + BOOST_CHECK_EQUAL(d_map->GetNumStructures(), 2u); + // check single entry + const alg::Distances& di = d_map->GetDistances(2,4); + BOOST_CHECK(di == d_map->GetDistances(4,2)); + BOOST_CHECK_EQUAL(di.GetDataSize(), 2u); + std::pair<Real,int> data0 = di.GetDataElement(0); + std::pair<Real,int> data1 = di.GetDataElement(1); + BOOST_CHECK_EQUAL(data0.second + data1.second, 1+2); + BOOST_CHECK_EQUAL(data0.first, geom::Length(pos4 - pos2)); + BOOST_CHECK_EQUAL(data1.first, data0.first); + // check modified entry + const alg::Distances& di2 = d_map->GetDistances(3,4); + BOOST_CHECK_EQUAL(di2.GetDataSize(), 2u); + data0 = di2.GetDataElement(0); + data1 = di2.GetDataElement(1); + if (data0.second == 2) std::swap(data0, data1); + BOOST_CHECK_EQUAL(data0.second, 1); + BOOST_CHECK_EQUAL(data1.second, 2); + BOOST_CHECK_EQUAL(data0.first, geom::Length(pos4)); + BOOST_CHECK_EQUAL(data1.first, geom::Length(pos4 - pos3)); + BOOST_CHECK(data0.first != data1.first); + // min/max and other features + std::pair<Real,int> min_data = di2.GetMin(); + std::pair<Real,int> max_data = di2.GetMax(); + if (data0.first > data1.first) { + BOOST_CHECK(max_data == data0); + BOOST_CHECK(min_data == data1); + } else { + BOOST_CHECK(max_data == data1); + BOOST_CHECK(min_data == data0); + } + const Real exp_avg = (data0.first + data1.first)/2; + BOOST_CHECK_CLOSE(di2.GetAverage(), exp_avg, 1e-3); + const Real exp_std = std::abs(data0.first - exp_avg); + BOOST_CHECK_CLOSE(di2.GetStdDev(), exp_std, 1e-3); + BOOST_CHECK_CLOSE(di2.GetWeightedStdDev(2), exp_std * std::exp(-exp_avg/4), + 1e-3); + BOOST_CHECK_CLOSE(di2.GetNormStdDev(), exp_std/exp_avg, 1e-3); + // check other entries + BOOST_CHECK_EQUAL(d_map->GetDistances(1,2).GetDataSize(), 1u); + BOOST_CHECK_EQUAL(d_map->GetDistances(1,7).GetDataSize(), 0u); + BOOST_CHECK_EQUAL(d_map->GetDistances(4,7).GetDataSize(), 0u); +} + +BOOST_AUTO_TEST_CASE(test_variance_map) { + alg::DistanceMapPtr d_map = GetDistanceMap(); + // check variance map + alg::VarianceMapPtr std_mat = alg::CreateVarianceMap(d_map); + BOOST_CHECK_EQUAL(std_mat->GetSize(), 8); + for (uint row = 0; row < 8; ++row) { + for (uint col = 0; col < row; ++col) { + if ((col == 3 && row <= 5) || (row == 3 && col >= 2)) { + BOOST_CHECK(std_mat->Get(row, col) > 0); + } else { + BOOST_CHECK_EQUAL(std_mat->Get(row, col), Real(0)); + } + BOOST_CHECK_EQUAL(std_mat->Get(row, col), std_mat->Get(col, row)); + } + BOOST_CHECK_EQUAL(std_mat->Get(row, row), Real(0)); + } + Real max_std = std::max(std_mat->Get(3,4), std_mat->Get(2,3)); + max_std = std::max(std_mat->Get(3,5), max_std); + BOOST_CHECK_EQUAL(std_mat->Min(), Real(0)); + BOOST_CHECK_EQUAL(std_mat->Max(), max_std); + // check json export + String json = std_mat->GetJsonString(); + BOOST_CHECK_EQUAL(std::count(json.begin(), json.end(), '['), 8+1); + BOOST_CHECK_EQUAL(std::count(json.begin(), json.end(), ']'), 8+1); + BOOST_CHECK_EQUAL(std::count(json.begin(), json.end(), ','), 7*8 + 7); +} + +BOOST_AUTO_TEST_CASE(test_dist_to_mean) { + alg::DistanceMapPtr d_map = GetDistanceMap(); + // check distance to mean + alg::Dist2MeanPtr dist_to_mean = alg::CreateDist2Mean(d_map); + uint num_res = dist_to_mean->GetNumResidues(); + uint num_str = dist_to_mean->GetNumStructures(); + BOOST_CHECK_EQUAL(num_res, 8u); + BOOST_CHECK_EQUAL(num_str, 2u); + for (uint row = 0; row < num_res; ++row) { + for (uint col = 0; col < num_str; ++col) { + if (row >= 2 && row <= 5) { + BOOST_CHECK(dist_to_mean->Get(row, col) != 0); + } else { + if (dist_to_mean->Get(row, col) != 0) std::cout << row << ", " << col; + BOOST_CHECK_EQUAL(dist_to_mean->Get(row, col), Real(0)); + } + } + } + // check json export + String json = dist_to_mean->GetJsonString(); + BOOST_CHECK_EQUAL(std::count(json.begin(), json.end(), '['), 8+1); + BOOST_CHECK_EQUAL(std::count(json.begin(), json.end(), ']'), 8+1); + BOOST_CHECK_EQUAL(std::count(json.begin(), json.end(), ','), 1*8 + 7); +} + +BOOST_AUTO_TEST_SUITE_END(); diff --git a/modules/seq/alg/tests/testfiles/dummy.pdb b/modules/seq/alg/tests/testfiles/dummy.pdb new file mode 100644 index 0000000000000000000000000000000000000000..30bd0291d76fcf1580cef43f24ffab295caa8ec0 --- /dev/null +++ b/modules/seq/alg/tests/testfiles/dummy.pdb @@ -0,0 +1,41 @@ +ATOM 1 N ASP A 1 0.000 0.000 0.000 1.00 0.00 N +ATOM 2 CA ASP A 1 1.430 0.000 0.000 1.00 0.00 C +ATOM 3 C ASP A 1 1.866 1.425 0.000 1.00 0.00 C +ATOM 4 O ASP A 1 2.762 1.828 0.740 1.00 0.00 O +ATOM 5 CB ASP A 1 1.954 -0.709 -1.214 1.00 0.00 C +ATOM 6 N ASP A 2 1.227 2.248 -0.851 1.00 0.00 N +ATOM 7 CA ASP A 2 1.585 3.631 -0.913 1.00 0.00 C +ATOM 8 C ASP A 2 1.354 4.202 0.444 1.00 0.00 C +ATOM 9 O ASP A 2 2.176 4.940 0.984 1.00 0.00 O +ATOM 10 CB ASP A 2 0.748 4.344 -1.932 1.00 0.00 C +ATOM 11 N PHE A 3 0.198 3.868 1.046 1.00 0.00 N +ATOM 12 CA PHE A 3 -0.101 4.374 2.350 1.00 0.00 C +ATOM 13 C PHE A 3 0.977 3.896 3.260 1.00 0.00 C +ATOM 14 O PHE A 3 1.514 4.643 4.077 1.00 0.00 O +ATOM 15 CB PHE A 3 -1.437 3.875 2.812 1.00 0.00 C +ATOM 16 N ALA A 4 1.333 2.604 3.143 1.00 0.00 N +ATOM 17 CA ALA A 4 2.360 2.066 3.980 1.00 0.00 C +ATOM 18 C ALA A 4 3.599 2.847 3.707 1.00 0.00 C +ATOM 19 O ALA A 4 4.320 3.256 4.616 1.00 0.00 O +ATOM 20 CB ALA A 4 2.578 0.614 3.676 1.00 0.00 C +ATOM 21 N GLY A 5 3.887 3.083 2.415 1.00 0.00 N +ATOM 22 CA GLY A 5 5.058 3.826 2.067 1.00 0.00 C +ATOM 23 C GLY A 5 4.925 5.169 2.697 1.00 0.00 C +ATOM 24 O GLY A 5 5.861 5.699 3.294 1.00 0.00 O +ATOM 25 N THR A 6 3.729 5.772 2.579 1.00 0.00 N +ATOM 26 CA THR A 6 3.518 7.063 3.156 1.00 0.00 C +ATOM 27 C THR A 6 3.755 6.930 4.621 1.00 0.00 C +ATOM 28 O THR A 6 4.423 7.752 5.246 1.00 0.00 O +ATOM 29 CB THR A 6 2.119 7.534 2.891 1.00 0.00 C +ATOM 30 N HIS A 7 3.200 5.863 5.224 1.00 0.00 N +ATOM 31 CA HIS A 7 3.379 5.662 6.628 1.00 0.00 C +ATOM 32 C HIS A 7 4.845 5.537 6.866 1.00 0.00 C +ATOM 33 O HIS A 7 5.402 6.125 7.792 1.00 0.00 O +ATOM 34 CB HIS A 7 2.668 4.419 7.075 1.00 0.00 C +ATOM 35 N ASN A 8 5.526 4.749 6.016 1.00 0.00 N +ATOM 36 CA ASN A 8 6.937 4.576 6.176 1.00 0.00 C +ATOM 37 C ASN A 8 7.558 5.925 6.054 1.00 0.00 C +ATOM 38 O ASN A 8 8.423 6.279 6.854 1.00 0.00 O +ATOM 39 CB ASN A 8 7.475 3.654 5.122 1.00 0.00 C +TER 40 ASN A 8 +END diff --git a/modules/seq/base/doc/seq.rst b/modules/seq/base/doc/seq.rst index 5022c0aba584d651d8e0d0d056e89cbc635468ff..4c3e2bf17517818a8de3fac10fe02f8088191429 100644 --- a/modules/seq/base/doc/seq.rst +++ b/modules/seq/base/doc/seq.rst @@ -231,7 +231,7 @@ the same length. New instances of alignments are created with Typically sequence alignments are used column-based, i.e by looking at an aligned columns in the sequence alignment. To get a row-based (sequence) view -on the sequence list, use :meth:`GetSequenceList()`. +on the sequence list, use :meth:`GetSequences()`. All functions that operate on an alignment will again produce a valid alignment. This mean that it is not possible to change the length of one sequence, without @@ -277,7 +277,7 @@ an alignment: Returns the sequence at the given index, raising an IndexError when trying to access an inexistent sequence. - .. method:: GetSequenceList() + .. method:: GetSequences() Returns a list of all sequence of the alignment. @@ -343,7 +343,7 @@ an alignment: .. method:: Cut(start, end) Removes the columns in the half-closed interval `start`, `end` from the - alignment. + alignment. Note that this function does not update offsets! .. code-block:: python @@ -365,16 +365,6 @@ an alignment: :param new_region: The region to be inserted :type new_region: :class:`AlignedRegion` or :class:`AlignmentHandle` - - .. method:: ShiftRegion(start, end, amount, master=-1) - - Shift columns in the half-closed interval `start`, `end`. If amount is a - positive number, the columns are shifted to the right, if negative, the - columns are shifted to the left. - - If master is set to -1, all sequences in the region are affected, otherwise - only the sequence at index equal to master is shifted. - .. method:: GetMatchingBackboneViews(index1=0, index2=1) Returns a tuple of entity views containing matching backbone atoms for the diff --git a/modules/seq/base/src/alignment_handle.hh b/modules/seq/base/src/alignment_handle.hh index 9990e170570d4c19f4c0f4a9eaf5129ea1b2f60f..6ac044cfdbd8ecf8dc3d33c16fbd3ce4b8c7ff27 100644 --- a/modules/seq/base/src/alignment_handle.hh +++ b/modules/seq/base/src/alignment_handle.hh @@ -44,7 +44,7 @@ class AlignedColumnIterator; /// /// Typically sequence alignments are used column-based, i.e by looking at an /// aligned columns in the sequence alignment. To get a row-based (sequence) -/// view on the sequence list, use AlignmentHandle::GetSequenceList(). For an +/// view on the sequence list, use AlignmentHandle::GetSequences(). For an /// overview of how to use the sequence module, see \ref module_seq "here" /// /// All operators that operate on an alignment will again produce a valid