structure_db.rst



Structural Data
The structural database serves as a container for structural backbone
and profile data. It can be filled with chains of pdb structures with their
corresponding profiles as they are produced by the HHSuite tools [soding2005].
Structural and profile data get complemented by with additional information.
Following features get stored on a per residue basis:

The amino acid one letter code
The coordinates of the backbone atoms (N,CA,C,O)
The phi/psi dihedral angles
The secondary structure state as defined by dssp
The solvent accessibility in square Angstrom
The residue depth defined as the average distance from all atoms of a
residue to the closest surface vertex as calculated by msms [sanner1996].
This is a simplified version of the residue depth as discussed in
[chakravarty1999] and gets directly calculated when structural information
gets added to the StructureDB.
The amino acid frequencies as given by an input sequence profile
The amino acid frequency derived from structural alignments as described
in [zhou2005] - Since the calculation of such a profile already requires a
StructureDB, we end up in a hen and egg problem here... When adding
structural information to the StructureDB, the according memory gets
just allocated and set to zero. The usage of this information
is therefore only meaningful if you calculate these profiles
and manually set them (or load the provided default database).


Defining Chains and Fragments
The CoordInfo gets automatically generated when new chains are added to
the structural database. It contains internal information of how
the according chain is stored in the database.
The FragmentInfo defines a fragment in the structural database.


param chain_index:
Fills :attr:`chain_index`


param offset:
Fills :attr:`offset`


param length:
Fills :attr:`length`


The Structure Database
The following code example demonstrates how to create a structural database
and fill it with content.
Calculating the structural profiles is highly expensive and heavily depends on
the size of the database used as source. If you want to do this for a larger
database, you might want to consider two things:

Use a database of limited size as structural source (something
in between 5000 and 10000 nonredundant chains is enough)
Use the :class:`ost.seq.ProfileDB` to gather profiles produced from jobs
running in parallel


Finding Fragments based on Geometric Features
The fragment database allows to organize, search and access the information
stored in a structural database (:class:`StructureDB`). In its current form it
groups fragments in bins according to their length (incl. stems) and the
geometry of their N-stem and C-stem (described by 4 angles and the distance
between the N-stem C atom and the C-stem N atom). It can therefore be searched
for fragments matching a certain geometry of N and C stems. The bins are
accessed through a hash table, making searching the database ultra fast.
This example illustrates how to create a custom FragDB based on a StructureDB:


param dist_bin_size:
Size of the distance parameter binning in A


param angle_bin_size:
Size of the angle parameter binning in degree


type dist_bin_size:
:class:`float`


type angle_bin_size:
:class:`int`


Finding Fragments based on Sequence Features
In some cases you might want to use the :class:`StructureDB` to search
for fragments that possibly represent the structural conformation of interest.
The :class:`Fragger` searches a :class:`StructureDB` for n fragments,
that maximize a certain score and gathers a set of fragments with a guaranteed
structural diversity based on an rmsd_threshold. You can use the :class:`Fragger`
wrapped in a full fletched pipeline implemented in
:class:`~promod3.modelling.FraggerHandle` or search for fragments from scratch
using an arbitrary linear combination of scores:


SeqID:
Calculates the fraction of amino acids being identical when comparing
a potential fragment from the :class:`StructureDB` and the target sequence

SeqSim:
Calculates the avg. substitution matrix based sequence similarity of amino acids
when comparing a potential fragment from the :class:`StructureDB` and the target
sequence

SSAgree:
Calculates the avg. agreement of the predicted secondary structure by PSIPRED [Jones1999]
and the dssp [kabsch1983] assignment stored in the :class:`StructureDB`.
The Agreement term is based on a probabilistic approach also used in HHSearch [soding2005].

TorsionProbability:
Calculates the avg. probability of observing the phi/psi dihedral angles of a potential
fragment from the :class:`StructureDB` given the target sequence. The probabilities are
extracted from the :class:`TorsionSampler` class.

SequenceProfile:
Calculates the avg. profile score between the amino acid frequencies of a potential
fragment from the :class:`StructureDB` and a target profile assuming a gapfree alignment
in between them. The scores are calculated as L1 distances between the profile columns.

StructureProfile:
Calculates the avg. profile score between the amino acid frequencies of a potential
fragment from the :class:`StructureDB` and a target profile assuming a gapfree alignment
in between them. The scores are calculated as L1 distances between the profile columns.
In this case, the amino acid frequencies extracted from structural alignments are used.

A Fragger object to search a :class:`StructureDB` for fragments with seq
as target sequence. You need to add some score components before you can
finally call the Fill function.


param seq:
Sequence of fragments to be searched


type seq:
:class:`str`


A simple storable map of Fragger objects. The idea is that one can use the map
to cache fragger lists that have already been generated.
You can use :meth:`Contains` to check if an item with a given key
(:class:`int`) already exists and access items with the [] operator (see
:meth:`__getitem__` and :meth:`__setitem__`).
Serialization is meant to be temporary and is not guaranteed to be portable.

The PsipredPrediction class
A container for the secondary structure prediction by Psipred.
Represents a list of :class:`PsipredPrediction` objects


[soding2005]

(1, 2) Söding J (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21 (7): 951–960.


[sanner1996]
Sanner M, Olson AJ, Spehner JC (1996). Reduced Surface: an Efficient Way to Compute Molecular Surfaces. Biopolymers 38 (3): 305-320.


[chakravarty1999]
Chakravarty S, Varadarajan R (1999). Residue depth: a novel parameter for the analysis of protein structure and stability. Structure 7 (7): 723–732.


[zhou2005]
Zhou H, Zhou Y (2005). Fold Recognition by Combining Sequence Profiles Derived From Evolution and From Depth-Dependent Structural Alignment of Fragments. Proteins 58 (2): 321–328.


[Jones1999]
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195-202.


[kabsch1983]
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 2577-2637.