-
Xavier Robin authoredXavier Robin authored
:mod:`~ost.seq` -- Sequences and Alignments
The :mod:`seq` module helps you working with sequence data of various kinds. It has classes for :class:`single sequences <SequenceHandle>`, :class:`lists of sequences <SequenceList>` and :class:`alignments <AlignmentHandle>` of two or more sequences.
Attaching Structures to Sequences
As OpenStructure is a computational structural biology framework, it is not surprising that the sequence classes have been designed to work together with structural data. Each sequence can have an attached :class:`~ost.mol.EntityView` allowing for fast mapping between residues in the entity view and position in the sequence.
Sequence Offset
When using sequences and structures together, often the start of the structure and the beginning of the sequence do not fall together. In the following case, the alignment of sequences B and C only covers a subsequence of structure A:
A acefghiklmnpqrstuvwy
B ghiklm
C 123-45
We would now like to know which residue in protein A is aligned to which residue in sequence C. This is achieved by setting the sequence offset of sequence C to 4. In essence, the sequence offset influences all the mapping operations from position in the sequence to residue index and vice versa. By default, the sequence offset is 0.
Loading and Saving Sequences and Alignments
The :mod:`~ost.io` module supports input and output of common sequence formats. Single sequences can be loaded from disk with :func:`~ost.io.LoadSequence`, alignments are loaded with :func:`~ost.io.LoadAlignment` and lists of sequences are loaded with :func:`~ost.io.LoadSequenceList`. In addition to the file based input methods, sequences can also be loaded from a string:
seq_string = '''>sequence
abcdefghiklmnop'''
s = io.SequenceFromString(seq_string, 'fasta')
print(s.name, s) # will print "sequence abcdefghiklmnop"
Note that, in that case specifying the format is mandatory.
The SequenceHandle
Represents a sequence. New instances are created with :func:`CreateSequence`.
The SequenceList
Represents a list of sequences. The class provides a row-based interface.
The AlignmentHandle
The :class:`AlignmentHandle` represents a list of aligned sequences. In contrast to :class:`SequenceList`, an alignment requires all sequences to be of the same length. New instances of alignments are created with :func:`CreateAlignment` and :func:`AlignmentFromSequenceList`.
Typically sequence alignments are used column-based, i.e by looking at an aligned columns in the sequence alignment. To get a row-based (sequence) view on the sequence list, use :meth:`~AlignmentHandle.GetSequences()`.
All functions that operate on an alignment will again produce a valid alignment. This mean that it is not possible to change the length of one sequence, without adjusting the other sequences, too.
The following example shows how to iterate over the columns and sequences of an alignment:
aln=io.LoadAlignment('aln.fasta')
# iterate over the columns
for col in aln:
print(col)
# iterate over the sequences
for s in aln.sequences:
print(s)
Represents a slice of an :class:`AlignmentHandle`.
Extracting views from sequences
Handling Sequence Profiles
The :class:`ProfileHandle` provides a simple container for profiles for each residue. It mainly contains:
- N :class:`ProfileColumn` objects (N = number of residues in sequence) which each contains 20 amino acid frequencies
- a :attr:`~ProfileHandle.sequence` (:class:`str`) of length N
- a :attr:`~ProfileHandle.null_model` to use for this profile
Optionally, HMM-related information can be added. This is transition probabilities between Match, Insertion or Deletion states or neff values (number of effective sequences, a measure of local sequence diversity).
The possible HMM-transitions between Match(M), Insertion(I) and Deletion(D) states. Transitions between Deletion and Insertion are disallowed:
HMM_M2M, HMM_M2I, HMM_M2D, HMM_I2M, HMM_I2I, HMM_D2M, HMM_D2D
Data container for HMM-related information that can be assigned to profile columns.
A simple database to gather :class:`ProfileHandle` objects. It is possible to save them to disk in a compressed format with limited accuracy (4 digits for each frequency).