Skip to content
Snippets Groups Projects
Commit 156499e3 authored by Studer Gabriel's avatar Studer Gabriel
Browse files

Make data extraction helper specific to their tasks: DisCo and GMQE

parent 4f4c50c9
Branches
Tags
No related merge requests found
...@@ -351,45 +351,164 @@ marks non-resolved residues with '-'. ...@@ -351,45 +351,164 @@ marks non-resolved residues with '-'.
:raises: :exc:`ost.Error` if requested data is not present :raises: :exc:`ost.Error` if requested data is not present
.. method:: ExtractTemplateData(entry_name, chain_name, aln, indexer, \ .. class:: DisCoDataContainers(indexer_path, \
seqres_container, atomseq_container, \ seqres_container_path, \
position_container) atomseq_container_path, \
position_container_path)
Thin wrapper to organize the data containers required in
:meth:`ExtractTemplateDataDisCo`. All containers are loaded
from disk and are expected to be consistent.
:param indexer_path: Path to indexer file to access the data containers
(:class:`LinearIndexer`)
:param seqres_container_path: Path to container with SEQRES data
(:class:`LinearCharacterContainer`)
:param atomseq_container_path: Path to container with ATOMSEQ data
(:class:`LinearCharacterContainer`)
:param position_container_path: Path to container with CA position data
(:class:`LinearPositionContainer`)
:type indexer_path: :class:`str`
:type seqres_container_path: :class:`str`
:type atomseq_container_path: :class:`str`
:type position_container_path: :class:`str`
.. attribute:: indexer
.. attribute:: seqres_container
.. attribute:: atomseq_container
.. attribute:: position_container
.. method:: ExtractTemplateDataDisCo(entry_name, chain_name, aln, \
data_containers, residue_numbers, \
positions)
Data extraction helper for the specific use of extracting data required
to generate DisCoContainers in the QMEAN scoring function.
Let's say we have a target-template alignment in *aln* (first seq: target, Let's say we have a target-template alignment in *aln* (first seq: target,
second seq: template). This function extracts all valid template positions second seq: template). This function extracts all valid template positions
given the entry specified by *entry_name* and *chain_name*. The template and their residue numbers given the entry specified by *entry_name* and
sequence in *aln* must match the sequence in *seqres_container*. Again, *chain_name*. The template sequence in *aln* must match the sequence in
the *atomseq_container* is used to identify valid positions. The according *seqres_container*. Again, the *atomseq_container* is used to identify valid
residue numbers relative to the target sequence in *aln* are also returned. positions.
:param entry_name: Name of assembly you want the data from :param entry_name: Name of assembly you want the data from
:param chain_name: Name of chain you want the data from :param chain_name: Name of chain you want the data from
:param aln: Target-template sequence alignment :param aln: Target-template sequence alignment
:param indexer: Used to access *atomseq_container*, *seqres_container* :param data_containers: Contains all required data containers
and *position_container* :param residue_numbers: Residue numbers of extracted positions will be stored
:param seqres_container: Container containing the full sequence data in here
:param atomseq_container: Container that marks locations with invalid position :param positions: Extracted positions will be stored in here
data with '-' :type entry_name: :class:`str`
:param position_container: Container containing position data :type chain_name: :class:`str`
:type aln: :ost.seq.SequenceHandle:
:type data_containers: :class:`DisCoDataContainers`
:type residue_numbers: :class:`list` of :class:`int`
:type positions: :class:`geom::Vec3List`
:raises: :exc:`ost.Error` if requested data is not present in
the containers or if the template sequence in *aln*
doesn't match with the sequence in *seqres_container*
.. class:: GMQEDataContainers(indexer_path, \
seqres_container_path, \
atomseq_container_path, \
dssp_container_path, \
n_position_container_path, \
ca_position_container_path, \
c_position_container_path, \
cb_position_container_path, \)
Thin wrapper to organize the data containers required in
:meth:`ExtractTemplateDataGMQE`. All containers are loaded
from disk and are expected to be consistent.
:param indexer_path: Path to indexer file to access the data containers
(:class:`LinearIndexer`)
:param seqres_container_path: Path to container with SEQRES data
(:class:`LinearCharacterContainer`)
:param atomseq_container_path: Path to container with ATOMSEQ data
(:class:`LinearCharacterContainer`)
:param dssp_container_path: Path to container containing dssp states
(:class:`LinearCharacterContainer`)
:param n_position_container_path: Path to container with N position data
(:class:`LinearPositionContainer`)
:param ca_position_container_path: Path to container with CA position data
(:class:`LinearPositionContainer`)
:param c_position_container_path: Path to container with C position data
(:class:`LinearPositionContainer`)
:param cb_position_container_path: Path to container with CB position data
(:class:`LinearPositionContainer`)
:type indexer_path: :class:`str`
:type seqres_container_path: :class:`str`
:type atomseq_container_path: :class:`str`
:type dssp_container_path: :class:`str`
:type n_position_container_path: :class:`str`
:type ca_position_container_path: :class:`str`
:type c_position_container_path: :class:`str`
:type cb_position_container_path: :class:`str`
.. attribute:: indexer
.. attribute:: seqres_container
.. attribute:: atomseq_container
.. attribute:: dssp_container
.. attribute:: n_position_container
.. attribute:: ca_position_container
.. attribute:: c_position_container
.. attribute:: cb_position_container
.. method:: ExtractTemplateDataGMQE(entry_name, chain_name, aln, \
data_containers, residue_numbers, \
dssp, n_positions, ca_positions, \
c_positions, cb_positions)
Data extraction helper for the specific use of extracting data required
to estimate GMQE as available in the QMEAN software package.
Let's say we have a target-template alignment in *aln* (first seq: target,
second seq: template). This function extracts all valid template positions,
their dssp states and their residue numbers given the entry specified by
*entry_name* and *chain_name*. The template sequence in *aln* must match the
sequence in *seqres_container*. Again, the *atomseq_container* is used to
identify valid positions.
:param entry_name: Name of assembly you want the data from
:param chain_name: Name of chain you want the data from
:param aln: Target-template sequence alignment
:param data_containers: Contains all required data containers
:param residue_numbers: Residue numbers of extracted positions will be stored
in here
:param dssp: Extracted DSSP states will be stored in here
:param n_positions: Extracted n_positions will be stored in here
:param ca_positions: Extracted ca_positions will be stored in here
:param c_positions: Extracted c_positions will be stored in here
:param cb_positions: Extracted cb_positions will be stored in here
:type entry_name: :class:`str` :type entry_name: :class:`str`
:type chain_name: :class:`str` :type chain_name: :class:`str`
:type aln: :ost.seq.SequenceHandle: :type aln: :ost.seq.SequenceHandle:
:type indexer: :class:`LinearIndexer` :type data_containers: :class:`GMQEDataContainers`
:type seqres_container: :class:`LinearCharacterContainer` :type residue_numbers: :class:`list` of :class:`int`
:type atomseq_container: :class:`LinearCharacterContainer` :type dssp: :class:`str`
:type position_container: :class:`LinearPositionContainer` or :type n_positions: :class:`geom::Vec3List`
:class:`list` of :class:`LinearPositionContainer` :type ca_positions: :class:`geom::Vec3List`
:type c_positions: :class:`geom::Vec3List`
:returns: First element: :class:`list` of residue numbers that :type cb_positions: :class:`geom::Vec3List`
relate each entry in the second element to the target
sequence specified in *aln*. The numbering scheme
starts from one. Second Element: the according positions.
:class:`ost.geom.Vec3List` if *position_container* was a
:class:`LinearPositionContainer`, a list of
:class:`ost.geom.Vec3List` if it was a :class:`list`.
:rtype: :class:`tuple`
:raises: :exc:`ost.Error` if requested data is not present in :raises: :exc:`ost.Error` if requested data is not present in
the container or if the template sequence in *aln* the containers or if the template sequence in *aln*
doesn't match with the sequence in *seqres_container* doesn't match with the sequence in *seqres_container*
...@@ -147,52 +147,102 @@ void WrapGetPositions(LinearPositionContainerPtr container, ...@@ -147,52 +147,102 @@ void WrapGetPositions(LinearPositionContainerPtr container,
container->GetPositions(p_range, positions); container->GetPositions(p_range, positions);
} }
// helper struct to reduce number of input parameters
struct DisCoDataContainers {
DisCoDataContainers(const String& indexer_path,
const String& seqres_container_path,
const String& atomseq_container_path,
const String& position_container_path) {
indexer = LinearIndexer::Load(indexer_path);
seqres_container = LinearCharacterContainer::Load(seqres_container_path);
atomseq_container = LinearCharacterContainer::Load(atomseq_container_path);
position_container = LinearPositionContainer::Load(position_container_path);
}
tuple WrapExtractTemplateDataSingle(const String& entry_name, const String& chain_name, LinearIndexerPtr indexer;
const ost::seq::AlignmentHandle& aln, LinearCharacterContainerPtr seqres_container;
LinearIndexerPtr indexer, LinearCharacterContainerPtr atomseq_container;
LinearCharacterContainerPtr seqres_container, LinearPositionContainerPtr position_container;
LinearCharacterContainerPtr atomseq_container, };
LinearPositionContainerPtr position_container) {
std::vector<LinearPositionContainerPtr> position_containers;
position_containers.push_back(position_container);
std::vector<int> v_residue_numbers;
std::vector<geom::Vec3List> position_vec;
ost::db::ExtractTemplateData(entry_name, chain_name, aln, indexer, void WrapExtractTemplateDataDisCo(const String& entry_name, const String& chain_name,
seqres_container, atomseq_container, const ost::seq::AlignmentHandle& aln,
position_containers, v_residue_numbers, DisCoDataContainers& data_containers,
position_vec); boost::python::list& residue_numbers,
geom::Vec3List& positions) {
list residue_numbers; std::vector<int> v_residue_numbers;
ost::db::ExtractTemplateDataDisCo(entry_name, chain_name, aln,
data_containers.indexer,
data_containers.seqres_container,
data_containers.atomseq_container,
data_containers.position_container,
v_residue_numbers,
positions);
residue_numbers = boost::python::list();
VecToList(v_residue_numbers, residue_numbers); VecToList(v_residue_numbers, residue_numbers);
return boost::python::make_tuple(residue_numbers, position_vec[0]);
} }
tuple WrapExtractTemplateDataList(const String& entry_name, const String& chain_name,
const ost::seq::AlignmentHandle& aln,
LinearIndexerPtr indexer,
LinearCharacterContainerPtr seqres_container,
LinearCharacterContainerPtr atomseq_container,
boost::python::list& position_containers) {
std::vector<LinearPositionContainerPtr> v_position_containers; // helper struct to reduce number of input parameters
ListToVec(position_containers, v_position_containers); struct GMQEDataContainers {
std::vector<int> v_residue_numbers;
std::vector<geom::Vec3List> position_vec; GMQEDataContainers(const String& indexer_path,
const String& seqres_container_path,
const String& atomseq_container_path,
const String& dssp_container_path,
const String& n_position_container_path,
const String& ca_position_container_path,
const String& c_position_container_path,
const String& cb_position_container_path) {
indexer = LinearIndexer::Load(indexer_path);
seqres_container = LinearCharacterContainer::Load(seqres_container_path);
atomseq_container = LinearCharacterContainer::Load(atomseq_container_path);
dssp_container = LinearCharacterContainer::Load(dssp_container_path);
n_position_container = LinearPositionContainer::Load(n_position_container_path);
ca_position_container = LinearPositionContainer::Load(ca_position_container_path);
c_position_container = LinearPositionContainer::Load(c_position_container_path);
cb_position_container = LinearPositionContainer::Load(cb_position_container_path);
}
ost::db::ExtractTemplateData(entry_name, chain_name, aln, indexer, LinearIndexerPtr indexer;
seqres_container, atomseq_container, LinearCharacterContainerPtr seqres_container;
v_position_containers, v_residue_numbers, LinearCharacterContainerPtr atomseq_container;
position_vec); LinearCharacterContainerPtr dssp_container;
LinearPositionContainerPtr n_position_container;
LinearPositionContainerPtr ca_position_container;
LinearPositionContainerPtr c_position_container;
LinearPositionContainerPtr cb_position_container;
};
void WrapExtractTemplateDataGMQE(const String& entry_name,
const String& chain_name,
const ost::seq::AlignmentHandle& aln,
GMQEDataContainers& data_containers,
boost::python::list& residue_numbers,
String& dssp,
geom::Vec3List& n_positions,
geom::Vec3List& ca_positions,
geom::Vec3List& c_positions,
geom::Vec3List& cb_positions) {
list residue_numbers; std::vector<int> v_residue_numbers;
list position_list; ost::db::ExtractTemplateDataGMQE(entry_name, chain_name, aln,
data_containers.indexer,
data_containers.seqres_container,
data_containers.atomseq_container,
data_containers.dssp_container,
data_containers.n_position_container,
data_containers.ca_position_container,
data_containers.c_position_container,
data_containers.cb_position_container,
v_residue_numbers, dssp, n_positions,
ca_positions, c_positions, cb_positions);
VecToList(v_residue_numbers, residue_numbers); VecToList(v_residue_numbers, residue_numbers);
VecToList(position_vec, position_list); }
return boost::python::make_tuple(residue_numbers, position_list);
}
} }
...@@ -252,20 +302,51 @@ void export_linear_db() { ...@@ -252,20 +302,51 @@ void export_linear_db() {
arg("position_container"), arg("position_container"),
arg("seq"), arg("positions"))); arg("seq"), arg("positions")));
def("ExtractTemplateData", &WrapExtractTemplateDataSingle, (arg("entry_name"), class_<DisCoDataContainers>("DisCoDataContainers", init<const String&,
arg("chain_name"), const String&,
arg("aln"), const String&,
arg("linear_indexer"), const String&>())
arg("seqres_container"), .def_readonly("indexer", &DisCoDataContainers::indexer)
arg("atomseq_container"), .def_readonly("seqres_container", &DisCoDataContainers::seqres_container)
arg("position_container"))); .def_readonly("atomseq_container", &DisCoDataContainers::atomseq_container)
.def_readonly("position_container", &DisCoDataContainers::position_container)
def("ExtractTemplateData", &WrapExtractTemplateDataList, (arg("entry_name"), ;
arg("chain_name"),
arg("aln"), def("ExtractTemplateDataDisCo", &WrapExtractTemplateDataDisCo, (arg("entry_name"),
arg("linear_indexer"), arg("chain_name"),
arg("seqres_container"), arg("aln"),
arg("atomseq_container"), arg("data_containers"),
arg("position_containers"))); arg("residue_numbers"),
arg("positions")));
class_<GMQEDataContainers>("GMQEDataContainers", init<const String&,
const String&,
const String&,
const String&,
const String&,
const String&,
const String&,
const String&>())
.def_readonly("indexer", &GMQEDataContainers::indexer)
.def_readonly("seqres_container", &GMQEDataContainers::seqres_container)
.def_readonly("atomseq_container", &GMQEDataContainers::atomseq_container)
.def_readonly("dssp_container", &GMQEDataContainers::dssp_container)
.def_readonly("n_position_container", &GMQEDataContainers::n_position_container)
.def_readonly("ca_position_container", &GMQEDataContainers::ca_position_container)
.def_readonly("c_position_container", &GMQEDataContainers::c_position_container)
.def_readonly("cb_position_container", &GMQEDataContainers::cb_position_container)
;
def("ExtractTemplateDataGMQE", &WrapExtractTemplateDataGMQE, (arg("entry_name"),
arg("chain_name"),
arg("aln"),
arg("data_containers"),
arg("residue_numbers"),
arg("dssp"),
arg("n_positions"),
arg("ca_positions"),
arg("c_positions"),
arg("cb_positions")));
} }
...@@ -48,14 +48,14 @@ void ExtractValidPositions(const String& entry_name, const String& chain_name, ...@@ -48,14 +48,14 @@ void ExtractValidPositions(const String& entry_name, const String& chain_name,
valid_seq.end())); valid_seq.end()));
} }
void ExtractTemplateData(const String& entry_name, const String& chain_name, void ExtractTemplateDataDisCo(const String& entry_name, const String& chain_name,
const ost::seq::AlignmentHandle& aln, const ost::seq::AlignmentHandle& aln,
LinearIndexerPtr indexer, LinearIndexerPtr indexer,
LinearCharacterContainerPtr seqres_container, LinearCharacterContainerPtr seqres_container,
LinearCharacterContainerPtr atomseq_container, LinearCharacterContainerPtr atomseq_container,
std::vector<LinearPositionContainerPtr>& position_containers, LinearPositionContainerPtr& position_container,
std::vector<int>& residue_numbers, std::vector<int>& residue_numbers,
std::vector<geom::Vec3List>& positions) { geom::Vec3List& positions) {
std::pair<uint64_t, uint64_t> data_range = indexer->GetDataRange(entry_name, std::pair<uint64_t, uint64_t> data_range = indexer->GetDataRange(entry_name,
chain_name); chain_name);
...@@ -75,34 +75,119 @@ void ExtractTemplateData(const String& entry_name, const String& chain_name, ...@@ -75,34 +75,119 @@ void ExtractTemplateData(const String& entry_name, const String& chain_name,
String template_atomseq; String template_atomseq;
atomseq_container->GetCharacters(data_range, template_atomseq); atomseq_container->GetCharacters(data_range, template_atomseq);
int n_pos_containers = position_containers.size(); geom::Vec3List extracted_positions;
std::vector<geom::Vec3List> extracted_positions(n_pos_containers); position_container->GetPositions(data_range, extracted_positions);
for(int i = 0; i < n_pos_containers; ++i) {
position_containers[i]->GetPositions(data_range, extracted_positions[i]);
}
// prepare output
uint template_atomseq_size = template_atomseq.size();
residue_numbers.clear();
residue_numbers.reserve(template_atomseq_size);
positions.clear();
positions.reserve(template_atomseq_size);
// extract
uint current_rnum = aln.GetSequence(0).GetOffset() + 1; uint current_rnum = aln.GetSequence(0).GetOffset() + 1;
uint current_template_pos = 0; uint current_template_pos = 0;
String seqres_seq = aln.GetSequence(0).GetString(); String seqres_seq = aln.GetSequence(0).GetString();
String template_seq = aln.GetSequence(1).GetString(); String template_seq = aln.GetSequence(1).GetString();
for(int i = 0; i < aln.GetLength(); ++i) {
if(seqres_seq[i] != '-' && template_seq[i] != '-') {
if(template_atomseq[current_template_pos] != '-') {
// it is aligned and we have a valid position!
residue_numbers.push_back(current_rnum);
positions.push_back(extracted_positions[current_template_pos]);
}
}
if(seqres_seq[i] != '-') {
++current_rnum;
}
if(template_seq[i] != '-') {
++current_template_pos;
}
}
}
void ExtractTemplateDataGMQE(const String& entry_name, const String& chain_name,
const ost::seq::AlignmentHandle& aln,
LinearIndexerPtr indexer,
LinearCharacterContainerPtr seqres_container,
LinearCharacterContainerPtr atomseq_container,
LinearCharacterContainerPtr dssp_container,
LinearPositionContainerPtr& n_position_container,
LinearPositionContainerPtr& ca_position_container,
LinearPositionContainerPtr& c_position_container,
LinearPositionContainerPtr& cb_position_container,
std::vector<int>& residue_numbers,
String& dssp,
geom::Vec3List& n_positions,
geom::Vec3List& ca_positions,
geom::Vec3List& c_positions,
geom::Vec3List& cb_positions) {
std::pair<uint64_t, uint64_t> data_range = indexer->GetDataRange(entry_name,
chain_name);
String template_seqres = aln.GetSequence(1).GetGaplessString();
data_range.first += aln.GetSequence(1).GetOffset();
data_range.second = data_range.first + template_seqres.size();
// check, whether the the template seqres is consistent with what
// we find in seqres_container
String expected_template_seqres;
seqres_container->GetCharacters(data_range, expected_template_seqres);
if(expected_template_seqres != template_seqres) {
throw std::runtime_error("Template sequence in input alignment is "
"inconsistent with sequence in SEQRES container!");
}
String template_atomseq;
atomseq_container->GetCharacters(data_range, template_atomseq);
String extracted_dssp;
dssp_container->GetCharacters(data_range, extracted_dssp);
geom::Vec3List extracted_n_positions;
n_position_container->GetPositions(data_range, extracted_n_positions);
geom::Vec3List extracted_ca_positions;
ca_position_container->GetPositions(data_range, extracted_ca_positions);
geom::Vec3List extracted_c_positions;
c_position_container->GetPositions(data_range, extracted_c_positions);
geom::Vec3List extracted_cb_positions;
cb_position_container->GetPositions(data_range, extracted_cb_positions);
// prepare output // prepare output
uint template_atomseq_size = template_atomseq.size(); uint template_atomseq_size = template_atomseq.size();
positions.assign(n_pos_containers, geom::Vec3List());
for(int i = 0; i < n_pos_containers; ++i) {
positions[i].reserve(template_atomseq_size);
}
residue_numbers.clear(); residue_numbers.clear();
residue_numbers.reserve(template_atomseq_size); residue_numbers.reserve(template_atomseq_size);
dssp.clear();
dssp.reserve(template_atomseq_size);
n_positions.clear();
n_positions.reserve(template_atomseq_size);
ca_positions.clear();
ca_positions.reserve(template_atomseq_size);
c_positions.clear();
c_positions.reserve(template_atomseq_size);
cb_positions.clear();
cb_positions.reserve(template_atomseq_size);
// extract
uint current_rnum = aln.GetSequence(0).GetOffset() + 1;
uint current_template_pos = 0;
String seqres_seq = aln.GetSequence(0).GetString();
String template_seq = aln.GetSequence(1).GetString();
for(int i = 0; i < aln.GetLength(); ++i) { for(int i = 0; i < aln.GetLength(); ++i) {
if(seqres_seq[i] != '-' && template_seq[i] != '-') { if(seqres_seq[i] != '-' && template_seq[i] != '-') {
if(template_atomseq[current_template_pos] != '-') { if(template_atomseq[current_template_pos] != '-') {
// it is aligned and we have a valid position! // it is aligned and we have a valid position!
residue_numbers.push_back(current_rnum); residue_numbers.push_back(current_rnum);
for(int j = 0; j < n_pos_containers; ++j) { dssp.push_back(extracted_dssp[current_template_pos]);
positions[j].push_back(extracted_positions[j][current_template_pos]); n_positions.push_back(extracted_n_positions[current_template_pos]);
} ca_positions.push_back(extracted_ca_positions[current_template_pos]);
c_positions.push_back(extracted_c_positions[current_template_pos]);
cb_positions.push_back(extracted_cb_positions[current_template_pos]);
} }
} }
......
...@@ -35,15 +35,31 @@ void ExtractValidPositions(const String& entry_name, const String& chain_name, ...@@ -35,15 +35,31 @@ void ExtractValidPositions(const String& entry_name, const String& chain_name,
ost::seq::SequenceHandle& seq, ost::seq::SequenceHandle& seq,
geom::Vec3List& positions); geom::Vec3List& positions);
void ExtractTemplateData(const String& entry_name, const String& chain_name, void ExtractTemplateDataDisCo(const String& entry_name, const String& chain_name,
const ost::seq::AlignmentHandle& aln, const ost::seq::AlignmentHandle& aln,
LinearIndexerPtr indexer, LinearIndexerPtr indexer,
LinearCharacterContainerPtr seqres_container, LinearCharacterContainerPtr seqres_container,
LinearCharacterContainerPtr atomseq_container, LinearCharacterContainerPtr atomseq_container,
std::vector<LinearPositionContainerPtr>& position_container, LinearPositionContainerPtr& position_container,
std::vector<int>& residue_numbers, std::vector<int>& residue_numbers,
std::vector<geom::Vec3List>& positions); geom::Vec3List& positions);
void ExtractTemplateDataGMQE(const String& entry_name, const String& chain_name,
const ost::seq::AlignmentHandle& aln,
LinearIndexerPtr indexer,
LinearCharacterContainerPtr seqres_container,
LinearCharacterContainerPtr atomseq_container,
LinearCharacterContainerPtr dssp_container,
LinearPositionContainerPtr& n_position_container,
LinearPositionContainerPtr& ca_position_container,
LinearPositionContainerPtr& c_position_container,
LinearPositionContainerPtr& cb_position_container,
std::vector<int>& residue_numbers,
String& dssp,
geom::Vec3List& n_positions,
geom::Vec3List& ca_positions,
geom::Vec3List& c_positions,
geom::Vec3List& cb_positions);
}} //ns }} //ns
#endif #endif
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment