Skip to content
Snippets Groups Projects
Verified Commit 0420c37b authored by Xavier Robin's avatar Xavier Robin
Browse files

doc: update actions doc

parent 0708fc56
No related branches found
No related tags found
No related merge requests found
.. Note on large code blocks: keep max. width to 100 or it will look bad
on webpage!
.. Note on large code blocks: keep max. width to less than 100 or it will look bad
on the web page!
You can do that by setting COLUMNS to 97:
COLUMNS=97 ost compare-structures -h
COLUMNS=97 ost compare-ligand-structures -h
.. TODO: look at argparse directive to autogenerate --help output!
.. ost-actions:
......@@ -32,30 +36,24 @@ Details on the usage (output of ``ost compare-structures --help``):
.. code-block:: console
usage: ost compare-structures [-h] -m MODEL -r REFERENCE [-o OUTPUT]
[-mf {pdb,cif,mmcif}] [-rf {pdb,cif,mmcif}]
[-mb MODEL_BIOUNIT] [-rb REFERENCE_BIOUNIT]
usage: ost compare-structures [-h] -m MODEL -r REFERENCE [-o OUTPUT] [-mf {pdb,cif,mmcif}]
[-rf {pdb,cif,mmcif}] [-mb MODEL_BIOUNIT] [-rb REFERENCE_BIOUNIT]
[-rna] [-ec] [-d] [-ds DUMP_SUFFIX] [-ft]
[-c CHAIN_MAPPING [CHAIN_MAPPING ...]] [--lddt]
[--local-lddt] [--aa-local-lddt] [--bb-lddt]
[--bb-local-lddt] [--ilddt] [--cad-score]
[--local-cad-score] [--cad-exec CAD_EXEC]
[--usalign-exec USALIGN_EXEC]
[--override-usalign-mapping] [--qs-score]
[--dockq] [--dockq-capri-peptide] [--ics]
[--ics-trimmed] [--ips] [--ips-trimmed]
[--rigid-scores] [--patch-scores] [--tm-score]
[--lddt-no-stereochecks]
[--n-max-naive N_MAX_NAIVE]
[--dump-aligned-residues] [--dump-pepnuc-alns]
[--dump-pepnuc-aligned-residues]
[-c CHAIN_MAPPING [CHAIN_MAPPING ...]] [--lddt] [--local-lddt]
[--aa-local-lddt] [--bb-lddt] [--bb-local-lddt] [--ilddt]
[--cad-score] [--local-cad-score] [--cad-exec CAD_EXEC]
[--usalign-exec USALIGN_EXEC] [--override-usalign-mapping]
[--qs-score] [--dockq] [--dockq-capri-peptide] [--ics]
[--ics-trimmed] [--ips] [--ips-trimmed] [--rigid-scores]
[--patch-scores] [--tm-score] [--lddt-no-stereochecks]
[--n-max-naive N_MAX_NAIVE] [--dump-aligned-residues]
[--dump-pepnuc-alns] [--dump-pepnuc-aligned-residues]
[--min-pep-length MIN_PEP_LENGTH]
[--min-nuc-length MIN_NUC_LENGTH] [-v VERBOSITY]
[--lddt-add-mdl-contacts]
[--lddt-inclusion-radius LDDT_INCLUSION_RADIUS]
[--chem-group-seqid-thresh CHEM_GROUP_SEQID_THRESH]
[--chem-map-seqid-thresh CHEM_MAP_SEQID_THRESH]
[--seqres SEQRES]
[--chem-map-seqid-thresh CHEM_MAP_SEQID_THRESH] [--seqres SEQRES]
[--trg-seqres-mapping TRG_SEQRES_MAPPING [TRG_SEQRES_MAPPING ...]]
Evaluate model against reference
......@@ -64,14 +62,14 @@ Details on the usage (output of ``ost compare-structures --help``):
Loads the structures and performs basic cleanup:
* Assign elements according to the PDB Chemical Component Dictionary
* Map nonstandard residues to their parent residues as defined by the PDB
Chemical Component Dictionary, e.g. phospho-serine => serine
* Remove hydrogens
* Remove OXT atoms
* Remove unknown atoms, i.e. atoms that are not expected according to the PDB
Chemical Component Dictionary
* Select for peptide/nucleotide residues
* Assign elements according to the PDB Chemical Component Dictionary
* Map nonstandard residues to their parent residues as defined by the PDB
Chemical Component Dictionary, e.g. phospho-serine => serine
* Remove hydrogens
* Remove OXT atoms
* Remove unknown atoms, i.e. atoms that are not expected according to the PDB
Chemical Component Dictionary
* Select for peptide/nucleotide residues
The cleaned structures are optionally dumped using -d/--dump-structures
......@@ -79,40 +77,47 @@ Details on the usage (output of ``ost compare-structures --help``):
options, this is a dictionary with the following keys describing model/reference
comparison:
* "reference_chains": Chain names of reference
* "model_chains": Chain names of model
* "chem_groups": Groups of polypeptides/polynucleotides from reference that
are considered chemically equivalent, i.e. pass a pairwise sequence identity
threshold that can be controlled with --chem-group-seqid-thresh.
You can derive stoichiometry from this. Contains only chains that are
considered in chain mapping, i.e. pass a size threshold (defaults: 6 for
peptides, 4 for nucleotides).
* "chem_mapping": List of same length as "chem_groups". Assigns model chains to
the respective chem group. Again, only contains chains that are considered
in chain mapping. That is 1) pass the same size threshold as fo chem_groups
2) can be aligned to any of the chem groups with a sequence identity
threshold that can be controlled by --chem-map-seqid-thresh.
* "mdl_chains_without_chem_mapping": Model chains that could be considered in chain mapping,
i.e. are long enough, but could not be mapped to any chem group.
Depends on --chem-map-seqid-thresh. A mapping for each model chain can be
enforced by setting it to 0.
* "chain_mapping": A dictionary with reference chain names as keys and the
mapped model chain names as values. Missing chains are either not mapped
(but present in "chem_groups", "chem_mapping"), were not mapped to any chem
group (present in "mdl_chains_without_chem_mapping") or were not considered in
chain mapping (short peptides etc.)
* "aln": Pairwise sequence alignment for each pair of mapped chains in fasta
format.
* "inconsistent_residues": List of strings that represent name mismatches of
aligned residues in form
<trg_cname>.<trg_rnum>.<trg_ins_code>-<mdl_cname>.<mdl_rnum>.<mdl_ins_code>.
Inconsistencies may lead to corrupt results but do not abort the program.
Program abortion in these cases can be enforced with
-ec/--enforce-consistency.
* "status": SUCCESS if everything ran through. In case of failure, the only
content of the JSON output will be "status" set to FAILURE and an
additional key: "traceback".
* "ost_version": The OpenStructure version used for computation.
* "reference_chains": Chain names of reference
* "model_chains": Chain names of model
* "chem_groups": Groups of polypeptides/polynucleotides from reference that
are considered chemically equivalent. Predefined if the reference is an mmCIF
file or if "seqres"/"trg-seqres-mapping" are provided manually. Alignments
of structure to SEQRES are established using residue numbers in these cases
and matching structure one letter codes and SEQRES are enforced.
In case of a PDB reference without predefined SEQRES, groups are established
using clustering based on pairwise alignments. Chains within
"chem_group_seqid_thresh" are considered equivalent and alignments are
established using residue numbers or Needleman-Wunsch
(see "residue-number-alignments" flag)
You can derive stoichiometry from this. Contains only chains that are
considered in chain mapping, i.e. pass a size threshold (defaults: 6 for
peptides, 4 for nucleotides).
* "chem_mapping": List of same length as "chem_groups". Assigns model chains to
the respective chem group. Again, only contains chains that are considered
in chain mapping. That is 1) pass the same size threshold as for chem_groups
2) can be aligned to any of the chem groups with a sequence identity
threshold that can be controlled by --chem-map-seqid-thresh.
* "mdl_chains_without_chem_mapping": Model chains that could be considered in
chain mapping, i.e. are long enough, but could not be mapped to any chem
group. Depends on --chem-map-seqid-thresh. A mapping for each model chain can
be enforced by setting it to 0.
* "chain_mapping": A dictionary with reference chain names as keys and the
mapped model chain names as values. Missing chains are either not mapped
(but present in "chem_groups", "chem_mapping"), were not mapped to any chem
group (present in "mdl_chains_without_chem_mapping") or were not considered in
chain mapping (short peptides etc.)
* "aln": Pairwise sequence alignment for each pair of mapped chains in fasta
format.
* "inconsistent_residues": List of strings that represent name mismatches of
aligned residues in form
<trg_cname>.<trg_rnum>.<trg_ins_code>-<mdl_cname>.<mdl_rnum>.<mdl_ins_code>.
Inconsistencies may lead to corrupt results but do not abort the program.
Program abortion in these cases can be enforced with
-ec/--enforce-consistency.
* "status": SUCCESS if everything ran through. In case of failure, the only
content of the JSON output will be "status" set to FAILURE and an
additional key: "traceback".
* "ost_version": The OpenStructure version used for computation.
Additional keys represent input options.
......@@ -138,369 +143,289 @@ Details on the usage (output of ``ost compare-structures --help``):
-r REFERENCE, --reference REFERENCE
Path to reference file.
-o OUTPUT, --output OUTPUT
Output file name. The output will be saved as a JSON
file. default: out.json
Output file name. The output will be saved as a JSON file. default:
out.json
-mf {pdb,cif,mmcif}, --model-format {pdb,cif,mmcif}
Format of model file. pdb reads pdb but also pdb.gz,
same applies to cif/mmcif. Inferred from filepath if
not given.
Format of model file. pdb reads pdb but also pdb.gz, same applies to
cif/mmcif. Inferred from filepath if not given.
-rf {pdb,cif,mmcif}, --reference-format {pdb,cif,mmcif}
Format of reference file. pdb reads pdb but also
pdb.gz, same applies to cif/mmcif. Inferred from
filepath if not given.
Format of reference file. pdb reads pdb but also pdb.gz, same applies
to cif/mmcif. Inferred from filepath if not given.
-mb MODEL_BIOUNIT, --model-biounit MODEL_BIOUNIT
Only has an effect if model is in mmcif format. By
default, the asymmetric unit (AU) is used for scoring.
If there are biounits defined in the mmcif file, you
can specify the ID (as a string) of the one which
should be used.
Only has an effect if model is in mmcif format. By default, the
asymmetric unit (AU) is used for scoring. If there are biounits defined
in the mmcif file, you can specify the ID (as a string) of the one
which should be used.
-rb REFERENCE_BIOUNIT, --reference-biounit REFERENCE_BIOUNIT
Only has an effect if reference is in mmcif format. By
default, the asymmetric unit (AU) is used for scoring.
If there are biounits defined in the mmcif file, you
can specify the ID (as a string) of the one which
should be used.
Only has an effect if reference is in mmcif format. By default, the
asymmetric unit (AU) is used for scoring. If there are biounits defined
in the mmcif file, you can specify the ID (as a string) of the one
which should be used.
-rna, --residue-number-alignment
Make alignment based on residue number instead of
using a global BLOSUM62-based alignment (NUC44 for
nucleotides).
Make alignment based on residue number instead of using a global
BLOSUM62-based alignment (NUC44 for nucleotides).
-ec, --enforce-consistency
Enforce consistency. By default residue name
discrepancies between a model and reference are
reported but the program proceeds. If this flag is ON,
the program fails for these cases.
Enforce consistency. By default residue name discrepancies between a
model and reference are reported but the program proceeds. If this flag
is ON, the program fails for these cases.
-d, --dump-structures
Dump cleaned structures used to calculate all the
scores as PDB or mmCIF files using specified suffix.
Files will be dumped to the same location and in the
same format as original files.
Dump cleaned structures used to calculate all the scores as PDB or
mmCIF files using specified suffix. Files will be dumped to the same
location and in the same format as original files.
-ds DUMP_SUFFIX, --dump-suffix DUMP_SUFFIX
Use this suffix to dump structures. Defaults to
_compare_structures
Use this suffix to dump structures. Defaults to _compare_structures
-ft, --fault-tolerant
Fault tolerant parsing.
-c CHAIN_MAPPING [CHAIN_MAPPING ...], --chain-mapping CHAIN_MAPPING [CHAIN_MAPPING ...]
Custom mapping of chains between the reference and the
model. Each separate mapping consist of key:value
pairs where key is the chain name in reference and
value is the chain name in model.
--lddt Compute global LDDT score with default
parameterization and store as key "lddt".
Stereochemical irregularities affecting LDDT are
reported as keys "model_clashes", "model_bad_bonds",
"model_bad_angles" and the respective reference
counterparts.
--local-lddt Compute per-residue LDDT scores with default
parameterization and store as key "local_lddt". Score
for each residue is accessible by key
<chain_name>.<resnum>.<resnum_inscode>. Residue with
number 42 in chain X can be extracted with:
data["local_lddt"]["X.42."]. If there is an insertion
code, lets say A, the residue key becomes "X.42.A".
Stereochemical irregularities affecting LDDT are
reported as keys "model_clashes", "model_bad_bonds",
"model_bad_angles" and the respective reference
counterparts. Atoms specified in there follow the
following format:
<chain_name>.<resnum>.<resnum_inscode>.<atom_name>
--aa-local-lddt Compute per-atom LDDT scores with default
parameterization and store as key "aa_local_lddt".
Score for each atom is accessible by key
<chain_name>.<resnum>.<resnum_inscode>.<aname>. Alpha
carbon from residue with number 42 in chain X can be
extracted with: data["aa_local_lddt"]["X.42..CA"]. If
there is a residue insertion code, lets say A, the
atom key becomes "X.42.A.CA". Stereochemical
irregularities affecting LDDT are reported as keys
"model_clashes", "model_bad_bonds", "model_bad_angles"
and the respective reference counterparts. Atoms
specified in there follow the following format:
Custom mapping of chains between the reference and the model. Each
separate mapping consist of key:value pairs where key is the chain name
in reference and value is the chain name in model.
--lddt Compute global LDDT score with default parameterization and store as
key "lddt". Stereochemical irregularities affecting LDDT are reported
as keys "model_clashes", "model_bad_bonds", "model_bad_angles" and the
respective reference counterparts.
--local-lddt Compute per-residue LDDT scores with default parameterization and store
as key "local_lddt". Score for each residue is accessible by key
<chain_name>.<resnum>.<resnum_inscode>. Residue with number 42 in chain
X can be extracted with: data["local_lddt"]["X.42."]. If there is an
insertion code, lets say A, the residue key becomes "X.42.A".
Stereochemical irregularities affecting LDDT are reported as keys
"model_clashes", "model_bad_bonds", "model_bad_angles" and the
respective reference counterparts. Atoms specified in there follow the
following format: <chain_name>.<resnum>.<resnum_inscode>.<atom_name>
--aa-local-lddt Compute per-atom LDDT scores with default parameterization and store as
key "aa_local_lddt". Score for each atom is accessible by key
<chain_name>.<resnum>.<resnum_inscode>.<aname>. Alpha carbon from
residue with number 42 in chain X can be extracted with:
data["aa_local_lddt"]["X.42..CA"]. If there is a residue insertion
code, lets say A, the atom key becomes "X.42.A.CA". Stereochemical
irregularities affecting LDDT are reported as keys "model_clashes",
"model_bad_bonds", "model_bad_angles" and the respective reference
counterparts. Atoms specified in there follow the following format:
<chain_name>.<resnum>.<resnum_inscode>.<atom_name>
--bb-lddt Compute global LDDT score with default
parameterization and store as key "bb_lddt". LDDT in
this case is only computed on backbone atoms: CA for
peptides and C3' for nucleotides
--bb-local-lddt Compute per-residue LDDT scores with default
parameterization and store as key "bb_local_lddt".
LDDT in this case is only computed on backbone atoms:
CA for peptides and C3' for nucleotides. Per-residue
scores are accessible as described for local_lddt.
--ilddt Compute global LDDT score which is solely based on
inter-chain contacts and store as key "ilddt". Same
stereochemical irregularities as for lddt apply.
--cad-score Compute global CAD's atom-atom (AA) score and store as
key "cad_score". --residue-number-alignment must be
enabled to compute this score. Requires
voronota_cadscore executable in PATH. Alternatively
you can set cad-exec.
--local-cad-score Compute local CAD's atom-atom (AA) scores and store as
key "local_cad_score". Per-residue scores are
accessible as described for local_lddt. --residue-
number-alignments must be enabled to compute this
score. Requires voronota_cadscore executable in PATH.
Alternatively you can set cad-exec.
--bb-lddt Compute global LDDT score with default parameterization and store as
key "bb_lddt". LDDT in this case is only computed on backbone atoms: CA
for peptides and C3' for nucleotides
--bb-local-lddt Compute per-residue LDDT scores with default parameterization and store
as key "bb_local_lddt". LDDT in this case is only computed on backbone
atoms: CA for peptides and C3' for nucleotides. Per-residue scores are
accessible as described for local_lddt.
--ilddt Compute global LDDT score which is solely based on inter-chain contacts
and store as key "ilddt". Same stereochemical irregularities as for
lddt apply.
--cad-score Compute global CAD's atom-atom (AA) score and store as key "cad_score".
--residue-number-alignment must be enabled to compute this score.
Requires voronota_cadscore executable in PATH. Alternatively you can
set cad-exec.
--local-cad-score Compute local CAD's atom-atom (AA) scores and store as key
"local_cad_score". Per-residue scores are accessible as described for
local_lddt. --residue-number-alignments must be enabled to compute this
score. Requires voronota_cadscore executable in PATH. Alternatively you
can set cad-exec.
--cad-exec CAD_EXEC Path to voronota-cadscore executable (installed from
https://github.com/kliment-olechnovic/voronota).
Searches PATH if not set.
https://github.com/kliment-olechnovic/voronota). Searches PATH if not
set.
--usalign-exec USALIGN_EXEC
Path to USalign executable to compute TM-score. If not
given, an OpenStructure internal copy of USalign code
is used.
Path to USalign executable to compute TM-score. If not given, an
OpenStructure internal copy of USalign code is used.
--override-usalign-mapping
Override USalign mapping and inject our own rigid
mapping. Only works if external usalign executable is
provided that is reasonably new and contains that
feature.
--qs-score Compute QS-score, stored as key "qs_global", and the
QS-best variant, stored as key "qs_best". Interfaces
in the reference with non-zero contribution to QS-
score are available as key "qs_reference_interfaces",
the ones from the model as key "qs_model_interfaces".
"qs_interfaces" is a subset of
"qs_reference_interfaces" that contains interfaces
that can be mapped to the model. They are stored as
lists in format [ref_ch1, ref_ch2, mdl_ch1, mdl_ch2].
The respective per-interface scores for
"qs_interfaces" are available as keys
"per_interface_qs_global" and "per_interface_qs_best"
--dockq Compute DockQ scores and its components. Relevant
interfaces with at least one contact (any atom within
5A) of the reference structure are available as key
"dockq_reference_interfaces". Protein-protein,
protein-nucleotide and nucleotide-nucleotide
interfaces are considered. Key "dockq_interfaces" is a
subset of "dockq_reference_interfaces" that contains
interfaces that can be mapped to the model. They are
stored as lists in format [ref_ch1, ref_ch2, mdl_ch1,
mdl_ch2]. The respective DockQ scores for
"dockq_interfaces" are available as key "dockq". It's
components are available as keys: "fnat" (fraction of
reference contacts which are also there in model)
"irmsd" (interface RMSD), "lrmsd" (ligand RMSD). The
DockQ score is strictly designed to score each
interface individually. We also provide two averaged
versions to get one full model score: "dockq_ave",
"dockq_wave". The first is simply the average of
"dockq_scores", the latter is a weighted average with
weights derived from number of contacts in the
reference interfaces. These two scores only consider
interfaces that are present in both, the model and the
reference. "dockq_ave_full" and "dockq_wave_full" add
zeros in the average computation for each interface
that is only present in the reference but not in the
model.
Override USalign mapping and inject our own rigid mapping. Only works
if external usalign executable is provided that is reasonably new and
contains that feature.
--qs-score Compute QS-score, stored as key "qs_global", and the QS-best variant,
stored as key "qs_best". Interfaces in the reference with non-zero
contribution to QS-score are available as key
"qs_reference_interfaces", the ones from the model as key
"qs_model_interfaces". "qs_interfaces" is a subset of
"qs_reference_interfaces" that contains interfaces that can be mapped
to the model. They are stored as lists in format [ref_ch1, ref_ch2,
mdl_ch1, mdl_ch2]. The respective per-interface scores for
"qs_interfaces" are available as keys "per_interface_qs_global" and
"per_interface_qs_best"
--dockq Compute DockQ scores and its components. Relevant interfaces with at
least one contact (any atom within 5A) of the reference structure are
available as key "dockq_reference_interfaces". Protein-protein,
protein-nucleotide and nucleotide-nucleotide interfaces are considered.
Key "dockq_interfaces" is a subset of "dockq_reference_interfaces" that
contains interfaces that can be mapped to the model. They are stored as
lists in format [ref_ch1, ref_ch2, mdl_ch1, mdl_ch2]. The respective
DockQ scores for "dockq_interfaces" are available as key "dockq". It's
components are available as keys: "fnat" (fraction of reference
contacts which are also there in model) "irmsd" (interface RMSD),
"lrmsd" (ligand RMSD). The DockQ score is strictly designed to score
each interface individually. We also provide two averaged versions to
get one full model score: "dockq_ave", "dockq_wave". The first is
simply the average of "dockq_scores", the latter is a weighted average
with weights derived from number of contacts in the reference
interfaces. These two scores only consider interfaces that are present
in both, the model and the reference. "dockq_ave_full" and
"dockq_wave_full" add zeros in the average computation for each
interface that is only present in the reference but not in the model.
--dockq-capri-peptide
Flag that changes two things in the way DockQ and its
underlying scores are computed which is proposed by
the CAPRI community when scoring peptides (PMID:
31886916). ONE: Two residues are considered in contact
if any of their atoms is within 5A. This is relevant
for fnat and fnonat scores. CAPRI suggests to lower
this threshold to 4A for protein-peptide interactions.
TWO: irmsd is computed on interface residues. A
residue is defined as interface residue if any of its
atoms is within 10A of another chain. CAPRI suggests
to lower the default of 10A to 8A in combination with
only considering CB atoms for protein-peptide
interactions. Note that the resulting DockQ is not
evaluated for these slightly updated fnat and irmsd
(lrmsd stays the same). Raises an error if reference
contains nucleotide chains. This flag has no influence
on patch_dockq scores.
--ics Computes interface contact similarity (ICS) related
scores. A contact between two residues of different
chains is defined as having at least one heavy atom
within 5A. Contacts in reference structure are
available as key "reference_contacts". Each contact
specifies the interacting residues in format
"<cname>.<rnum>.<ins_code>". Model contacts are
available as key "model_contacts". The precision which
is available as key "ics_precision" reports the
fraction of model contacts that are also present in
the reference. The recall which is available as key
"ics_recall" reports the fraction of reference
contacts that are correctly reproduced in the model.
The ICS score (Interface Contact Similarity) available
as key "ics" combines precision and recall using the
F1-measure. All these measures are also available on a
per-interface basis for each interface in the
reference structure that are defined as chain pairs
with at least one contact (available as key
"contact_reference_interfaces"). The respective
metrics are available as keys
"per_interface_ics_precision",
"per_interface_ics_recall" and "per_interface_ics".
--ics-trimmed Computes interface contact similarity (ICS) related
scores but on a trimmed model. That means that a
mapping between model and reference is performed and
all model residues without reference counterpart are
removed. As a consequence, model contacts for which we
have no experimental evidence do not affect the score.
The effect of these added model contacts without
mapping to target would be decreased precision and
thus lower ics. Recall is not affected. Enabling this
flag adds the following keys: "ics_trimmed",
"ics_precision_trimmed", "ics_recall_trimmed",
"model_contacts_trimmed". The reference contacts and
reference interfaces are the same as for ics and
available as keys: "reference_contacts",
"contact_reference_interfaces". All these measures are
also available on a per-interface basis for each
interface in the reference structure that are defined
as chain pairs with at least one contact (available as
key "contact_reference_interfaces"). The respective
metrics are available as keys
Flag that changes two things in the way DockQ and its underlying scores
are computed which is proposed by the CAPRI community when scoring
peptides (PMID: 31886916). ONE: Two residues are considered in contact
if any of their atoms is within 5A. This is relevant for fnat and
fnonat scores. CAPRI suggests to lower this threshold to 4A for
protein-peptide interactions. TWO: irmsd is computed on interface
residues. A residue is defined as interface residue if any of its atoms
is within 10A of another chain. CAPRI suggests to lower the default of
10A to 8A in combination with only considering CB atoms for protein-
peptide interactions. Note that the resulting DockQ is not evaluated
for these slightly updated fnat and irmsd (lrmsd stays the same).
Raises an error if reference contains nucleotide chains. This flag has
no influence on patch_dockq scores.
--ics Computes interface contact similarity (ICS) related scores. A contact
between two residues of different chains is defined as having at least
one heavy atom within 5A. Contacts in reference structure are available
as key "reference_contacts". Each contact specifies the interacting
residues in format "<cname>.<rnum>.<ins_code>". Model contacts are
available as key "model_contacts". The precision which is available as
key "ics_precision" reports the fraction of model contacts that are
also present in the reference. The recall which is available as key
"ics_recall" reports the fraction of reference contacts that are
correctly reproduced in the model. The ICS score (Interface Contact
Similarity) available as key "ics" combines precision and recall using
the F1-measure. All these measures are also available on a per-
interface basis for each interface in the reference structure that are
defined as chain pairs with at least one contact (available as key
"contact_reference_interfaces"). The respective metrics are available
as keys "per_interface_ics_precision", "per_interface_ics_recall" and
"per_interface_ics".
--ics-trimmed Computes interface contact similarity (ICS) related scores but on a
trimmed model. That means that a mapping between model and reference is
performed and all model residues without reference counterpart are
removed. As a consequence, model contacts for which we have no
experimental evidence do not affect the score. The effect of these
added model contacts without mapping to target would be decreased
precision and thus lower ics. Recall is not affected. Enabling this
flag adds the following keys: "ics_trimmed", "ics_precision_trimmed",
"ics_recall_trimmed", "model_contacts_trimmed". The reference contacts
and reference interfaces are the same as for ics and available as keys:
"reference_contacts", "contact_reference_interfaces". All these
measures are also available on a per-interface basis for each interface
in the reference structure that are defined as chain pairs with at
least one contact (available as key "contact_reference_interfaces").
The respective metrics are available as keys
"per_interface_ics_precision_trimmed",
"per_interface_ics_recall_trimmed" and
"per_interface_ics_trimmed".
--ips Computes interface patch similarity (IPS) related
scores. They focus on interface residues. They are
defined as having at least one contact to a residue
from any other chain. In short: if they show up in the
contact lists used to compute ICS. If ips is enabled,
these contacts get reported too and are available as
keys "reference_contacts" and "model_contacts".The
precision which is available as key "ips_precision"
reports the fraction of model interface residues, that
are also interface residues in the reference. The
recall which is available as key "ips_recall" reports
the fraction of reference interface residues that are
also interface residues in the model. The IPS score
(Interface Patch Similarity) available as key "ips" is
the Jaccard coefficient between interface residues in
reference and model. All these measures are also
available on a per-interface basis for each interface
in the reference structure that are defined as chain
pairs with at least one contact (available as key
"contact_reference_interfaces"). The respective
metrics are available as keys
"per_interface_ips_precision",
"per_interface_ips_recall" and "per_interface_ips".
"per_interface_ics_recall_trimmed" and "per_interface_ics_trimmed".
--ips Computes interface patch similarity (IPS) related scores. They focus on
interface residues. They are defined as having at least one contact to
a residue from any other chain. In short: if they show up in the
contact lists used to compute ICS. If ips is enabled, these contacts
get reported too and are available as keys "reference_contacts" and
"model_contacts".The precision which is available as key
"ips_precision" reports the fraction of model interface residues, that
are also interface residues in the reference. The recall which is
available as key "ips_recall" reports the fraction of reference
interface residues that are also interface residues in the model. The
IPS score (Interface Patch Similarity) available as key "ips" is the
Jaccard coefficient between interface residues in reference and model.
All these measures are also available on a per-interface basis for each
interface in the reference structure that are defined as chain pairs
with at least one contact (available as key
"contact_reference_interfaces"). The respective metrics are available
as keys "per_interface_ips_precision", "per_interface_ips_recall" and
"per_interface_ips".
--ips-trimmed The IPS equivalent of ICS on trimmed models.
--rigid-scores Computes rigid superposition based scores. They're
based on a Kabsch superposition of all mapped CA
positions (C3' for nucleotides). Makes the following
keys available: "oligo_gdtts": GDT with distance
thresholds [1.0, 2.0, 4.0, 8.0] given these positions
and transformation, "oligo_gdtha": same with
thresholds [0.5, 1.0, 2.0, 4.0], "rmsd": RMSD given
these positions and transformation, "transform": the
used 4x4 transformation matrix that superposes model
onto reference, "rigid_chain_mapping": equivalent of
"chain_mapping" which is used for rigid scores
(optimized for RMSD instead of QS-score/LDDT).
--patch-scores Local interface quality score used in CASP15. Scores
each model residue that is considered in the interface
(CB pos within 8A of any CB pos from another chain (CA
for GLY)). The local neighborhood gets represented by
"interface patches" which are scored with QS-score and
DockQ. Scores where not the full patches are
represented by the reference are set to None. Model
interface residues are available as key
"model_interface_residues", reference interface
residues as key "reference_interface_residues".
Residues are represented as string in form
<chain_name>.<resnum>.<resnum_inscode>. The respective
scores are available as keys "patch_qs" and
"patch_dockq"
--tm-score Computes TM-score with the USalign tool. Also computes
a chain mapping in case of complexes that is stored in
the same format as the default mapping. TM-score and
the mapping are available as keys "tm_score" and
--rigid-scores Computes rigid superposition based scores. They're based on a Kabsch
superposition of all mapped CA positions (C3' for nucleotides). Makes
the following keys available: "oligo_gdtts": GDT with distance
thresholds [1.0, 2.0, 4.0, 8.0] given these positions and
transformation, "oligo_gdtha": same with thresholds [0.5, 1.0, 2.0,
4.0], "rmsd": RMSD given these positions and transformation,
"transform": the used 4x4 transformation matrix that superposes model
onto reference, "rigid_chain_mapping": equivalent of "chain_mapping"
which is used for rigid scores (optimized for RMSD instead of QS-
score/LDDT).
--patch-scores Local interface quality score used in CASP15. Scores each model residue
that is considered in the interface (CB pos within 8A of any CB pos
from another chain (CA for GLY)). The local neighborhood gets
represented by "interface patches" which are scored with QS-score and
DockQ. Scores where not the full patches are represented by the
reference are set to None. Model interface residues are available as
key "model_interface_residues", reference interface residues as key
"reference_interface_residues". Residues are represented as string in
form <chain_name>.<resnum>.<resnum_inscode>. The respective scores are
available as keys "patch_qs" and "patch_dockq"
--tm-score Computes TM-score with the USalign tool. Also computes a chain mapping
in case of complexes that is stored in the same format as the default
mapping. TM-score and the mapping are available as keys "tm_score" and
"usalign_mapping"
--lddt-no-stereochecks
Disable stereochecks for LDDT computation
--n-max-naive N_MAX_NAIVE
Parameter for chain mapping. If the number of possible
mappings is <= *n_max_naive*, the full mapping
solution space is enumerated to find the the mapping
with optimal QS-score. A heuristic is used otherwise.
The default of 40320 corresponds to an octamer (8! =
40320). A structure with stoichiometry A6B2 would be
6!*2! = 1440 etc.
Parameter for chain mapping. If the number of possible mappings is <=
*n_max_naive*, the full mapping solution space is enumerated to find
the the mapping with optimal QS-score. A heuristic is used otherwise.
The default of 40320 corresponds to an octamer (8! = 40320). A
structure with stoichiometry A6B2 would be 6!*2! = 1440 etc.
--dump-aligned-residues
Dump additional info on aligned model and reference
residues.
--dump-pepnuc-alns Dump alignments of mapped chains but with sequences
that did not undergo Molck preprocessing in the
scorer. Sequences are extracted from model/target
after undergoing selection for peptide and nucleotide
Dump additional info on aligned model and reference residues.
--dump-pepnuc-alns Dump alignments of mapped chains but with sequences that did not
undergo Molck preprocessing in the scorer. Sequences are extracted from
model/target after undergoing selection for peptide and nucleotide
residues.
--dump-pepnuc-aligned-residues
Dump additional info on model and reference residues
that occur in pepnuc alignments.
Dump additional info on model and reference residues that occur in
pepnuc alignments.
--min-pep-length MIN_PEP_LENGTH
Default: 6 - Relevant parameter if short peptides are
involved in scoring. Minimum peptide length for a
chain in the target structure to be considered in
chain mapping. The chain mapping algorithm first
performs an all vs. all pairwise sequence alignment to
identify "equal" chains within the target structure.
We go for simple sequence identity there. Short
sequences can be problematic as they may produce high
Default: 6 - Relevant parameter if short peptides are involved in
scoring. Minimum peptide length for a chain in the target structure to
be considered in chain mapping. The chain mapping algorithm first
performs an all vs. all pairwise sequence alignment to identify "equal"
chains within the target structure. We go for simple sequence identity
there. Short sequences can be problematic as they may produce high
sequence identity alignments by pure chance.
--min-nuc-length MIN_NUC_LENGTH
Default: 4 - Relevant parameter if short nucleotides
are involved in scoring.Minimum nucleotide length for
a chain in the target structure to be considered in
chain mapping. The chain mapping algorithm first
performs an all vs. all pairwise sequence alignment to
identify "equal" chains within the target structure.
We go for simple sequence identity there. Short
sequences can be problematic as they may produce high
Default: 4 - Relevant parameter if short nucleotides are involved in
scoring.Minimum nucleotide length for a chain in the target structure
to be considered in chain mapping. The chain mapping algorithm first
performs an all vs. all pairwise sequence alignment to identify "equal"
chains within the target structure. We go for simple sequence identity
there. Short sequences can be problematic as they may produce high
sequence identity alignments by pure chance.
-v VERBOSITY, --verbosity VERBOSITY
Set verbosity level. Defaults to 2 (Script).
--lddt-add-mdl-contacts
Only using contacts in LDDT that are within a certain
distance threshold in the reference does not penalize
for added model contacts. If set to True, this flag
will also consider reference contacts that are within
the specified distance threshold in the model but not
necessarily in the reference. No contact will be added
if the respective atom pair is not resolved in the
reference.
Only using contacts in LDDT that are within a certain distance
threshold in the reference does not penalize for added model contacts.
If set to True, this flag will also consider reference contacts that
are within the specified distance threshold in the model but not
necessarily in the reference. No contact will be added if the
respective atom pair is not resolved in the reference.
--lddt-inclusion-radius LDDT_INCLUSION_RADIUS
Passed to LDDT scorer. Affects all LDDT scores but not
chain mapping.
Passed to LDDT scorer. Affects all LDDT scores but not chain mapping.
--chem-group-seqid-thresh CHEM_GROUP_SEQID_THRESH
Default: 95 - Sequence identity threshold used to
group identical chains in reference structure in the
chain mapping step. The same threshold is applied to
peptide and nucleotide chains.
Default: 95 - Sequence identity threshold used to group identical
chains in reference structure in the chain mapping step. The same
threshold is applied to peptide and nucleotide chains.
--chem-map-seqid-thresh CHEM_MAP_SEQID_THRESH
Default: 70 - Sequence identity threshold used to map
model chains to groups derived in the chem grouping
step in chain mapping. If set to 0., a mapping is
enforced and each model chain is assigned to the chem
group with maximum sequence identity. If larger than
0., a mapping only happens if the respective model
chain can be aligned to a chem group with the
specified sequence identity threshold AND if at least
min-pep-length/min-nuc-length residues are aligned.
The same threshold is applied to peptide and
nucleotide chains.
--seqres SEQRES Default: None - manually define chem groups by
specifying path to a fasta file. Each sequence in that
file is considered a reference sequence of a chem
group. All polymer chains in reference will be aligned
to these sequences. This only works if -rna/--residue-
number-alignment is enabled and an error is raised
otherwise. Additionally, you need to manually specify
a mapping of the polymer chains using trg-seqres-
mapping and an error is raised otherwise. The one
letter codes in the structure must exactly match the
respective characters in seqres and an error is raised
if not.
Default: 70 - Sequence identity threshold used to map model chains to
groups derived in the chem grouping step in chain mapping. If set to
0., a mapping is enforced and each model chain is assigned to the chem
group with maximum sequence identity. If larger than 0., a mapping only
happens if the respective model chain can be aligned to a chem group
with the specified sequence identity threshold AND if at least min-pep-
length/min-nuc-length residues are aligned. The same threshold is
applied to peptide and nucleotide chains.
--seqres SEQRES Default: None - manually define chem groups by specifying path to a
fasta file. Each sequence in that file is considered a reference
sequence of a chem group. All polymer chains in reference will be
aligned to these sequences. This only works if -rna/--residue-number-
alignment is enabled and an error is raised otherwise. Additionally,
you need to manually specify a mapping of the polymer chains using trg-
seqres-mapping and an error is raised otherwise. The one letter codes
in the structure must exactly match the respective characters in seqres
and an error is raised if not.
--trg-seqres-mapping TRG_SEQRES_MAPPING [TRG_SEQRES_MAPPING ...]
Default: None - Maps each polymer chain in reference
to a sequence in *seqres*. Each mapping is a key:value
pair where key is the chain name in reference and
value is the sequence name in seqres. So let's say you
have a homo-dimer reference with chains "A" and "B"for
which you provide a seqres file containing one
sequence with name "1". You can specify this mapping
with: --trg-seqres-mapping A:1 B:1
Default: None - Maps each polymer chain in reference to a sequence in
*seqres*. Each mapping is a key:value pair where key is the chain name
in reference and value is the sequence name in seqres. So let's say you
have a homo-dimer reference with chains "A" and "B"for which you
provide a seqres file containing one sequence with name "1". You can
specify this mapping with: --trg-seqres-mapping A:1 B:1
......@@ -519,31 +444,24 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
.. code-block:: console
usage: ost compare-ligand-structures [-h] -m MODEL [-ml [MODEL_LIGANDS ...]]
-r REFERENCE
[-rl [REFERENCE_LIGANDS ...]] [-o OUTPUT]
[-mf {pdb,cif,mmcif}]
[-rf {pdb,cif,mmcif}] [-of {json,csv}]
[-csvm]
[--csv-extra-header CSV_EXTRA_HEADER]
[--csv-extra-data CSV_EXTRA_DATA]
[-mb MODEL_BIOUNIT]
[-rb REFERENCE_BIOUNIT] [-ft] [-rna]
[-sm] [-cd COVERAGE_DELTA] [-v VERBOSITY]
[--full-results] [--lddt-pli]
[--lddt-pli-radius LDDT_PLI_RADIUS]
[--lddt-pli-add-mdl-contacts]
[--no-lddt-pli-add-mdl-contacts] [--rmsd]
[--radius RADIUS]
[--lddt-lp-radius LDDT_LP_RADIUS] [-fbs]
[-ms MAX_SYMMETRIES]
[--min-pep-length MIN_PEP_LENGTH]
[--min-nuc-length MIN_NUC_LENGTH]
[--chem-group-seqid-thresh CHEM_GROUP_SEQID_THRESH]
[--chem-map-seqid-thresh CHEM_MAP_SEQID_THRESH]
[--seqres SEQRES]
[--trg-seqres-mapping TRG_SEQRES_MAPPING [TRG_SEQRES_MAPPING ...]]
[--allow-heuristic-conn]
usage: ost compare-ligand-structures [-h] -m MODEL [-ml [MODEL_LIGANDS ...]] -r REFERENCE
[-rl [REFERENCE_LIGANDS ...]] [-o OUTPUT]
[-mf {pdb,cif,mmcif}] [-rf {pdb,cif,mmcif}]
[-of {json,csv}] [-csvm]
[--csv-extra-header CSV_EXTRA_HEADER]
[--csv-extra-data CSV_EXTRA_DATA] [-mb MODEL_BIOUNIT]
[-rb REFERENCE_BIOUNIT] [-ft] [-rna] [-sm]
[-cd COVERAGE_DELTA] [-v VERBOSITY] [--full-results]
[--lddt-pli] [--lddt-pli-radius LDDT_PLI_RADIUS]
[--lddt-pli-add-mdl-contacts]
[--no-lddt-pli-add-mdl-contacts] [--rmsd]
[--radius RADIUS] [--lddt-lp-radius LDDT_LP_RADIUS] [-fbs]
[-ms MAX_SYMMETRIES] [--min-pep-length MIN_PEP_LENGTH]
[--min-nuc-length MIN_NUC_LENGTH]
[--chem-group-seqid-thresh CHEM_GROUP_SEQID_THRESH]
[--chem-map-seqid-thresh CHEM_MAP_SEQID_THRESH]
[--seqres SEQRES]
[--trg-seqres-mapping TRG_SEQRES_MAPPING [TRG_SEQRES_MAPPING ...]]
Evaluate model with non-polymer/small molecule ligands against reference.
......@@ -553,47 +471,44 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
-r reference.cif \
--lddt-pli --rmsd
Structures of polymer entities (proteins and nucleotides) can be given in PDB
or mmCIF format. In case of PDB format, the full loaded structure undergoes
processing described below. In case of mmCIF format, chains representing
"polymer" entities according to _entity.type are selected and further processed
as described below.
Structures of polymer entities (proteins and nucleotides) can be given in
legacy PDB or mmCIF format. In case of PDB format, the full loaded structure
undergoes processing described below. In case of mmCIF format, chains
representing "polymer" entities according to _entity.type are selected and
further processed as described below.
Structure cleanup is heavily based on the PDB component dictionary and performs
1) removal of hydrogens, 2) removal of residues for which there is no entry in
component dictionary, 3) removal of residues that are not peptide linking or
nucleotide linking according to the component dictionary 4) removal of atoms
that are not defined for respective residues in the component dictionary. Except
step 1), every cleanup is logged and a report is available in the json outfile.
Structure cleanup of polymer chains is heavily based on the compound library
and performs: 1) removal of hydrogens, 2) removal of residues for which there
is no entry in compound library, 3) removal of residues that are not peptide
linking or nucleotide linking according to the compound library 4) removal of
atoms that are not defined for respective residues in the compound library.
Except step 1), every cleanup is logged and a report is available in the json
outfile.
Only polymers (protein and nucleic acids) of model and reference are considered
for ligand binding sites. The mapping of possible reference/model chain
assignments requires a preprocessing. In short: identical chains in the
reference are grouped based on pairwise sequence identity
(see --chem-group-seqid-thresh). Each model chain is assigned to
one of these groups (see --chem-map-seqid-thresh param).
To avoid spurious matches, only polymers of a certain length are considered
in this matching procedure (see --min_pep_length/--min_nuc_length param).
Shorter polymers are never mapped and do not contribute to scoring.
reference are grouped based on pairwise sequence identity (see
--chem-group-seqid-thresh). Each model chain is assigned to one of these
groups (see --chem-map-seqid-thresh param). To avoid spurious matches, only
polymers of a certain length are considered in this matching procedure (see
--min_pep_length/--min_nuc_length param). Shorter polymers are never mapped
and do not contribute to scoring.
Ligands can be given as path to SDF files containing the ligand for both model
(--model-ligands/-ml) and reference (--reference-ligands/-rl). If omitted,
ligands are optionally detected from a structure file if it is given in mmCIF
format. This is based on "non-polymer" _entity.type annotation and the
respective entries must exist in the PDB component dictionary in order to get
connectivity information. You can avoid the requirement of the PDB component
dictionary by enabling --allow-heuristic-conn. In this case, connectivity
is established through a distance based heuristic if the ligand is not found in
the component dictionary. Be aware that this might be an issue in ligand
matching.
If you provide structures in PDB format, an error is raised if ligands are not
explicitely given in SDF format.
format (based on "non-polymer" _entity.type annotation). If you provide
structures in PDB format, an error is raised if ligands are not explicitly
given in SDF format.
Ligands undergo gentle processing where hydrogens are removed. Connectivity
is relevant for scoring. It is read directly from SDF input. If ligands are
extracted from mmCIF, connectivity is derived from the PDB component
dictionary. Polymer/oligomeric ligands (saccharides, peptides, nucleotides)
are not supported.
extracted from mmCIF, connectivity is derived from the compound library.
Ligands that are not present in the compound library are only supported in
fault-tolerant mode, where a distance based heuristic is used to connect the
ligand atoms. Be aware that this is unreliable and might cause issues with
ligand matching
Output can be written in two format: JSON (default) or CSV, controlled by the
--output-format/-of argument.
......@@ -601,41 +516,48 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
Without additional options, the JSON ouput is a dictionary with the following
keys:
* "model_ligands": A list of ligands in the model. If ligands were provided
explicitly with --model-ligands, elements of the list will be the paths to
the ligand SDF file(s). Otherwise, they will be the chain name, residue
number and insertion code of the ligand, separated by a dot.
* "reference_ligands": Same for reference ligands.
* "chem_groups": Groups of polypeptides/polynucleotides from reference that
are considered chemically equivalent, i.e. pass a pairwise sequence identity
threshold that can be controlled with --chem-group-seqid-thresh.
You can derive stoichiometry from this. Contains only chains that are
considered in chain mapping, i.e. pass a size threshold (defaults: 6 for
peptides, 4 for nucleotides).
* "chem_mapping": List of same length as "chem_groups". Assigns model chains to
the respective chem group. Again, only contains chains that are considered
in chain mapping. That is 1) pass the same size threshold as for chem_groups
2) can be aligned to any of the chem groups with a sequence identity
threshold that can be controlled by --chem-map-seqid-thresh.
* "mdl_chains_without_chem_mapping": Model chains that could be considered in
chain mapping, i.e. are long enough, but could not be mapped to any chem
group. Depends on --chem-map-seqid-thresh. A mapping for each model chain can
be enforced by setting it to 0.
* "status": SUCCESS if everything ran through. In case of failure, the only
content of the JSON output will be "status" set to FAILURE and an
additional key: "traceback".
* "ost_version": The OpenStructure version used for computation.
* "model_cleanup_log": Lists residues/atoms that have been removed in model
cleanup process.
* "reference_cleanup_log": Same for reference.
* "model_ligands": A list of ligands in the model. If ligands were provided
explicitly with --model-ligands, elements of the list will be the paths to
the ligand SDF file(s). Otherwise, they will be the chain name, residue
number and insertion code of the ligand, separated by a dot.
* "reference_ligands": Same for reference ligands.
* "chem_groups": Groups of polypeptides/polynucleotides from reference that
are considered chemically equivalent. Predefined if the reference is an mmCIF
file or if "seqres"/"trg-seqres-mapping" are provided manually. Alignments
of structure to SEQRES are established using residue numbers in these cases
and matching structure one letter codes and SEQRES are enforced.
In case of a PDB reference without predefined SEQRES, groups are established
using clustering based on pairwise alignments. Chains within
"chem_group_seqid_thresh" are considered equivalent and alignments are
established using residue numbers or Needleman-Wunsch
(see "residue-number-alignments" flag)
You can derive stoichiometry from this. Contains only chains that are
considered in chain mapping, i.e. pass a size threshold (defaults: 6 for
peptides, 4 for nucleotides).
* "chem_mapping": List of same length as "chem_groups". Assigns model chains to
the respective chem group. Again, only contains chains that are considered
in chain mapping. That is 1) pass the same size threshold as for chem_groups
2) can be aligned to any of the chem groups with a sequence identity
threshold that can be controlled by --chem-map-seqid-thresh.
* "mdl_chains_without_chem_mapping": Model chains that could be considered in
chain mapping, i.e. are long enough, but could not be mapped to any chem
group. Depends on --chem-map-seqid-thresh. A mapping for each model chain can
be enforced by setting it to 0.
* "status": SUCCESS if everything ran through. In case of failure, the only
content of the JSON output will be "status" set to FAILURE and an
additional key: "traceback".
* "ost_version": The OpenStructure version used for computation.
* "model_cleanup_log": Lists residues/atoms that have been removed in model
cleanup process.
* "reference_cleanup_log": Same for reference.
Additional keys represent input options.
Each score is opt-in and the respective results are available in three keys:
* "assigned_scores": A list with data for each pair of assigned ligands.
Data is yet another dict containing score specific information for that
ligand pair. The following keys are there in any case:
* "assigned_scores": A list with data for each pair of assigned ligands.
Data is yet another dict containing score specific information for that
ligand pair. The following keys are there in any case:
* "model_ligand": The model ligand
* "reference_ligand": The target ligand to which model ligand is assigned to
......@@ -643,11 +565,11 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
* "coverage": Fraction of model ligand atoms which are covered by target
ligand. Will only deviate from 1.0 if --substructure-match is enabled.
* "model_ligand_unassigned_reason": Dictionary with unassigned model ligands
as key and an educated guess why this happened.
* "model_ligand_unassigned_reason": Dictionary with unassigned model ligands
as key and an educated guess why this happened.
* "reference_ligand_unassigned_reason": Dictionary with unassigned target ligands
as key and an educated guess why this happened.
* "reference_ligand_unassigned_reason": Dictionary with unassigned target ligands
as key and an educated guess why this happened.
If --full-results is enabled, another element with key "full_results" is added.
This is a list of data items for each pair of model/reference ligands. The data
......@@ -661,31 +583,31 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
The following column is always available:
* reference_ligand/model_ligand: If reference ligands were provided explicitly
with --reference-ligands, elements of the list will be the paths to the
ligand SDF file(s). Otherwise, they will be the chain name, residue number
and insertion code of the ligand, separated by a dot. If the
--by-model-ligand-output flag was set, this will be model ligand instead,
following the same rules.
* reference_ligand/model_ligand: If reference ligands were provided explicitly
with --reference-ligands, elements of the list will be the paths to the
ligand SDF file(s). Otherwise, they will be the chain name, residue number
and insertion code of the ligand, separated by a dot. If the
--by-model-ligand-output flag was set, this will be model ligand instead,
following the same rules.
If LDDT-PLI was enabled with --lddt-pli, the following columns are added:
* "lddt_pli", "lddt_pli_coverage" and "lddt_pli_(model|reference)_ligand"
are the LDDT-PLI score result, the corresponding coverage and assigned model
ligand (or reference ligand if the --by-model-ligand-output flag was set)
if an assignment was found, respectively, empty otherwise.
* "lddt_pli_unassigned" is empty if an assignment was found, otherwise it
lists the short reason this reference ligand was unassigned.
* "lddt_pli", "lddt_pli_coverage" and "lddt_pli_(model|reference)_ligand"
are the LDDT-PLI score result, the corresponding coverage and assigned model
ligand (or reference ligand if the --by-model-ligand-output flag was set)
if an assignment was found, respectively, empty otherwise.
* "lddt_pli_unassigned" is empty if an assignment was found, otherwise it
lists the short reason this reference ligand was unassigned.
If BiSyRMSD was enabled with --rmsd, the following columns are added:
* "rmsd", "rmsd_coverage". "lddt_lp" "bb_rmsd" and
"rmsd_(model|reference)_ligand" are the BiSyRMSD, the corresponding
coverage, LDDT-LP, backbone RMSD and assigned model ligand (or reference
ligand if the --by-model-ligand-output flag was set) if an assignment
was found, respectively, empty otherwise.
* "rmsd_unassigned" is empty if an assignment was found, otherwise it
lists the short reason this reference ligand was unassigned.
* "rmsd", "rmsd_coverage". "lddt_lp" "bb_rmsd" and
"rmsd_(model|reference)_ligand" are the BiSyRMSD, the corresponding
coverage, LDDT-LP, backbone RMSD and assigned model ligand (or reference
ligand if the --by-model-ligand-output flag was set) if an assignment
was found, respectively, empty otherwise.
* "rmsd_unassigned" is empty if an assignment was found, otherwise it
lists the short reason this reference ligand was unassigned.
options:
-h, --help show this help message and exit
......@@ -698,56 +620,47 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
-rl [REFERENCE_LIGANDS ...], --ref-ligands [REFERENCE_LIGANDS ...], --reference-ligands [REFERENCE_LIGANDS ...]
Path to reference ligand files.
-o OUTPUT, --out OUTPUT, --output OUTPUT
Output file name. Default depends on format: out.json
or out.csv
Output file name. Default depends on format: out.json or out.csv
-mf {pdb,cif,mmcif}, --mdl-format {pdb,cif,mmcif}, --model-format {pdb,cif,mmcif}
Format of model file. pdb reads pdb but also pdb.gz,
same applies to cif/mmcif. Inferred from filepath if
not given.
Format of model file. pdb reads pdb but also pdb.gz, same applies to
cif/mmcif. Inferred from filepath if not given.
-rf {pdb,cif,mmcif}, --reference-format {pdb,cif,mmcif}, --ref-format {pdb,cif,mmcif}
Format of reference file. pdb reads pdb but also
pdb.gz, same applies to cif/mmcif. Inferred from
filepath if not given.
Format of reference file. pdb reads pdb but also pdb.gz, same applies
to cif/mmcif. Inferred from filepath if not given.
-of {json,csv}, --out-format {json,csv}, --output-format {json,csv}
Output format, JSON or CSV, in lowercase. default:
json
Output format, JSON or CSV, in lowercase. default: json
-csvm, --by-model-ligand, --by-model-ligand-output
For CSV output, this flag changes the output so that
each line reports one model ligand, instead of a
reference ligand. Has no effect with JSON output.
For CSV output, this flag changes the output so that each line reports
one model ligand, instead of a reference ligand. Has no effect with
JSON output.
--csv-extra-header CSV_EXTRA_HEADER
Extra header prefix for CSV output. This allows adding
additional annotations (such as target ID, group, etc)
to the output
Extra header prefix for CSV output. This allows adding additional
annotations (such as target ID, group, etc) to the output
--csv-extra-data CSV_EXTRA_DATA
Additional data (columns) for CSV output.
-mb MODEL_BIOUNIT, --model-biounit MODEL_BIOUNIT
Only has an effect if model is in mmcif format. By
default, the asymmetric unit (AU) is used for scoring.
If there are biounits defined in the mmcif file, you
can specify the ID (as a string) of the one which
should be used.
Only has an effect if model is in mmcif format. By default, the
asymmetric unit (AU) is used for scoring. If there are biounits defined
in the mmcif file, you can specify the ID (as a string) of the one
which should be used.
-rb REFERENCE_BIOUNIT, --reference-biounit REFERENCE_BIOUNIT
Only has an effect if reference is in mmcif format. By
default, the asymmetric unit (AU) is used for scoring.
If there are biounits defined in the mmcif file, you
can specify the ID (as a string) of the one which
should be used.
Only has an effect if reference is in mmcif format. By default, the
asymmetric unit (AU) is used for scoring. If there are biounits defined
in the mmcif file, you can specify the ID (as a string) of the one
which should be used.
-ft, --fault-tolerant
Fault tolerant parsing.
-rna, --residue-number-alignment
Make alignment based on residue number instead of
using a global BLOSUM62-based alignment (NUC44 for
nucleotides).
Make alignment based on residue number instead of using a global
BLOSUM62-based alignment (NUC44 for nucleotides).
-sm, --substructure-match
Allow incomplete (ie partially resolved) target
ligands.
Allow incomplete (ie partially resolved) target ligands.
-cd COVERAGE_DELTA, --coverage-delta COVERAGE_DELTA
Coverage delta for partial ligand assignment.
-v VERBOSITY, --verbosity VERBOSITY
Set verbosity level. Defaults to 2 (Script).
--full-results Outputs scoring results for all model/reference ligand
pairs and store as key "full_results"
--full-results Outputs scoring results for all model/reference ligand pairs and store
as key "full_results"
--lddt-pli Compute LDDT-PLI scores and store as key "lddt_pli".
--lddt-pli-radius LDDT_PLI_RADIUS
LDDT inclusion radius for LDDT-PLI.
......@@ -756,72 +669,49 @@ Details on the usage (output of ``ost compare-ligand-structures --help``):
--no-lddt-pli-add-mdl-contacts
DO NOT add model contacts when computing LDDT-PLI.
--rmsd Compute RMSD scores and store as key "rmsd".
--radius RADIUS Inclusion radius to extract reference binding site
that is used for RMSD computation. Any residue with
atoms within this distance of the ligand will be
included in the binding site.
--radius RADIUS Inclusion radius to extract reference binding site that is used for
RMSD computation. Any residue with atoms within this distance of the
ligand will be included in the binding site.
--lddt-lp-radius LDDT_LP_RADIUS
LDDT inclusion radius for LDDT-LP.
-fbs, --full-bs-search
Enumerate all potential binding sites in the model
when searching rigid superposition for RMSD
computation
Enumerate all potential binding sites in the model when searching rigid
superposition for RMSD computation
-ms MAX_SYMMETRIES, --max-symmetries MAX_SYMMETRIES
If more than that many isomorphisms exist for a
target-ligand pair, it will be ignored and reported as
unassigned.
If more than that many isomorphisms exist for a target-ligand pair, it
will be ignored and reported as unassigned.
--min-pep-length MIN_PEP_LENGTH
Default: 6 - Minimum length of a protein chain to be
considered for being part of a binding site.
Default: 6 - Minimum length of a protein chain to be considered for
being part of a binding site.
--min-nuc-length MIN_NUC_LENGTH
Default: 4 - Minimum length of a NA chain to be
considered for being part of a binding site.
Default: 4 - Minimum length of a NA chain to be considered for being
part of a binding site.
--chem-group-seqid-thresh CHEM_GROUP_SEQID_THRESH
Default: 95 - Sequence identity threshold used to
group identical chains in reference structure in the
chain mapping step. The same threshold is applied to
peptide and nucleotide chains.
Default: 95 - Sequence identity threshold used to group identical
chains in reference structure in the chain mapping step. The same
threshold is applied to peptide and nucleotide chains.
--chem-map-seqid-thresh CHEM_MAP_SEQID_THRESH
Default: 70 - Sequence identity threshold used to map
model chains to groups derived in the chem grouping
step in chain mapping. If set to 0., a mapping is
enforced and each model chain is assigned to the chem
group with maximum sequence identity. If larger than
0., a mapping only happens if the respective model
chain can be aligned to a chem group with the
specified sequence identity threshold AND if at least
min-pep-length/min-nuc-length residues are aligned.
The same threshold is applied to peptide and
nucleotide chains.
--seqres SEQRES Default: None - manually define chem groups by
specifying path to a fasta file. Each sequence in that
file is considered a reference sequence of a chem
group. All polymer chains in reference will be aligned
to these sequences. This only works if -rna/--residue-
number-alignment is enabled and an error is raised
otherwise. Additionally, you need to manually specify
a mapping of the polymer chains using trg-seqres-
mapping and an error is raised otherwise. The one
letter codes in the structure must exactly match the
respective characters in seqres and an error is raised
if not.
Default: 70 - Sequence identity threshold used to map model chains to
groups derived in the chem grouping step in chain mapping. If set to
0., a mapping is enforced and each model chain is assigned to the chem
group with maximum sequence identity. If larger than 0., a mapping only
happens if the respective model chain can be aligned to a chem group
with the specified sequence identity threshold AND if at least min-pep-
length/min-nuc-length residues are aligned. The same threshold is
applied to peptide and nucleotide chains.
--seqres SEQRES Default: None - manually define chem groups by specifying path to a
fasta file. Each sequence in that file is considered a reference
sequence of a chem group. All polymer chains in reference will be
aligned to these sequences. This only works if -rna/--residue-number-
alignment is enabled and an error is raised otherwise. Additionally,
you need to manually specify a mapping of the polymer chains using trg-
seqres-mapping and an error is raised otherwise. The one letter codes
in the structure must exactly match the respective characters in seqres
and an error is raised if not.
--trg-seqres-mapping TRG_SEQRES_MAPPING [TRG_SEQRES_MAPPING ...]
Default: None - Maps each polymer chain in reference
to a sequence in *seqres*. Each mapping is a key:value
pair where key is the chain name in reference and
value is the sequence name in seqres. So let's say you
have a homo-dimer reference with chains "A" and "B"for
which you provide a seqres file containing one
sequence with name "1". You can specify this mapping
with: --trg-seqres-mapping A:1 B:1
--allow-heuristic-conn
Default: False - Only relevant if ligands are
extracted from ref/mdl in mmCIF format. Connectivity
in these cases is based on the chemical component
dictionary. If you enable this flag, connectivity can
be established by a distance based heuristic if the
ligand is not present in the component dictionary.
This might cause issues in ligand matching, i.e. graph
matching.
Default: None - Maps each polymer chain in reference to a sequence in
*seqres*. Each mapping is a key:value pair where key is the chain name
in reference and value is the sequence name in seqres. So let's say you
have a homo-dimer reference with chains "A" and "B"for which you
provide a seqres file containing one sequence with name "1". You can
specify this mapping with: --trg-seqres-mapping A:1 B:1
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment