mmCIF File Format
The mmCIF file format is a container for structural entities provided by the PDB. Saving/loading happens through dedicated convenient functions (:func:`ost.io.LoadMMCIF`/:func:`ost.io.SaveMMCIF`). Here provide more in-depth information on mmCIF IO and describe how to deal with information provided above the legacy PDB format (:class:`MMCifInfo`, :class:`MMCifInfoCitation`, :class:`MMCifInfoTransOp`, :class:`MMCifInfoBioUnit`, :class:`MMCifInfoStructDetails`, :class:`MMCifInfoObsolete`, :class:`MMCifInfoStructRef`, :class:`MMCifInfoStructRefSeq`, :class:`MMCifInfoStructRefSeqDif`, :class:`MMCifInfoRevisions`, :class:`MMCifInfoEntityBranchLink`).
Reading mmCIF files
Categories Available
The following categories of a mmCIF file are considered by the reader:
-
atom_site
: Used to build the :class:`~ost.mol.EntityHandle` -
entity
: Involved in setting :class:`~ost.mol.ChainType` of chains -
entity_poly
: Involved in setting :class:`~ost.mol.ChainType` of chains -
citation
: Goes into :class:`MMCifInfoCitation` -
citation_author
: Goes into :class:`MMCifInfoCitation` -
exptl
: Goes into :class:`MMCifInfo` as :attr:`~MMCifInfo.method`. -
refine
: Goes into :class:`MMCifInfo` as :attr:`~MMCifInfo.resolution`, :attr:`~MMCifInfo.r_free` and :attr:`~MMCifInfo.r_work`. -
pdbx_struct_assembly
: Used for :class:`MMCifInfoBioUnit`. -
pdbx_struct_assembly_gen
: Used for :class:`MMCifInfoBioUnit`. -
pdbx_struct_oper_list
: Used for :class:`MMCifInfoBioUnit`. -
struct
: Details about a structure, stored in :class:`MMCifInfoStructDetails`. -
struct_conf
: Stores secondary structure information (practically helices) in the :class:`~ost.mol.EntityHandle` -
struct_sheet_range
: Stores secondary structure information for sheets in the :class:`~ost.mol.EntityHandle` -
pdbx_database_PDB_obs_spr
: Verbose information on obsoleted/ superseded entries, stored in :class:`MMCifInfoObsolete` -
struct_ref
stored in :class:`MMCifInfoStructRef` -
struct_ref_seq
stored in :class:`MMCifInfoStructRefSeqDif` -
struct_ref_seq_dif
stored in :class:`MMCifInfoStructRefDif` -
database_pdb_rev
(mmCIF dictionary version < 5) stored in :class:`MMCifInfoRevisions` -
pdbx_audit_revision_history
andpdbx_audit_revision_details
(mmCIF dictionary version >= 5) used to fill :class:`MMCifInfoRevisions` -
pdbx_entity_branch
andpdbx_entity_branch_link
used for :class:`MMCifInfoEntityBranchLink`, a list of links is available by :meth:`~MMCifInfo.GetEntityBranchLinks` and :meth:`~MMCifInfo.GetEntityBranchByChain`
Notes:
-
Structures in mmCIF format can have two chain names. The "new" chain name extracted from
atom_site.label_asym_id
is used to name the chains in the :class:`~ost.mol.EntityHandle`. The "old" (author provided) chain name is extracted fromatom_site.auth_asym_id
for the first atom of the chain. It is added as string property named "pdb_auth_chain_name" to the :class:`~ost.mol.ChainHandle`. The mapping is also stored in :class:`MMCifInfo` as :meth:`~MMCifInfo.GetMMCifPDBChainTr` and :meth:`~MMCifInfo.GetPDBMMCifChainTr` if a non-empty SEQRES record exists for that chain (this should exclude ligands and water). -
Molecular entities in mmCIF are identified by an
entity.id
, which is extracted fromatom_site.label_entity_id
for the first atom of the chain. It is added as string property named "entity_id" to the :class:`~ost.mol.ChainHandle`. Each chain is mapped to an ID in :class:`MMCifInfo` as :meth:`~MMCifInfo.GetMMCifEntityIdTr`. -
For more complex mappings, such as ligands which may be in a same "old" chain as the protein chain but are represented in a separate "new" chain in mmCIF, we also store :class:`string properties<ost.GenericPropContainer>` on a per-residue level. For mmCIF files from the PDB, there is a unique mapping between ("label_asym_id", "label_seq_id") and ("auth_asym_id", "auth_seq_id", "pdbx_PDB_ins_code"). The following data items are available:
-
atom_site.label_asym_id
:residue.chain.name
-
atom_site.label_seq_id
:residue.GetStringProp("resnum")
(this is the same asresidue.number
for residues in polymer chains. However, for ligandsresidue.number
is unset in mmCIF, but it is set to 1 by openstructure.) -
atom_site.label_entity_id
:residue.GetStringProp("entity_id")
-
atom_site.auth_asym_id
:residue.GetStringProp("pdb_auth_chain_name")
-
atom_site.auth_seq_id
:residue.GetStringProp("pdb_auth_resnum")
-
atom_site.pdbx_PDB_ins_code
:residue.GetStringProp("pdb_auth_ins_code")
The last two items might be missing (not empty) if the
atom_site.auth_seq_id
oratom_site.pdbx_PDB_ins_code
are not present in the mmCIF file. -
-
Missing values in the aforementioned data items will be denoted as
.
or?
. -
Author residue numbers (
atom_site.auth_seq_id
) and insertion codes (atom_site.pdbx_PDB_ins_code
) are optional according to the mmCIF dictionary. The data items (whole columns) can be omitted in structures where the "new" residue numbers (atom_site.label_seq_id
) are defined (to valid values). This is usually the case for polymer chains. However non-polymer and water chains do not have valid "new" residue numbers. In structures containing such missing data, OST requires the presence of both "old" residue numbers and insertion codes in order to identify and build residues properly. It is a known limitation of the mmCIF format to allow ambiguous identifiers for waters (and ligands to some extent) and so we have to require these additional identifiers.
Info Classes
Information from mmCIF files that goes beyond structural data, is kept in a special container, the :class:`MMCifInfo` class. Here is a detailed description of the annotation available.
This is the container for all bits of non-molecular data pulled from a mmCIF file.
This stores citation information from an input file.
This stores operations needed to transform an :class:`~ost.mol.EntityHandle` into a bio unit.
This stores information how a structure is to be assembled to form the bio unit.
Holds details about the structure.
- Holds details on obsolete / superseded structures. The data is
- available both in the obsolete and in the replacement entries.
Holds the information of the struct_ref category. The category describes the
link of polymers in the mmCIF file to sequences stored in external databases
such as UniProt. The related categories struct_ref_seq
and
struct_ref_seq_dif
also list differences between the sequences of the
deposited structure and the sequences in the database. Two prominent examples
of such differences are point mutations and/or expression tags.
An aligned range of residues between a sequence in a reference database and the deposited sequence.
A particular difference between the deposited sequence and the sequence in the database.
Revision history of a PDB entry. If you find a '?' somewhere, this means 'not set'.
Data from pdbx_entity_branch
, most specifically
pdbx_entity_branch_link
. That is connectivity information for branched
entities, e.g. carbohydrates/ oligosaccharides.
:class:`Conop Processors <ost.conop.Processor>` can not easily connect them so
we use this information in :meth:`LoadMMCIF` to do that.
Data collected for certain mmCIF entity
Writing mmCIF files
Star Writer
The syntax of mmCIF is a subset of the CIF file syntax, that by itself is a subset of STAR file syntax. OpenStructure implements a simple :class:`StarWriter` that is able to write data in two ways:
-
key-value: A category name and an attribute name that is linked to a value. Example:
_citation.year 2024
_citation.year
is called a mmCIF token. It consists of a data category (_citation
) and a data item (year
), delimited by a ".
". -
tabular: Represents several values for a mmCIF token. The tokens are written in a header which is followed by the respective values. Example:
loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_entity_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.auth_seq_id _atom_site.auth_asym_id _atom_site.id _atom_site.pdbx_PDB_ins_code ATOM N N SER A 0 1 . -47.333 0.941 8.834 1.00 52.56 71 P 0 ? ATOM C CA SER A 0 1 . -45.849 0.731 8.796 1.00 53.56 71 P 1 ? ATOM C C SER A 0 1 . -45.191 1.608 7.714 1.00 51.61 71 P 2 ? ...
What follows is an example of how to use the :class:`StarWriter` and its associated objects. In principle thats enough to write a full mmCIF file but you definitely want to check out the :class:`MMCifWriter` which extends :class:`StarWriter` and extracts the relevant data from an OpenStructure :class:`ost.mol.EntityHandle`.
from ost import io
import math
writer = io.StarWriter()
# Add key value pair
value = io.StarWriterValue.FromInt(42)
data_item = io.StarWriterDataItem("_the", "answer", value)
writer.Push(data_item)
# Add tabular data
loop_desc = io.StarWriterLoopDesc("_math_oper")
loop_desc.Add("num")
loop_desc.Add("sqrt")
loop_desc.Add("square")
loop = io.StarWriterLoop(loop_desc)
for i in range(10):
data = list()
data.append(io.StarWriterValue.FromInt(i))
data.append(io.StarWriterValue.FromFloat(math.sqrt(i), 3))
data.append(io.StarWriterValue.FromInt(i*i))
loop.AddData(data)
writer.Push(loop)
# Write this groundbreaking data into a file with name numbers.gz
# and yes, its directly gzipped
writer.Write("numbers", "numbers.gz")
The content of the file written:
data_numbers
_the.answer 42
#
loop_
_math_oper.num
_math_oper.sqrt
_math_oper.square
0 0.000 0
1 1.000 1
2 1.414 4
3 1.732 9
4 2.000 16
5 2.236 25
6 2.449 36
7 2.646 49
8 2.828 64
9 3.000 81
#
A value which is stored as string - must be constructed from static constructor functions
key-value data representation
param category: | The category name of the data item |
---|---|
type category: | :class:`str` |
param attribute: | The attribute name of the data item |
type attribute: | :class:`str` |
param value: | The value of the data item |
type value: | :class:`StarWriterValue` |
Defines header for tabular data representation for the specified category
param category: | The category |
---|---|
type category: | :class:`str` |
Allows to populate :class:`StarWriterLoopDesc` with data to get a full tabular data representation
param desc: | The header |
---|---|
type desc: | :class:`StarWriterLoopDesc` |
Can be populated with data which can then be written to a file.
mmCIF Writer
Data categories considered by the OpenStructure mmCIF writer are described in the following. The listed attributes are written to fulfill all dependencies in a mmCIF file according to mmcif_pdbx_v50.
-
_atom_site
- group_PDB
- type_symbol
- label_atom_id
- label_asym_id
- label_entity_id
- label_seq_id
- label_alt_id
- Cartn_x
- Cartn_y
- Cartn_z
- occupancy
- B_iso_or_equiv
- auth_seq_id
- auth_asym_id
- id
- pdbx_PDB_ins_code
-
_entity
- id
- type
-
_struct_asym
- id
- entity_id
-
_entity_poly
- entity_id
- type
- pdbx_seq_one_letter_code
- pdbx_seq_one_letter_code_can
-
_entity_poly_seq
- entity_id
- mon_id
- num
-
_pdbx_poly_seq_scheme
- asym_id
- entity_id
- mon_id
- seq_id
- pdb_strand_id
- pdb_seq_num
- pdb_ins_code
-
_chem_comp
- id
- type
-
_atom_type
- symbol
-
_pdbx_entity_branch
- entity_id
- type
The writer is designed to only require an OpenStructure :class:`ost.mol.EntityHandle`/ :class:`ost.mol.EntityView` as input but optionally performs preprocessing in order to separate residues of chains into valid mmCIF entities. This is controlled by the mmcif_conform flag which has significant impact on how chains are assigned to mmCIF entities, chain names and residue numbers. Ideally, the input is mmcif_conform which is the case when loading a structure from a valid mmCIF file with :func:`ost.io.LoadMMCIF`.
Behaviour when mmcif_conform is True
Expected properties when mmcif_conform is enabled:
- The residues in a chain all belong to the same mmCIF molecular entity. That
is for example a polypeptide chain with all residues being peptide linking.
In mmCIF lingo: An entity of type "polymer" which is of
_entity_poly
type "polypeptide(L)" and all residues being "L-PEPTIDE LINKING". Well, some glycines might be "PEPTIDE LINKING". Another example might be a ligand where the chain refers to an entity of type "non-polymer" and only contains that particular ligand. - Each chain must have a chain type assigned (available as :func:`ost.mol.ChainHandle.GetType`) which refers to the entity type. For entity type "polymer" and "branched", the chain type also encodes the subtypes. If you for example have a polymer chain, not the general CHAINTYPE_POLY is expected but the more finegrained polymer specific type. That could be CHAINTYPE_POLY_PEPTIDE_D. This is also true for entities of type "branched". There, a subtype such as CHAINTYPE_OLIGOSACCHARIDE is expected.
- The residue numbers in "polymer" chains must match the SEQRES of the underlying entity with 1-based indexing. Insertion codes are not allowed and raise an error.
- Each residue must have a valid chem class assigned (available as :func:`ost.mol.ResidueHandle.GetChemClass`). Even though this information would be available when reading valid mmCIF files, OpenStructure delegates this to the :class:`ost.conop.Processor` and thus requires a valid :class:`ost.conop.CompoundLib` when reading in order to correctly set them.
There is one quirk remaining: The assignment of underlying mmCIF entities. This is a challenge primarily for polymers. The current logic starts with an empty internal entity list and successively processes chains. If no match is found, a new entity gets generated and the SEQRES is set to what we observe in the chain residues given their residue numbers (i.e. the ATOMSEQ). If the first residue has residue number 10, the SEQRES gets prefixed by 9 elements using a default value (e.g. UNK for a chain of type CHAINTYPE_POLY_PEPTIDE_D). The same is done for gaps. Matching requires an exact match for ALL residues given their residue number with that SEQRES. However, there might be the case that one chain resolves more residues than another. So you may have residues at locations that are undefined in the current SEQRES. If the fraction of matches with undefined locations does not exceed 5%, we still assume an overall match and fill in the previsouly undefined locations in the SEQRES with the newly gained information. This is a heuristic that works in most cases but potentially introduces errors in entity assignment. If you want to avoid that, you must set your entities manually and pass a list of :class:`MMCifWriterEntity` when calling :func:`MMCifWriter.SetStructure`.
if mmcif_conform is enabled, there is pretty much everything in place and the previously listed mmCIF categories/attributes are written with a few special cases:
- _atom_site.auth_asym_id: Honours the residue string property "pdb_auth_chain_name" if set, uses the actual chain name otherwise. The string property is set in the mmCIF reader.
- _pdbx_poly_seq_scheme.pdb_strand_id: Same behaviour as _atom_site.auth_asym_id
- _atom_site.auth_seq_id: Honours the residue string property "pdb_auth_resnum" if set, uses the actual residue number otherwise. The string property is set in the mmCIF reader.
- _pdbx_poly_seq_scheme.pdb_seq_num: Same behaviour as _atom_site.auth_seq_id
- _atom_site.pdbx_PDB_ins_code: Honours the residue string property "pdb_auth_ins_code" if set, uses the actual residue insertion code otherwise. The string property is set in the mmCIF reader. If mmcif_conform is enabled, the actual residue insertion code can expected to be empty though.
- _pdbx_poly_seq_scheme.pdb_ins_code: Same behaviour as _atom_site.pdbx_PDB_ins_code
Behaviour when mmcif_conform is False
If mmcif_conform is not enabled, the only expectation is that chem classes (available as :func:`ost.mol.ResidueHandle.GetChemClass`) are set. OpenStructure delegates this to the :class:`ost.conop.Processor` and thus requires a valid :class:`ost.conop.CompoundLib` when reading a structure. There will be significant preprocessing involving the split of chains which is purely based on the set chem classes. Each chain gets split with the following rules:
- separate chain of
_entity.type
"non-polymer" for each residue with chem class :class:`NON_POLYMER`/:class:`UNKNOWN` - if any residue has chem class :class:`WATER`, all of them are collected into one separate chain with _entity.type "water"
- if any residue is a saccharide, i.e. has chem class :class:`SACCHARIDE`/:class:`L_SACCHARIDE`/:class:`D_SACCHARIDE`, all of them are collected into one separate chain of _entity.type "branched" and _pdbx_entity_branch.type "oligosaccharide".
- if any residue has chem class :class:`RNA_LINKING`, all of them are collected into one separate chain of _entity.type "polymer" and _entity_poly.type "polyribonucleotide".
- if any residue has chem class :class:`DNA_LINKING`, all of them are collected into one separate chainof _entity.type "polymer" and _entity_poly.type "polydeoxyribonucleotide".
- if any residue is peptide linking, all of them are collected into one separate chain of _entity.type "polymer" and _entity_poly.type "polypeptide(L)"/"polypeptide(D)". We only allow the following combinations of chem classes. Either :class:`L_PEPTIDE_LINKING`/:class:`PEPTIDE_LINKING` or :class:`D_PEPTIDE_LINKING`/:class:`PEPTIDE_LINKING`. Mixing :class:`L_PEPTIDE_LINKING` and :class:`D_PEPTIDE_LINKING` raises an error.
Chain names are generated by iterating over "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz", starting with AA, AB, AC etc. once the first cycle is through. There can therefore be as many chains as needed. The mmCIF entities are built the same way as for mmcif_conform with two differences: 1) the extracted SEQRES of a chain is the ATOMSEQ, i.e. the exact sequence of its residues 2) Entity matching happens through exact matches of SEQRES and is independent from residue numbers. As a consequence, the residue numbers written as _atom_site.label_seq_id do not correspond anymore to the actual residue numbers but refer to the location in ATOMSEQ.
Once split and new chain names assigned, the rest is straightforward. The special cases listed above (_atom_site.auth_asym_id, _pdbx_poly_seq_scheme.pdb_strand_id, _atom_site.auth_seq_id etc.) are treated the same as if mmcif_conform was true.
To see it all in action:
from ost import io
ent = io.LoadMMCIF("1a0s", remote=True)
writer = io.MMCifWriter()
# The MMCifWriter is still object of type StarWriter
# I can decorate my mmCIF file with any data I want
val = io.StarWriterValue.FromInt(42)
data_item = io.StarWriterDataItem("_the", "answer", val)
writer.Push(data_item)
# pre-define mmCIF entity which is total nonsense
entity_info = io.MMCifWriterEntityList()
mon_ids = ost.StringList()
mon_ids.append("ALA")
mon_ids.append("GLU")
mon_ids.append("ALA")
lib = conop.GetDefaultLib()
mmcif_ent = io.MMCifWriterEntity.FromPolymer("polypeptide(L)",
mon_ids, lib)
entity_info.append(mmcif_ent)
# The actual relevant part... mmcif_conform can be set to
# True, as we loaded from mmCIF file
writer.SetStructure(ent, mmcif_conform = True,
entity_info = entity_info)
# And write...
writer.Write("1a0s", "1a0s.cif.gz")
# The written mmCIF file will contain _the.answer and
# writes out the mmCIF entity we defined above in
# _entity_poly. However, nothing matches that entity...
Defines mmCIF entity which will be written in :class:`MMCifWriter` Must be created from static constructor function.
A list for :class:`MMCifWriterEntity`
Inherits all functionality from :class:`StarWriter` and provides functionality to extract relevant mmCIF information from :class:`ost.mol.EntityHandle`/:class:`ost.mol.EntityView`
Biounits
Biological assemblies, i.e. biounits, are an integral part of mmCIF files and their construction is fully defined in :class:`MMCifInfoBioUnit`. :func:`MMCifInfoBioUnit.PDBize` provides one possibility to construct such biounits with compatibility with the PDB format in mind. That is single character chain names, dumping all ligands in one chain etc. For a more mmCIF-style way of constructing biounits, check out :func:`ost.mol.alg.CreateBU` in the ost.mol.alg module.