# ModelCIF validation tool

This is a tool to check that the formatting of [ModelCIF](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Index/) files complies with the ModelCIF format declaration (aka "dictionary"). Upon successful validation, a ModelCIF file can be extended with the dictionary version the file was compared to (option [`--extend-validated-file`](#add-dictionary-information-used-for-validation-to-modelcif-file)). For more basic [mmCIF](https://mmcif.wwpdb.org) validation, the dictionary of the underlying [PDBx/mmCIF](https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/) format is also available.

The easiest way to run validation is from [Docker](https://www.docker.com) container.

The tool itself is a wrapper around the [`CifCheck`](https://github.com/rcsb/cpp-dict-pack) tool by [RCSB](https://www.rcsb.org/).

If you have questions about ModelCIF validation, feel free to contact the [MA team](https://modelarchive.org/contact).

[[_TOC_]]


## How to run the validation tool

This is just a description of the [validation tool](./validate-mmcif-file.py) itself. When running it from inside a container, the command needs to be prefixed with the instructions to start the container. Find information for running the validation Docker container in "[How to run the Docker container](#how-to-run-the-docker-container)".

Upon completion, if there hasn't been any error running the command, the validation tool returns a concise report in [JSON](https://www.json.org/json-en.html) format. That output is meant to be input to a website or any kind of nicely formatted report. Output can also be stored as a JSON formatted file. If the tested ModelCIF file is fully compliant with the ModelCIF format, the JSON output has

- `status` "completed"
- no messages in the `cifcheck-errors` list
- no messages in the `diagnosis` list
- `versions` of the dictionaries the file was tested against

Format violations will be listed in `diagnosis`.

`cifcheck-errors` gathers errors from the `CifCheck` command. This has nothing to do with wrong formatting - messages in this list mean that `CifCheck` has "crashed". This should not happen, possible issues with `CifCheck` should be caught by the validation tool. Feel free to report them to the [MA team](https://modelarchive.org/contact).

The most basic way to invoke the validation tool is just with a ModelCIF file (example shows the command plus possible output):

```bash
$ validate-mmcif-file model.cif
{"cifcheck-errors":[],"status":"completed","diagnosis":[],"versions":[{"title":"mmcif_pdbx_v50.dic","version":"5.361","location":"https://raw.github.com/ihmwg/ModelCIF/master/base/mmcif_pdbx_v50.dic"},{"title":"mmcif_ma.dic","version":"1.4.3","location":"https://raw.github.com/ihmwg/ModelCIF/master/archive/mmcif_ma-v1.4.3.dic"}]}
$ 
```


### Add dictionary information used for validation to ModelCIF file

Since both dictionaries, ModelCIF and PDBx/mmCIF, represent actively developed file formats, different versions exist. While extending them, quite some thinking goes into making only non-breaking changes. The idea is that a ModelCIF file formatted following dictionary version 1.3, is still valid with dictionary version 1.4. But the version number also tells you which features to expect in a ModelCIF file, so it seems like a good idea to keep the version inside the file.

The validation tool can add the version upon positive validation, enabled by `--extend-validated-file` (`-e`).

`-e` can take an alternative file name to write the validated ModelCIF file to, e.g. if one wants to keep the original ModelCIF file unaltered:
```bash
$ validate-mmcif-file -e validated_model.cif model.cif
{"cifcheck-errors":[],"status":"completed","diagnosis":[],"versions":[{"title":"mmcif_pdbx_v50.dic","version":"5.361","location":"https://raw.github.com/ihmwg/ModelCIF/master/base/mmcif_pdbx_v50.dic"},{"title":"mmcif_ma.dic","version":"1.4.3","location":"https://raw.github.com/ihmwg/ModelCIF/master/archive/mmcif_ma-v1.4.3.dic"}]}
$ 
```
The last command will generate a new file `validated_model.cf` upon positive validation (`diagnosis` points to an empty list), with the `versions` added to the [`_audit_conform`](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Categories/audit_conform.html) list inside the file.

To add the validation dictionaries to `_audit_conform` in the original ModelCIF file, just invoke `-e` without an alternative file name... well almost. By the way Python handles this kind of command line arguments, `-e` consumes everything after it, that does not start with a `-`, as a file name. So `validate-mmcif-file -e model.cif` would mean that `-e` assumes `model.cif` as its file name but then the command fails because it is missing the ModelCIF file to be validated. The solution is either putting `-e` at the beginning of the arguments list or after the ModelCIF file name at the very end, if there are no other command line arguments:

```bash
$ validate-mmcif-file model.cif -e
{"cifcheck-errors":[],"status":"completed","diagnosis":[],"versions":[{"title":"mmcif_pdbx_v50.dic","version":"5.361","location":"https://raw.github.com/ihmwg/ModelCIF/master/base/mmcif_pdbx_v50.dic"},{"title":"mmcif_ma.dic","version":"1.4.3","location":"https://raw.github.com/ihmwg/ModelCIF/master/archive/mmcif_ma-v1.4.3.dic"}]}
$ 
```


### Base directory for associated files

For a ModelCIF file using the [`_ma_entry_associated_files`](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Categories/ma_entry_associated_files.html) category, the validation tool tries to merge associated data into the ModelCIF file, if [`_ma_entry_associated_files.file_format`](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Items/_ma_entry_associated_files.file_format.html) is `cif` and [`_ma_entry_associated_files.file_content`](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Items/_ma_entry_associated_files.file_content.html) is `local pairwise QA scores`. That way the outsourced data is validated, too.

Command line argument `--associates-dir` (`-a`) is used to declare the base directory associated files are stored in. Inside the directory, the path must follow what is defined in [`_ma_entry_associated_files.file_url`](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Items/_ma_entry_associated_files.file_url.html). If the URL is just the file name, the file must be stored right in the associates directory. The following example works for `_ma_entry_associated_files.file_url model_pae.cif` (`grep` and `ls` are just used to illustrate the data situation)

```bash
$ grep _ma_entry_associated_files.file_url model.cif
_ma_entry_associated_files.file_url model_pae.cif
$ ls extra
model_pae.cif
$ validate-mmcif-file -a extra model.cif
{"cifcheck-errors":[],"status":"completed","diagnosis":[],"versions":[{"title":"mmcif_pdbx_v50.dic","version":"5.361","location":"https://raw.github.com/ihmwg/ModelCIF/master/base/mmcif_pdbx_v50.dic"},{"title":"mmcif_ma.dic","version":"1.4.3","location":"https://raw.github.com/ihmwg/ModelCIF/master/archive/mmcif_ma-v1.4.3.dic"}]}
$ 
```

If the URL points to a subdirectory, this must be reflected by the associates directory tree declared to the validation tool. The following example illustrates that the `extra` directory needs a `pae` directory storing the associated file as expected by `_ma_entry_associated_files.file_url`:

```bash
$ grep _ma_entry_associated_files.file_url model.cif
_ma_entry_associated_files.file_url pae/model_pae.cif
$ ls extra
pae
$ ls extra/pae
model_pae.cif
$ validate-mmcif-file -a extra model.cif
{"cifcheck-errors":[],"status":"completed","diagnosis":[],"versions":[{"title":"mmcif_pdbx_v50.dic","version":"5.361","location":"https://raw.github.com/ihmwg/ModelCIF/master/base/mmcif_pdbx_v50.dic"},{"title":"mmcif_ma.dic","version":"1.4.3","location":"https://raw.github.com/ihmwg/ModelCIF/master/archive/mmcif_ma-v1.4.3.dic"}]}
$ 
```


### Misc. arguments

**`--help`** (**`-h`**) Print a help/ usage page for the validation tool.

**`--dict-sdb <SDB FILE>`** (**`-d`**) Format dictionary in (binary) SDB format used for validating a ModelCIF file. The Docker container comes with a SDB for ModelCIF (`/usr/local/share/mmcif-dict-suite/mmcif_ma.sdb`) and one for the original PDBx/mmCIF (`/usr/local/share/mmcif-dict-suite/mmcif_pdbx_v50.dic.sdb`) format.

**`--out-file <JSON FILE>`** (**`-o`**) Instead of printing the output to `stdout`, store it in a JSON formatted file.

**`--verbose`** (**`-v`**) Be more talkative.


## How to run the Docker container

Calling the validation tool (almost) stays the same, it just needs instructions to start the Docker container as a prefix:

```bash
$ docker run --rm -v /home/user/models:/data registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:latest validate-mmcif-file /data/model.cif
{"cifcheck-errors":[],"status":"completed","diagnosis":[],"versions":[{"title":"mmcif_pdbx_v50.dic","version":"5.361","location":"https://raw.github.com/ihmwg/ModelCIF/master/base/mmcif_pdbx_v50.dic"},{"title":"mmcif_ma.dic","version":"1.4.3","location":"https://raw.github.com/ihmwg/ModelCIF/master/archive/mmcif_ma-v1.4.3.dic"}]}
$ 
```

- [`docker run`](https://docs.docker.com/engine/reference/commandline/run/) starts a new Docker container from image `registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:latest` and executes the `validate-mmcif-file` command inside the container.
- `--rm` makes sure that the container is removed from the system once the job completed.
- `-v` mounts directory `/home/user/models` from the local host computer to `/data` inside the Docker container, otherwise the `validate-mmcif-file` command has no access to local files. The bind mount makes the ModelCIF file `/home/user/models/model.cif` available as `/data/model.cif` to commands executed by `docker run`. Keep in mind, `validate-mmcif-file -e` and `validate-mmcif-file -a` also need to refer to `/data` (or any other local directory mounted in the Docker container).


## How to get the Docker container

Before running the Docker container, you need a local copy of its image. There are three ways to get it:

- `docker run` will pull it automatically upon first call
- [`docker pull`](https://docs.docker.com/engine/reference/commandline/pull/) the Docker image yourself before running it
- [`docker build`](https://docs.docker.com/engine/reference/commandline/build/) the Docker image from scratch


### How to pull a copy of the Docker container from our registry

With `docker pull`, the ready-made Docker image can be fetched from our [Docker registry](https://git.scicore.unibas.ch/schwede/modelcif-converters/container_registry/). Two kinds of Docker images are available, differentiated by tags. The `latest` tag refers to the Docker image with the most recent ModelCIF dictionary. This should be the default choice. For specific use cases, e.g. debugging, we also provide Docker images for older versions of the ModelCIF dictionary, those are tagged with the version number of the dictionary. The `latest` image is pulled like this:

```terminal
$ docker pull registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:latest
```


### How to build the Docker container from scratch

Here is the command we use to generate the Docker image. It works when executed from within the [`validation/`](./validation) subdirectory of the [Git repository](https://git.scicore.unibas.ch/schwede/modelcif-converters):

```terminal
docker build -t registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:latest .
```

When developing you own tools using the Docker image, there is one [build argument](https://docs.docker.com/engine/reference/commandline/build/#set-build-time-variables---build-arg) that adds an [editor](https://www.gnu.org/software/emacs/), [Black](https://black.readthedocs.io/en/stable/), [Pylint](https://pylint.org) and [bash](https://tiswww.case.edu/php/chet/bash/bashtop.html) to ease working in interactive sessions inside the Docker container:

```terminal
docker build --build-arg ADD_DEV=yes -t registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:dev .
```

The [`pyproject.toml`](pyproject.toml) we use can be found in the Git repository root.

# Files in this directory

|Path       |Content                                                         |
|-----------|----------------------------------------------------------------|
|[Dockerfile](./Dockerfile)|Build instructions for the Docker image|
|[README.md](./README.md)|This README|
|[entrypoint.sh](./entrypoint.sh)|Script executed on Docker container start|
|[get-mmcif-dict-versions.py](./get-mmcif-dict-versions.py)|Extract versions of mmCIF dictionaries, used for building the Docker image. Copied into the image as `get-mmcif-dict-versions.py`.|
|[validate-mmcif-file.py](./validate-mmcif-file.py)|Validation tool, copied into the image as `validate-mmcif-file`.|

<!--  LocalWords:  PDBx ModelCIF TOC JSON CifCheck RCSB mmcif cif pdbx dic dir
      LocalWords:  url pae sdb SDB stdout modelcif cifcheck arg Pylint DEV md
      LocalWords:  pyproject toml README entrypoint py
 -->