Skip to content
Snippets Groups Projects
Commit eb44b033 authored by Gerardo Tauriello's avatar Gerardo Tauriello
Browse files

Merge branch 'develop'

parents 9303e8dc 9ad1ab88
Branches
No related tags found
No related merge requests found
# Modelling of full-length trimeric autotransporter adhesins (TAAs)
[Link to project in ModelArchive](https://www.modelarchive.org/doi/10.5452/ma-taas) (incl. background on project itself)
Modelling setup:
- Step 1: AlphaFold-Multimer (AF) using v2.1.2 or v2.3.1 with different parameters for overlapping ranges of full-length sequence
- Step 2 (if needed): AF-models truncated and superposed and assembled using MODELLER
- Step 3 (if multiple models combined): custom QE calculated combining pLDDT and MODELLER energy score
- Step 4 (for assemblies): domain annotation performed using a combination of DSSP, SamCC-Turbo, Foldseek, and US-align
- Single domain models only included step 1 and a truncation step (usually a subset of another model in the same set)
- Full-length sequences taken from NCBI.
Input files for conversion:
- Single ZIP file for each model
- Each ZIP file contains:
- assembly_info.json or domain_info.json with metadata on model
- assembly.cif or domain.cif with the model coordinates
- accompanying_data.zip or accompanying_data sub-folder with additional files as listed in the metadata
- image.png with an overview image for use in ModelArchive
Special features here compared to past projects:
- NCBI data loaded in bulk before doing conversion (helpful to process large set in parallel).
- Generic `ZipFileHandler` class to simplify processing ZIP files.
- Fills `not_modeled_residue_ranges` which was recently added to python-modelcif.
- Enable packing of associated files without having to dump data to disk (avoids having to extract files from provided ZIP files).
- Used software includes AlphaFold-Multimer, MODELLER, DSSP, Foldseek, SamCC-Turbo, and US-align.
- QE metrics linked to software used (fetched from protocol steps to avoid duplicated entries in ModelCIF).
- Complex modelling pipeline with different variants of the same software (AF) being used to produce distinct intermediate files. Global definition of objects was needed to avoid duplicated entries in ModelCIF. Data input/output for protocol steps handled with lists of keywords.
- Provided assembly.cif files contain additional mmCIF data to keep (categories pdbx_domain, pdbx_feature_domain, pdbx_domain_range kept and added using gemmi).
- Intermediate models kept as accompanying data in PDB format. AF-models additionally with scores in JSON format (processed from pickle files into more portable JSON) and PAE matrix as PNG file.
- Use of modelcif.reference.SeqDif to handle mismatches between NCBI and entity sequence (simplified compared to 2024-08-ma-dm-hisrep).
- Similar checking for issues as in 2024-04-ma-dm-prc.
Content:
- translate2modelcif.py : script to do conversion (run in virtual environment with same setup as Docker container here; using OST 2.8, python-modelcif 1.1 and python-ihm 1.6)
- fetch_ncbi_entries.py : script used to obtain info from NCBI stored in ncbi_data.json
- ncbi_data.json.gz : version of ncbi_data.json used for this conversion (from 24.9.2024); to be unpacked to use translate2modelcif.py
- ma-taas-0272.zip : example for a small assembly in the set
- ma-taas-9037.zip : example for a small domain in the set
- example_modelcif.zip : output from running conversion of examples with the commands below:
```
ost translate2modelcif.py ma-taas-0272.zip modelcif
ost translate2modelcif.py ma-taas-9037.zip modelcif
```
File added
# get info from NCBI entries for the full set
import requests, json, os, re, zipfile, time
import pandas as pd
import numpy as np
from xml.etree import ElementTree as ET
from ost import io
# HC setup
out_json_ncbi = "./ncbi_data.json"
out_json_data = "./summary_data.json" # collection of all assembly/domain.json for testing
assemblies_path = "./data_assemblies/taa_export_MA_180724"
domains_path = "./data_domains/domain_export_merged_200624"
#
def _warn_msg(msg):
"""Write a warning message to stdout."""
print(f"WARNING: {msg}")
def _get_ncbi_base_url():
return "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
def _get_ncbi_docsums(params):
# params = parameters to pass to esummary
response = requests.get(_get_ncbi_base_url() + "esummary.fcgi",
params=params)
root = ET.fromstring(response.text)
docsums = root.findall(".//DocSum")
ncbi_dicts = []
for docsum in docsums:
ncbi_dict = {}
for cn in docsum:
if cn.tag == "Item":
cn_name = cn.get("Name")
cn_type = cn.get("Type")
if cn.text:
d = cn.text
if cn_type == "String":
ncbi_dict[cn_name] = d
elif cn_type == "Integer":
ncbi_dict[cn_name] = int(d)
elif cn_type == "Date":
# kept as string
ncbi_dict[cn_name] = d
else:
raise RuntimeError(
f"Unknown type {cn_type} for {ncbi_ac}"
)
else:
ncbi_dict[cn_name] = None
ncbi_dicts.append(ncbi_dict)
return ncbi_dicts
def _get_ncbi_data(ncbi_acs):
"""Fetch data from NCBI in bulk."""
max_num_ids = 2000 # recommended to do max. 5000 at once
ncbi_seq = {}
ncbi_info = {}
sel_ids = list(ncbi_acs)
print(f"FETCHING NCBI INFO FOR {len(sel_ids)} PROTEIN ACs")
while len(sel_ids) > 0:
# SOURCE: https://www.ncbi.nlm.nih.gov/books/NBK25501/
db = "protein"
# Upload the list of IDs using the epost endpoint
sel_ids = sel_ids[:max_num_ids]
params = {"db": db, "id": ",".join(sel_ids)}
response = requests.post(_get_ncbi_base_url() + "epost.fcgi",
data=params)
xml_string = response.text
root = ET.fromstring(xml_string)
query_key = root.find(".//QueryKey").text
web_env = root.find(".//WebEnv").text
# Fetch all sequences using the efetch endpoint via epost info
params = {"db": db, "query_key": query_key, "WebEnv": web_env,
"rettype": "fasta", "retmode": "text"}
response = requests.get(_get_ncbi_base_url() + "efetch.fcgi",
params=params)
ss = io.SequenceListFromString(response.text, "fasta")
for s in ss:
ncbi_ac = s.name.split()[0]
if ncbi_ac not in sel_ids:
# assume that we store them without version
ncbi_ac = ncbi_ac.rsplit('.', 1)[0]
assert ncbi_ac in sel_ids
ncbi_seq[ncbi_ac] = s
# Fetch all infos using the esummary endpoint via epost info
params = {"db": db, "query_key": query_key, "WebEnv": web_env}
for ncbi_dict in _get_ncbi_docsums(params):
ncbi_ac = ncbi_dict["AccessionVersion"]
if ncbi_ac not in sel_ids:
# assume that we store them without version
ncbi_ac = ncbi_ac.rsplit('.', 1)[0]
assert ncbi_ac in sel_ids
ncbi_info[ncbi_ac] = ncbi_dict
assert ncbi_info.keys() == ncbi_seq.keys()
# what's left?
# (random ones get dropped; reason unknown; so we try until all done)
sel_ids = [ncbi_ac for ncbi_ac in ncbi_acs \
if ncbi_ac not in ncbi_info]
print(f"- {len(ncbi_info)} DONE; {len(sel_ids)} TODO")
assert sorted(ncbi_acs) == sorted(ncbi_info)
# combine nicely for further use
ncbi_data = {
ncbi_ac: {
"seq_name": ncbi_seq[ncbi_ac].name,
"seq_str": ncbi_seq[ncbi_ac].string,
"info": ncbi_info[ncbi_ac]
} \
for ncbi_ac in ncbi_acs
}
return ncbi_data
def _get_ncbi_data_cached(ncbi_acs, ncbi_metadata_file):
"""Fetch dict with info for all NCBI proteins from file or web."""
if ncbi_metadata_file:
if os.path.exists(ncbi_metadata_file):
return json.load(open(ncbi_metadata_file))
else:
ncbi_data = _get_ncbi_data(ncbi_acs)
json.dump(ncbi_data, open(ncbi_metadata_file, "w"))
return ncbi_data
else:
return _get_ncbi_data(ncbi_acs)
def _get_ncbi_tax_info(tax_ids):
"""Fetch data from NCBI species in bulk."""
max_num_ids = 2000 # recommended to do max. 5000 at once
ncbi_info = {}
sel_ids = list(tax_ids)
print(f"FETCHING NCBI INFO FOR {len(sel_ids)} TAX IDs")
while len(sel_ids) > 0:
# keep track of TODOs to abort if no progress made
last_num_todo = len(sel_ids)
# SOURCE: https://www.ncbi.nlm.nih.gov/books/NBK25501/
db = "taxonomy"
# Upload the list of IDs using the epost endpoint
sel_ids = sel_ids[:max_num_ids]
params = {"db": db, "id": ",".join(sel_ids)}
response = requests.post(_get_ncbi_base_url() + "epost.fcgi",
data=params)
xml_string = response.text
root = ET.fromstring(xml_string)
query_key = root.find(".//QueryKey").text
web_env = root.find(".//WebEnv").text
# Fetch all infos using the esummary endpoint via epost info
params = {"db": db, "query_key": query_key, "WebEnv": web_env}
for ncbi_dict in _get_ncbi_docsums(params):
tax_id = str(ncbi_dict["TaxId"])
assert tax_id in sel_ids
ncbi_info[tax_id] = ncbi_dict
# what's left?
# (random ones get dropped; reason unknown; so we try until all done)
sel_ids = [tax_id for tax_id in tax_ids \
if tax_id not in ncbi_info]
print(f"- {len(ncbi_info)} DONE; {len(sel_ids)} TODO")
if last_num_todo == len(sel_ids):
print(f"ABORTING...failed to get {sel_ids}")
return ncbi_info
assert sorted(tax_ids) == sorted(ncbi_info)
return ncbi_info
def _check_ncbi_data(ncbi_data):
"""Run sanity checks on fetched data."""
non_live = []
outdated = [] # (AC, replaced-by)
for ncbi_ac, ncbi_item in ncbi_data.items():
ncbi_info = ncbi_item["info"]
if ncbi_info["Status"] != "live":
non_live.append(ncbi_ac)
if ncbi_info["ReplacedBy"]:
outdated.append((ncbi_ac, ncbi_info['ReplacedBy']))
if ncbi_info["AccessionVersion"] != ncbi_ac:
ncbi_info_ac = ncbi_info["AccessionVersion"].rsplit('.', 1)[0]
if ncbi_info_ac != ncbi_ac:
raise RuntimeError(f"NCBI AC is not AC")
mmcif_regex = "[][ \t_(),.;:\"&<>/\\\\{}'`~!@#$%?+=*A-Za-z0-9|^-]*"
description = ncbi_info["Title"]
if not re.fullmatch(mmcif_regex, description):
raise RuntimeError(f"Illegal characters found in title of " \
f"{ncbi_ac}: {description}")
tax_id = str(ncbi_info["TaxId"])
organism_scientific = ncbi_info["SpeciesName"]
if len(non_live) > 0:
msg = f"{len(non_live)} NCBI entries not live"
if len(non_live) < 20:
msg += f": {non_live}"
_warn_msg(msg)
if len(outdated) > 0:
ncbi_acs = [v[0] for v in outdated]
ncbi_rep = [v[1] for v in outdated]
msg = f"{len(ncbi_acs)} outdated NCBI entries"
if len(outdated) < 20:
msg += f": {ncbi_acs} replaced by {ncbi_rep}"
_warn_msg(msg)
return non_live, outdated
def _check_summary_data(json_data):
# collect info for further large scale checks
json_dicts = {}
df_dict_list = []
for item in json_data:
# keep JSON dict for easy access
json_dicts[item["ma_id"]] = item["json_dict"]
# prepare data for DataFrame
keys_to_keep = [
"ma_id", "mdl_type", "has_accompanying_data", "has_cif", "has_image"
]
new_item = {k: item[k] for k in keys_to_keep}
new_item["ncbi_ac"] = item["json_dict"]["source_sequence_RefSeq_ID"]
new_item["no_unparsed_files"] = len(item["unparsed_files"]) == 0
df_dict_list.append(new_item)
# keep data in DataFrame for further checks
df = pd.DataFrame(df_dict_list)
assert len(set(df.ma_id)) == len(df.ma_id)
df.set_index("ma_id", inplace=True)
for k, v in df.isna().any().to_dict().items():
if v:
_warn_msg(f"TO CHECK: NA found in summary data key {k}")
for k, v in df.all().to_dict().items():
if not v:
_warn_msg(f"TO CHECK: False values found in summary data key {k}")
def _parse_zip_file(zip_file_path):
data_from_zip = {
"ma_id": os.path.splitext(os.path.basename(zip_file_path))[0],
"zip_file_path": zip_file_path,
"mdl_type": None,
"json_dict": None,
"has_accompanying_data": False,
"has_cif": False,
"has_image": False,
"unparsed_files": [],
}
with zipfile.ZipFile(zip_file_path) as zf:
for file_name in sorted(zf.namelist()):
base_name = os.path.basename(file_name)
if base_name == "":
# we ignore entries for paths
continue
elif base_name == "accompanying_data.zip" and not data_from_zip["has_accompanying_data"]:
data_from_zip["has_accompanying_data"] = True
elif "accompanying_data/" in file_name:
data_from_zip["has_accompanying_data"] = True
elif base_name == "assembly_info.json" and data_from_zip["json_dict"] is None and data_from_zip["mdl_type"] in [None, "assembly"]:
data_from_zip["mdl_type"] = "assembly"
data_from_zip["json_dict"] = json.load(zf.open(file_name))
elif base_name == "assembly.cif" and not data_from_zip["has_cif"] and data_from_zip["mdl_type"] in [None, "assembly"]:
data_from_zip["mdl_type"] = "assembly"
data_from_zip["has_cif"] = True
elif base_name == "domain_info.json" and data_from_zip["json_dict"] is None and data_from_zip["mdl_type"] in [None, "domain"]:
data_from_zip["mdl_type"] = "domain"
data_from_zip["json_dict"] = json.load(zf.open(file_name))
elif base_name == "domain.cif" and not data_from_zip["has_cif"] and data_from_zip["mdl_type"] in [None, "domain"]:
data_from_zip["mdl_type"] = "domain"
data_from_zip["has_cif"] = True
elif base_name == "image.png" and not data_from_zip["has_image"]:
data_from_zip["has_image"] = True
else:
data_from_zip["unparsed_files"].append(file_name)
return data_from_zip
def _parse_path(path, exp_type):
t0 = time.time()
data_from_zips = []
for file_name in sorted(os.listdir(path)):
data_from_zip = _parse_zip_file(os.path.join(path, file_name))
if data_from_zip["mdl_type"] != exp_type:
print("WARNING:", path, file_name, data_from_zip["mdl_type"], exp_type)
data_from_zips.append(data_from_zip)
if len(data_from_zips) % 200 == 0:
print("DONE ZIP", path, len(data_from_zips), time.time() - t0)
return data_from_zips
def main():
# get all JSONs
if os.path.exists(out_json_data):
data_from_zips = json.load(open(out_json_data))
else:
data_from_zips = _parse_path(assemblies_path, "assembly") \
+ _parse_path(domains_path, "domain")
json.dump(data_from_zips, open(out_json_data, "w"))
# collect all NCBI ACs
ncbi_acs = [item["json_dict"]["source_sequence_RefSeq_ID"] \
for item in data_from_zips]
unique_acs = sorted(set(ncbi_acs))
# fetch NCBI data for each
ncbi_data = _get_ncbi_data_cached(unique_acs, out_json_ncbi)
# fetch all needed tax. infos (excluding overridden ones)
tax_ids = sorted(set(
str(ncbi_item["info"]["TaxId"]) \
for ncbi_item in ncbi_data.values()
))
ncbi_tax_infos = _get_ncbi_tax_info(tax_ids)
# some checks (expect to have all active tax. ids)
for item, exp_vals in [("Status", ["active"])]:
observed_values = sorted(set(
ncbi_tax_info[item] \
for ncbi_tax_info in ncbi_tax_infos.values()
))
if observed_values != exp_vals:
print(f"UNEXPECTED observed '{item}' values: {observed_values}")
# extract/keep scientific names
ncbi_species_names = {
ncbi_tax_id: ncbi_tax_info["ScientificName"] \
for ncbi_tax_id, ncbi_tax_info in ncbi_tax_infos.items()
}
# apply to data
for ncbi_ac, ncbi_item in ncbi_data.items():
tax_id = str(ncbi_item["info"]["TaxId"])
ncbi_item["info"]["SpeciesName"] = ncbi_species_names[tax_id]
# dump file
json.dump(ncbi_data, open(out_json_ncbi, "w"))
# do some checks at the end
_check_summary_data(data_from_zips)
_check_ncbi_data(ncbi_data)
if __name__ == "__main__":
main()
File added
File added
File added
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment