Skip to content
Snippets Groups Projects
Commit 09b1e070 authored by Gerardo Tauriello's avatar Gerardo Tauriello
Browse files

SCHWED-6038: Added README for Sergey set

parent 03a49dc9
No related branches found
No related tags found
No related merge requests found
# Modelling of representative sequences from NMPFamsDB (DB on metagenomic families)
[Link to project in ModelArchive](https://modelarchive.org/doi/10.5452/ma-nmpfamsdb) (incl. background on project itself)
Input files for conversion:
- Directly from Sergey:
- all_ptm_plddt.txt.gz: "csv" file (' ' as separator) with pTM and global pLDDT values for each model
- all_pdb.zip: all produced models (5 models per modelled family)
- all_pae.zip: PAE matrices ("as a text file") for each model
- From [NMPFamsDB downloads](https://bib.fleming.gr/NMPFamsDB/downloads):
- "AlphaFold2 3D models (PDB)" (structures_all.tgz): selected PDB models for 80585 families (expected to match top ranked one in all_pdb.zip)
- "Consensus Sequences (FASTA)" (consensus_all.fasta.gz): representative sequences for each family (expected to match modelled sequence where model available; contains 25613 extra sequences which is fine)
- "AlphaFold2 Input Alignments (FASTA)" (alphafold_msa_all.tgz): MSA used for each AF run (first sequence expected to match modelled sequence)
Modelling setup:
- Total of 106198 families identified from clustered JGI sequences; only subset with 3D structure (see paper for details)
- Using AlphaFold v2.0 for monomer predictions with ptm weights, producing 5 models, without model relaxation, without templates, ranked by pLDDT, starting from a custom MSA
- Custom MSA as extra modelling step ("created by calculating the central or "pivot" sequence of each seed MSA, and refining each alignment using that sequence as the guide")
- Sequence source set to NMPFamsDB as representative sequences do not have a direct match to JGI sequences
Special features here:
- No separate data for pLDDT and hence taken from b-factors of PDB files
- Needed to deal with UNK in sequences (here: global pLDDT and pTM and local PAE take it into account but no data available for local pLDDT as no coordinates for UNK res.; example: F000347)
- Custom MSA stored in accompanying data
- Extra models stored in accompanying data (incl. PAE but without duplicating custom MSA; similar setup as in human-heterodimers-w-crosslinks but here listing pTM and pLDDT in description of acc. files)
- Note that on the NMPFamsDB web site they list max. pTM and display model with max. pLDDT. In ModelArchive, we show top-pLDDT-ranked model as main entry and others in acc. data, but list max. pTM and max. pLDDT in model description.
- Used optional family name prefix to be able to batch process families in parallel (did 1062 parallel jobs)
- Soft PDB comparison to allow for some numerical differences between PDB stored on web and one in ModelCIF (due to random reruns of AF2 in Sergey's data; differences small enough to ignore; largest one in F004657 with RMSD of 0.204)
- 11 extra families had models in all_pdb.zip but those were skipped (no custom MSA there)
Content:
- translate2modelcif.py : script to do conversion (was in virtual environment with same setup as Docker container here but with OST 2.6 RC; used python-modelcif 0.9 and ModelCIF dict. 1.4.5)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment