SCHWED-6038: Added README for Sergey set

09b1e070 · Gerardo Tauriello · 03a49dc9 · 09b1e070
Commit 09b1e070 authored 1 year ago by Gerardo Tauriello
--- a/projects/dark-matter-metagenomics/README.md
+++ b/projects/dark-matter-metagenomics/README.md
+# Modelling of representative sequences from NMPFamsDB (DB on metagenomic families)
+
+[Link to project in ModelArchive](https://modelarchive.org/doi/10.5452/ma-nmpfamsdb) (incl. background on project itself)
+
+Input files for conversion:
+- Directly from Sergey:
+  - all_ptm_plddt.txt.gz: "csv" file (' ' as separator) with pTM and global pLDDT values for each model
+  - all_pdb.zip: all produced models (5 models per modelled family)
+  - all_pae.zip: PAE matrices ("as a text file") for each model
+- From [NMPFamsDB downloads](https://bib.fleming.gr/NMPFamsDB/downloads):
+  - "AlphaFold2 3D models (PDB)" (structures_all.tgz): selected PDB models for 80585 families (expected to match top ranked one in all_pdb.zip)
+  - "Consensus Sequences (FASTA)" (consensus_all.fasta.gz): representative sequences for each family (expected to match modelled sequence where model available; contains 25613 extra sequences which is fine)
+  - "AlphaFold2 Input Alignments (FASTA)" (alphafold_msa_all.tgz): MSA used for each AF run (first sequence expected to match modelled sequence)
+
+Modelling setup:
+- Total of 106198 families identified from clustered JGI sequences; only subset with 3D structure (see paper for details)
+- Using AlphaFold v2.0 for monomer predictions with ptm weights, producing 5 models, without model relaxation, without templates, ranked by pLDDT, starting from a custom MSA
+- Custom MSA as extra modelling step ("created by calculating the central or "pivot" sequence of each seed MSA, and refining each alignment using that sequence as the guide")
+- Sequence source set to NMPFamsDB as representative sequences do not have a direct match to JGI sequences
+
+Special features here:
+- No separate data for pLDDT and hence taken from b-factors of PDB files
+- Needed to deal with UNK in sequences (here: global pLDDT and pTM and local PAE take it into account but no data available for local pLDDT as no coordinates for UNK res.; example: F000347)
+- Custom MSA stored in accompanying data
+- Extra models stored in accompanying data (incl. PAE but without duplicating custom MSA; similar setup as in human-heterodimers-w-crosslinks but here listing pTM and pLDDT in description of acc. files)
+- Note that on the NMPFamsDB web site they list max. pTM and display model with max. pLDDT. In ModelArchive, we show top-pLDDT-ranked model as main entry and others in acc. data, but list max. pTM and max. pLDDT in model description.
+- Used optional family name prefix to be able to batch process families in parallel (did 1062 parallel jobs)
+- Soft PDB comparison to allow for some numerical differences between PDB stored on web and one in ModelCIF (due to random reruns of AF2 in Sergey's data; differences small enough to ignore; largest one in F004657 with RMSD of 0.204)
+- 11 extra families had models in all_pdb.zip but those were skipped (no custom MSA there)
+
+Content:
+- translate2modelcif.py : script to do conversion (was in virtual environment with same setup as Docker container here but with OST 2.6 RC; used python-modelcif 0.9 and ModelCIF dict. 1.4.5)