Skip to content
Snippets Groups Projects

Data generation for AFDB Modelling capabilities in ProMod3

Requires you to download the full proteomes database as described in https://github.com/deepmind/alphafold/blob/main/afdb/README.md.

This gives one tar file per proteome which serves as starting point.

In case of sciCORE they're here: /scicore/data/managed/AF_UniProt/frozen_221115T101000/proteomes

The afdb_proteom_to_data_chunks.py script generates one such chunk. It reads a list of filenames which needs to be generated manually. Something like:

import os
files = os.listdir(<AFDB_PROTEOM_DIR>)
with open("afdb_proteom_files.txt", 'w') as fh:
    fh.write('\n'.join(files))

create_commands.py generates a command file which can be submitted as batch job. Carefully check the variables defined on top and adapt to your needs. Test to run one of these commands interactively to see whether the respective chunk file is created correctly before submitting.

Once all chunks are there, an indexed database can be created with:

from promod3.modelling import FSStructureServer
fs_server = FSStructureServer.FromDataChunks("afdb_data_chunks", "afdb_fs")

Data preparation for PentaMatch

The same data chunks are used to extract the sequences that are searched by PentaMatch. create_pentamatch_sequences.py generates a Fasta file with all sequences of the previously generated FSStructureServer. Execute with:

ost create_pentamatch_sequences.py --data_chunks <DATA_CHUNK_DIR> --fs_server <FS_SERVER> --out <PENTAMATCH>.fasta

The searchable PentaMatch object is generated interactively with:

from promod3.modelling import PentaMatch
PentaMatch.FromSeqList("<PENTAMATCH>.fasta", "<PENTAMATCH_DIR>",
                       entries_from_seqnames=True)

Be aware that the command above requires a substantial amount of memory. For 200e6 entries, 500GB was sufficient.