-
Studer Gabriel authoredStuder Gabriel authored
Data generation for AFDB Modelling capabilities in ProMod3
Requires you to download the full proteomes database as described in https://github.com/deepmind/alphafold/blob/main/afdb/README.md.
This gives one tar file per proteome which serves as starting point.
In case of sciCORE they're here:
/scicore/data/managed/AF_UniProt/frozen_221115T101000/proteomes
The afdb_proteom_to_data_chunks.py
script generates one such chunk.
It reads a list of filenames which needs to be generated manually.
Something like:
import os
files = os.listdir(<AFDB_PROTEOM_DIR>)
with open("afdb_proteom_files.txt", 'w') as fh:
fh.write('\n'.join(files))
create_commands.py
generates a command file which can be submitted
as batch job. Carefully check the variables defined on top and adapt to your
needs. Test to run one of these commands interactively to see whether the
respective chunk file is created correctly before submitting.
Once all chunks are there, an indexed database can be created with:
from promod3.modelling import FSStructureServer
fs_server = FSStructureServer.FromDataChunks("afdb_data_chunks", "afdb_fs")
Data preparation for PentaMatch
The same data chunks are used to extract the sequences that are searched by
PentaMatch. create_pentamatch_sequences.py
generates a Fasta file with all
sequences of the previously generated FSStructureServer. Execute with:
ost create_pentamatch_sequences.py --data_chunks <DATA_CHUNK_DIR> --fs_server <FS_SERVER> --out <PENTAMATCH>.fasta
The searchable PentaMatch object is generated interactively with:
from promod3.modelling import PentaMatch
PentaMatch.FromSeqList("<PENTAMATCH>.fasta", "<PENTAMATCH_DIR>",
entries_from_seqnames=True)
Be aware that the command above requires a substantial amount of memory. For 200e6 entries, 500GB was sufficient.