HelmholtzAI-FZJ
/

oneprot-4

PyTorch

Model card Files Files and versions

xet

Community

sealinka commited on May 12, 2025

Commit

3250adb

verified ·

1 Parent(s): 28a2698

Create README.md

Browse files

Files changed (1) hide show

README.md +192 -0

README.md ADDED Viewed

	@@ -0,0 +1,192 @@

+---
+license: mit
+---
+[Github repo](https://github.com/klemens-floege/oneprot/)|
+[Paper link](https://arxiv.org/abs/2411.04863)
+## Overview
+OneProt is a multimodal model that integrates protein sequence, protein structure (both in form of an augmented sequence and in a form of a graph), protein binding sites and protein text annotations. Contrastive learning is used to align each of the modality to the central one, which is protein sequence. In the pre-training phase InfoNCE loss is computed between pairs (protein sequence, other modality).
+This model **omitted the structure encoder based on the ESM2 architecture, leaving only the GNN encoder for structure in**, and therefore comprising only 4 out of possible 5 modalities.
+## Model architecture
+Protein sequence encoder: [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D)
+Protein structure encoder GNN: [ProNet](https://github.com/divelab/DIG)
+Pocket (binding sites encoder) GNN: [ProNet](https://github.com/divelab/DIG)
+Text encoder: [BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
+Below is an example code on how to obtain the embeddings (requires cloning our repo first). Note that example data for transformer models are read-off from `.txt` files and in principle can be passed as strings, whlist the data for GNN models are contained in the example `.h5` file and need to subsequently be converted to graphs.
+```
+import torch
+import hydra
+from omegaconf import OmegaConf
+from huggingface_hub import  HfApi, hf_hub_download
+import sys
+import os
+import h5py
+from torch_geometric.data import Batch
+from transformers import AutoTokenizer
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))) # assuming that you are running this script from the oneprot repo, can be any other path
+from src.models.oneprot_module import OneProtLitModule
+from src.data.utils.struct_graph_utils import protein_to_graph
+###if you are not running on the supercomputer, you may need to uncomment the two following lines
+#os.environ['RANK']='0'
+#os.environ['WORLD_SIZE']='1'
+#Load the config file and read it off
+config_path = hf_hub_download(
+        repo_id="HelmholtzAI-FZJ/oneprot-4",
+        filename="config.yaml",
+    )
+with open(config_path, 'r') as f:
+    cfg = OmegaConf.load(f)
+# Prepare components dictionary from config
+components = {
+        'sequence': hydra.utils.instantiate(cfg.model.components.sequence),
+        'struct_graph': hydra.utils.instantiate(cfg.model.components.struct_graph),
+        'pocket': hydra.utils.instantiate(cfg.model.components.pocket),
+        'text': hydra.utils.instantiate(cfg.model.components.text)
+    }
+# Load the model checkpoint
+checkpoint_path = hf_hub_download(
+        repo_id="HelmholtzAI-FZJ/oneprot-4",
+        filename="pytorch_model.bin",
+        repo_type="model"
+    )
+# Create model instance and load the checkpoint
+model = OneProtLitModule(
+        components=components,
+        optimizer=None,
+        loss_fn=cfg.model.loss_fn,
+        local_loss=cfg.model.local_loss,
+        gather_with_grad=cfg.model.gather_with_grad,
+        use_l1_regularization=cfg.model.use_l1_regularization,
+        train_on_all_modalities_after_step=cfg.model.train_on_all_modalities_after_step,
+        use_seqsim=cfg.model.use_seqsim
+    )
+state_dict = torch.load(checkpoint_path)
+model_state_dict = model.state_dict()
+model.load_state_dict(state_dict, strict=True)
+# Define the tokenisers
+tokenizers = {
+        'sequence': "facebook/esm2_t33_650M_UR50D",
+        'struct_token': "facebook/esm2_t33_650M_UR50D",
+        'text': "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
+    }
+loaded_tokenizers = {}
+for modality, tokenizer_name in tokenizers.items():
+    tokenizer = AutoTokenizer.from_pretrained(tokenizers[modality])
+    loaded_tokenizers[modality] = tokenizer
+# Get example embeddings for each modality
+##########################sequence##############################
+modality = "sequence"
+file_path = hf_hub_download(
+    repo_id="HelmholtzAI-FZJ/oneprot",
+    filename="data_examples/sequence_example.txt",
+    repo_type="model"  # or "dataset"
+)
+with open(file_path, 'r') as file:
+    input_sequence = file.read().strip()
+input_tensor = loaded_tokenizers[modality](input_sequence, return_tensors="pt")["input_ids"]
+output = model.network[modality](input_tensor)
+print(f"Output for modality '{modality}': {output}")
+###########################text#################################
+modality = "text"
+file_path = hf_hub_download(
+    repo_id="HelmholtzAI-FZJ/oneprot",
+    filename="data_examples/text_example.txt",
+    repo_type="model"  # or "dataset"
+)
+with open(file_path, 'r') as file:
+    input_text = file.read().strip()
+input_tensor = loaded_tokenizers[modality](input_text, return_tensors="pt")["input_ids"]
+output = model.network[modality](input_tensor)
+print(f"Output for modality '{modality}': {output}")
+#####################graph structure############################
+modality = "struct_graph"
+file_path = hf_hub_download(
+    repo_id="HelmholtzAI-FZJ/oneprot",
+    filename="data_examples/seqstruc_example.h5",
+    repo_type="model"  # or "dataset"
+)
+with h5py.File(file_path, 'r') as file:
+    input_struct_graph=[protein_to_graph('E6Y2X0', file_path, 'non_pdb', 'A', pockets=False)]
+    input_struct_graph = Batch.from_data_list(input_struct_graph)
+    output=model.network[modality](input_struct_graph)
+print(f"Output for modality '{modality}': {output}")
+##########################pocket################################
+modality = "pocket"  # Replace with the desired modality
+file_path = hf_hub_download(
+    repo_id="HelmholtzAI-FZJ/oneprot",
+    filename="data_examples/pocket_example.h5",
+    repo_type="model"  # or "dataset"
+)
+with h5py.File(file_path, 'r') as file:
+    input_pocket=[protein_to_graph('E6Y2X0', file_path, 'non_pdb', 'A', pockets=True)]
+    input_pocket = Batch.from_data_list(input_pocket)
+    output=model.network[modality](input_pocket)
+print(f"Output for modality '{modality}': {output}")
+```
+Citation
+```
+@misc{flöge2024oneprotmultimodalproteinfoundation,
+      title={OneProt: Towards Multi-Modal Protein Foundation Models},
+      author={Klemens Flöge and Srisruthi Udayakumar and Johanna Sommer and Marie Piraud and Stefan Kesselheim and Vincent Fortuin and Stephan Günneman and Karel J van der Weg and Holger Gohlke and Alina Bazarova and Erinc Merdivan},
+      year={2024},
+      eprint={2411.04863},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2411.04863},
+}
+```