LoGoBERT-PPI (Bernett-trained)
LoGoBERT-PPI is a protein–protein interaction (PPI) prediction model built on top of the ESM-2 protein language model and a late-interaction MaxSim mechanism.
The model is trained on the Bernett human PPI benchmark dataset (leakage-controlled splits)
Model Overview
LoGoBERT-PPI combines:
- Global sequence-level representations from ESM-2 using mean pooling
- A residue-level late-interaction signal computed via a MaxSim operation inspired by ColBERT
This hybrid design enables efficient modeling of localized binding patterns between protein sequences while remaining computationally efficient for large-scale inference.
Available Checkpoints
| File | Description |
|---|---|
model.safetensors |
Trained on the Bernett human PPI benchmark (HIPPIE-derived, leakage-controlled) |
Requirements
- Python >= 3.9
- torch
- transformers
- huggingface_hub
Install dependencies:
pip install torch transformers huggingface_hub
import torch
from model import LoGo_BERT
from transformers import AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LoGo_BERT.from_pretrained("netbiolab/LoGoBERT-PPI-Bernett")
model = model.to(device)
model.eval()
tok = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
seqA = "MDKKSARIRRATRARRKLQELGATRLVVHRTPRHIYAQVIAPNGSEVLVAASTVEKAIAEQLKYTGNKDAAAAVGKAVAERALEKGIKDVSFDRSGFQYHGRVQALADAAREAGLQF"
seqB = "MAVVKCKPTSPGRRHVVKVVNPELHKGKPFAPLLEKNSKSGGRNNNGRITTRHIGGGHKQAYRIVDFKRNKDGIPAVVERLEYDPNRSANIALVLYKDGERRYILAPKGLKAGDQIQSGVDAAIKPGNTLPMRNIPVGSTVHNVEMKPGKGGQLARSAGTYVQIVARDGAYVTLRLRSGEMRKVEADCRATLGEVGNAEHMLRVLGKAGAARWRGVRPTVRGTAMNPVDHPHGGGEGRNFGKHPVTPWGVQTKGKKTRSNKRTDKFIVRRRSK"
input_a = tok(seqA, return_tensors="pt")
input_b = tok(seqB, return_tensors="pt")
input_a = {k: v.to(device) for k, v in input_a.items()}
input_b = {k: v.to(device) for k, v in input_b.items()}
with torch.no_grad():
prob = model(input_a, input_b)
print(prob)
Training Dataset
https://doi.org/10.6084/m9.figshare.21591618.v3. LoGoBERT-PPI was trained and evaluated using the Bernett human PPI benchmark dataset.
The Bernett dataset is a leakage-controlled gold-standard benchmark constructed from HIPPIE (v2.3) high-confidence human protein–protein interactions. Graph partitioning using KaHIP was applied to minimize topological overlap between train, validation, and test splits. To further reduce sequence-level information leakage, CD-HIT clustering was performed with a sequence identity threshold of ≤40%, ensuring low homology across splits.
Negative protein pairs were generated by randomly pairing proteins not reported to interact in HIPPIE, sampled at a 1:1 ratio relative to positive interactions.
This dataset provides a stringent evaluation setting for assessing generalization and minimizing data leakage in PPI prediction models.
- Downloads last month
- 14