LoGoBERT-PPI (Bernett-trained)

LoGoBERT-PPI is a protein–protein interaction (PPI) prediction model built on top of the ESM-2 protein language model and a late-interaction MaxSim mechanism.

The model is trained on the Bernett human PPI benchmark dataset (leakage-controlled splits)

Model Overview

LoGoBERT-PPI combines:

  • Global sequence-level representations from ESM-2 using mean pooling
  • A residue-level late-interaction signal computed via a MaxSim operation inspired by ColBERT

This hybrid design enables efficient modeling of localized binding patterns between protein sequences while remaining computationally efficient for large-scale inference.


Available Checkpoints

File Description
model.safetensors Trained on the Bernett human PPI benchmark (HIPPIE-derived, leakage-controlled)

Requirements

  • Python >= 3.9
  • torch
  • transformers
  • huggingface_hub

Install dependencies:

pip install torch transformers huggingface_hub
import torch
from model import LoGo_BERT
from transformers import AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = LoGo_BERT.from_pretrained("netbiolab/LoGoBERT-PPI-Bernett")
model = model.to(device)
model.eval()

tok = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")

seqA = "MDKKSARIRRATRARRKLQELGATRLVVHRTPRHIYAQVIAPNGSEVLVAASTVEKAIAEQLKYTGNKDAAAAVGKAVAERALEKGIKDVSFDRSGFQYHGRVQALADAAREAGLQF"
seqB = "MAVVKCKPTSPGRRHVVKVVNPELHKGKPFAPLLEKNSKSGGRNNNGRITTRHIGGGHKQAYRIVDFKRNKDGIPAVVERLEYDPNRSANIALVLYKDGERRYILAPKGLKAGDQIQSGVDAAIKPGNTLPMRNIPVGSTVHNVEMKPGKGGQLARSAGTYVQIVARDGAYVTLRLRSGEMRKVEADCRATLGEVGNAEHMLRVLGKAGAARWRGVRPTVRGTAMNPVDHPHGGGEGRNFGKHPVTPWGVQTKGKKTRSNKRTDKFIVRRRSK"

input_a = tok(seqA, return_tensors="pt")
input_b = tok(seqB, return_tensors="pt")

input_a = {k: v.to(device) for k, v in input_a.items()}
input_b = {k: v.to(device) for k, v in input_b.items()}

with torch.no_grad():
    prob = model(input_a, input_b)

print(prob)

Training Dataset

https://doi.org/10.6084/m9.figshare.21591618.v3. LoGoBERT-PPI was trained and evaluated using the Bernett human PPI benchmark dataset.

The Bernett dataset is a leakage-controlled gold-standard benchmark constructed from HIPPIE (v2.3) high-confidence human protein–protein interactions. Graph partitioning using KaHIP was applied to minimize topological overlap between train, validation, and test splits. To further reduce sequence-level information leakage, CD-HIT clustering was performed with a sequence identity threshold of ≤40%, ensuring low homology across splits.

Negative protein pairs were generated by randomly pairing proteins not reported to interact in HIPPIE, sampled at a 1:1 ratio relative to positive interactions.

This dataset provides a stringent evaluation setting for assessing generalization and minimizing data leakage in PPI prediction models.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support