Protein-Protein Interaction Site Prediction

This model is a finetuned version of ESM2-3B [1] for protein-protein interaction site prediction. It predicts whether a certain amino acid in a protein sequence is part of an interaction site (1) or not (0).

For more details on the training and testing on this model, refer to the article [...].

The github repository to use with this model is available here: https://github.com/RitAreaSciencePark/PPI-Reps

The data for the training and evaluation of this model is available in csv format in this zenodo repository: https://doi.org/10.5281/zenodo.18802482

How to Get Started with the Model

This code snippet shows how to load the model and use it to predict probabilities that each amino acid in a protein sequence is part of a protein-protein interaction site.

import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig

model_name = "evillegasgarcia/esm2-ppi-pdbbind-1"

# Load config 
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
# Load model using the custom remote code
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


#move model to device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# run over a sample sequence
sequence = "MKTVRQERLKSIVRILEAAKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"

inputs = tokenizer.encode(sequence, return_tensors="pt").to(device)
logits = model(inputs)["logits"]
probabilities = torch.sigmoid(logits)

probabilities

Training Details

The model was trained on the pdbBind dataset described on the paper.
We used the Adam optimizer with default hyperparameters, and weight decay of 0.01. The learning rate was 1e-5 and we had a gradient accumulation batch size of 8.

Evaluation

The performance of the model was tested on the ZK448 benchmark available from the Zenodo repository and originally curated by [3]. The model has an accuracy of 0.74 and a Matthews Correlation Coefficient (MCC) score of 0.35.

References

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.
Stringer, B., de Ferrante, H., Abeln, S., Heringa, J., Feenstra, K. A., & Haydarlou, R. (2022). PIPENN: protein interface prediction from sequence with an ensemble of neural nets. Bioinformatics, 38(8), 2111-2118.
Zhang, J., & Kurgan, L. (2018). Review and comparative assessment of sequence-based predictors of protein-binding residues. Briefings in bioinformatics, 19(5), 821-837.

Downloads last month: 9

Safetensors

Model size

3B params

Tensor type

F32