Create README.md
Browse files## **Fine-Tuning ESM-1b for Phosphosite Prediction**
This repository provides a fine-tuned version of the [ESM-1b]([https://website-name.com](https://huggingface.co/facebook/esm1b_t33_650M_UR50S)) model, trained to classify phosphosites using unlabeled phosphosites(ie, which kinases phosphorylate those phosphosites is unknown) from [PhosphoSitePlus](https://www.phosphosite.org/staticDownloads). The model is designed for binary classification, distinguishing phosphosites from non-phosphorylated peptid sequences [(Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites)](https://www.sciencedirect.com/science/article/pii/S1535947620311518)
### **Dataset & Labeling Strategy**
The dataset was constructed using phosphosite information from **PhosphoSitePlus**, with the following assumptions:
- Positive Samples: Known phosphorylated residues from PhosphoSitePlus.
- Negative Samples: Derived by selecting 15-residue sequences from the same proteins, ensuring the central residue matches a known phosphorylation site but is not reported as phosphorylated in PhosphoSitePlus.
**Note**: The absence of phosphorylation reports does not imply absolute non-phosphorylation but is assumed as negative in this study.
### **Dataset Statistics**
- Positive Samples: 366,028
- Negative Samples: 364,121
- Training Samples: 511,104
- Validation Samples: 109,522
- Testing Samples: 109,523
### **Test Performance**
- Accuracy: 0.94
- F1-Score: 0.94
### **Usage**
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "isikz/phosphorylation_binaryclassification_esm1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example sequence
sequence = "MKTLLLTLVVVTIVCLDLGYTGV"
# Tokenize input
inputs = tokenizer(sequence, return_tensors="pt")
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
prediction = torch.sigmoid(logits).item()
print(f"Phosphorylation Probability: {prediction:.4f}")
```