| | --- |
| | license: apache-2.0 |
| | metrics: |
| | - accuracy |
| | - f1 |
| | base_model: |
| | - facebook/esm1b_t33_650M_UR50S |
| | --- |
| | |
| | ## **Fine-Tuning ESM-1b for Phosphosite Prediction** |
| |
|
| | This repository provides a fine-tuned version of the [ESM-1b]([https://website-name.com](https://huggingface.co/facebook/esm1b_t33_650M_UR50S)) model, trained to classify phosphosites using unlabeled phosphosites(ie, which kinases phosphorylate those phosphosites is unknown) from [PhosphoSitePlus](https://www.phosphosite.org/staticDownloads). The model is designed for binary classification, distinguishing phosphosites from non-phosphorylated peptid sequences [(Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites)](https://www.sciencedirect.com/science/article/pii/S1535947620311518) |
| |
|
| | ### **Developed by:** |
| | Zeynep Işık (MSc, Sabanci University) |
| | ### **Dataset & Labeling Strategy** |
| |
|
| | The dataset was constructed using phosphosite information from **PhosphoSitePlus**, with the following assumptions: |
| |
|
| | - Positive Samples: Known phosphorylated residues from PhosphoSitePlus. |
| | - Negative Samples: Derived by selecting 15-residue sequences from the same proteins, ensuring the central residue matches a known phosphorylation site but is not reported as phosphorylated in PhosphoSitePlus. |
| | **Note**: The absence of phosphorylation reports does not imply absolute non-phosphorylation but is assumed as negative in this study. |
| |
|
| | ### **Dataset Statistics** |
| | - Positive Samples: 366,028 |
| | - Negative Samples: 364,121 |
| | - Training Samples: 511,104 |
| | - Validation Samples: 109,522 |
| | - Testing Samples: 109,523 |
| |
|
| | ### **Test Performance** |
| | - Accuracy: 0.94 |
| | - F1-Score: 0.94 |
| |
|
| | ### **Usage** |
| | ``` |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | # Load the model and tokenizer |
| | model_name = "isikz/phosphorylation_binaryclassification_esm1b" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | |
| | # Example sequence |
| | sequence = "MKTLLLTLVVVTIVCLDLGYTGV" |
| | |
| | # Tokenize input |
| | inputs = tokenizer(sequence, return_tensors="pt") |
| | |
| | # Get prediction |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | logits = outputs.logits |
| | prediction = torch.sigmoid(logits).item() |
| | |
| | print(f"Phosphorylation Probability: {prediction:.4f}") |
| | |
| | ``` |