isikz commited on
Commit
c406195
·
verified ·
1 Parent(s): dfb95f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -1
README.md CHANGED
@@ -5,4 +5,53 @@ metrics:
5
  - f1
6
  base_model:
7
  - facebook/esm1b_t33_650M_UR50S
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - f1
6
  base_model:
7
  - facebook/esm1b_t33_650M_UR50S
8
+ ---
9
+
10
+ ## **Fine-Tuning ESM-1b for Phosphosite Prediction**
11
+
12
+ This repository provides a fine-tuned version of the [ESM-1b]([https://website-name.com](https://huggingface.co/facebook/esm1b_t33_650M_UR50S)) model, trained to classify phosphosites using unlabeled phosphosites(ie, which kinases phosphorylate those phosphosites is unknown) from [PhosphoSitePlus](https://www.phosphosite.org/staticDownloads). The model is designed for binary classification, distinguishing phosphosites from non-phosphorylated peptid sequences [(Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites)](https://www.sciencedirect.com/science/article/pii/S1535947620311518)
13
+
14
+ ### **Dataset & Labeling Strategy**
15
+
16
+ The dataset was constructed using phosphosite information from **PhosphoSitePlus**, with the following assumptions:
17
+
18
+ - Positive Samples: Known phosphorylated residues from PhosphoSitePlus.
19
+ - Negative Samples: Derived by selecting 15-residue sequences from the same proteins, ensuring the central residue matches a known phosphorylation site but is not reported as phosphorylated in PhosphoSitePlus.
20
+ **Note**: The absence of phosphorylation reports does not imply absolute non-phosphorylation but is assumed as negative in this study.
21
+
22
+ ### **Dataset Statistics**
23
+ - Positive Samples: 366,028
24
+ - Negative Samples: 364,121
25
+ - Training Samples: 511,104
26
+ - Validation Samples: 109,522
27
+ - Testing Samples: 109,523
28
+
29
+ ### **Test Performance**
30
+ - Accuracy: 0.94
31
+ - F1-Score: 0.94
32
+
33
+ ### **Usage**
34
+ ```
35
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
36
+ import torch
37
+
38
+ # Load the model and tokenizer
39
+ model_name = "isikz/phosphorylation_binaryclassification_esm1b"
40
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
41
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
42
+
43
+ # Example sequence
44
+ sequence = "MKTLLLTLVVVTIVCLDLGYTGV"
45
+
46
+ # Tokenize input
47
+ inputs = tokenizer(sequence, return_tensors="pt")
48
+
49
+ # Get prediction
50
+ with torch.no_grad():
51
+ outputs = model(**inputs)
52
+ logits = outputs.logits
53
+ prediction = torch.sigmoid(logits).item()
54
+
55
+ print(f"Phosphorylation Probability: {prediction:.4f}")
56
+
57
+ ```