Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,78 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
license: mit
|
| 2 |
+
tags:
|
| 3 |
+
- biology
|
| 4 |
+
---
|
| 5 |
+
# Model description
|
| 6 |
+
**MHC-II-EpiPred** (MHC-I-EpiPred, MHC I molecular epitope prediction) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t33_650M_UR50D***)](https://huggingface.co/facebook/esm2_t33_650M_UR50D) on a T cell epitope with Immunogenicity score dataset.
|
| 7 |
+
|
| 8 |
+
**MHC-II-EpiPred** is a regression model for predicting the Immunogenicity score using a potential epitope peptide as an input.
|
| 9 |
+
|
| 10 |
+
**MHC-II-EpiPred** achieved the following results:
|
| 11 |
+
Everage Train Loss (mse): 0.0547
|
| 12 |
+
Everage Validation Loss (mse): 0.0535
|
| 13 |
+
Epoch: 3
|
| 14 |
+
|
| 15 |
+
# The dataset for training **MHC-II-EpiPred**
|
| 16 |
+
The original data we obtained comes from the data in the paper by [Lee CH et al.](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01225-z) The data is in a CSV file with a total of 9 columns with a sample size of 100,097. We used the first column (amino acid sequences), the second column (immunogenicity, positive or negative), and the ninth column (immunogenicity score). We used these three columns as input to fine-tune the ESM2 pre-trained model and built a regression model. Using this regression model, by inputting potential epitope amino acid sequences, we can predict the immunogenicity score of the potential epitope, and then determine whether it is an epitope based on the set threshold.
|
| 17 |
+
|
| 18 |
+
The dataset was downloaded from GtHub at [**TRAP**](https://github.com/ChloeHJ/TRAP/blob/main/data/pathogenic_db.csv).
|
| 19 |
+
|
| 20 |
+
# Model training code at GitHub
|
| 21 |
+
https://github.com/pengsihua2023/MHC-I-EpiPred-ESM2
|
| 22 |
+
|
| 23 |
+
# How to use **MHC-II-EpiPred**
|
| 24 |
+
### An example
|
| 25 |
+
Pytorch and transformers libraries should be installed in your system.
|
| 26 |
+
### Install pytorch
|
| 27 |
+
```
|
| 28 |
+
pip install torch torchvision torchaudio
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
### Install transformers
|
| 32 |
+
```
|
| 33 |
+
pip install transformers
|
| 34 |
+
|
| 35 |
+
```
|
| 36 |
+
### Run the following code
|
| 37 |
+
```
|
| 38 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 39 |
+
import torch
|
| 40 |
+
|
| 41 |
+
# Load the fine-tuned model and tokenizer
|
| 42 |
+
model_name = "sihuapeng/PPPSL-ESM2"
|
| 43 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 44 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 45 |
+
|
| 46 |
+
# Sample protein sequence
|
| 47 |
+
protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN"
|
| 48 |
+
|
| 49 |
+
# Encode the sequence as model input
|
| 50 |
+
inputs = tokenizer(protein_sequence, return_tensors="pt")
|
| 51 |
+
|
| 52 |
+
# Perform inference using the model
|
| 53 |
+
with torch.no_grad():
|
| 54 |
+
outputs = model(**inputs)
|
| 55 |
+
|
| 56 |
+
# Get the prediction result
|
| 57 |
+
logits = outputs.logits
|
| 58 |
+
predicted_class_id = torch.argmax(logits, dim=-1).item()
|
| 59 |
+
id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'}
|
| 60 |
+
predicted_label = id2label[predicted_class_id]
|
| 61 |
+
|
| 62 |
+
# Output the predicted class
|
| 63 |
+
print ("===========================================================================================================================================")
|
| 64 |
+
print(f"Predicted class Label: {predicted_label}")
|
| 65 |
+
print ("===========================================================================================================================================")
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Funding
|
| 70 |
+
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).
|
| 71 |
+
### Model architecture, coding and implementation
|
| 72 |
+
Sihua Peng
|
| 73 |
+
## Group, Department and Institution
|
| 74 |
+
### Lab: [Justin Bahl](https://bahl-lab.github.io/)
|
| 75 |
+
### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/)
|
| 76 |
+
### Institution: [The University of Georgia](https://www.uga.edu/)
|
| 77 |
+
|
| 78 |
+

|