sihuapeng commited on
Commit
0c611b7
·
verified ·
1 Parent(s): acef1ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -3
README.md CHANGED
@@ -1,3 +1,78 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ license: mit
2
+ tags:
3
+ - biology
4
+ ---
5
+ # Model description
6
+ **MHC-II-EpiPred** (MHC-I-EpiPred, MHC I molecular epitope prediction) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t33_650M_UR50D***)](https://huggingface.co/facebook/esm2_t33_650M_UR50D) on a T cell epitope with Immunogenicity score dataset.
7
+
8
+ **MHC-II-EpiPred** is a regression model for predicting the Immunogenicity score using a potential epitope peptide as an input.
9
+
10
+ **MHC-II-EpiPred** achieved the following results:
11
+ Everage Train Loss (mse): 0.0547
12
+ Everage Validation Loss (mse): 0.0535
13
+ Epoch: 3
14
+
15
+ # The dataset for training **MHC-II-EpiPred**
16
+ The original data we obtained comes from the data in the paper by [Lee CH et al.](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01225-z) The data is in a CSV file with a total of 9 columns with a sample size of 100,097. We used the first column (amino acid sequences), the second column (immunogenicity, positive or negative), and the ninth column (immunogenicity score). We used these three columns as input to fine-tune the ESM2 pre-trained model and built a regression model. Using this regression model, by inputting potential epitope amino acid sequences, we can predict the immunogenicity score of the potential epitope, and then determine whether it is an epitope based on the set threshold.
17
+
18
+ The dataset was downloaded from GtHub at [**TRAP**](https://github.com/ChloeHJ/TRAP/blob/main/data/pathogenic_db.csv).
19
+
20
+ # Model training code at GitHub
21
+ https://github.com/pengsihua2023/MHC-I-EpiPred-ESM2
22
+
23
+ # How to use **MHC-II-EpiPred**
24
+ ### An example
25
+ Pytorch and transformers libraries should be installed in your system.
26
+ ### Install pytorch
27
+ ```
28
+ pip install torch torchvision torchaudio
29
+
30
+ ```
31
+ ### Install transformers
32
+ ```
33
+ pip install transformers
34
+
35
+ ```
36
+ ### Run the following code
37
+ ```
38
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
39
+ import torch
40
+
41
+ # Load the fine-tuned model and tokenizer
42
+ model_name = "sihuapeng/PPPSL-ESM2"
43
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
44
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
45
+
46
+ # Sample protein sequence
47
+ protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN"
48
+
49
+ # Encode the sequence as model input
50
+ inputs = tokenizer(protein_sequence, return_tensors="pt")
51
+
52
+ # Perform inference using the model
53
+ with torch.no_grad():
54
+ outputs = model(**inputs)
55
+
56
+ # Get the prediction result
57
+ logits = outputs.logits
58
+ predicted_class_id = torch.argmax(logits, dim=-1).item()
59
+ id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'}
60
+ predicted_label = id2label[predicted_class_id]
61
+
62
+ # Output the predicted class
63
+ print ("===========================================================================================================================================")
64
+ print(f"Predicted class Label: {predicted_label}")
65
+ print ("===========================================================================================================================================")
66
+
67
+ ```
68
+
69
+ ## Funding
70
+ This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).
71
+ ### Model architecture, coding and implementation
72
+ Sihua Peng
73
+ ## Group, Department and Institution
74
+ ### Lab: [Justin Bahl](https://bahl-lab.github.io/)
75
+ ### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/)
76
+ ### Institution: [The University of Georgia](https://www.uga.edu/)
77
+
78
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c56e2d2d07296c7e35994f/2rlokZM1FBTxibqrM8ERs.png)