File size: 3,555 Bytes
931f8c7 6b0882f d023cf1 d42d42e f162b06 0524995 7724913 0524995 27a1150 8149980 0524995 ccda6ac 9ae6630 e036818 0e91aea 772a9fd 0e91aea 0524995 0e98273 4d4c785 0e98273 3d6d94f 0e98273 d42d42e cbd5ac0 d42d42e cbd5ac0 d42d42e cbd5ac0 d42d42e cbd5ac0 d42d42e 86e7472 cbd5ac0 86e7472 cbd5ac0 22998f7 983147e bb5a6d8 a9fb671 e47b360 a9fb671 1199d42 2857604 e3f836d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: mit
metrics:
- accuracy
tags:
- biology
pipeline_tag: text-classification
---
# Model description
**PPPSL**(PPPSL, Prediction of prokaryotic protein subcellular localization) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t36_3B_UR50D***)](https://huggingface.co/facebook/esm2_t36_3B_UR50D) on a prokaryotic protein subcellular localization dataset.
**PPPSL** achieved the following results:
Train Loss: 0.0148
Train Accuracy: 0.9923
Validation Loss: 0.0718
Validation Accuracy: 0.9893
Epoch: 20
# The dataset for training **PPPSL**
The full dataset contains 11,970 protein sequences, including Cellwall (87), Cytoplasmic (6,905), CYtoplasmic Membrane (2,567), Extracellular (1,085), Outer Membrane (758), and Periplasmic (568).
The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.
The dataset was downloaded from the website at [**DeepLocPro - 1.0**](https://services.healthtech.dtu.dk/services/DeepLocPro-1.0/).
# Model training code at GitHub
https://github.com/pengsihua2023/PPPSL-ESM2
# How to use **PPPSL**
### An example
Pytorch and transformers libraries should be installed in your system.
### Install pytorch
```
pip install torch torchvision torchaudio
```
### Install transformers
```
pip install transformers
```
### Run the following code
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the fine-tuned model and tokenizer
model_name = "sihuapeng/PPPSL-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Sample protein sequence
protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN"
# Encode the sequence as model input
inputs = tokenizer(protein_sequence, return_tensors="pt")
# Perform inference using the model
with torch.no_grad():
outputs = model(**inputs)
# Get the prediction result
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'}
predicted_label = id2label[predicted_class_id]
# Output the predicted class
print ("===========================================================================================================================================")
print(f"Predicted class Label: {predicted_label}")
print ("===========================================================================================================================================")
```
## Funding
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).
### Model architecture, coding and implementation
Sihua Peng
## Group, Department and Institution
### Lab: [Justin Bahl](https://bahl-lab.github.io/)
### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/)
### Institution: [The University of Georgia](https://www.uga.edu/)
 |