|
|
--- |
|
|
license: mit |
|
|
metrics: |
|
|
- accuracy |
|
|
tags: |
|
|
- biology |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
# Model description |
|
|
In biology, "targeting peptides" typically refer to "targeting signal peptides" or "targeting sequences," also known as "signal peptides" or "signal sequences." These are short amino acid sequences located at the N-terminal or C-terminal of a protein that direct the protein to specific locations within the cell, such as the mitochondria, chloroplasts, plastids, endoplasmic reticulum, and more. Targeting peptides play a crucial signaling role during protein synthesis, ensuring that the protein is correctly localized to its intended cellular destination. |
|
|
|
|
|
**TarPepSubLoc-ESM2** (TarPepSubLoc, Targeting Peptide Subcellular Localization) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t36_3B_UR50D***)](https://huggingface.co/facebook/esm2_t36_3B_UR50D) on a trageting peptides subcelluar localization dataset with five classes. |
|
|
|
|
|
**TarPepSubLoc-ESM2** achieved the following results: |
|
|
Train Loss: 0.0385 |
|
|
Train Accuracy: 0.9881 |
|
|
Validation Loss: 0.0566 |
|
|
Validation Accuracy: 0.9812 |
|
|
Epoch: 20 |
|
|
# The dataset for training **TarPepSubLoc-ESM2** |
|
|
The full dataset contains 13,005 protein sequences, including SP (2,697), MT (499), CH (227), TH (45), and Other (9,537). |
|
|
The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification. |
|
|
- "SP" for signal peptide, |
|
|
- "MT" for mitochondrial transit peptide (mTP), |
|
|
- "CH" for chloroplast transit peptide (cTP), |
|
|
- "TH" for thylakoidal lumen composite transit peptide (lTP), |
|
|
- "Other" for no targeting peptide (in this case, the length is given as 0). |
|
|
|
|
|
The dataset was downloaded from the website at [**TargetP - 2.0**](https://services.healthtech.dtu.dk/services/TargetP-2.0/). |
|
|
# Model training code at GitHub |
|
|
https://github.com/pengsihua2023/TarPepSubLoc-ESM2 |
|
|
|
|
|
# How to use **TarPepSubLoc-ESM2** |
|
|
### An example |
|
|
Pytorch and transformers libraries should be installed in your system. |
|
|
### Install pytorch |
|
|
``` |
|
|
pip install torch torchvision torchaudio |
|
|
|
|
|
``` |
|
|
### Install transformers |
|
|
``` |
|
|
pip install transformers |
|
|
|
|
|
``` |
|
|
### Run the following code |
|
|
``` |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load the fine-tuned model and tokenizer from Hugging Face |
|
|
model_name = "sihuapeng/TarPepSubLoc-ESM2" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Define the amino acid sequence |
|
|
sequence = "MNSLLMITACLALVGTVWAKEGYLVNSYTGCKFECFKLGDNDYCLRECRQQYGKGSGGYCYAFGCWCTHLYEQAVVWPLPNKTCNGK" |
|
|
|
|
|
# Tokenize the sequence |
|
|
inputs = tokenizer(sequence, return_tensors="pt") |
|
|
|
|
|
# Make the prediction |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
predicted_class_id = logits.argmax().item() |
|
|
|
|
|
# Define the ID to Label mapping |
|
|
id2label = {0: 'CH', 1: 'MT', 2: 'Other', 3: 'SP', 4: 'TH'} |
|
|
|
|
|
# Get the predicted label |
|
|
predicted_label = id2label[predicted_class_id] |
|
|
|
|
|
print(f"The predicted class for the sequence is: {predicted_label}") |
|
|
|
|
|
``` |
|
|
|
|
|
## Funding |
|
|
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738). |
|
|
### Model architecture, coding and implementation |
|
|
Sihua Peng |
|
|
## Group, Department and Institution |
|
|
### Lab: [Justin Bahl](https://bahl-lab.github.io/) |
|
|
### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/) |
|
|
### Institution: [The University of Georgia](https://www.uga.edu/) |
|
|
|
|
|
 |
|
|
|