|
|
--- |
|
|
license: gpl-3.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- facebook/esm2_t30_150M_UR50D |
|
|
tags: |
|
|
- peptides |
|
|
- neurotoxicity |
|
|
- protein-classification |
|
|
- therapeutic-peptides |
|
|
- bioinformatics |
|
|
- esm2 |
|
|
- transformer |
|
|
--- |
|
|
|
|
|
# π§ NTxPred2: A large language model for predicting neurotoxic peptides and neurotoxins |
|
|
|
|
|
NTxPred2 is a fine-tuned transformer model built on top of the [ESM2-t30_150M_UR50D](https://huggingface.co/facebook/esm2_t30_150M_UR50D) protein language model. It is specifically trained for **binary classification** of peptide sequences β predicting whether a peptide is **neurotoxic** or **non-toxic**. |
|
|
|
|
|
π― **Use Case:** Accelerating the identification and design of safe peptide therapeutics by filtering out neurotoxic candidates early in the drug development pipeline. |
|
|
|
|
|
--- |
|
|
|
|
|
### πΌοΈ NTxPred2 Workflow |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## 𧬠Model Highlights |
|
|
|
|
|
- **Base Model:** Facebookβs ESM2-t30 (150M parameters) |
|
|
- **Fine-Tuning Task:** Neurotoxicity prediction (binary classification) |
|
|
- **Input:** Short peptide sequences (7β50 amino acids) |
|
|
- **Output:** Binary label β `1` (neurotoxic), `0` (non-toxic) |
|
|
- **Architecture:** ESM2 encoder + linear classification head |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Files Included |
|
|
|
|
|
- `config.json` β Contains configuration settings for the model architecture, hyperparameters, and training details. |
|
|
|
|
|
- `model.safetensors` β This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files. |
|
|
|
|
|
- `special_tokens_map.json` β Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer. |
|
|
|
|
|
- `tokenizer_config.json` β Contains tokenizer-related settings (like vocabulary size, tokenization method). |
|
|
|
|
|
- `vocab.txt` β Lists all tokens and their corresponding IDs; it's essential for text tokenization. |
|
|
|
|
|
--- |
|
|
|
|
|
## π How to Use |
|
|
|
|
|
### π§ Install Dependencies |
|
|
|
|
|
```bash |
|
|
pip install torch esm biopython huggingface_hub |
|
|
|
|
|
|
|
|
### Loading the Model from Hugging Face |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
import esm |
|
|
import json |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Define the classifier model (ESM encoder + linear head) |
|
|
class ProteinClassifier(nn.Module): |
|
|
def __init__(self, esm_model, embedding_dim, num_classes): |
|
|
super(ProteinClassifier, self).__init__() |
|
|
self.esm_model = esm_model |
|
|
self.fc = nn.Linear(embedding_dim, num_classes) |
|
|
|
|
|
def forward(self, tokens): |
|
|
layer_index = len(self.esm_model.layers) # Get number of layers |
|
|
results = self.esm_model(tokens, repr_layers=[layer_index]) |
|
|
embeddings = results["representations"][layer_index].mean(1) |
|
|
return self.fc(embeddings) |
|
|
|
|
|
# Device setup |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load config from your repo |
|
|
config_path = hf_hub_download(repo_id="anandr88/NTxPred2", filename="config.json") |
|
|
with open(config_path, 'r') as f: |
|
|
config = json.load(f) |
|
|
|
|
|
# Load ESM2 model - UPDATED METHOD |
|
|
model_name = "esm2_t30_150M_UR50D" |
|
|
esm_model, alphabet = esm.pretrained.load_model_and_alphabet(model_name) |
|
|
batch_converter = alphabet.get_batch_converter() |
|
|
|
|
|
# Initialize a NEW classifier (with random weights) |
|
|
classifier = ProteinClassifier( |
|
|
esm_model, |
|
|
embedding_dim=config['embedding_dim'], |
|
|
num_classes=config['num_classes'] |
|
|
) |
|
|
classifier.to(device) |
|
|
classifier.eval() |
|
|
|
|
|
print("β
Model loaded successfully!") |
|
|
print(f"Using device: {device}") |
|
|
print(f"Model architecture: {classifier}") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Example Usage (Optional) |
|
|
|
|
|
--- |
|
|
|
|
|
```python |
|
|
# Example Usage for Binary Classification |
|
|
sequence = ("TEST_SEQUENCE", "ACDEFGHIKLMNPQRSTVWY") # Your peptide sequence |
|
|
|
|
|
# Convert to model input format |
|
|
_, _, batch_tokens = batch_converter([sequence]) |
|
|
batch_tokens = batch_tokens.to(device) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
logits = classifier(batch_tokens) |
|
|
probability = torch.sigmoid(logits).item() # Sigmoid for binary classification |
|
|
|
|
|
# Interpret results |
|
|
threshold = 0.5 # Standard threshold (adjust if needed) |
|
|
prediction = "Neurotoxic" if probability >= threshold else "Not-toxic" |
|
|
|
|
|
print("\n" + "="*50) |
|
|
print(f"π¬ Input Sequence: {sequence[1]}") |
|
|
print(f"π Neurotoxicity Probability: {probability:.4f}") |
|
|
print(f"π·οΈ Prediction: {prediction} (threshold={threshold})") |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Applications |
|
|
|
|
|
- **Neurotoxic peptide filtering** in therapeutic design |
|
|
- **Toxicity scanning** of synthetic peptides |
|
|
- **Dataset annotation** for bioactivity studies |
|
|
- **Educational use** in bioinformatics and deep learning for proteins |
|
|
|
|
|
--- |
|
|
|
|
|
## π Related Links |
|
|
|
|
|
- π¬ Project Web Server: [NTxPred2 Web Tool](http://webs.iiitd.edu.in/raghava/ntxpred2) |
|
|
- π§Ύ Documentation & Source: [GitHub β raghavagps/NTxPred2](https://github.com/raghavagps/NTxPred2) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Citation |
|
|
|
|
|
π Rathore et al. |
|
|
_A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins._ |
|
|
**#Coming Soon#** |
|
|
|
|
|
--- |
|
|
|
|
|
π¨βπ¬ Start using **NTxPred2** today to enhance your peptide screening pipeline with the power of **transformer-based intelligence**! |
|
|
|