NTxPred2 / README.md
anandr88's picture
Update README.md
a50ea08 verified
---
license: gpl-3.0
language:
- en
base_model:
- facebook/esm2_t30_150M_UR50D
tags:
- peptides
- neurotoxicity
- protein-classification
- therapeutic-peptides
- bioinformatics
- esm2
- transformer
---
# 🧠 NTxPred2: A large language model for predicting neurotoxic peptides and neurotoxins
NTxPred2 is a fine-tuned transformer model built on top of the [ESM2-t30_150M_UR50D](https://huggingface.co/facebook/esm2_t30_150M_UR50D) protein language model. It is specifically trained for **binary classification** of peptide sequences β€” predicting whether a peptide is **neurotoxic** or **non-toxic**.
🎯 **Use Case:** Accelerating the identification and design of safe peptide therapeutics by filtering out neurotoxic candidates early in the drug development pipeline.
---
### πŸ–ΌοΈ NTxPred2 Workflow
![NTxPred2 Workflow](https://github.com/raghavagps/NTxPred2/raw/main/NTxPred2_workflow.png)
---
## 🧬 Model Highlights
- **Base Model:** Facebook’s ESM2-t30 (150M parameters)
- **Fine-Tuning Task:** Neurotoxicity prediction (binary classification)
- **Input:** Short peptide sequences (7–50 amino acids)
- **Output:** Binary label β†’ `1` (neurotoxic), `0` (non-toxic)
- **Architecture:** ESM2 encoder + linear classification head
---
## πŸ—‚οΈ Files Included
- `config.json` – Contains configuration settings for the model architecture, hyperparameters, and training details.
- `model.safetensors` – This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files.
- `special_tokens_map.json` – Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer.
- `tokenizer_config.json` – Contains tokenizer-related settings (like vocabulary size, tokenization method).
- `vocab.txt` – Lists all tokens and their corresponding IDs; it's essential for text tokenization.
---
## πŸš€ How to Use
### πŸ”§ Install Dependencies
```bash
pip install torch esm biopython huggingface_hub
### Loading the Model from Hugging Face
```python
import torch
import torch.nn as nn
import esm
import json
from huggingface_hub import hf_hub_download
# Define the classifier model (ESM encoder + linear head)
class ProteinClassifier(nn.Module):
def __init__(self, esm_model, embedding_dim, num_classes):
super(ProteinClassifier, self).__init__()
self.esm_model = esm_model
self.fc = nn.Linear(embedding_dim, num_classes)
def forward(self, tokens):
layer_index = len(self.esm_model.layers) # Get number of layers
results = self.esm_model(tokens, repr_layers=[layer_index])
embeddings = results["representations"][layer_index].mean(1)
return self.fc(embeddings)
# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load config from your repo
config_path = hf_hub_download(repo_id="anandr88/NTxPred2", filename="config.json")
with open(config_path, 'r') as f:
config = json.load(f)
# Load ESM2 model - UPDATED METHOD
model_name = "esm2_t30_150M_UR50D"
esm_model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
batch_converter = alphabet.get_batch_converter()
# Initialize a NEW classifier (with random weights)
classifier = ProteinClassifier(
esm_model,
embedding_dim=config['embedding_dim'],
num_classes=config['num_classes']
)
classifier.to(device)
classifier.eval()
print("βœ… Model loaded successfully!")
print(f"Using device: {device}")
print(f"Model architecture: {classifier}")
```
---
## πŸ§ͺ Example Usage (Optional)
---
```python
# Example Usage for Binary Classification
sequence = ("TEST_SEQUENCE", "ACDEFGHIKLMNPQRSTVWY") # Your peptide sequence
# Convert to model input format
_, _, batch_tokens = batch_converter([sequence])
batch_tokens = batch_tokens.to(device)
# Predict
with torch.no_grad():
logits = classifier(batch_tokens)
probability = torch.sigmoid(logits).item() # Sigmoid for binary classification
# Interpret results
threshold = 0.5 # Standard threshold (adjust if needed)
prediction = "Neurotoxic" if probability >= threshold else "Not-toxic"
print("\n" + "="*50)
print(f"πŸ”¬ Input Sequence: {sequence[1]}")
print(f"πŸ“Š Neurotoxicity Probability: {probability:.4f}")
print(f"🏷️ Prediction: {prediction} (threshold={threshold})")
```
---
## πŸ“Š Applications
- **Neurotoxic peptide filtering** in therapeutic design
- **Toxicity scanning** of synthetic peptides
- **Dataset annotation** for bioactivity studies
- **Educational use** in bioinformatics and deep learning for proteins
---
## 🌐 Related Links
- πŸ”¬ Project Web Server: [NTxPred2 Web Tool](http://webs.iiitd.edu.in/raghava/ntxpred2)
- 🧾 Documentation & Source: [GitHub – raghavagps/NTxPred2](https://github.com/raghavagps/NTxPred2)
---
## 🧠 Citation
πŸ“– Rathore et al.
_A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins._
**#Coming Soon#**
---
πŸ‘¨β€πŸ”¬ Start using **NTxPred2** today to enhance your peptide screening pipeline with the power of **transformer-based intelligence**!