IonNTxPred / README.md
anandr88's picture
Update README.md
cb3a66b verified
---
license: gpl-3.0
language:
- en
base_model:
- facebook/esm2_t33_650M_UR50D
tags:
- Ion Channel Impairing Proteins
- neurotoxicity
- protein-classification
- therapeutic-peptides
- bioinformatics
- esm2
- transformer
---
# 🧠 IonNTxPred: LLM-based Prediction and Designing of Ion Channel Impairing Proteins
IonNTxPred is a fine-tuned transformer model built on top of the [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) protein language model. It is specifically trained for **binary classification** of peptide sequences β€” predicting whether a peptide is **ion channel modulating** or **non-modulating**.
🎯 **Use Case:** Accelerating the identification and design of safe peptide therapeutics by filtering out ion channel impairing/modulating candidates early in the drug development pipeline.
---
### πŸ–ΌοΈ IonNTxPred Workflow
![IonNTxPred Workflow](https://webs.iiitd.edu.in/raghava/ionntxpred/images/IonNTxPred.png)
---
## 🧬 Model Highlights
- **Base Model:** Facebook’s ESM2-t33 (650M parameters)
- **Fine-Tuning Task:** Ion channel toxins prediction (binary classification)
- **Input:** Protein/Peptide sequences
- **Output:** Binary label β†’ `1` (Ion Channel Modulating), `0` (non-modulating)
- **Architecture:** ESM2 encoder + linear classification head
---
## πŸ—‚οΈ Files Included
- `config.json` – Contains configuration settings for the model architecture, hyperparameters, and training details.
- `model.safetensors` – This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files.
- `special_tokens_map.json` – Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer.
- `tokenizer_config.json` – Contains tokenizer-related settings (like vocabulary size, tokenization method).
- `vocab.txt` – Lists all tokens and their corresponding IDs; it's essential for text tokenization.
---
# πŸš€ How to Use
---
# Predict Sodium channel modulating proteins
### πŸ”§ Install Dependencies
```bash
pip install torch esm biopython huggingface_hub
### Loading the Model from Hugging Face
```python
import torch
import esm
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoTokenizer, EsmForSequenceClassification
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Downloading fine-tuned models & weights...")
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_na"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_na/model.safetensors")
# Create a simple classifier model
class ProteinClassifier(torch.nn.Module):
def __init__(self, esm_model):
super().__init__()
self.esm_model = esm_model
# We'll dynamically determine the classifier layer size
self.classifier = None
def forward(self, tokens):
with torch.no_grad():
results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
embeddings = results["representations"][33].mean(1)
return self.classifier(embeddings)
# Initialize model
classifier = ProteinClassifier(model)
# Load the state dict and determine architecture
state_dict = load_file(weights_path, device=str(device))
# Find the classifier layer (look for a weight matrix)
for key, tensor in state_dict.items():
if len(tensor.shape) == 2: # This should be the weight matrix
num_classes = tensor.shape[0]
embedding_dim = tensor.shape[1]
print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")
# Initialize the classifier layer
classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)
# Create new state dict with proper names
new_state_dict = {
'classifier.weight': state_dict[key],
'classifier.bias': state_dict[key.replace('weight', 'bias')]
}
classifier.load_state_dict(new_state_dict, strict=False)
break
# Move to device and set to eval mode
classifier = classifier.to(device)
classifier.eval()
print(f"\nModel successfully loaded on {device} and ready for inference!")
```
### πŸ§ͺ Example Usage (Optional)
```python
from transformers import AutoTokenizer, EsmForSequenceClassification
import torch
# Define the repository ID and subfolder
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_na"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Function to make predictions
def make_predictions(model, inputs, device):
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
return probs
# Example protein sequence
protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"
# Tokenize the input sequence
inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Make predictions
prediction = make_predictions(model, inputs, device)
# Apply threshold for final classification
threshold = 0.5
final_prediction = "Na+ channel modulating" if prediction[0] > threshold else "Not Na+ channel modulating"
print(f"πŸ“Š Prediction Probability: {prediction[0]:.4f}")
print(f"🏷️ Final Prediction: {final_prediction}")
```
---
# Predict Potassium channel modulating proteins
---
### πŸ”§ Install Dependencies
```bash
pip install torch esm biopython huggingface_hub
### Loading the Model from Hugging Face
```python
import torch
import esm
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoTokenizer, EsmForSequenceClassification
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Downloading fine-tuned models & weights...")
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_k"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_na/model.safetensors")
# Create a simple classifier model
class ProteinClassifier(torch.nn.Module):
def __init__(self, esm_model):
super().__init__()
self.esm_model = esm_model
# We'll dynamically determine the classifier layer size
self.classifier = None
def forward(self, tokens):
with torch.no_grad():
results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
embeddings = results["representations"][33].mean(1)
return self.classifier(embeddings)
# Initialize model
classifier = ProteinClassifier(model)
# Load the state dict and determine architecture
state_dict = load_file(weights_path, device=str(device))
# Find the classifier layer (look for a weight matrix)
for key, tensor in state_dict.items():
if len(tensor.shape) == 2: # This should be the weight matrix
num_classes = tensor.shape[0]
embedding_dim = tensor.shape[1]
print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")
# Initialize the classifier layer
classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)
# Create new state dict with proper names
new_state_dict = {
'classifier.weight': state_dict[key],
'classifier.bias': state_dict[key.replace('weight', 'bias')]
}
classifier.load_state_dict(new_state_dict, strict=False)
break
# Move to device and set to eval mode
classifier = classifier.to(device)
classifier.eval()
print(f"\nModel successfully loaded on {device} and ready for inference!")
```
### πŸ§ͺ Example Usage (Optional)
```python
from transformers import AutoTokenizer, EsmForSequenceClassification
import torch
# Define the repository ID and subfolder
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_k"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Function to make predictions
def make_predictions(model, inputs, device):
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
return probs
# Example protein sequence
protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"
# Tokenize the input sequence
inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Make predictions
prediction = make_predictions(model, inputs, device)
# Apply threshold for final classification
threshold = 0.5
final_prediction = "K+ channel modulating" if prediction[0] > threshold else "Not K+ channel modulating"
print(f"πŸ“Š Prediction Probability: {prediction[0]:.4f}")
print(f"🏷️ Final Prediction: {final_prediction}")
```
---
# Predict Calcilum channel modulating proteins
---
### πŸ”§ Install Dependencies
```bash
pip install torch esm biopython huggingface_hub
### Loading the Model from Hugging Face
```python
import torch
import esm
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoTokenizer, EsmForSequenceClassification
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Downloading fine-tuned models & weights...")
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_ca"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_ca/model.safetensors")
# Create a simple classifier model
class ProteinClassifier(torch.nn.Module):
def __init__(self, esm_model):
super().__init__()
self.esm_model = esm_model
# We'll dynamically determine the classifier layer size
self.classifier = None
def forward(self, tokens):
with torch.no_grad():
results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
embeddings = results["representations"][33].mean(1)
return self.classifier(embeddings)
# 4. Initialize model
classifier = ProteinClassifier(model)
# 5. Load the state dict and determine architecture
state_dict = load_file(weights_path, device=str(device))
# Find the classifier layer (look for a weight matrix)
for key, tensor in state_dict.items():
if len(tensor.shape) == 2: # This should be the weight matrix
num_classes = tensor.shape[0]
embedding_dim = tensor.shape[1]
print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")
# Initialize the classifier layer
classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)
# Create new state dict with proper names
new_state_dict = {
'classifier.weight': state_dict[key],
'classifier.bias': state_dict[key.replace('weight', 'bias')]
}
classifier.load_state_dict(new_state_dict, strict=False)
break
# Move to device and set to eval mode
classifier = classifier.to(device)
classifier.eval()
print(f"\nModel successfully loaded on {device} and ready for inference!")
```
### πŸ§ͺ Example Usage (Optional)
```python
from transformers import AutoTokenizer, EsmForSequenceClassification
import torch
# Define the repository ID and subfolder
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_ca"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Function to make predictions
def make_predictions(model, inputs, device):
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
return probs
# Example protein sequence
protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"
# Tokenize the input sequence
inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Make predictions
prediction = make_predictions(model, inputs, device)
# Apply threshold for final classification
threshold = 0.5
final_prediction = "Ca++ channel modulating" if prediction[0] > threshold else "Not Ca++ channel modulating"
print(f"πŸ“Š Prediction Probability: {prediction[0]:.4f}")
print(f"🏷️ Final Prediction: {final_prediction}")
```
---
# Predict other ion channel modulating proteins
---
### πŸ”§ Install Dependencies
```bash
pip install torch esm biopython huggingface_hub
### Loading the Model from Hugging Face
```python
import torch
import esm
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoTokenizer, EsmForSequenceClassification
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Downloading fine-tuned models & weights...")
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_other"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_na/model.safetensors")
# Create a simple classifier model
class ProteinClassifier(torch.nn.Module):
def __init__(self, esm_model):
super().__init__()
self.esm_model = esm_model
# We'll dynamically determine the classifier layer size
self.classifier = None
def forward(self, tokens):
with torch.no_grad():
results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
embeddings = results["representations"][33].mean(1)
return self.classifier(embeddings)
# Initialize model
classifier = ProteinClassifier(model)
# Load the state dict and determine architecture
state_dict = load_file(weights_path, device=str(device))
# Find the classifier layer (look for a weight matrix)
for key, tensor in state_dict.items():
if len(tensor.shape) == 2: # This should be the weight matrix
num_classes = tensor.shape[0]
embedding_dim = tensor.shape[1]
print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")
# Initialize the classifier layer
classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)
# Create new state dict with proper names
new_state_dict = {
'classifier.weight': state_dict[key],
'classifier.bias': state_dict[key.replace('weight', 'bias')]
}
classifier.load_state_dict(new_state_dict, strict=False)
break
# Move to device and set to eval mode
classifier = classifier.to(device)
classifier.eval()
print(f"\nModel successfully loaded on {device} and ready for inference!")
```
### πŸ§ͺ Example Usage (Optional)
```python
from transformers import AutoTokenizer, EsmForSequenceClassification
import torch
# Define the repository ID and subfolder
repo_id = "anandr88/IonNTxPred"
subfolder = "saved_model_t33_other"
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Function to make predictions
def make_predictions(model, inputs, device):
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
return probs
# Example protein sequence
protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"
# Tokenize the input sequence
inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Make predictions
prediction = make_predictions(model, inputs, device)
# Apply threshold for final classification
threshold = 0.5
final_prediction = "Other channel modulating" if prediction[0] > threshold else "Not Other channel modulating"
print(f"πŸ“Š Prediction Probability: {prediction[0]:.4f}")
print(f"🏷️ Final Prediction: {final_prediction}")
```
---
---
## πŸ“Š Applications
- **Ion channel impairing proteins filtering** in therapeutic design
- **Toxicity scanning** of synthetic peptides
- **Dataset annotation** for bioactivity studies
- **Educational use** in bioinformatics and deep learning for proteins
---
## 🌐 Related Links
- πŸ”¬ Project Web Server: [IonNTxpred Web Tool](http://webs.iiitd.edu.in/raghava/ionntxpred)
- 🧾 Documentation & Source: [GitHub – raghavagps/IonNTxPred](https://github.com/raghavagps/ionntxpred)
---
## 🧠 Citation
πŸ“– Rathore et al.
_A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins._
**#Coming Soon#**
---
πŸ‘¨β€πŸ”¬ Start using **IonNTxPred** today to enhance your protein/peptide screening pipeline with the power of **transformer-based intelligence**!