IonNTxPred / README.md

Update README.md

cb3a66b verified 5 months ago

18.9 kB

	---
	license: gpl-3.0
	language:
	- en
	base_model:
	- facebook/esm2_t33_650M_UR50D
	tags:
	- Ion Channel Impairing Proteins
	- neurotoxicity
	- protein-classification
	- therapeutic-peptides
	- bioinformatics
	- esm2
	- transformer
	---

	# 🧠 IonNTxPred: LLM-based Prediction and Designing of Ion Channel Impairing Proteins

	IonNTxPred is a fine-tuned transformer model built on top of the [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) protein language model. It is specifically trained for binary classification of peptide sequences — predicting whether a peptide is ion channel modulating or non-modulating.

	🎯 Use Case: Accelerating the identification and design of safe peptide therapeutics by filtering out ion channel impairing/modulating candidates early in the drug development pipeline.

	---

	### 🖼️ IonNTxPred Workflow

	![IonNTxPred Workflow](https://webs.iiitd.edu.in/raghava/ionntxpred/images/IonNTxPred.png)

	---

	## 🧬 Model Highlights

	- Base Model: Facebook’s ESM2-t33 (650M parameters)
	- Fine-Tuning Task: Ion channel toxins prediction (binary classification)
	- Input: Protein/Peptide sequences
	- Output: Binary label → `1` (Ion Channel Modulating), `0` (non-modulating)
	- Architecture: ESM2 encoder + linear classification head

	---

	## 🗂️ Files Included


	- `config.json` – Contains configuration settings for the model architecture, hyperparameters, and training details.

	- `model.safetensors` – This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files.

	- `special_tokens_map.json` – Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer.

	- `tokenizer_config.json` – Contains tokenizer-related settings (like vocabulary size, tokenization method).

	- `vocab.txt` – Lists all tokens and their corresponding IDs; it's essential for text tokenization.
	---

	# 🚀 How to Use
	---
	# Predict Sodium channel modulating proteins

	### 🔧 Install Dependencies

	```bash
	pip install torch esm biopython huggingface_hub


	### Loading the Model from Hugging Face

	```python
	import torch
	import esm
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import AutoTokenizer, EsmForSequenceClassification

	# Set device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


	print("Downloading fine-tuned models & weights...")
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_na"
	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
	weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_na/model.safetensors")

	# Create a simple classifier model
	class ProteinClassifier(torch.nn.Module):
	def __init__(self, esm_model):
	super().__init__()
	self.esm_model = esm_model
	# We'll dynamically determine the classifier layer size
	self.classifier = None

	def forward(self, tokens):
	with torch.no_grad():
	results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
	embeddings = results["representations"][33].mean(1)
	return self.classifier(embeddings)

	# Initialize model
	classifier = ProteinClassifier(model)

	# Load the state dict and determine architecture
	state_dict = load_file(weights_path, device=str(device))

	# Find the classifier layer (look for a weight matrix)
	for key, tensor in state_dict.items():
	if len(tensor.shape) == 2: # This should be the weight matrix
	num_classes = tensor.shape[0]
	embedding_dim = tensor.shape[1]
	print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")

	# Initialize the classifier layer
	classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)

	# Create new state dict with proper names
	new_state_dict = {
	'classifier.weight': state_dict[key],
	'classifier.bias': state_dict[key.replace('weight', 'bias')]
	}
	classifier.load_state_dict(new_state_dict, strict=False)
	break

	# Move to device and set to eval mode
	classifier = classifier.to(device)
	classifier.eval()

	print(f"\nModel successfully loaded on {device} and ready for inference!")
	```



	### 🧪 Example Usage (Optional)



	```python
	from transformers import AutoTokenizer, EsmForSequenceClassification
	import torch

	# Define the repository ID and subfolder
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_na"

	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)

	# Move the model to the appropriate device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()

	# Function to make predictions
	def make_predictions(model, inputs, device):
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
	return probs

	# Example protein sequence
	protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"

	# Tokenize the input sequence
	inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
	inputs = {key: value.to(device) for key, value in inputs.items()}

	# Make predictions
	prediction = make_predictions(model, inputs, device)

	# Apply threshold for final classification
	threshold = 0.5
	final_prediction = "Na+ channel modulating" if prediction[0] > threshold else "Not Na+ channel modulating"

	print(f"📊 Prediction Probability: {prediction[0]:.4f}")
	print(f"🏷️ Final Prediction: {final_prediction}")
	```

	---

	# Predict Potassium channel modulating proteins
	---
	### 🔧 Install Dependencies

	```bash
	pip install torch esm biopython huggingface_hub


	### Loading the Model from Hugging Face

	```python
	import torch
	import esm
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import AutoTokenizer, EsmForSequenceClassification

	# Set device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


	print("Downloading fine-tuned models & weights...")
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_k"
	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
	weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_na/model.safetensors")

	# Create a simple classifier model
	class ProteinClassifier(torch.nn.Module):
	def __init__(self, esm_model):
	super().__init__()
	self.esm_model = esm_model
	# We'll dynamically determine the classifier layer size
	self.classifier = None

	def forward(self, tokens):
	with torch.no_grad():
	results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
	embeddings = results["representations"][33].mean(1)
	return self.classifier(embeddings)

	# Initialize model
	classifier = ProteinClassifier(model)

	# Load the state dict and determine architecture
	state_dict = load_file(weights_path, device=str(device))

	# Find the classifier layer (look for a weight matrix)
	for key, tensor in state_dict.items():
	if len(tensor.shape) == 2: # This should be the weight matrix
	num_classes = tensor.shape[0]
	embedding_dim = tensor.shape[1]
	print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")

	# Initialize the classifier layer
	classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)

	# Create new state dict with proper names
	new_state_dict = {
	'classifier.weight': state_dict[key],
	'classifier.bias': state_dict[key.replace('weight', 'bias')]
	}
	classifier.load_state_dict(new_state_dict, strict=False)
	break

	# Move to device and set to eval mode
	classifier = classifier.to(device)
	classifier.eval()

	print(f"\nModel successfully loaded on {device} and ready for inference!")
	```


	### 🧪 Example Usage (Optional)


	```python
	from transformers import AutoTokenizer, EsmForSequenceClassification
	import torch

	# Define the repository ID and subfolder
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_k"

	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)

	# Move the model to the appropriate device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()

	# Function to make predictions
	def make_predictions(model, inputs, device):
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
	return probs

	# Example protein sequence
	protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"

	# Tokenize the input sequence
	inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
	inputs = {key: value.to(device) for key, value in inputs.items()}

	# Make predictions
	prediction = make_predictions(model, inputs, device)

	# Apply threshold for final classification
	threshold = 0.5
	final_prediction = "K+ channel modulating" if prediction[0] > threshold else "Not K+ channel modulating"

	print(f"📊 Prediction Probability: {prediction[0]:.4f}")
	print(f"🏷️ Final Prediction: {final_prediction}")
	```

	---


	# Predict Calcilum channel modulating proteins
	---
	### 🔧 Install Dependencies

	```bash
	pip install torch esm biopython huggingface_hub


	### Loading the Model from Hugging Face

	```python
	import torch
	import esm
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import AutoTokenizer, EsmForSequenceClassification

	# Set device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


	print("Downloading fine-tuned models & weights...")
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_ca"
	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
	weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_ca/model.safetensors")

	# Create a simple classifier model
	class ProteinClassifier(torch.nn.Module):
	def __init__(self, esm_model):
	super().__init__()
	self.esm_model = esm_model
	# We'll dynamically determine the classifier layer size
	self.classifier = None

	def forward(self, tokens):
	with torch.no_grad():
	results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
	embeddings = results["representations"][33].mean(1)
	return self.classifier(embeddings)

	# 4. Initialize model
	classifier = ProteinClassifier(model)

	# 5. Load the state dict and determine architecture
	state_dict = load_file(weights_path, device=str(device))

	# Find the classifier layer (look for a weight matrix)
	for key, tensor in state_dict.items():
	if len(tensor.shape) == 2: # This should be the weight matrix
	num_classes = tensor.shape[0]
	embedding_dim = tensor.shape[1]
	print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")

	# Initialize the classifier layer
	classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)

	# Create new state dict with proper names
	new_state_dict = {
	'classifier.weight': state_dict[key],
	'classifier.bias': state_dict[key.replace('weight', 'bias')]
	}
	classifier.load_state_dict(new_state_dict, strict=False)
	break

	# Move to device and set to eval mode
	classifier = classifier.to(device)
	classifier.eval()

	print(f"\nModel successfully loaded on {device} and ready for inference!")
	```


	### 🧪 Example Usage (Optional)


	```python
	from transformers import AutoTokenizer, EsmForSequenceClassification
	import torch

	# Define the repository ID and subfolder
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_ca"

	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)

	# Move the model to the appropriate device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()

	# Function to make predictions
	def make_predictions(model, inputs, device):
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
	return probs

	# Example protein sequence
	protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"

	# Tokenize the input sequence
	inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
	inputs = {key: value.to(device) for key, value in inputs.items()}

	# Make predictions
	prediction = make_predictions(model, inputs, device)

	# Apply threshold for final classification
	threshold = 0.5
	final_prediction = "Ca++ channel modulating" if prediction[0] > threshold else "Not Ca++ channel modulating"

	print(f"📊 Prediction Probability: {prediction[0]:.4f}")
	print(f"🏷️ Final Prediction: {final_prediction}")
	```

	---
	# Predict other ion channel modulating proteins
	---
	### 🔧 Install Dependencies

	```bash
	pip install torch esm biopython huggingface_hub


	### Loading the Model from Hugging Face

	```python
	import torch
	import esm
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import AutoTokenizer, EsmForSequenceClassification

	# Set device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


	print("Downloading fine-tuned models & weights...")
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_other"
	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
	weights_path = hf_hub_download(repo_id=repo_id, filename="saved_model_t33_na/model.safetensors")

	# Create a simple classifier model
	class ProteinClassifier(torch.nn.Module):
	def __init__(self, esm_model):
	super().__init__()
	self.esm_model = esm_model
	# We'll dynamically determine the classifier layer size
	self.classifier = None

	def forward(self, tokens):
	with torch.no_grad():
	results = self.esm_model(tokens, repr_layers=[33], return_contacts=False)
	embeddings = results["representations"][33].mean(1)
	return self.classifier(embeddings)

	# Initialize model
	classifier = ProteinClassifier(model)

	# Load the state dict and determine architecture
	state_dict = load_file(weights_path, device=str(device))

	# Find the classifier layer (look for a weight matrix)
	for key, tensor in state_dict.items():
	if len(tensor.shape) == 2: # This should be the weight matrix
	num_classes = tensor.shape[0]
	embedding_dim = tensor.shape[1]
	print(f"Found classifier layer: {key} (input_dim={embedding_dim}, output_dim={num_classes})")

	# Initialize the classifier layer
	classifier.classifier = torch.nn.Linear(embedding_dim, num_classes).to(device)

	# Create new state dict with proper names
	new_state_dict = {
	'classifier.weight': state_dict[key],
	'classifier.bias': state_dict[key.replace('weight', 'bias')]
	}
	classifier.load_state_dict(new_state_dict, strict=False)
	break

	# Move to device and set to eval mode
	classifier = classifier.to(device)
	classifier.eval()

	print(f"\nModel successfully loaded on {device} and ready for inference!")
	```


	### 🧪 Example Usage (Optional)

	```python
	from transformers import AutoTokenizer, EsmForSequenceClassification
	import torch

	# Define the repository ID and subfolder
	repo_id = "anandr88/IonNTxPred"
	subfolder = "saved_model_t33_other"

	# Load the tokenizer and model from Hugging Face
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
	model = EsmForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)

	# Move the model to the appropriate device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()

	# Function to make predictions
	def make_predictions(model, inputs, device):
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
	return probs

	# Example protein sequence
	protein_sequence = "MKASTLVVIFIVIFITISSFSIHDVQASGVEKREQKDCLKKLKLCKENKDCCSKSCKRRGTNIEKRCR"

	# Tokenize the input sequence
	inputs = tokenizer(protein_sequence, return_tensors="pt", truncation=True, padding=True)
	inputs = {key: value.to(device) for key, value in inputs.items()}

	# Make predictions
	prediction = make_predictions(model, inputs, device)

	# Apply threshold for final classification
	threshold = 0.5
	final_prediction = "Other channel modulating" if prediction[0] > threshold else "Not Other channel modulating"

	print(f"📊 Prediction Probability: {prediction[0]:.4f}")
	print(f"🏷️ Final Prediction: {final_prediction}")
	```

	---

	---

	## 📊 Applications

	- Ion channel impairing proteins filtering in therapeutic design
	- Toxicity scanning of synthetic peptides
	- Dataset annotation for bioactivity studies
	- Educational use in bioinformatics and deep learning for proteins

	---

	## 🌐 Related Links

	- 🔬 Project Web Server: [IonNTxpred Web Tool](http://webs.iiitd.edu.in/raghava/ionntxpred)
	- 🧾 Documentation & Source: [GitHub – raghavagps/IonNTxPred](https://github.com/raghavagps/ionntxpred)

	---

	## 🧠 Citation

	📖 Rathore et al.
	_A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins._
	#Coming Soon#

	---

	👨‍🔬 Start using IonNTxPred today to enhance your protein/peptide screening pipeline with the power of transformer-based intelligence!