NTxPred2 / README.md

Update README.md

a50ea08 verified 7 months ago

5.12 kB

	---
	license: gpl-3.0
	language:
	- en
	base_model:
	- facebook/esm2_t30_150M_UR50D
	tags:
	- peptides
	- neurotoxicity
	- protein-classification
	- therapeutic-peptides
	- bioinformatics
	- esm2
	- transformer
	---

	# 🧠 NTxPred2: A large language model for predicting neurotoxic peptides and neurotoxins

	NTxPred2 is a fine-tuned transformer model built on top of the [ESM2-t30_150M_UR50D](https://huggingface.co/facebook/esm2_t30_150M_UR50D) protein language model. It is specifically trained for binary classification of peptide sequences — predicting whether a peptide is neurotoxic or non-toxic.

	🎯 Use Case: Accelerating the identification and design of safe peptide therapeutics by filtering out neurotoxic candidates early in the drug development pipeline.

	---

	### 🖼️ NTxPred2 Workflow

	![NTxPred2 Workflow](https://github.com/raghavagps/NTxPred2/raw/main/NTxPred2_workflow.png)

	---

	## 🧬 Model Highlights

	- Base Model: Facebook’s ESM2-t30 (150M parameters)
	- Fine-Tuning Task: Neurotoxicity prediction (binary classification)
	- Input: Short peptide sequences (7–50 amino acids)
	- Output: Binary label → `1` (neurotoxic), `0` (non-toxic)
	- Architecture: ESM2 encoder + linear classification head

	---

	## 🗂️ Files Included

	- `config.json` – Contains configuration settings for the model architecture, hyperparameters, and training details.

	- `model.safetensors` – This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files.

	- `special_tokens_map.json` – Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer.

	- `tokenizer_config.json` – Contains tokenizer-related settings (like vocabulary size, tokenization method).

	- `vocab.txt` – Lists all tokens and their corresponding IDs; it's essential for text tokenization.

	---

	## 🚀 How to Use

	### 🔧 Install Dependencies

	```bash
	pip install torch esm biopython huggingface_hub


	### Loading the Model from Hugging Face

	```python
	import torch
	import torch.nn as nn
	import esm
	import json
	from huggingface_hub import hf_hub_download

	# Define the classifier model (ESM encoder + linear head)
	class ProteinClassifier(nn.Module):
	def __init__(self, esm_model, embedding_dim, num_classes):
	super(ProteinClassifier, self).__init__()
	self.esm_model = esm_model
	self.fc = nn.Linear(embedding_dim, num_classes)

	def forward(self, tokens):
	layer_index = len(self.esm_model.layers) # Get number of layers
	results = self.esm_model(tokens, repr_layers=[layer_index])
	embeddings = results["representations"][layer_index].mean(1)
	return self.fc(embeddings)

	# Device setup
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load config from your repo
	config_path = hf_hub_download(repo_id="anandr88/NTxPred2", filename="config.json")
	with open(config_path, 'r') as f:
	config = json.load(f)

	# Load ESM2 model - UPDATED METHOD
	model_name = "esm2_t30_150M_UR50D"
	esm_model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
	batch_converter = alphabet.get_batch_converter()

	# Initialize a NEW classifier (with random weights)
	classifier = ProteinClassifier(
	esm_model,
	embedding_dim=config['embedding_dim'],
	num_classes=config['num_classes']
	)
	classifier.to(device)
	classifier.eval()

	print("✅ Model loaded successfully!")
	print(f"Using device: {device}")
	print(f"Model architecture: {classifier}")
	```

	---

	## 🧪 Example Usage (Optional)

	---

	```python
	# Example Usage for Binary Classification
	sequence = ("TEST_SEQUENCE", "ACDEFGHIKLMNPQRSTVWY") # Your peptide sequence

	# Convert to model input format
	_, _, batch_tokens = batch_converter([sequence])
	batch_tokens = batch_tokens.to(device)

	# Predict
	with torch.no_grad():
	logits = classifier(batch_tokens)
	probability = torch.sigmoid(logits).item() # Sigmoid for binary classification

	# Interpret results
	threshold = 0.5 # Standard threshold (adjust if needed)
	prediction = "Neurotoxic" if probability >= threshold else "Not-toxic"

	print("\n" + "="*50)
	print(f"🔬 Input Sequence: {sequence[1]}")
	print(f"📊 Neurotoxicity Probability: {probability:.4f}")
	print(f"🏷️ Prediction: {prediction} (threshold={threshold})")

	```

	---

	## 📊 Applications

	- Neurotoxic peptide filtering in therapeutic design
	- Toxicity scanning of synthetic peptides
	- Dataset annotation for bioactivity studies
	- Educational use in bioinformatics and deep learning for proteins

	---

	## 🌐 Related Links

	- 🔬 Project Web Server: [NTxPred2 Web Tool](http://webs.iiitd.edu.in/raghava/ntxpred2)
	- 🧾 Documentation & Source: [GitHub – raghavagps/NTxPred2](https://github.com/raghavagps/NTxPred2)

	---

	## 🧠 Citation

	📖 Rathore et al.
	_A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins._
	#Coming Soon#

	---

	👨‍🔬 Start using NTxPred2 today to enhance your peptide screening pipeline with the power of transformer-based intelligence!