toxicity / README.md
rudradcruze's picture
upload toxicity api application
1c25c67
---
title: Toxicity Prediction API
description: A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.
short_description: Toxicity Prediction API
version: 1.0.0
emoji: 🧬
colorFrom: green
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
license: mit
tags:
- protein-toxicity
- protbert
- mhsa-gru
- pytorch
- fastapi
---
# Toxicity Prediction API
A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.
Developed by the CAMLAs research team - [Francis Rudra D Cruze](https://linkedin.com/in/rudradcruze).
## πŸš€ Features
- **ProtBERT Feature Extraction**: Uses state-of-the-art protein language model
- **MHSA-GRU Classification**: Multi-Head Self-Attention with GRU for accurate predictions
- **Single & Batch Predictions**: Process one or multiple sequences
- **HuggingFace Integration**: Automatic model loading from private repository
- **Production Ready**: Health checks, error handling, and comprehensive logging
## πŸ“‹ Requirements
- Python 3.8+
- CUDA-capable GPU (optional, but recommended)
- HuggingFace account with access to private repository
## πŸ”§ Installation
1. **Clone the repository**
```bash
git clone https://huggingface.co/spaces/camlas/toxicity
cd toxicity
```
2. **Create virtual environment**
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
```
4. **Create `.env` file**
```bash
echo "HF_TOKEN=your_huggingface_token_here" > .env
```
Get your HuggingFace token from: https://huggingface.co/settings/tokens
## 🎯 Usage
### Start the API Server
```bash
python app.py
```
Or with uvicorn directly:
```bash
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
```
The API will be available at: `http://localhost:8000`
### Run Tests
```bash
python test_api.py
```
## πŸ“‘ API Endpoints
### 1. Root Endpoint
**GET** `/`
Returns API information and available endpoints.
```bash
curl http://localhost:8000/
```
### 2. Health Check
**GET** `/health`
Check API status and model loading status.
```bash
curl http://localhost:8000/health
```
**Response:**
```json
{
"status_code": 200,
"status": "healthy",
"service": "Toxicity Prediction API",
"api_version": "1.0.0",
"model_version": "MHSA-GRU-Transformer-v1.0",
"models_loaded": true,
"device": "cuda",
"timestamp": "2025-01-21T10:30:00Z"
}
```
### 3. Single Prediction
**POST** `/predict`
Predict toxicity for a single protein sequence.
**Request:**
```bash
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}'
```
**Response:**
```json
{
"status_code": 200,
"status": "success",
"success": true,
"data": {
"sequence": "MKTAYIAKQRQISFVKSHFSRQLE",
"sequence_length": 24,
"prediction": {
"predicted_class": "Toxic",
"confidence": 0.85,
"confidence_level": "high",
"toxicity_score": 0.925,
"non_toxicity_score": 0.075
},
"metadata": {
"embedding_model": "ProtBERT",
"embedding_type": "Bert",
"model_version": "MHSA-GRU-Transformer-v1.0",
"device": "cuda"
}
},
"timestamp": "2025-01-21T10:30:00Z",
"api_version": "1.0.0",
"processing_time_ms": 45.2
}
```
### 4. Batch Prediction
**POST** `/predict/batch`
Predict toxicity for multiple sequences at once.
**Request in Postman/cURL:**
```bash
curl -X POST http://localhost:8000/predict/batch \
-H "Content-Type: application/json" \
-d '{
"sequences": [
"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
"MKTAYIAKQRQISFVKSHFSRQLE"
]
}'
```
**Request Body (JSON):**
```json
{
"sequences": [
"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK"
]
}
```
**Response:**
```json
{
"status_code": 200,
"status": "success",
"success": true,
"data": {
"total_sequences": 2,
"results": [
{
"sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
"sequence_length": 51,
"predicted_class": "Toxic",
"toxicity_score": 0.925,
"confidence": 0.85
},
{
"sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
"sequence_length": 51,
"predicted_class": "Non-Toxic",
"toxicity_score": 0.125,
"confidence": 0.75
}
],
"metadata": {
"embedding_model": "ProtBERT",
"embedding_type": "Bert",
"model_version": "MHSA-GRU-Transformer-v1.0",
"device": "cuda"
}
},
"timestamp": "2025-01-21T10:30:00Z",
"api_version": "1.0.0",
"processing_time_ms": 125.8
}
```
## 🐍 Python Usage Examples
### Single Prediction
```python
import requests
response = requests.post(
"http://localhost:8000/predict",
json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}
)
result = response.json()
print(f"Predicted Class: {result['data']['prediction']['predicted_class']}")
print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}")
print(f"Confidence: {result['data']['prediction']['confidence']:.4f}")
```
### Batch Prediction
```python
sequences = [
"MKTAYIAKQRQISFVKSHFSRQLE",
"ARNDCEQGHILKMFPSTWYV",
"MVHLTPEEKS"
]
response = requests.post(
"http://localhost:8000/predict/batch",
json={"sequences": sequences}
)
results = response.json()
for i, pred in enumerate(results['data']['results'], 1):
print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})")
```
## πŸ“ Project Structure
```
toxicity-api/
β”œβ”€β”€ app.py # Main FastAPI application
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ test_api.py # Test suite
β”œβ”€β”€ .env # Environment variables (create this)
β”œβ”€β”€ models/ # Downloaded models (auto-created)
└── README.md # This file
```
## πŸ”’ HuggingFace Repository Structure
Your private repository `camlas/toxicity` should contain:
```
camlas/toxicity/
β”œβ”€β”€ mhsa_gru_classifier.pth # Trained MHSA-GRU model
β”œβ”€β”€ scaler.pkl # Feature scaler
β”œβ”€β”€ config.json # ProtBERT config
β”œβ”€β”€ model.safetensors # ProtBERT weights
β”œβ”€β”€ vocab.txt # ProtBERT vocabulary
β”œβ”€β”€ tokenizer_config.json # Tokenizer configuration
└── special_tokens_map.json # Special tokens mapping
```
## 🎨 Model Architecture
1. **Feature Extraction**: ProtBERT (1024-dimensional embeddings)
2. **Feature Scaling**: StandardScaler
3. **Classification**: MHSA-GRU
- Multi-Head Self-Attention (3 layers)
- Bidirectional GRU (2 layers)
- Fully connected layers with dropout
## ⚠️ Error Codes
- `MISSING_SEQUENCE`: No sequence provided in request
- `SEQUENCE_TOO_SHORT`: Sequence length < 10 amino acids
- `MODEL_NOT_LOADED`: Models failed to load from HuggingFace
- `INTERNAL_ERROR`: Unexpected server error
## πŸ“Š Performance
- Single prediction: ~40-50ms (GPU)
- Batch prediction (10 sequences): ~100-150ms (GPU)
- Model loading time: ~10-15 seconds (first time)
## πŸ› Troubleshooting
### Models not loading
1. Check your HuggingFace token in `.env`
2. Verify you have access to the private repository
3. Check internet connection
4. Look at console logs for specific errors
### CUDA out of memory
- Reduce batch size
- Use CPU instead: Set `device = "cpu"` in code
- Process sequences one at a time
### Slow predictions
- Ensure GPU is being used (check `/health` endpoint)
- First prediction is always slower (model initialization)
## 🌐 Public Usage Guidelines
- **Free to Use**: No authentication or API keys required.
- **Rate Limiting**: Fair usage is expected. Please do not abuse the service.
- **Educational Purpose**: Designed for research and educational use.
- **Medical Disclaimer**: Not for clinical diagnosis. See disclaimer below.
- **Availability**: Best effort uptime, not guaranteed 24/7.
## ⚠️ Medical Disclaimer
**IMPORTANT**: This API is designed for **research and educational purposes only**. It should **NOT** be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions.
## 🏒 About CAMLAs
**CAMLAs** (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare.
**Team Members:**
- **S M Hasan Mahmud** – Principal Investigator & Supervisor
_Roles:_ Writing – Original Draft, Writing – Review & Editing, Conceptualization, Supervision, Project Administration
- **Francis Rudra D Cruze** – Lead Developer & Researcher
_Roles:_ Methodology, Software, Formal Analysis, Investigation, Resources, Visualization
## πŸ“ž Support & Contact
- **Issues**: [GitHub Repository Issues](https://github.com/camlas/ovarian-cancer)
- **Email**: drhasan.swe@diu.edu.bd
- **Documentation**: This README
- **API Status**: Check `/health` endpoint
- **Website Integration**: Perfect for ovarian.francisrudra.com
## πŸ“„ License
This project is licensed under the MIT License - see the LICENSE file for details.
---
**CAMLAs** - Center for Advanced Machine Learning and Applications
_Advancing Medical AI Research with Public FastAPI_ πŸŒπŸš€