title: Toxicity Prediction API
description: >-
A FastAPI-based REST API for predicting protein sequence toxicity using
ProtBERT embeddings and MHSA-GRU classifier.
short_description: Toxicity Prediction API
version: 1.0.0
emoji: π§¬
colorFrom: green
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
license: mit
tags:
- protein-toxicity
- protbert
- mhsa-gru
- pytorch
- fastapi
Toxicity Prediction API
A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.
Developed by the CAMLAs research team - Francis Rudra D Cruze.
π Features
- ProtBERT Feature Extraction: Uses state-of-the-art protein language model
- MHSA-GRU Classification: Multi-Head Self-Attention with GRU for accurate predictions
- Single & Batch Predictions: Process one or multiple sequences
- HuggingFace Integration: Automatic model loading from private repository
- Production Ready: Health checks, error handling, and comprehensive logging
π Requirements
- Python 3.8+
- CUDA-capable GPU (optional, but recommended)
- HuggingFace account with access to private repository
π§ Installation
- Clone the repository
git clone https://huggingface.co/spaces/camlas/toxicity
cd toxicity
- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Create
.envfile
echo "HF_TOKEN=your_huggingface_token_here" > .env
Get your HuggingFace token from: https://huggingface.co/settings/tokens
π― Usage
Start the API Server
python app.py
Or with uvicorn directly:
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
The API will be available at: http://localhost:8000
Run Tests
python test_api.py
π‘ API Endpoints
1. Root Endpoint
GET /
Returns API information and available endpoints.
curl http://localhost:8000/
2. Health Check
GET /health
Check API status and model loading status.
curl http://localhost:8000/health
Response:
{
"status_code": 200,
"status": "healthy",
"service": "Toxicity Prediction API",
"api_version": "1.0.0",
"model_version": "MHSA-GRU-Transformer-v1.0",
"models_loaded": true,
"device": "cuda",
"timestamp": "2025-01-21T10:30:00Z"
}
3. Single Prediction
POST /predict
Predict toxicity for a single protein sequence.
Request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}'
Response:
{
"status_code": 200,
"status": "success",
"success": true,
"data": {
"sequence": "MKTAYIAKQRQISFVKSHFSRQLE",
"sequence_length": 24,
"prediction": {
"predicted_class": "Toxic",
"confidence": 0.85,
"confidence_level": "high",
"toxicity_score": 0.925,
"non_toxicity_score": 0.075
},
"metadata": {
"embedding_model": "ProtBERT",
"embedding_type": "Bert",
"model_version": "MHSA-GRU-Transformer-v1.0",
"device": "cuda"
}
},
"timestamp": "2025-01-21T10:30:00Z",
"api_version": "1.0.0",
"processing_time_ms": 45.2
}
4. Batch Prediction
POST /predict/batch
Predict toxicity for multiple sequences at once.
Request in Postman/cURL:
curl -X POST http://localhost:8000/predict/batch \
-H "Content-Type: application/json" \
-d '{
"sequences": [
"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
"MKTAYIAKQRQISFVKSHFSRQLE"
]
}'
Request Body (JSON):
{
"sequences": [
"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK"
]
}
Response:
{
"status_code": 200,
"status": "success",
"success": true,
"data": {
"total_sequences": 2,
"results": [
{
"sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
"sequence_length": 51,
"predicted_class": "Toxic",
"toxicity_score": 0.925,
"confidence": 0.85
},
{
"sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
"sequence_length": 51,
"predicted_class": "Non-Toxic",
"toxicity_score": 0.125,
"confidence": 0.75
}
],
"metadata": {
"embedding_model": "ProtBERT",
"embedding_type": "Bert",
"model_version": "MHSA-GRU-Transformer-v1.0",
"device": "cuda"
}
},
"timestamp": "2025-01-21T10:30:00Z",
"api_version": "1.0.0",
"processing_time_ms": 125.8
}
π Python Usage Examples
Single Prediction
import requests
response = requests.post(
"http://localhost:8000/predict",
json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}
)
result = response.json()
print(f"Predicted Class: {result['data']['prediction']['predicted_class']}")
print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}")
print(f"Confidence: {result['data']['prediction']['confidence']:.4f}")
Batch Prediction
sequences = [
"MKTAYIAKQRQISFVKSHFSRQLE",
"ARNDCEQGHILKMFPSTWYV",
"MVHLTPEEKS"
]
response = requests.post(
"http://localhost:8000/predict/batch",
json={"sequences": sequences}
)
results = response.json()
for i, pred in enumerate(results['data']['results'], 1):
print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})")
π Project Structure
toxicity-api/
βββ app.py # Main FastAPI application
βββ requirements.txt # Python dependencies
βββ test_api.py # Test suite
βββ .env # Environment variables (create this)
βββ models/ # Downloaded models (auto-created)
βββ README.md # This file
π HuggingFace Repository Structure
Your private repository camlas/toxicity should contain:
camlas/toxicity/
βββ mhsa_gru_classifier.pth # Trained MHSA-GRU model
βββ scaler.pkl # Feature scaler
βββ config.json # ProtBERT config
βββ model.safetensors # ProtBERT weights
βββ vocab.txt # ProtBERT vocabulary
βββ tokenizer_config.json # Tokenizer configuration
βββ special_tokens_map.json # Special tokens mapping
π¨ Model Architecture
- Feature Extraction: ProtBERT (1024-dimensional embeddings)
- Feature Scaling: StandardScaler
- Classification: MHSA-GRU
- Multi-Head Self-Attention (3 layers)
- Bidirectional GRU (2 layers)
- Fully connected layers with dropout
β οΈ Error Codes
MISSING_SEQUENCE: No sequence provided in requestSEQUENCE_TOO_SHORT: Sequence length < 10 amino acidsMODEL_NOT_LOADED: Models failed to load from HuggingFaceINTERNAL_ERROR: Unexpected server error
π Performance
- Single prediction: ~40-50ms (GPU)
- Batch prediction (10 sequences): ~100-150ms (GPU)
- Model loading time: ~10-15 seconds (first time)
π Troubleshooting
Models not loading
- Check your HuggingFace token in
.env - Verify you have access to the private repository
- Check internet connection
- Look at console logs for specific errors
CUDA out of memory
- Reduce batch size
- Use CPU instead: Set
device = "cpu"in code - Process sequences one at a time
Slow predictions
- Ensure GPU is being used (check
/healthendpoint) - First prediction is always slower (model initialization)
π Public Usage Guidelines
- Free to Use: No authentication or API keys required.
- Rate Limiting: Fair usage is expected. Please do not abuse the service.
- Educational Purpose: Designed for research and educational use.
- Medical Disclaimer: Not for clinical diagnosis. See disclaimer below.
- Availability: Best effort uptime, not guaranteed 24/7.
β οΈ Medical Disclaimer
IMPORTANT: This API is designed for research and educational purposes only. It should NOT be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions.
π’ About CAMLAs
CAMLAs (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare.
Team Members:
S M Hasan Mahmud β Principal Investigator & Supervisor
Roles: Writing β Original Draft, Writing β Review & Editing, Conceptualization, Supervision, Project AdministrationFrancis Rudra D Cruze β Lead Developer & Researcher
Roles: Methodology, Software, Formal Analysis, Investigation, Resources, Visualization
π Support & Contact
- Issues: GitHub Repository Issues
- Email: drhasan.swe@diu.edu.bd
- Documentation: This README
- API Status: Check
/healthendpoint - Website Integration: Perfect for ovarian.francisrudra.com
π License
This project is licensed under the MIT License - see the LICENSE file for details.
CAMLAs - Center for Advanced Machine Learning and Applications
Advancing Medical AI Research with Public FastAPI ππ