|
|
--- |
|
|
title: Toxicity Prediction API |
|
|
description: A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier. |
|
|
short_description: Toxicity Prediction API |
|
|
version: 1.0.0 |
|
|
emoji: 𧬠|
|
|
colorFrom: green |
|
|
colorTo: blue |
|
|
sdk: docker |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
tags: |
|
|
- protein-toxicity |
|
|
- protbert |
|
|
- mhsa-gru |
|
|
- pytorch |
|
|
- fastapi |
|
|
--- |
|
|
|
|
|
# Toxicity Prediction API |
|
|
|
|
|
A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier. |
|
|
|
|
|
Developed by the CAMLAs research team - [Francis Rudra D Cruze](https://linkedin.com/in/rudradcruze). |
|
|
|
|
|
## π Features |
|
|
|
|
|
- **ProtBERT Feature Extraction**: Uses state-of-the-art protein language model |
|
|
- **MHSA-GRU Classification**: Multi-Head Self-Attention with GRU for accurate predictions |
|
|
- **Single & Batch Predictions**: Process one or multiple sequences |
|
|
- **HuggingFace Integration**: Automatic model loading from private repository |
|
|
- **Production Ready**: Health checks, error handling, and comprehensive logging |
|
|
|
|
|
## π Requirements |
|
|
|
|
|
- Python 3.8+ |
|
|
- CUDA-capable GPU (optional, but recommended) |
|
|
- HuggingFace account with access to private repository |
|
|
|
|
|
## π§ Installation |
|
|
|
|
|
1. **Clone the repository** |
|
|
|
|
|
```bash |
|
|
git clone https://huggingface.co/spaces/camlas/toxicity |
|
|
cd toxicity |
|
|
``` |
|
|
|
|
|
2. **Create virtual environment** |
|
|
|
|
|
```bash |
|
|
python -m venv venv |
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate |
|
|
``` |
|
|
|
|
|
3. **Install dependencies** |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
4. **Create `.env` file** |
|
|
|
|
|
```bash |
|
|
echo "HF_TOKEN=your_huggingface_token_here" > .env |
|
|
``` |
|
|
|
|
|
Get your HuggingFace token from: https://huggingface.co/settings/tokens |
|
|
|
|
|
## π― Usage |
|
|
|
|
|
### Start the API Server |
|
|
|
|
|
```bash |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
Or with uvicorn directly: |
|
|
|
|
|
```bash |
|
|
uvicorn app:app --host 0.0.0.0 --port 8000 --reload |
|
|
``` |
|
|
|
|
|
The API will be available at: `http://localhost:8000` |
|
|
|
|
|
### Run Tests |
|
|
|
|
|
```bash |
|
|
python test_api.py |
|
|
``` |
|
|
|
|
|
## π‘ API Endpoints |
|
|
|
|
|
### 1. Root Endpoint |
|
|
|
|
|
**GET** `/` |
|
|
|
|
|
Returns API information and available endpoints. |
|
|
|
|
|
```bash |
|
|
curl http://localhost:8000/ |
|
|
``` |
|
|
|
|
|
### 2. Health Check |
|
|
|
|
|
**GET** `/health` |
|
|
|
|
|
Check API status and model loading status. |
|
|
|
|
|
```bash |
|
|
curl http://localhost:8000/health |
|
|
``` |
|
|
|
|
|
**Response:** |
|
|
|
|
|
```json |
|
|
{ |
|
|
"status_code": 200, |
|
|
"status": "healthy", |
|
|
"service": "Toxicity Prediction API", |
|
|
"api_version": "1.0.0", |
|
|
"model_version": "MHSA-GRU-Transformer-v1.0", |
|
|
"models_loaded": true, |
|
|
"device": "cuda", |
|
|
"timestamp": "2025-01-21T10:30:00Z" |
|
|
} |
|
|
``` |
|
|
|
|
|
### 3. Single Prediction |
|
|
|
|
|
**POST** `/predict` |
|
|
|
|
|
Predict toxicity for a single protein sequence. |
|
|
|
|
|
**Request:** |
|
|
|
|
|
```bash |
|
|
curl -X POST http://localhost:8000/predict \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}' |
|
|
``` |
|
|
|
|
|
**Response:** |
|
|
|
|
|
```json |
|
|
{ |
|
|
"status_code": 200, |
|
|
"status": "success", |
|
|
"success": true, |
|
|
"data": { |
|
|
"sequence": "MKTAYIAKQRQISFVKSHFSRQLE", |
|
|
"sequence_length": 24, |
|
|
"prediction": { |
|
|
"predicted_class": "Toxic", |
|
|
"confidence": 0.85, |
|
|
"confidence_level": "high", |
|
|
"toxicity_score": 0.925, |
|
|
"non_toxicity_score": 0.075 |
|
|
}, |
|
|
"metadata": { |
|
|
"embedding_model": "ProtBERT", |
|
|
"embedding_type": "Bert", |
|
|
"model_version": "MHSA-GRU-Transformer-v1.0", |
|
|
"device": "cuda" |
|
|
} |
|
|
}, |
|
|
"timestamp": "2025-01-21T10:30:00Z", |
|
|
"api_version": "1.0.0", |
|
|
"processing_time_ms": 45.2 |
|
|
} |
|
|
``` |
|
|
|
|
|
### 4. Batch Prediction |
|
|
|
|
|
**POST** `/predict/batch` |
|
|
|
|
|
Predict toxicity for multiple sequences at once. |
|
|
|
|
|
**Request in Postman/cURL:** |
|
|
|
|
|
```bash |
|
|
curl -X POST http://localhost:8000/predict/batch \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"sequences": [ |
|
|
"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES", |
|
|
"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK", |
|
|
"MKTAYIAKQRQISFVKSHFSRQLE" |
|
|
] |
|
|
}' |
|
|
``` |
|
|
|
|
|
**Request Body (JSON):** |
|
|
|
|
|
```json |
|
|
{ |
|
|
"sequences": [ |
|
|
"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES", |
|
|
"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK" |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
**Response:** |
|
|
|
|
|
```json |
|
|
{ |
|
|
"status_code": 200, |
|
|
"status": "success", |
|
|
"success": true, |
|
|
"data": { |
|
|
"total_sequences": 2, |
|
|
"results": [ |
|
|
{ |
|
|
"sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES", |
|
|
"sequence_length": 51, |
|
|
"predicted_class": "Toxic", |
|
|
"toxicity_score": 0.925, |
|
|
"confidence": 0.85 |
|
|
}, |
|
|
{ |
|
|
"sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK", |
|
|
"sequence_length": 51, |
|
|
"predicted_class": "Non-Toxic", |
|
|
"toxicity_score": 0.125, |
|
|
"confidence": 0.75 |
|
|
} |
|
|
], |
|
|
"metadata": { |
|
|
"embedding_model": "ProtBERT", |
|
|
"embedding_type": "Bert", |
|
|
"model_version": "MHSA-GRU-Transformer-v1.0", |
|
|
"device": "cuda" |
|
|
} |
|
|
}, |
|
|
"timestamp": "2025-01-21T10:30:00Z", |
|
|
"api_version": "1.0.0", |
|
|
"processing_time_ms": 125.8 |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Python Usage Examples |
|
|
|
|
|
### Single Prediction |
|
|
|
|
|
```python |
|
|
import requests |
|
|
|
|
|
response = requests.post( |
|
|
"http://localhost:8000/predict", |
|
|
json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"} |
|
|
) |
|
|
|
|
|
result = response.json() |
|
|
print(f"Predicted Class: {result['data']['prediction']['predicted_class']}") |
|
|
print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}") |
|
|
print(f"Confidence: {result['data']['prediction']['confidence']:.4f}") |
|
|
``` |
|
|
|
|
|
### Batch Prediction |
|
|
|
|
|
```python |
|
|
sequences = [ |
|
|
"MKTAYIAKQRQISFVKSHFSRQLE", |
|
|
"ARNDCEQGHILKMFPSTWYV", |
|
|
"MVHLTPEEKS" |
|
|
] |
|
|
|
|
|
response = requests.post( |
|
|
"http://localhost:8000/predict/batch", |
|
|
json={"sequences": sequences} |
|
|
) |
|
|
|
|
|
results = response.json() |
|
|
for i, pred in enumerate(results['data']['results'], 1): |
|
|
print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})") |
|
|
``` |
|
|
|
|
|
## π Project Structure |
|
|
|
|
|
``` |
|
|
toxicity-api/ |
|
|
βββ app.py # Main FastAPI application |
|
|
βββ requirements.txt # Python dependencies |
|
|
βββ test_api.py # Test suite |
|
|
βββ .env # Environment variables (create this) |
|
|
βββ models/ # Downloaded models (auto-created) |
|
|
βββ README.md # This file |
|
|
``` |
|
|
|
|
|
## π HuggingFace Repository Structure |
|
|
|
|
|
Your private repository `camlas/toxicity` should contain: |
|
|
|
|
|
``` |
|
|
camlas/toxicity/ |
|
|
βββ mhsa_gru_classifier.pth # Trained MHSA-GRU model |
|
|
βββ scaler.pkl # Feature scaler |
|
|
βββ config.json # ProtBERT config |
|
|
βββ model.safetensors # ProtBERT weights |
|
|
βββ vocab.txt # ProtBERT vocabulary |
|
|
βββ tokenizer_config.json # Tokenizer configuration |
|
|
βββ special_tokens_map.json # Special tokens mapping |
|
|
``` |
|
|
|
|
|
## π¨ Model Architecture |
|
|
|
|
|
1. **Feature Extraction**: ProtBERT (1024-dimensional embeddings) |
|
|
2. **Feature Scaling**: StandardScaler |
|
|
3. **Classification**: MHSA-GRU |
|
|
- Multi-Head Self-Attention (3 layers) |
|
|
- Bidirectional GRU (2 layers) |
|
|
- Fully connected layers with dropout |
|
|
|
|
|
## β οΈ Error Codes |
|
|
|
|
|
- `MISSING_SEQUENCE`: No sequence provided in request |
|
|
- `SEQUENCE_TOO_SHORT`: Sequence length < 10 amino acids |
|
|
- `MODEL_NOT_LOADED`: Models failed to load from HuggingFace |
|
|
- `INTERNAL_ERROR`: Unexpected server error |
|
|
|
|
|
## π Performance |
|
|
|
|
|
- Single prediction: ~40-50ms (GPU) |
|
|
- Batch prediction (10 sequences): ~100-150ms (GPU) |
|
|
- Model loading time: ~10-15 seconds (first time) |
|
|
|
|
|
## π Troubleshooting |
|
|
|
|
|
### Models not loading |
|
|
|
|
|
1. Check your HuggingFace token in `.env` |
|
|
2. Verify you have access to the private repository |
|
|
3. Check internet connection |
|
|
4. Look at console logs for specific errors |
|
|
|
|
|
### CUDA out of memory |
|
|
|
|
|
- Reduce batch size |
|
|
- Use CPU instead: Set `device = "cpu"` in code |
|
|
- Process sequences one at a time |
|
|
|
|
|
### Slow predictions |
|
|
|
|
|
- Ensure GPU is being used (check `/health` endpoint) |
|
|
- First prediction is always slower (model initialization) |
|
|
|
|
|
## π Public Usage Guidelines |
|
|
|
|
|
- **Free to Use**: No authentication or API keys required. |
|
|
- **Rate Limiting**: Fair usage is expected. Please do not abuse the service. |
|
|
- **Educational Purpose**: Designed for research and educational use. |
|
|
- **Medical Disclaimer**: Not for clinical diagnosis. See disclaimer below. |
|
|
- **Availability**: Best effort uptime, not guaranteed 24/7. |
|
|
|
|
|
## β οΈ Medical Disclaimer |
|
|
|
|
|
**IMPORTANT**: This API is designed for **research and educational purposes only**. It should **NOT** be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions. |
|
|
|
|
|
## π’ About CAMLAs |
|
|
|
|
|
**CAMLAs** (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare. |
|
|
|
|
|
**Team Members:** |
|
|
|
|
|
- **S M Hasan Mahmud** β Principal Investigator & Supervisor |
|
|
_Roles:_ Writing β Original Draft, Writing β Review & Editing, Conceptualization, Supervision, Project Administration |
|
|
|
|
|
- **Francis Rudra D Cruze** β Lead Developer & Researcher |
|
|
_Roles:_ Methodology, Software, Formal Analysis, Investigation, Resources, Visualization |
|
|
|
|
|
## π Support & Contact |
|
|
|
|
|
- **Issues**: [GitHub Repository Issues](https://github.com/camlas/ovarian-cancer) |
|
|
- **Email**: drhasan.swe@diu.edu.bd |
|
|
- **Documentation**: This README |
|
|
- **API Status**: Check `/health` endpoint |
|
|
- **Website Integration**: Perfect for ovarian.francisrudra.com |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is licensed under the MIT License - see the LICENSE file for details. |
|
|
|
|
|
--- |
|
|
|
|
|
**CAMLAs** - Center for Advanced Machine Learning and Applications |
|
|
_Advancing Medical AI Research with Public FastAPI_ ππ |
|
|
|