---
title: Toxicity Prediction API
description: A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.
short_description: Toxicity Prediction API
version: 1.0.0
emoji: 🧬
colorFrom: green
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
license: mit
tags:
    - protein-toxicity
    - protbert
    - mhsa-gru
    - pytorch
    - fastapi
---

# Toxicity Prediction API

A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.

Developed by the CAMLAs research team - [Francis Rudra D Cruze](https://linkedin.com/in/rudradcruze).

## 🚀 Features

-   **ProtBERT Feature Extraction**: Uses state-of-the-art protein language model
-   **MHSA-GRU Classification**: Multi-Head Self-Attention with GRU for accurate predictions
-   **Single & Batch Predictions**: Process one or multiple sequences
-   **HuggingFace Integration**: Automatic model loading from private repository
-   **Production Ready**: Health checks, error handling, and comprehensive logging

## 📋 Requirements

-   Python 3.8+
-   CUDA-capable GPU (optional, but recommended)
-   HuggingFace account with access to private repository

## 🔧 Installation

1. **Clone the repository**

```bash
git clone https://huggingface.co/spaces/camlas/toxicity
cd toxicity
```

2. **Create virtual environment**

```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. **Install dependencies**

```bash
pip install -r requirements.txt
```

4. **Create `.env` file**

```bash
echo "HF_TOKEN=your_huggingface_token_here" > .env
```

Get your HuggingFace token from: https://huggingface.co/settings/tokens

## 🎯 Usage

### Start the API Server

```bash
python app.py
```

Or with uvicorn directly:

```bash
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
```

The API will be available at: `http://localhost:8000`

### Run Tests

```bash
python test_api.py
```

## 📡 API Endpoints

### 1. Root Endpoint

**GET** `/`

Returns API information and available endpoints.

```bash
curl http://localhost:8000/
```

### 2. Health Check

**GET** `/health`

Check API status and model loading status.

```bash
curl http://localhost:8000/health
```

**Response:**

```json
{
    "status_code": 200,
    "status": "healthy",
    "service": "Toxicity Prediction API",
    "api_version": "1.0.0",
    "model_version": "MHSA-GRU-Transformer-v1.0",
    "models_loaded": true,
    "device": "cuda",
    "timestamp": "2025-01-21T10:30:00Z"
}
```

### 3. Single Prediction

**POST** `/predict`

Predict toxicity for a single protein sequence.

**Request:**

```bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}'
```

**Response:**

```json
{
    "status_code": 200,
    "status": "success",
    "success": true,
    "data": {
        "sequence": "MKTAYIAKQRQISFVKSHFSRQLE",
        "sequence_length": 24,
        "prediction": {
            "predicted_class": "Toxic",
            "confidence": 0.85,
            "confidence_level": "high",
            "toxicity_score": 0.925,
            "non_toxicity_score": 0.075
        },
        "metadata": {
            "embedding_model": "ProtBERT",
            "embedding_type": "Bert",
            "model_version": "MHSA-GRU-Transformer-v1.0",
            "device": "cuda"
        }
    },
    "timestamp": "2025-01-21T10:30:00Z",
    "api_version": "1.0.0",
    "processing_time_ms": 45.2
}
```

### 4. Batch Prediction

**POST** `/predict/batch`

Predict toxicity for multiple sequences at once.

**Request in Postman/cURL:**

```bash
curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "sequences": [
      "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
      "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
      "MKTAYIAKQRQISFVKSHFSRQLE"
    ]
  }'
```

**Request Body (JSON):**

```json
{
    "sequences": [
        "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
        "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK"
    ]
}
```

**Response:**

```json
{
    "status_code": 200,
    "status": "success",
    "success": true,
    "data": {
        "total_sequences": 2,
        "results": [
            {
                "sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
                "sequence_length": 51,
                "predicted_class": "Toxic",
                "toxicity_score": 0.925,
                "confidence": 0.85
            },
            {
                "sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
                "sequence_length": 51,
                "predicted_class": "Non-Toxic",
                "toxicity_score": 0.125,
                "confidence": 0.75
            }
        ],
        "metadata": {
            "embedding_model": "ProtBERT",
            "embedding_type": "Bert",
            "model_version": "MHSA-GRU-Transformer-v1.0",
            "device": "cuda"
        }
    },
    "timestamp": "2025-01-21T10:30:00Z",
    "api_version": "1.0.0",
    "processing_time_ms": 125.8
}
```

## 🐍 Python Usage Examples

### Single Prediction

```python
import requests

response = requests.post(
    "http://localhost:8000/predict",
    json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}
)

result = response.json()
print(f"Predicted Class: {result['data']['prediction']['predicted_class']}")
print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}")
print(f"Confidence: {result['data']['prediction']['confidence']:.4f}")
```

### Batch Prediction

```python
sequences = [
    "MKTAYIAKQRQISFVKSHFSRQLE",
    "ARNDCEQGHILKMFPSTWYV",
    "MVHLTPEEKS"
]

response = requests.post(
    "http://localhost:8000/predict/batch",
    json={"sequences": sequences}
)

results = response.json()
for i, pred in enumerate(results['data']['results'], 1):
    print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})")
```

## 📁 Project Structure

```
toxicity-api/
├── app.py                 # Main FastAPI application
├── requirements.txt       # Python dependencies
├── test_api.py           # Test suite
├── .env                  # Environment variables (create this)
├── models/               # Downloaded models (auto-created)
└── README.md            # This file
```

## 🔒 HuggingFace Repository Structure

Your private repository `camlas/toxicity` should contain:

```
camlas/toxicity/
├── mhsa_gru_classifier.pth    # Trained MHSA-GRU model
├── scaler.pkl                  # Feature scaler
├── config.json                 # ProtBERT config
├── model.safetensors          # ProtBERT weights
├── vocab.txt                   # ProtBERT vocabulary
├── tokenizer_config.json      # Tokenizer configuration
└── special_tokens_map.json    # Special tokens mapping
```

## 🎨 Model Architecture

1. **Feature Extraction**: ProtBERT (1024-dimensional embeddings)
2. **Feature Scaling**: StandardScaler
3. **Classification**: MHSA-GRU
    - Multi-Head Self-Attention (3 layers)
    - Bidirectional GRU (2 layers)
    - Fully connected layers with dropout

## ⚠️ Error Codes

-   `MISSING_SEQUENCE`: No sequence provided in request
-   `SEQUENCE_TOO_SHORT`: Sequence length < 10 amino acids
-   `MODEL_NOT_LOADED`: Models failed to load from HuggingFace
-   `INTERNAL_ERROR`: Unexpected server error

## 📊 Performance

-   Single prediction: ~40-50ms (GPU)
-   Batch prediction (10 sequences): ~100-150ms (GPU)
-   Model loading time: ~10-15 seconds (first time)

## 🐛 Troubleshooting

### Models not loading

1. Check your HuggingFace token in `.env`
2. Verify you have access to the private repository
3. Check internet connection
4. Look at console logs for specific errors

### CUDA out of memory

-   Reduce batch size
-   Use CPU instead: Set `device = "cpu"` in code
-   Process sequences one at a time

### Slow predictions

-   Ensure GPU is being used (check `/health` endpoint)
-   First prediction is always slower (model initialization)

## 🌐 Public Usage Guidelines

-   **Free to Use**: No authentication or API keys required.
-   **Rate Limiting**: Fair usage is expected. Please do not abuse the service.
-   **Educational Purpose**: Designed for research and educational use.
-   **Medical Disclaimer**: Not for clinical diagnosis. See disclaimer below.
-   **Availability**: Best effort uptime, not guaranteed 24/7.

## ⚠️ Medical Disclaimer

**IMPORTANT**: This API is designed for **research and educational purposes only**. It should **NOT** be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions.

## 🏢 About CAMLAs

**CAMLAs** (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare.

**Team Members:**

-   **S M Hasan Mahmud** – Principal Investigator & Supervisor  
    _Roles:_ Writing – Original Draft, Writing – Review & Editing, Conceptualization, Supervision, Project Administration

-   **Francis Rudra D Cruze** – Lead Developer & Researcher  
    _Roles:_ Methodology, Software, Formal Analysis, Investigation, Resources, Visualization

## 📞 Support & Contact

-   **Issues**: [GitHub Repository Issues](https://github.com/camlas/ovarian-cancer)
-   **Email**: drhasan.swe@diu.edu.bd
-   **Documentation**: This README
-   **API Status**: Check `/health` endpoint
-   **Website Integration**: Perfect for ovarian.francisrudra.com

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

---

**CAMLAs** - Center for Advanced Machine Learning and Applications  
_Advancing Medical AI Research with Public FastAPI_ 🌐🚀