toxicity / README.md
rudradcruze's picture
upload toxicity api application
1c25c67
metadata
title: Toxicity Prediction API
description: >-
  A FastAPI-based REST API for predicting protein sequence toxicity using
  ProtBERT embeddings and MHSA-GRU classifier.
short_description: Toxicity Prediction API
version: 1.0.0
emoji: 🧬
colorFrom: green
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
license: mit
tags:
  - protein-toxicity
  - protbert
  - mhsa-gru
  - pytorch
  - fastapi

Toxicity Prediction API

A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.

Developed by the CAMLAs research team - Francis Rudra D Cruze.

πŸš€ Features

  • ProtBERT Feature Extraction: Uses state-of-the-art protein language model
  • MHSA-GRU Classification: Multi-Head Self-Attention with GRU for accurate predictions
  • Single & Batch Predictions: Process one or multiple sequences
  • HuggingFace Integration: Automatic model loading from private repository
  • Production Ready: Health checks, error handling, and comprehensive logging

πŸ“‹ Requirements

  • Python 3.8+
  • CUDA-capable GPU (optional, but recommended)
  • HuggingFace account with access to private repository

πŸ”§ Installation

  1. Clone the repository
git clone https://huggingface.co/spaces/camlas/toxicity
cd toxicity
  1. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Create .env file
echo "HF_TOKEN=your_huggingface_token_here" > .env

Get your HuggingFace token from: https://huggingface.co/settings/tokens

🎯 Usage

Start the API Server

python app.py

Or with uvicorn directly:

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

The API will be available at: http://localhost:8000

Run Tests

python test_api.py

πŸ“‘ API Endpoints

1. Root Endpoint

GET /

Returns API information and available endpoints.

curl http://localhost:8000/

2. Health Check

GET /health

Check API status and model loading status.

curl http://localhost:8000/health

Response:

{
    "status_code": 200,
    "status": "healthy",
    "service": "Toxicity Prediction API",
    "api_version": "1.0.0",
    "model_version": "MHSA-GRU-Transformer-v1.0",
    "models_loaded": true,
    "device": "cuda",
    "timestamp": "2025-01-21T10:30:00Z"
}

3. Single Prediction

POST /predict

Predict toxicity for a single protein sequence.

Request:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}'

Response:

{
    "status_code": 200,
    "status": "success",
    "success": true,
    "data": {
        "sequence": "MKTAYIAKQRQISFVKSHFSRQLE",
        "sequence_length": 24,
        "prediction": {
            "predicted_class": "Toxic",
            "confidence": 0.85,
            "confidence_level": "high",
            "toxicity_score": 0.925,
            "non_toxicity_score": 0.075
        },
        "metadata": {
            "embedding_model": "ProtBERT",
            "embedding_type": "Bert",
            "model_version": "MHSA-GRU-Transformer-v1.0",
            "device": "cuda"
        }
    },
    "timestamp": "2025-01-21T10:30:00Z",
    "api_version": "1.0.0",
    "processing_time_ms": 45.2
}

4. Batch Prediction

POST /predict/batch

Predict toxicity for multiple sequences at once.

Request in Postman/cURL:

curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "sequences": [
      "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
      "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
      "MKTAYIAKQRQISFVKSHFSRQLE"
    ]
  }'

Request Body (JSON):

{
    "sequences": [
        "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
        "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK"
    ]
}

Response:

{
    "status_code": 200,
    "status": "success",
    "success": true,
    "data": {
        "total_sequences": 2,
        "results": [
            {
                "sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
                "sequence_length": 51,
                "predicted_class": "Toxic",
                "toxicity_score": 0.925,
                "confidence": 0.85
            },
            {
                "sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
                "sequence_length": 51,
                "predicted_class": "Non-Toxic",
                "toxicity_score": 0.125,
                "confidence": 0.75
            }
        ],
        "metadata": {
            "embedding_model": "ProtBERT",
            "embedding_type": "Bert",
            "model_version": "MHSA-GRU-Transformer-v1.0",
            "device": "cuda"
        }
    },
    "timestamp": "2025-01-21T10:30:00Z",
    "api_version": "1.0.0",
    "processing_time_ms": 125.8
}

🐍 Python Usage Examples

Single Prediction

import requests

response = requests.post(
    "http://localhost:8000/predict",
    json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}
)

result = response.json()
print(f"Predicted Class: {result['data']['prediction']['predicted_class']}")
print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}")
print(f"Confidence: {result['data']['prediction']['confidence']:.4f}")

Batch Prediction

sequences = [
    "MKTAYIAKQRQISFVKSHFSRQLE",
    "ARNDCEQGHILKMFPSTWYV",
    "MVHLTPEEKS"
]

response = requests.post(
    "http://localhost:8000/predict/batch",
    json={"sequences": sequences}
)

results = response.json()
for i, pred in enumerate(results['data']['results'], 1):
    print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})")

πŸ“ Project Structure

toxicity-api/
β”œβ”€β”€ app.py                 # Main FastAPI application
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ test_api.py           # Test suite
β”œβ”€β”€ .env                  # Environment variables (create this)
β”œβ”€β”€ models/               # Downloaded models (auto-created)
└── README.md            # This file

πŸ”’ HuggingFace Repository Structure

Your private repository camlas/toxicity should contain:

camlas/toxicity/
β”œβ”€β”€ mhsa_gru_classifier.pth    # Trained MHSA-GRU model
β”œβ”€β”€ scaler.pkl                  # Feature scaler
β”œβ”€β”€ config.json                 # ProtBERT config
β”œβ”€β”€ model.safetensors          # ProtBERT weights
β”œβ”€β”€ vocab.txt                   # ProtBERT vocabulary
β”œβ”€β”€ tokenizer_config.json      # Tokenizer configuration
└── special_tokens_map.json    # Special tokens mapping

🎨 Model Architecture

  1. Feature Extraction: ProtBERT (1024-dimensional embeddings)
  2. Feature Scaling: StandardScaler
  3. Classification: MHSA-GRU
    • Multi-Head Self-Attention (3 layers)
    • Bidirectional GRU (2 layers)
    • Fully connected layers with dropout

⚠️ Error Codes

  • MISSING_SEQUENCE: No sequence provided in request
  • SEQUENCE_TOO_SHORT: Sequence length < 10 amino acids
  • MODEL_NOT_LOADED: Models failed to load from HuggingFace
  • INTERNAL_ERROR: Unexpected server error

πŸ“Š Performance

  • Single prediction: ~40-50ms (GPU)
  • Batch prediction (10 sequences): ~100-150ms (GPU)
  • Model loading time: ~10-15 seconds (first time)

πŸ› Troubleshooting

Models not loading

  1. Check your HuggingFace token in .env
  2. Verify you have access to the private repository
  3. Check internet connection
  4. Look at console logs for specific errors

CUDA out of memory

  • Reduce batch size
  • Use CPU instead: Set device = "cpu" in code
  • Process sequences one at a time

Slow predictions

  • Ensure GPU is being used (check /health endpoint)
  • First prediction is always slower (model initialization)

🌐 Public Usage Guidelines

  • Free to Use: No authentication or API keys required.
  • Rate Limiting: Fair usage is expected. Please do not abuse the service.
  • Educational Purpose: Designed for research and educational use.
  • Medical Disclaimer: Not for clinical diagnosis. See disclaimer below.
  • Availability: Best effort uptime, not guaranteed 24/7.

⚠️ Medical Disclaimer

IMPORTANT: This API is designed for research and educational purposes only. It should NOT be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions.

🏒 About CAMLAs

CAMLAs (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare.

Team Members:

  • S M Hasan Mahmud – Principal Investigator & Supervisor
    Roles: Writing – Original Draft, Writing – Review & Editing, Conceptualization, Supervision, Project Administration

  • Francis Rudra D Cruze – Lead Developer & Researcher
    Roles: Methodology, Software, Formal Analysis, Investigation, Resources, Visualization

πŸ“ž Support & Contact

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


CAMLAs - Center for Advanced Machine Learning and Applications
Advancing Medical AI Research with Public FastAPI πŸŒπŸš€