Spaces:

camlas
/

toxicity

Running

App Files Files Community

toxicity / README.md

rudradcruze

upload toxicity api application

1c25c67 3 months ago

preview code

raw

history blame contribute delete

9.92 kB

metadata

title: Toxicity Prediction API
description: >-
  A FastAPI-based REST API for predicting protein sequence toxicity using
  ProtBERT embeddings and MHSA-GRU classifier.
short_description: Toxicity Prediction API
version: 1.0.0
emoji: 🧬
colorFrom: green
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
license: mit
tags:
  - protein-toxicity
  - protbert
  - mhsa-gru
  - pytorch
  - fastapi

Toxicity Prediction API

A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.

Developed by the CAMLAs research team - Francis Rudra D Cruze.

🚀 Features

ProtBERT Feature Extraction: Uses state-of-the-art protein language model
MHSA-GRU Classification: Multi-Head Self-Attention with GRU for accurate predictions
Single & Batch Predictions: Process one or multiple sequences
HuggingFace Integration: Automatic model loading from private repository
Production Ready: Health checks, error handling, and comprehensive logging

📋 Requirements

Python 3.8+
CUDA-capable GPU (optional, but recommended)
HuggingFace account with access to private repository

🔧 Installation

Clone the repository

git clone https://huggingface.co/spaces/camlas/toxicity
cd toxicity

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Create .env file

echo "HF_TOKEN=your_huggingface_token_here" > .env

Get your HuggingFace token from: https://huggingface.co/settings/tokens

🎯 Usage

Start the API Server

python app.py

Or with uvicorn directly:

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

The API will be available at: http://localhost:8000

Run Tests

python test_api.py

📡 API Endpoints

1. Root Endpoint

GET /

Returns API information and available endpoints.

curl http://localhost:8000/

2. Health Check

GET /health

Check API status and model loading status.

curl http://localhost:8000/health

Response:

{
    "status_code": 200,
    "status": "healthy",
    "service": "Toxicity Prediction API",
    "api_version": "1.0.0",
    "model_version": "MHSA-GRU-Transformer-v1.0",
    "models_loaded": true,
    "device": "cuda",
    "timestamp": "2025-01-21T10:30:00Z"
}

3. Single Prediction

POST /predict

Predict toxicity for a single protein sequence.

Request:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}'

Response:

{
    "status_code": 200,
    "status": "success",
    "success": true,
    "data": {
        "sequence": "MKTAYIAKQRQISFVKSHFSRQLE",
        "sequence_length": 24,
        "prediction": {
            "predicted_class": "Toxic",
            "confidence": 0.85,
            "confidence_level": "high",
            "toxicity_score": 0.925,
            "non_toxicity_score": 0.075
        },
        "metadata": {
            "embedding_model": "ProtBERT",
            "embedding_type": "Bert",
            "model_version": "MHSA-GRU-Transformer-v1.0",
            "device": "cuda"
        }
    },
    "timestamp": "2025-01-21T10:30:00Z",
    "api_version": "1.0.0",
    "processing_time_ms": 45.2
}

4. Batch Prediction

POST /predict/batch

Predict toxicity for multiple sequences at once.

Request in Postman/cURL:

curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "sequences": [
      "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
      "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
      "MKTAYIAKQRQISFVKSHFSRQLE"
    ]
  }'

Request Body (JSON):

{
    "sequences": [
        "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
        "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK"
    ]
}

Response:

{
    "status_code": 200,
    "status": "success",
    "success": true,
    "data": {
        "total_sequences": 2,
        "results": [
            {
                "sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
                "sequence_length": 51,
                "predicted_class": "Toxic",
                "toxicity_score": 0.925,
                "confidence": 0.85
            },
            {
                "sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
                "sequence_length": 51,
                "predicted_class": "Non-Toxic",
                "toxicity_score": 0.125,
                "confidence": 0.75
            }
        ],
        "metadata": {
            "embedding_model": "ProtBERT",
            "embedding_type": "Bert",
            "model_version": "MHSA-GRU-Transformer-v1.0",
            "device": "cuda"
        }
    },
    "timestamp": "2025-01-21T10:30:00Z",
    "api_version": "1.0.0",
    "processing_time_ms": 125.8
}

🐍 Python Usage Examples

Single Prediction

import requests

response = requests.post(
    "http://localhost:8000/predict",
    json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}
)

result = response.json()
print(f"Predicted Class: {result['data']['prediction']['predicted_class']}")
print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}")
print(f"Confidence: {result['data']['prediction']['confidence']:.4f}")

Batch Prediction

sequences = [
    "MKTAYIAKQRQISFVKSHFSRQLE",
    "ARNDCEQGHILKMFPSTWYV",
    "MVHLTPEEKS"
]

response = requests.post(
    "http://localhost:8000/predict/batch",
    json={"sequences": sequences}
)

results = response.json()
for i, pred in enumerate(results['data']['results'], 1):
    print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})")

📁 Project Structure

toxicity-api/
├── app.py                 # Main FastAPI application
├── requirements.txt       # Python dependencies
├── test_api.py           # Test suite
├── .env                  # Environment variables (create this)
├── models/               # Downloaded models (auto-created)
└── README.md            # This file

🔒 HuggingFace Repository Structure

Your private repository camlas/toxicity should contain:

camlas/toxicity/
├── mhsa_gru_classifier.pth    # Trained MHSA-GRU model
├── scaler.pkl                  # Feature scaler
├── config.json                 # ProtBERT config
├── model.safetensors          # ProtBERT weights
├── vocab.txt                   # ProtBERT vocabulary
├── tokenizer_config.json      # Tokenizer configuration
└── special_tokens_map.json    # Special tokens mapping

🎨 Model Architecture

Feature Extraction: ProtBERT (1024-dimensional embeddings)
Feature Scaling: StandardScaler
Classification: MHSA-GRU
- Multi-Head Self-Attention (3 layers)
- Bidirectional GRU (2 layers)
- Fully connected layers with dropout

⚠️ Error Codes

MISSING_SEQUENCE: No sequence provided in request
SEQUENCE_TOO_SHORT: Sequence length < 10 amino acids
MODEL_NOT_LOADED: Models failed to load from HuggingFace
INTERNAL_ERROR: Unexpected server error

📊 Performance

Single prediction: ~40-50ms (GPU)
Batch prediction (10 sequences): ~100-150ms (GPU)
Model loading time: ~10-15 seconds (first time)

🐛 Troubleshooting

Models not loading

Check your HuggingFace token in .env
Verify you have access to the private repository
Check internet connection
Look at console logs for specific errors

CUDA out of memory

Reduce batch size
Use CPU instead: Set device = "cpu" in code
Process sequences one at a time

Slow predictions

Ensure GPU is being used (check /health endpoint)
First prediction is always slower (model initialization)

🌐 Public Usage Guidelines

Free to Use: No authentication or API keys required.
Rate Limiting: Fair usage is expected. Please do not abuse the service.
Educational Purpose: Designed for research and educational use.
Medical Disclaimer: Not for clinical diagnosis. See disclaimer below.
Availability: Best effort uptime, not guaranteed 24/7.

⚠️ Medical Disclaimer

IMPORTANT: This API is designed for research and educational purposes only. It should NOT be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions.

🏢 About CAMLAs

CAMLAs (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare.

Team Members:

S M Hasan Mahmud – Principal Investigator & Supervisor
Roles: Writing – Original Draft, Writing – Review & Editing, Conceptualization, Supervision, Project Administration
Francis Rudra D Cruze – Lead Developer & Researcher
Roles: Methodology, Software, Formal Analysis, Investigation, Resources, Visualization

📞 Support & Contact

Issues: GitHub Repository Issues
Email: drhasan.swe@diu.edu.bd
Documentation: This README
API Status: Check /health endpoint
Website Integration: Perfect for ovarian.francisrudra.com

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

CAMLAs - Center for Advanced Machine Learning and Applications
Advancing Medical AI Research with Public FastAPI 🌐🚀