--- title: Toxicity Prediction API description: A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier. short_description: Toxicity Prediction API version: 1.0.0 emoji: 🧬 colorFrom: green colorTo: blue sdk: docker app_file: app.py pinned: false license: mit tags: - protein-toxicity - protbert - mhsa-gru - pytorch - fastapi --- # Toxicity Prediction API A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier. Developed by the CAMLAs research team - [Francis Rudra D Cruze](https://linkedin.com/in/rudradcruze). ## 🚀 Features - **ProtBERT Feature Extraction**: Uses state-of-the-art protein language model - **MHSA-GRU Classification**: Multi-Head Self-Attention with GRU for accurate predictions - **Single & Batch Predictions**: Process one or multiple sequences - **HuggingFace Integration**: Automatic model loading from private repository - **Production Ready**: Health checks, error handling, and comprehensive logging ## 📋 Requirements - Python 3.8+ - CUDA-capable GPU (optional, but recommended) - HuggingFace account with access to private repository ## 🔧 Installation 1. **Clone the repository** ```bash git clone https://huggingface.co/spaces/camlas/toxicity cd toxicity ``` 2. **Create virtual environment** ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 3. **Install dependencies** ```bash pip install -r requirements.txt ``` 4. **Create `.env` file** ```bash echo "HF_TOKEN=your_huggingface_token_here" > .env ``` Get your HuggingFace token from: https://huggingface.co/settings/tokens ## 🎯 Usage ### Start the API Server ```bash python app.py ``` Or with uvicorn directly: ```bash uvicorn app:app --host 0.0.0.0 --port 8000 --reload ``` The API will be available at: `http://localhost:8000` ### Run Tests ```bash python test_api.py ``` ## 📡 API Endpoints ### 1. Root Endpoint **GET** `/` Returns API information and available endpoints. ```bash curl http://localhost:8000/ ``` ### 2. Health Check **GET** `/health` Check API status and model loading status. ```bash curl http://localhost:8000/health ``` **Response:** ```json { "status_code": 200, "status": "healthy", "service": "Toxicity Prediction API", "api_version": "1.0.0", "model_version": "MHSA-GRU-Transformer-v1.0", "models_loaded": true, "device": "cuda", "timestamp": "2025-01-21T10:30:00Z" } ``` ### 3. Single Prediction **POST** `/predict` Predict toxicity for a single protein sequence. **Request:** ```bash curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}' ``` **Response:** ```json { "status_code": 200, "status": "success", "success": true, "data": { "sequence": "MKTAYIAKQRQISFVKSHFSRQLE", "sequence_length": 24, "prediction": { "predicted_class": "Toxic", "confidence": 0.85, "confidence_level": "high", "toxicity_score": 0.925, "non_toxicity_score": 0.075 }, "metadata": { "embedding_model": "ProtBERT", "embedding_type": "Bert", "model_version": "MHSA-GRU-Transformer-v1.0", "device": "cuda" } }, "timestamp": "2025-01-21T10:30:00Z", "api_version": "1.0.0", "processing_time_ms": 45.2 } ``` ### 4. Batch Prediction **POST** `/predict/batch` Predict toxicity for multiple sequences at once. **Request in Postman/cURL:** ```bash curl -X POST http://localhost:8000/predict/batch \ -H "Content-Type: application/json" \ -d '{ "sequences": [ "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES", "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK", "MKTAYIAKQRQISFVKSHFSRQLE" ] }' ``` **Request Body (JSON):** ```json { "sequences": [ "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES", "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK" ] } ``` **Response:** ```json { "status_code": 200, "status": "success", "success": true, "data": { "total_sequences": 2, "results": [ { "sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES", "sequence_length": 51, "predicted_class": "Toxic", "toxicity_score": 0.925, "confidence": 0.85 }, { "sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK", "sequence_length": 51, "predicted_class": "Non-Toxic", "toxicity_score": 0.125, "confidence": 0.75 } ], "metadata": { "embedding_model": "ProtBERT", "embedding_type": "Bert", "model_version": "MHSA-GRU-Transformer-v1.0", "device": "cuda" } }, "timestamp": "2025-01-21T10:30:00Z", "api_version": "1.0.0", "processing_time_ms": 125.8 } ``` ## 🐍 Python Usage Examples ### Single Prediction ```python import requests response = requests.post( "http://localhost:8000/predict", json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"} ) result = response.json() print(f"Predicted Class: {result['data']['prediction']['predicted_class']}") print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}") print(f"Confidence: {result['data']['prediction']['confidence']:.4f}") ``` ### Batch Prediction ```python sequences = [ "MKTAYIAKQRQISFVKSHFSRQLE", "ARNDCEQGHILKMFPSTWYV", "MVHLTPEEKS" ] response = requests.post( "http://localhost:8000/predict/batch", json={"sequences": sequences} ) results = response.json() for i, pred in enumerate(results['data']['results'], 1): print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})") ``` ## 📁 Project Structure ``` toxicity-api/ ├── app.py # Main FastAPI application ├── requirements.txt # Python dependencies ├── test_api.py # Test suite ├── .env # Environment variables (create this) ├── models/ # Downloaded models (auto-created) └── README.md # This file ``` ## 🔒 HuggingFace Repository Structure Your private repository `camlas/toxicity` should contain: ``` camlas/toxicity/ ├── mhsa_gru_classifier.pth # Trained MHSA-GRU model ├── scaler.pkl # Feature scaler ├── config.json # ProtBERT config ├── model.safetensors # ProtBERT weights ├── vocab.txt # ProtBERT vocabulary ├── tokenizer_config.json # Tokenizer configuration └── special_tokens_map.json # Special tokens mapping ``` ## 🎨 Model Architecture 1. **Feature Extraction**: ProtBERT (1024-dimensional embeddings) 2. **Feature Scaling**: StandardScaler 3. **Classification**: MHSA-GRU - Multi-Head Self-Attention (3 layers) - Bidirectional GRU (2 layers) - Fully connected layers with dropout ## ⚠️ Error Codes - `MISSING_SEQUENCE`: No sequence provided in request - `SEQUENCE_TOO_SHORT`: Sequence length < 10 amino acids - `MODEL_NOT_LOADED`: Models failed to load from HuggingFace - `INTERNAL_ERROR`: Unexpected server error ## 📊 Performance - Single prediction: ~40-50ms (GPU) - Batch prediction (10 sequences): ~100-150ms (GPU) - Model loading time: ~10-15 seconds (first time) ## 🐛 Troubleshooting ### Models not loading 1. Check your HuggingFace token in `.env` 2. Verify you have access to the private repository 3. Check internet connection 4. Look at console logs for specific errors ### CUDA out of memory - Reduce batch size - Use CPU instead: Set `device = "cpu"` in code - Process sequences one at a time ### Slow predictions - Ensure GPU is being used (check `/health` endpoint) - First prediction is always slower (model initialization) ## 🌐 Public Usage Guidelines - **Free to Use**: No authentication or API keys required. - **Rate Limiting**: Fair usage is expected. Please do not abuse the service. - **Educational Purpose**: Designed for research and educational use. - **Medical Disclaimer**: Not for clinical diagnosis. See disclaimer below. - **Availability**: Best effort uptime, not guaranteed 24/7. ## ⚠️ Medical Disclaimer **IMPORTANT**: This API is designed for **research and educational purposes only**. It should **NOT** be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions. ## 🏢 About CAMLAs **CAMLAs** (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare. **Team Members:** - **S M Hasan Mahmud** – Principal Investigator & Supervisor _Roles:_ Writing – Original Draft, Writing – Review & Editing, Conceptualization, Supervision, Project Administration - **Francis Rudra D Cruze** – Lead Developer & Researcher _Roles:_ Methodology, Software, Formal Analysis, Investigation, Resources, Visualization ## 📞 Support & Contact - **Issues**: [GitHub Repository Issues](https://github.com/camlas/ovarian-cancer) - **Email**: drhasan.swe@diu.edu.bd - **Documentation**: This README - **API Status**: Check `/health` endpoint - **Website Integration**: Perfect for ovarian.francisrudra.com ## 📄 License This project is licensed under the MIT License - see the LICENSE file for details. --- **CAMLAs** - Center for Advanced Machine Learning and Applications _Advancing Medical AI Research with Public FastAPI_ 🌐🚀