Spaces:

camlas
/

toxicity

Running

App Files Files Community

toxicity / README.md

rudradcruze

upload toxicity api application

1c25c67 3 months ago

preview code

raw

history blame contribute delete

9.92 kB

	---
	title: Toxicity Prediction API
	description: A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.
	short_description: Toxicity Prediction API
	version: 1.0.0
	emoji: 🧬
	colorFrom: green
	colorTo: blue
	sdk: docker
	app_file: app.py
	pinned: false
	license: mit
	tags:
	- protein-toxicity
	- protbert
	- mhsa-gru
	- pytorch
	- fastapi
	---

	# Toxicity Prediction API

	A FastAPI-based REST API for predicting protein sequence toxicity using ProtBERT embeddings and MHSA-GRU classifier.

	Developed by the CAMLAs research team - [Francis Rudra D Cruze](https://linkedin.com/in/rudradcruze).

	## 🚀 Features

	- ProtBERT Feature Extraction: Uses state-of-the-art protein language model
	- MHSA-GRU Classification: Multi-Head Self-Attention with GRU for accurate predictions
	- Single & Batch Predictions: Process one or multiple sequences
	- HuggingFace Integration: Automatic model loading from private repository
	- Production Ready: Health checks, error handling, and comprehensive logging

	## 📋 Requirements

	- Python 3.8+
	- CUDA-capable GPU (optional, but recommended)
	- HuggingFace account with access to private repository

	## 🔧 Installation

	1. Clone the repository

	```bash
	git clone https://huggingface.co/spaces/camlas/toxicity
	cd toxicity
	```

	2. Create virtual environment

	```bash
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate
	```

	3. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	4. Create `.env` file

	```bash
	echo "HF_TOKEN=your_huggingface_token_here" > .env
	```

	Get your HuggingFace token from: https://huggingface.co/settings/tokens

	## 🎯 Usage

	### Start the API Server

	```bash
	python app.py
	```

	Or with uvicorn directly:

	```bash
	uvicorn app:app --host 0.0.0.0 --port 8000 --reload
	```

	The API will be available at: `http://localhost:8000`

	### Run Tests

	```bash
	python test_api.py
	```

	## 📡 API Endpoints

	### 1. Root Endpoint

	GET `/`

	Returns API information and available endpoints.

	```bash
	curl http://localhost:8000/
	```

	### 2. Health Check

	GET `/health`

	Check API status and model loading status.

	```bash
	curl http://localhost:8000/health
	```

	Response:

	```json
	{
	"status_code": 200,
	"status": "healthy",
	"service": "Toxicity Prediction API",
	"api_version": "1.0.0",
	"model_version": "MHSA-GRU-Transformer-v1.0",
	"models_loaded": true,
	"device": "cuda",
	"timestamp": "2025-01-21T10:30:00Z"
	}
	```

	### 3. Single Prediction

	POST `/predict`

	Predict toxicity for a single protein sequence.

	Request:

	```bash
	curl -X POST http://localhost:8000/predict \
	-H "Content-Type: application/json" \
	-d '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}'
	```

	Response:

	```json
	{
	"status_code": 200,
	"status": "success",
	"success": true,
	"data": {
	"sequence": "MKTAYIAKQRQISFVKSHFSRQLE",
	"sequence_length": 24,
	"prediction": {
	"predicted_class": "Toxic",
	"confidence": 0.85,
	"confidence_level": "high",
	"toxicity_score": 0.925,
	"non_toxicity_score": 0.075
	},
	"metadata": {
	"embedding_model": "ProtBERT",
	"embedding_type": "Bert",
	"model_version": "MHSA-GRU-Transformer-v1.0",
	"device": "cuda"
	}
	},
	"timestamp": "2025-01-21T10:30:00Z",
	"api_version": "1.0.0",
	"processing_time_ms": 45.2
	}
	```

	### 4. Batch Prediction

	POST `/predict/batch`

	Predict toxicity for multiple sequences at once.

	Request in Postman/cURL:

	```bash
	curl -X POST http://localhost:8000/predict/batch \
	-H "Content-Type: application/json" \
	-d '{
	"sequences": [
	"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
	"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
	"MKTAYIAKQRQISFVKSHFSRQLE"
	]
	}'
	```

	Request Body (JSON):

	```json
	{
	"sequences": [
	"MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
	"MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK"
	]
	}
	```

	Response:

	```json
	{
	"status_code": 200,
	"status": "success",
	"success": true,
	"data": {
	"total_sequences": 2,
	"results": [
	{
	"sequence": "MLLPATMSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES",
	"sequence_length": 51,
	"predicted_class": "Toxic",
	"toxicity_score": 0.925,
	"confidence": 0.85
	},
	{
	"sequence": "MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK",
	"sequence_length": 51,
	"predicted_class": "Non-Toxic",
	"toxicity_score": 0.125,
	"confidence": 0.75
	}
	],
	"metadata": {
	"embedding_model": "ProtBERT",
	"embedding_type": "Bert",
	"model_version": "MHSA-GRU-Transformer-v1.0",
	"device": "cuda"
	}
	},
	"timestamp": "2025-01-21T10:30:00Z",
	"api_version": "1.0.0",
	"processing_time_ms": 125.8
	}
	```

	## 🐍 Python Usage Examples

	### Single Prediction

	```python
	import requests

	response = requests.post(
	"http://localhost:8000/predict",
	json={"sequence": "MKTAYIAKQRQISFVKSHFSRQLE"}
	)

	result = response.json()
	print(f"Predicted Class: {result['data']['prediction']['predicted_class']}")
	print(f"Toxicity Score: {result['data']['prediction']['toxicity_score']:.4f}")
	print(f"Confidence: {result['data']['prediction']['confidence']:.4f}")
	```

	### Batch Prediction

	```python
	sequences = [
	"MKTAYIAKQRQISFVKSHFSRQLE",
	"ARNDCEQGHILKMFPSTWYV",
	"MVHLTPEEKS"
	]

	response = requests.post(
	"http://localhost:8000/predict/batch",
	json={"sequences": sequences}
	)

	results = response.json()
	for i, pred in enumerate(results['data']['results'], 1):
	print(f"Sequence {i}: {pred['predicted_class']} ({pred['toxicity_score']:.4f})")
	```

	## 📁 Project Structure

	```
	toxicity-api/
	├── app.py # Main FastAPI application
	├── requirements.txt # Python dependencies
	├── test_api.py # Test suite
	├── .env # Environment variables (create this)
	├── models/ # Downloaded models (auto-created)
	└── README.md # This file
	```

	## 🔒 HuggingFace Repository Structure

	Your private repository `camlas/toxicity` should contain:

	```
	camlas/toxicity/
	├── mhsa_gru_classifier.pth # Trained MHSA-GRU model
	├── scaler.pkl # Feature scaler
	├── config.json # ProtBERT config
	├── model.safetensors # ProtBERT weights
	├── vocab.txt # ProtBERT vocabulary
	├── tokenizer_config.json # Tokenizer configuration
	└── special_tokens_map.json # Special tokens mapping
	```

	## 🎨 Model Architecture

	1. Feature Extraction: ProtBERT (1024-dimensional embeddings)
	2. Feature Scaling: StandardScaler
	3. Classification: MHSA-GRU
	- Multi-Head Self-Attention (3 layers)
	- Bidirectional GRU (2 layers)
	- Fully connected layers with dropout

	## ⚠️ Error Codes

	- `MISSING_SEQUENCE`: No sequence provided in request
	- `SEQUENCE_TOO_SHORT`: Sequence length < 10 amino acids
	- `MODEL_NOT_LOADED`: Models failed to load from HuggingFace
	- `INTERNAL_ERROR`: Unexpected server error

	## 📊 Performance

	- Single prediction: ~40-50ms (GPU)
	- Batch prediction (10 sequences): ~100-150ms (GPU)
	- Model loading time: ~10-15 seconds (first time)

	## 🐛 Troubleshooting

	### Models not loading

	1. Check your HuggingFace token in `.env`
	2. Verify you have access to the private repository
	3. Check internet connection
	4. Look at console logs for specific errors

	### CUDA out of memory

	- Reduce batch size
	- Use CPU instead: Set `device = "cpu"` in code
	- Process sequences one at a time

	### Slow predictions

	- Ensure GPU is being used (check `/health` endpoint)
	- First prediction is always slower (model initialization)

	## 🌐 Public Usage Guidelines

	- Free to Use: No authentication or API keys required.
	- Rate Limiting: Fair usage is expected. Please do not abuse the service.
	- Educational Purpose: Designed for research and educational use.
	- Medical Disclaimer: Not for clinical diagnosis. See disclaimer below.
	- Availability: Best effort uptime, not guaranteed 24/7.

	## ⚠️ Medical Disclaimer

	IMPORTANT: This API is designed for research and educational purposes only. It should NOT be used for clinical diagnosis or medical decision-making. Always consult qualified medical professionals for diagnostic decisions.

	## 🏢 About CAMLAs

	CAMLAs (Centre for Advanced Machine Learning & Applications) is a research organization focused on advancing AI applications in medical imaging and healthcare.

	Team Members:

	- S M Hasan Mahmud – Principal Investigator & Supervisor
	_Roles:_ Writing – Original Draft, Writing – Review & Editing, Conceptualization, Supervision, Project Administration

	- Francis Rudra D Cruze – Lead Developer & Researcher
	_Roles:_ Methodology, Software, Formal Analysis, Investigation, Resources, Visualization

	## 📞 Support & Contact

	- Issues: [GitHub Repository Issues](https://github.com/camlas/ovarian-cancer)
	- Email: drhasan.swe@diu.edu.bd
	- Documentation: This README
	- API Status: Check `/health` endpoint
	- Website Integration: Perfect for ovarian.francisrudra.com

	## 📄 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	---

	CAMLAs - Center for Advanced Machine Learning and Applications
	_Advancing Medical AI Research with Public FastAPI_ 🌐🚀