ML Inference Service
FastAPI service for serving ML models over HTTP. Comes with ResNet-18 for image classification out of the box, but you can swap in any model you want.
Quick Start
Install uv:
https://docs.astral.sh/uv/getting-started/installation/
Local development:
# Install dependencies
make setup
source venv/bin/activate
# Download the example model
make download
# Run it
make serve
In a second terminal:
# Process an example input
./prompt.sh cat.json
Server runs on http://127.0.0.1:8000. Check /docs for the interactive API documentation.
Docker:
# Build
make docker-build
# Run
make docker-run
Testing the API
# Using curl
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"image": {
"mediaType": "image/jpeg",
"data": "<base64-encoded-image>"
}
}'
Example response:
{
"logprobs": [-0.859380304813385,-1.2701971530914307,-2.1918208599090576,-1.69235098361969],
"localizationMask": {
"mediaType":"image/png",
"data":"iVBORw0KGgoAAAANSUhEUgAAA8AAAAKDAQAAAAD9Fl5AAAAAu0lEQVR4nO3NsREAMAgDMWD/nZMVKEwn1T5/FQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMCl3g5f+HC24TRhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAj70gwKsTlmdBwAAAABJRU5ErkJggg=="
}
}
Project Structure
example-submission/
βββ main.py # Entry point
βββ app/
β βββ core/
β β βββ app.py # <= INSTANTIATE YOUR DETECTOR HERE
β β βββ logging.py # Logging setup
β βββ api/
β β βββ models.py # Request/response schemas
β β βββ controllers.py # Business logic
β β βββ routes/
β β βββ prediction.py # POST /predict
β βββ services/
β βββ base.py # <= YOUR DETECTOR IMPLEMENTS THIS INTERFACE
β βββ inference.py # Example service based on ResNet-18
βββ models/
β βββ microsoft/
β βββ resnet-18/ # Model weights and config
βββ scripts/
β βββ model_download.bash
β βββ generate_test_datasets.py
β βββ test_datasets.py
βββ Dockerfile
βββ .env.example # Environment config template
βββ cat.json # An example /predict request object
βββ makefile
βββ prompt.sh # Script that makes a /predict request
βββ requirements.in
βββ requirements.txt
βββ response.json # An example /predict response object
βββ
How to Plug In Your Own Model
To integrate your model, implement the InferenceService abstract class defined in app/services/base.py. You can follow the example implementation in app/services/inference.py, which is based on ResNet-18. After implementing the required interface, instantiate your model in the lifespan() function in app/core/app.py, replacing the ResNetInferenceService instance.
Step 1: Create Your Service Class
# app/services/your_model_service.py
from app.services.base import InferenceService
from app.api.models import ImageRequest, PredictionResponse
class YourModelService(InferenceService[ImageRequest, PredictionResponse]):
def __init__(self, model_name: str):
self.model_name = model_name
self.model_path = f"models/{model_name}"
self.model = None
self._is_loaded = False
def load_model(self) -> None:
"""Load your model here. Called once at startup."""
self.model = load_your_model(self.model_path)
self._is_loaded = True
def predict(self, request: ImageRequest) -> PredictionResponse:
"""Actual inference happens here."""
image = decode_base64_image(request.image.data)
result = self.model(image)
logprobs = ...
mask = ...
return PredictionResponse(
logprobs=logprobs,
localizationMask=mask,
)
@property
def is_loaded(self) -> bool:
return self._is_loaded
Step 2: Register Your Service
Open app/core/app.py and find the lifespan function:
# Change this line:
service = ResNetInferenceService(model_name="microsoft/resnet-18")
# To this:
service = YourModelService(...)
That's it. The /predict endpoint now serves your model.
Model Files
Put your model files under the models/ directory:
models/
βββ your-org/
βββ your-model/
βββ config.json
βββ weights.bin
βββ (other files)
Configuration
Settings are managed via environment variables or a .env file. See .env.example for all available options.
Default values:
APP_NAME: "ML Inference Service"APP_VERSION: "0.1.0"DEBUG: falseHOST: "0.0.0.0"PORT: 8000MODEL_NAME: "microsoft/resnet-18"
To customize:
# Copy the example
cp .env.example .env
# Edit values
vim .env
Or set environment variables directly:
export MODEL_NAME="google/vit-base-patch16-224"
uvicorn main:app --reload
Deployment
Development:
uvicorn main:app --reload
Production:
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
The service runs on CPU by default. For GPU inference, install CUDA-enabled PyTorch and modify your service to move tensors to the GPU device.
Docker:
- Multi-stage build keeps the image small
- Runs as non-root user (
appuser) - Python dependencies installed in user site-packages
- Model files baked into the image
What Happens When You Start the Server
INFO: Starting ML Inference Service...
INFO: Initializing ResNet service: models/microsoft/resnet-18
INFO: Loading model from models/microsoft/resnet-18
INFO: Model loaded: 1000 classes
INFO: Startup completed successfully
INFO: Uvicorn running on http://0.0.0.0:8000
If you see "Model directory not found", check that your model files exist at the expected path with the full org/model structure.
API Reference
Endpoint: POST /predict
Request:
{
"image": {
"mediaType": "image/jpeg", // or "image/png"
"data": "<base64 string>"
}
}
Response:
{
"logprobs": [float], // Log-probabilities of each label
"localizationMask": { // [Optional] binary mask
"mediaType": "image/png", // Always png
"data": "<base64 string>" // Image data
}
}
Docs:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc - OpenAPI JSON:
http://localhost:8000/openapi.json
PyArrow Test Datasets
We've included a test dataset system for validating your model. It generates 100 standardized test cases covering normal inputs, edge cases, performance benchmarks, and model comparisons.
Generate Datasets
python scripts/generate_test_datasets.py
This creates:
scripts/test_datasets/*.parquet- Test data (images, requests, expected responses)scripts/test_datasets/*_metadata.json- Human-readable descriptionsscripts/test_datasets/datasets_summary.json- Overview of all datasets
Run Tests
# Start your service first
make serve
In another terminal:
# Quick test (5 samples per dataset)
python scripts/test_datasets.py --quick
# Full validation
python scripts/test_datasets.py
# Test specific category
python scripts/test_datasets.py --category edge_case
Dataset Categories (25 datasets each)
1. Standard Tests (standard_test_*.parquet)
- Normal images: random patterns, shapes, gradients
- Common sizes: 224x224, 256x256, 299x299, 384x384
- Formats: JPEG, PNG
- Purpose: Baseline validation
2. Edge Cases (edge_case_*.parquet)
- Tiny images (32x32, 1x1)
- Huge images (2048x2048)
- Extreme aspect ratios (1000x50)
- Corrupted data, malformed requests
- Purpose: Test error handling
3. Performance Benchmarks (performance_test_*.parquet)
- Batch sizes: 1, 5, 10, 25, 50, 100 images
- Latency and throughput tracking
- Purpose: Performance profiling
4. Model Comparisons (model_comparison_*.parquet)
- Same inputs across different architectures
- Models: ResNet-18/50, ViT, ConvNext, Swin
- Purpose: Cross-model benchmarking
Test Output
DATASET TESTING SUMMARY
============================================================
Datasets tested: 100
Successful datasets: 95
Failed datasets: 5
Total samples: 1,247
Overall success rate: 87.3%
Test duration: 45.2s
Performance:
Avg latency: 123.4ms
Median latency: 98.7ms
p95 latency: 342.1ms
Max latency: 2,341.0ms
Requests/sec: 27.6
Category breakdown:
standard: 25 datasets, 94.2% avg success
edge_case: 25 datasets, 76.8% avg success
performance: 25 datasets, 91.1% avg success
model_comparison: 25 datasets, 89.3% avg success
Common Issues
Port 8000 already in use:
# Find what's using it
lsof -i :8000
# Or just use a different port
uvicorn main:app --port 8080
Model not loading:
- Check the path: models should be in
models/<org>/<model-name>/ - If you're trying to run the example ResNet-based model, make sure you ran
make downloadto fetch the model weights. - Check logs for the exact error
Slow inference:
- Inference runs on CPU by default
- For GPU: install CUDA PyTorch and modify service to use GPU device
- Consider using smaller models or quantization
License
Apache 2.0