ML Inference Service

FastAPI service for serving ML models over HTTP. Comes with ResNet-18 for image classification out of the box, but you can swap in any model you want.

Quick Start

Install uv: https://docs.astral.sh/uv/getting-started/installation/

Local development:

# Install dependencies
make setup
source venv/bin/activate

# Download the example model
make download

# Run it
make serve

In a second terminal:

# Process an example input
./prompt.sh cat.json

Server runs on http://127.0.0.1:8000. Check /docs for the interactive API documentation.

Docker:

# Build
make docker-build

# Run
make docker-run

Testing the API

# Using curl
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "image": {
      "mediaType": "image/jpeg",
      "data": "<base64-encoded-image>"
    }
  }'

Example response:

{
  "logprobs": [-0.859380304813385,-1.2701971530914307,-2.1918208599090576,-1.69235098361969],
  "localizationMask": {
    "mediaType":"image/png",
    "data":"iVBORw0KGgoAAAANSUhEUgAAA8AAAAKDAQAAAAD9Fl5AAAAAu0lEQVR4nO3NsREAMAgDMWD/nZMVKEwn1T5/FQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMCl3g5f+HC24TRhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAj70gwKsTlmdBwAAAABJRU5ErkJggg=="
  }
}

Project Structure

example-submission/
β”œβ”€β”€ main.py                      # Entry point
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ app.py               # <= INSTANTIATE YOUR DETECTOR HERE
β”‚   β”‚   └── logging.py           # Logging setup
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ models.py            # Request/response schemas
β”‚   β”‚   β”œβ”€β”€ controllers.py       # Business logic
β”‚   β”‚   └── routes/
β”‚   β”‚       └── prediction.py    # POST /predict
β”‚   └── services/
β”‚       β”œβ”€β”€ base.py              # <= YOUR DETECTOR IMPLEMENTS THIS INTERFACE
β”‚       └── inference.py         # Example service based on ResNet-18
β”œβ”€β”€ models/
β”‚   └── microsoft/
β”‚       └── resnet-18/           # Model weights and config
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ model_download.bash
β”‚   β”œβ”€β”€ generate_test_datasets.py
β”‚   └── test_datasets.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .env.example                 # Environment config template
β”œβ”€β”€ cat.json                     # An example /predict request object
β”œβ”€β”€ makefile
β”œβ”€β”€ prompt.sh                    # Script that makes a /predict request
β”œβ”€β”€ requirements.in              
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ response.json                # An example /predict response object
└──

How to Plug In Your Own Model

To integrate your model, implement the InferenceService abstract class defined in app/services/base.py. You can follow the example implementation in app/services/inference.py, which is based on ResNet-18. After implementing the required interface, instantiate your model in the lifespan() function in app/core/app.py, replacing the ResNetInferenceService instance.

Step 1: Create Your Service Class

# app/services/your_model_service.py
from app.services.base import InferenceService
from app.api.models import ImageRequest, PredictionResponse

class YourModelService(InferenceService[ImageRequest, PredictionResponse]):
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.model_path = f"models/{model_name}"
        self.model = None
        self._is_loaded = False

    def load_model(self) -> None:
        """Load your model here. Called once at startup."""
        self.model = load_your_model(self.model_path)
        self._is_loaded = True

    def predict(self, request: ImageRequest) -> PredictionResponse:
        """Actual inference happens here."""
        image = decode_base64_image(request.image.data)
        result = self.model(image)

        logprobs = ...
        mask = ...

        return PredictionResponse(
            logprobs=logprobs,
            localizationMask=mask,
        )

    @property
    def is_loaded(self) -> bool:
        return self._is_loaded

Step 2: Register Your Service

Open app/core/app.py and find the lifespan function:

# Change this line:
service = ResNetInferenceService(model_name="microsoft/resnet-18")

# To this:
service = YourModelService(...)

That's it. The /predict endpoint now serves your model.

Model Files

Put your model files under the models/ directory:

models/
└── your-org/
    └── your-model/
        β”œβ”€β”€ config.json
        β”œβ”€β”€ weights.bin
        └── (other files)

Configuration

Settings are managed via environment variables or a .env file. See .env.example for all available options.

Default values:

  • APP_NAME: "ML Inference Service"
  • APP_VERSION: "0.1.0"
  • DEBUG: false
  • HOST: "0.0.0.0"
  • PORT: 8000
  • MODEL_NAME: "microsoft/resnet-18"

To customize:

# Copy the example
cp .env.example .env

# Edit values
vim .env

Or set environment variables directly:

export MODEL_NAME="google/vit-base-patch16-224"
uvicorn main:app --reload

Deployment

Development:

uvicorn main:app --reload

Production:

gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

The service runs on CPU by default. For GPU inference, install CUDA-enabled PyTorch and modify your service to move tensors to the GPU device.

Docker:

  • Multi-stage build keeps the image small
  • Runs as non-root user (appuser)
  • Python dependencies installed in user site-packages
  • Model files baked into the image

What Happens When You Start the Server

INFO: Starting ML Inference Service...
INFO: Initializing ResNet service: models/microsoft/resnet-18
INFO: Loading model from models/microsoft/resnet-18
INFO: Model loaded: 1000 classes
INFO: Startup completed successfully
INFO: Uvicorn running on http://0.0.0.0:8000

If you see "Model directory not found", check that your model files exist at the expected path with the full org/model structure.

API Reference

Endpoint: POST /predict

Request:

{
  "image": {
    "mediaType": "image/jpeg",  // or "image/png"
    "data": "<base64 string>"
  }
}

Response:

{
  "logprobs": [float],         // Log-probabilities of each label
  "localizationMask": {        // [Optional] binary mask
    "mediaType": "image/png",  // Always png
    "data": "<base64 string>"  // Image data
  }
}

Docs:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc
  • OpenAPI JSON: http://localhost:8000/openapi.json

PyArrow Test Datasets

We've included a test dataset system for validating your model. It generates 100 standardized test cases covering normal inputs, edge cases, performance benchmarks, and model comparisons.

Generate Datasets

python scripts/generate_test_datasets.py

This creates:

  • scripts/test_datasets/*.parquet - Test data (images, requests, expected responses)
  • scripts/test_datasets/*_metadata.json - Human-readable descriptions
  • scripts/test_datasets/datasets_summary.json - Overview of all datasets

Run Tests

# Start your service first
make serve

In another terminal:

# Quick test (5 samples per dataset)
python scripts/test_datasets.py --quick

# Full validation
python scripts/test_datasets.py

# Test specific category
python scripts/test_datasets.py --category edge_case

Dataset Categories (25 datasets each)

1. Standard Tests (standard_test_*.parquet)

  • Normal images: random patterns, shapes, gradients
  • Common sizes: 224x224, 256x256, 299x299, 384x384
  • Formats: JPEG, PNG
  • Purpose: Baseline validation

2. Edge Cases (edge_case_*.parquet)

  • Tiny images (32x32, 1x1)
  • Huge images (2048x2048)
  • Extreme aspect ratios (1000x50)
  • Corrupted data, malformed requests
  • Purpose: Test error handling

3. Performance Benchmarks (performance_test_*.parquet)

  • Batch sizes: 1, 5, 10, 25, 50, 100 images
  • Latency and throughput tracking
  • Purpose: Performance profiling

4. Model Comparisons (model_comparison_*.parquet)

  • Same inputs across different architectures
  • Models: ResNet-18/50, ViT, ConvNext, Swin
  • Purpose: Cross-model benchmarking

Test Output

DATASET TESTING SUMMARY
============================================================
Datasets tested: 100
Successful datasets: 95
Failed datasets: 5
Total samples: 1,247
Overall success rate: 87.3%
Test duration: 45.2s

Performance:
  Avg latency: 123.4ms
  Median latency: 98.7ms
  p95 latency: 342.1ms
  Max latency: 2,341.0ms
  Requests/sec: 27.6

Category breakdown:
  standard: 25 datasets, 94.2% avg success
  edge_case: 25 datasets, 76.8% avg success
  performance: 25 datasets, 91.1% avg success
  model_comparison: 25 datasets, 89.3% avg success

Common Issues

Port 8000 already in use:

# Find what's using it
lsof -i :8000

# Or just use a different port
uvicorn main:app --port 8080

Model not loading:

  • Check the path: models should be in models/<org>/<model-name>/
  • If you're trying to run the example ResNet-based model, make sure you ran make download to fetch the model weights.
  • Check logs for the exact error

Slow inference:

  • Inference runs on CPU by default
  • For GPU: install CUDA PyTorch and modify service to use GPU device
  • Consider using smaller models or quantization

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support