agentic-api / README.md
MiniMax Agent
Implement lazy loading - model loads on first request to avoid startup timeouts
2aeb5c7
metadata
title: OpenELM OpenAI API
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading)

A FastAPI-based service with lazy model loading that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments.

Key Feature: Lazy Loading

Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses lazy loading:

  1. Fast Startup: API is available immediately (model not loaded yet)
  2. On-Demand Loading: Model loads when first request arrives
  3. Better Reliability: Avoids startup timeouts and memory issues
  4. Resource Efficient: Only uses resources when needed

How It Works

When you make your first API request:

  • The API returns a 503 status temporarily
  • The model downloads and loads in the background
  • Subsequent requests work normally
  • Progress is logged to the console

Quick Start

Build and Run

# Build and run with Docker
docker build -t openelm-api .
docker run -p 8000:8000 openelm-api

Make Your First Request

# This will trigger model loading
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openelm-450m-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

The first request will take longer while the model loads. Check the logs for progress.

API Reference

OpenAI API Endpoint

POST /v1/chat/completions

Request:

{
  "model": "openelm-450m-instruct",
  "messages": [
    {"role": "user", "content": "Your prompt here"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "openelm-450m-instruct",
  "choices": [{
    "message": {"role": "assistant", "content": "Generated response"},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 25,
    "total_tokens": 35
  }
}

Anthropic API Endpoint

POST /v1/messages

Request:

{
  "model": "openelm-450m-instruct",
  "messages": [{"role": "user", "content": "Your prompt here"}],
  "max_tokens": 1024
}

Health Check

curl http://localhost:8000/health

Response:

{
  "status": "initializing",  // or "healthy" after model loads
  "model_loaded": false
}

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# First request triggers model loading
response = client.chat.completions.create(
    model="openelm-450m-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

print(response.choices[0].message.content)

Using with Anthropic SDK

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# First request triggers model loading
message = client.messages.create(
    model="openelm-450m-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

print(message.content[0].text)

Expected Behavior

First Request (Model Loading)

$ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'

# Response (after model loads):
{"id":"chatcmpl-...", ...}

Console Output During Loading

Initializing OpenELM model (this may take a moment)...
  Loading tokenizer...
  Loading model...
  Model apple/OpenELM-450M-Instruct loaded successfully!

Model Information

  • Default Model: apple/OpenELM-450M-Instruct
  • Parameters: 450M
  • Context Window: 2048 tokens
  • Weight Format: Safetensors
  • Lazy Loading: Model loads on first request

Troubleshooting

First Request Takes Too Long

  • Normal behavior: Model downloads (1GB) and loads (2GB RAM)
  • Subsequent requests are much faster (cached model)

Model Loading Fails

  • Check internet connection (needed for HuggingFace download)
  • Ensure sufficient RAM (at least 4GB recommended)
  • Check console logs for specific error messages

API Returns 503

  • Model is still loading, retry after a few seconds
  • Check /health endpoint for loading status

Architecture

  • Framework: FastAPI with async support
  • Lazy Loading: Model loads on first request
  • ML Backend: PyTorch + HuggingFace Transformers
  • Streaming: Server-Sent Events (SSE) support
  • Dual Compatibility: OpenAI and Anthropic API formats

Files

  • app.py - Main API application with lazy loading
  • openelm_tokenizer.py - Tokenizer utilities
  • examples/ - Usage examples
  • requirements.txt - Dependencies

License

This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.