Spaces:
Runtime error
Runtime error
MiniMax Agent
Implement lazy loading - model loads on first request to avoid startup timeouts
2aeb5c7
metadata
title: OpenELM OpenAI API
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading)
A FastAPI-based service with lazy model loading that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments.
Key Feature: Lazy Loading
Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses lazy loading:
- Fast Startup: API is available immediately (model not loaded yet)
- On-Demand Loading: Model loads when first request arrives
- Better Reliability: Avoids startup timeouts and memory issues
- Resource Efficient: Only uses resources when needed
How It Works
When you make your first API request:
- The API returns a 503 status temporarily
- The model downloads and loads in the background
- Subsequent requests work normally
- Progress is logged to the console
Quick Start
Build and Run
# Build and run with Docker
docker build -t openelm-api .
docker run -p 8000:8000 openelm-api
Make Your First Request
# This will trigger model loading
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openelm-450m-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
The first request will take longer while the model loads. Check the logs for progress.
API Reference
OpenAI API Endpoint
POST /v1/chat/completions
Request:
{
"model": "openelm-450m-instruct",
"messages": [
{"role": "user", "content": "Your prompt here"}
],
"temperature": 0.7,
"max_tokens": 1024
}
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "openelm-450m-instruct",
"choices": [{
"message": {"role": "assistant", "content": "Generated response"},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 25,
"total_tokens": 35
}
}
Anthropic API Endpoint
POST /v1/messages
Request:
{
"model": "openelm-450m-instruct",
"messages": [{"role": "user", "content": "Your prompt here"}],
"max_tokens": 1024
}
Health Check
curl http://localhost:8000/health
Response:
{
"status": "initializing", // or "healthy" after model loads
"model_loaded": false
}
Using with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
# First request triggers model loading
response = client.chat.completions.create(
model="openelm-450m-instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(response.choices[0].message.content)
Using with Anthropic SDK
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
# First request triggers model loading
message = client.messages.create(
model="openelm-450m-instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(message.content[0].text)
Expected Behavior
First Request (Model Loading)
$ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}]}'
# Response (after model loads):
{"id":"chatcmpl-...", ...}
Console Output During Loading
Initializing OpenELM model (this may take a moment)...
Loading tokenizer...
Loading model...
Model apple/OpenELM-450M-Instruct loaded successfully!
Model Information
- Default Model: apple/OpenELM-450M-Instruct
- Parameters: 450M
- Context Window: 2048 tokens
- Weight Format: Safetensors
- Lazy Loading: Model loads on first request
Troubleshooting
First Request Takes Too Long
- Normal behavior: Model downloads (
1GB) and loads (2GB RAM) - Subsequent requests are much faster (cached model)
Model Loading Fails
- Check internet connection (needed for HuggingFace download)
- Ensure sufficient RAM (at least 4GB recommended)
- Check console logs for specific error messages
API Returns 503
- Model is still loading, retry after a few seconds
- Check
/healthendpoint for loading status
Architecture
- Framework: FastAPI with async support
- Lazy Loading: Model loads on first request
- ML Backend: PyTorch + HuggingFace Transformers
- Streaming: Server-Sent Events (SSE) support
- Dual Compatibility: OpenAI and Anthropic API formats
Files
app.py- Main API application with lazy loadingopenelm_tokenizer.py- Tokenizer utilitiesexamples/- Usage examplesrequirements.txt- Dependencies
License
This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.