--- title: OpenELM OpenAI API emoji: 🤖 colorFrom: blue colorTo: purple sdk: docker pinned: false --- # OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading) A FastAPI-based service with **lazy model loading** that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments. ## Key Feature: Lazy Loading Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses **lazy loading**: 1. **Fast Startup**: API is available immediately (model not loaded yet) 2. **On-Demand Loading**: Model loads when first request arrives 3. **Better Reliability**: Avoids startup timeouts and memory issues 4. **Resource Efficient**: Only uses resources when needed ## How It Works When you make your first API request: - The API returns a 503 status temporarily - The model downloads and loads in the background - Subsequent requests work normally - Progress is logged to the console ## Quick Start ### Build and Run ```bash # Build and run with Docker docker build -t openelm-api . docker run -p 8000:8000 openelm-api ``` ### Make Your First Request ```bash # This will trigger model loading curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openelm-450m-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }' ``` The first request will take longer while the model loads. Check the logs for progress. ## API Reference ### OpenAI API Endpoint **POST** `/v1/chat/completions` Request: ```json { "model": "openelm-450m-instruct", "messages": [ {"role": "user", "content": "Your prompt here"} ], "temperature": 0.7, "max_tokens": 1024 } ``` Response: ```json { "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1677858242, "model": "openelm-450m-instruct", "choices": [{ "message": {"role": "assistant", "content": "Generated response"}, "finish_reason": "stop" }], "usage": { "prompt_tokens": 10, "completion_tokens": 25, "total_tokens": 35 } } ``` ### Anthropic API Endpoint **POST** `/v1/messages` Request: ```json { "model": "openelm-450m-instruct", "messages": [{"role": "user", "content": "Your prompt here"}], "max_tokens": 1024 } ``` ## Health Check ```bash curl http://localhost:8000/health ``` Response: ```json { "status": "initializing", // or "healthy" after model loads "model_loaded": false } ``` ## Using with OpenAI SDK ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="dummy" ) # First request triggers model loading response = client.chat.completions.create( model="openelm-450m-instruct", messages=[{"role": "user", "content": "Hello!"}], max_tokens=100 ) print(response.choices[0].message.content) ``` ## Using with Anthropic SDK ```python import anthropic client = anthropic.Anthropic( base_url="http://localhost:8000/v1", api_key="dummy" ) # First request triggers model loading message = client.messages.create( model="openelm-450m-instruct", messages=[{"role": "user", "content": "Hello!"}], max_tokens=100 ) print(message.content[0].text) ``` ## Expected Behavior ### First Request (Model Loading) ``` $ curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":"Hello"}]}' # Response (after model loads): {"id":"chatcmpl-...", ...} ``` ### Console Output During Loading ``` Initializing OpenELM model (this may take a moment)... Loading tokenizer... Loading model... Model apple/OpenELM-450M-Instruct loaded successfully! ``` ## Model Information - **Default Model**: apple/OpenELM-450M-Instruct - **Parameters**: 450M - **Context Window**: 2048 tokens - **Weight Format**: Safetensors - **Lazy Loading**: Model loads on first request ## Troubleshooting ### First Request Takes Too Long - Normal behavior: Model downloads (~1GB) and loads (~2GB RAM) - Subsequent requests are much faster (cached model) ### Model Loading Fails - Check internet connection (needed for HuggingFace download) - Ensure sufficient RAM (at least 4GB recommended) - Check console logs for specific error messages ### API Returns 503 - Model is still loading, retry after a few seconds - Check `/health` endpoint for loading status ## Architecture - **Framework**: FastAPI with async support - **Lazy Loading**: Model loads on first request - **ML Backend**: PyTorch + HuggingFace Transformers - **Streaming**: Server-Sent Events (SSE) support - **Dual Compatibility**: OpenAI and Anthropic API formats ## Files - `app.py` - Main API application with lazy loading - `openelm_tokenizer.py` - Tokenizer utilities - `examples/` - Usage examples - `requirements.txt` - Dependencies ## License This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.