Spaces:
Runtime error
Runtime error
MiniMax Agent
Implement lazy loading - model loads on first request to avoid startup timeouts
2aeb5c7
| title: OpenELM OpenAI API | |
| emoji: 🤖 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| # OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading) | |
| A FastAPI-based service with **lazy model loading** that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments. | |
| ## Key Feature: Lazy Loading | |
| Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses **lazy loading**: | |
| 1. **Fast Startup**: API is available immediately (model not loaded yet) | |
| 2. **On-Demand Loading**: Model loads when first request arrives | |
| 3. **Better Reliability**: Avoids startup timeouts and memory issues | |
| 4. **Resource Efficient**: Only uses resources when needed | |
| ## How It Works | |
| When you make your first API request: | |
| - The API returns a 503 status temporarily | |
| - The model downloads and loads in the background | |
| - Subsequent requests work normally | |
| - Progress is logged to the console | |
| ## Quick Start | |
| ### Build and Run | |
| ```bash | |
| # Build and run with Docker | |
| docker build -t openelm-api . | |
| docker run -p 8000:8000 openelm-api | |
| ``` | |
| ### Make Your First Request | |
| ```bash | |
| # This will trigger model loading | |
| curl -X POST http://localhost:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "openelm-450m-instruct", | |
| "messages": [{"role": "user", "content": "Hello!"}], | |
| "max_tokens": 100 | |
| }' | |
| ``` | |
| The first request will take longer while the model loads. Check the logs for progress. | |
| ## API Reference | |
| ### OpenAI API Endpoint | |
| **POST** `/v1/chat/completions` | |
| Request: | |
| ```json | |
| { | |
| "model": "openelm-450m-instruct", | |
| "messages": [ | |
| {"role": "user", "content": "Your prompt here"} | |
| ], | |
| "temperature": 0.7, | |
| "max_tokens": 1024 | |
| } | |
| ``` | |
| Response: | |
| ```json | |
| { | |
| "id": "chatcmpl-abc123", | |
| "object": "chat.completion", | |
| "created": 1677858242, | |
| "model": "openelm-450m-instruct", | |
| "choices": [{ | |
| "message": {"role": "assistant", "content": "Generated response"}, | |
| "finish_reason": "stop" | |
| }], | |
| "usage": { | |
| "prompt_tokens": 10, | |
| "completion_tokens": 25, | |
| "total_tokens": 35 | |
| } | |
| } | |
| ``` | |
| ### Anthropic API Endpoint | |
| **POST** `/v1/messages` | |
| Request: | |
| ```json | |
| { | |
| "model": "openelm-450m-instruct", | |
| "messages": [{"role": "user", "content": "Your prompt here"}], | |
| "max_tokens": 1024 | |
| } | |
| ``` | |
| ## Health Check | |
| ```bash | |
| curl http://localhost:8000/health | |
| ``` | |
| Response: | |
| ```json | |
| { | |
| "status": "initializing", // or "healthy" after model loads | |
| "model_loaded": false | |
| } | |
| ``` | |
| ## Using with OpenAI SDK | |
| ```python | |
| from openai import OpenAI | |
| client = OpenAI( | |
| base_url="http://localhost:8000/v1", | |
| api_key="dummy" | |
| ) | |
| # First request triggers model loading | |
| response = client.chat.completions.create( | |
| model="openelm-450m-instruct", | |
| messages=[{"role": "user", "content": "Hello!"}], | |
| max_tokens=100 | |
| ) | |
| print(response.choices[0].message.content) | |
| ``` | |
| ## Using with Anthropic SDK | |
| ```python | |
| import anthropic | |
| client = anthropic.Anthropic( | |
| base_url="http://localhost:8000/v1", | |
| api_key="dummy" | |
| ) | |
| # First request triggers model loading | |
| message = client.messages.create( | |
| model="openelm-450m-instruct", | |
| messages=[{"role": "user", "content": "Hello!"}], | |
| max_tokens=100 | |
| ) | |
| print(message.content[0].text) | |
| ``` | |
| ## Expected Behavior | |
| ### First Request (Model Loading) | |
| ``` | |
| $ curl -X POST http://localhost:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"messages":[{"role":"user","content":"Hello"}]}' | |
| # Response (after model loads): | |
| {"id":"chatcmpl-...", ...} | |
| ``` | |
| ### Console Output During Loading | |
| ``` | |
| Initializing OpenELM model (this may take a moment)... | |
| Loading tokenizer... | |
| Loading model... | |
| Model apple/OpenELM-450M-Instruct loaded successfully! | |
| ``` | |
| ## Model Information | |
| - **Default Model**: apple/OpenELM-450M-Instruct | |
| - **Parameters**: 450M | |
| - **Context Window**: 2048 tokens | |
| - **Weight Format**: Safetensors | |
| - **Lazy Loading**: Model loads on first request | |
| ## Troubleshooting | |
| ### First Request Takes Too Long | |
| - Normal behavior: Model downloads (~1GB) and loads (~2GB RAM) | |
| - Subsequent requests are much faster (cached model) | |
| ### Model Loading Fails | |
| - Check internet connection (needed for HuggingFace download) | |
| - Ensure sufficient RAM (at least 4GB recommended) | |
| - Check console logs for specific error messages | |
| ### API Returns 503 | |
| - Model is still loading, retry after a few seconds | |
| - Check `/health` endpoint for loading status | |
| ## Architecture | |
| - **Framework**: FastAPI with async support | |
| - **Lazy Loading**: Model loads on first request | |
| - **ML Backend**: PyTorch + HuggingFace Transformers | |
| - **Streaming**: Server-Sent Events (SSE) support | |
| - **Dual Compatibility**: OpenAI and Anthropic API formats | |
| ## Files | |
| - `app.py` - Main API application with lazy loading | |
| - `openelm_tokenizer.py` - Tokenizer utilities | |
| - `examples/` - Usage examples | |
| - `requirements.txt` - Dependencies | |
| ## License | |
| This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses. | |