agentic-api / README.md
MiniMax Agent
Implement lazy loading - model loads on first request to avoid startup timeouts
2aeb5c7
---
title: OpenELM OpenAI API
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---
# OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading)
A FastAPI-based service with **lazy model loading** that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments.
## Key Feature: Lazy Loading
Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses **lazy loading**:
1. **Fast Startup**: API is available immediately (model not loaded yet)
2. **On-Demand Loading**: Model loads when first request arrives
3. **Better Reliability**: Avoids startup timeouts and memory issues
4. **Resource Efficient**: Only uses resources when needed
## How It Works
When you make your first API request:
- The API returns a 503 status temporarily
- The model downloads and loads in the background
- Subsequent requests work normally
- Progress is logged to the console
## Quick Start
### Build and Run
```bash
# Build and run with Docker
docker build -t openelm-api .
docker run -p 8000:8000 openelm-api
```
### Make Your First Request
```bash
# This will trigger model loading
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openelm-450m-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
The first request will take longer while the model loads. Check the logs for progress.
## API Reference
### OpenAI API Endpoint
**POST** `/v1/chat/completions`
Request:
```json
{
"model": "openelm-450m-instruct",
"messages": [
{"role": "user", "content": "Your prompt here"}
],
"temperature": 0.7,
"max_tokens": 1024
}
```
Response:
```json
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "openelm-450m-instruct",
"choices": [{
"message": {"role": "assistant", "content": "Generated response"},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 25,
"total_tokens": 35
}
}
```
### Anthropic API Endpoint
**POST** `/v1/messages`
Request:
```json
{
"model": "openelm-450m-instruct",
"messages": [{"role": "user", "content": "Your prompt here"}],
"max_tokens": 1024
}
```
## Health Check
```bash
curl http://localhost:8000/health
```
Response:
```json
{
"status": "initializing", // or "healthy" after model loads
"model_loaded": false
}
```
## Using with OpenAI SDK
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
# First request triggers model loading
response = client.chat.completions.create(
model="openelm-450m-instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(response.choices[0].message.content)
```
## Using with Anthropic SDK
```python
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
# First request triggers model loading
message = client.messages.create(
model="openelm-450m-instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(message.content[0].text)
```
## Expected Behavior
### First Request (Model Loading)
```
$ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}]}'
# Response (after model loads):
{"id":"chatcmpl-...", ...}
```
### Console Output During Loading
```
Initializing OpenELM model (this may take a moment)...
Loading tokenizer...
Loading model...
Model apple/OpenELM-450M-Instruct loaded successfully!
```
## Model Information
- **Default Model**: apple/OpenELM-450M-Instruct
- **Parameters**: 450M
- **Context Window**: 2048 tokens
- **Weight Format**: Safetensors
- **Lazy Loading**: Model loads on first request
## Troubleshooting
### First Request Takes Too Long
- Normal behavior: Model downloads (~1GB) and loads (~2GB RAM)
- Subsequent requests are much faster (cached model)
### Model Loading Fails
- Check internet connection (needed for HuggingFace download)
- Ensure sufficient RAM (at least 4GB recommended)
- Check console logs for specific error messages
### API Returns 503
- Model is still loading, retry after a few seconds
- Check `/health` endpoint for loading status
## Architecture
- **Framework**: FastAPI with async support
- **Lazy Loading**: Model loads on first request
- **ML Backend**: PyTorch + HuggingFace Transformers
- **Streaming**: Server-Sent Events (SSE) support
- **Dual Compatibility**: OpenAI and Anthropic API formats
## Files
- `app.py` - Main API application with lazy loading
- `openelm_tokenizer.py` - Tokenizer utilities
- `examples/` - Usage examples
- `requirements.txt` - Dependencies
## License
This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.