---
title: OpenELM OpenAI API
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading)

A FastAPI-based service with **lazy model loading** that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments.

## Key Feature: Lazy Loading

Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses **lazy loading**:

1. **Fast Startup**: API is available immediately (model not loaded yet)
2. **On-Demand Loading**: Model loads when first request arrives
3. **Better Reliability**: Avoids startup timeouts and memory issues
4. **Resource Efficient**: Only uses resources when needed

## How It Works

When you make your first API request:
- The API returns a 503 status temporarily
- The model downloads and loads in the background
- Subsequent requests work normally
- Progress is logged to the console

## Quick Start

### Build and Run

```bash
# Build and run with Docker
docker build -t openelm-api .
docker run -p 8000:8000 openelm-api
```

### Make Your First Request

```bash
# This will trigger model loading
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openelm-450m-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

The first request will take longer while the model loads. Check the logs for progress.

## API Reference

### OpenAI API Endpoint

**POST** `/v1/chat/completions`

Request:
```json
{
  "model": "openelm-450m-instruct",
  "messages": [
    {"role": "user", "content": "Your prompt here"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024
}
```

Response:
```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "openelm-450m-instruct",
  "choices": [{
    "message": {"role": "assistant", "content": "Generated response"},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 25,
    "total_tokens": 35
  }
}
```

### Anthropic API Endpoint

**POST** `/v1/messages`

Request:
```json
{
  "model": "openelm-450m-instruct",
  "messages": [{"role": "user", "content": "Your prompt here"}],
  "max_tokens": 1024
}
```

## Health Check

```bash
curl http://localhost:8000/health
```

Response:
```json
{
  "status": "initializing",  // or "healthy" after model loads
  "model_loaded": false
}
```

## Using with OpenAI SDK

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# First request triggers model loading
response = client.chat.completions.create(
    model="openelm-450m-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

print(response.choices[0].message.content)
```

## Using with Anthropic SDK

```python
import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# First request triggers model loading
message = client.messages.create(
    model="openelm-450m-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

print(message.content[0].text)
```

## Expected Behavior

### First Request (Model Loading)
```
$ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'

# Response (after model loads):
{"id":"chatcmpl-...", ...}
```

### Console Output During Loading
```
Initializing OpenELM model (this may take a moment)...
  Loading tokenizer...
  Loading model...
  Model apple/OpenELM-450M-Instruct loaded successfully!
```

## Model Information

- **Default Model**: apple/OpenELM-450M-Instruct
- **Parameters**: 450M
- **Context Window**: 2048 tokens
- **Weight Format**: Safetensors
- **Lazy Loading**: Model loads on first request

## Troubleshooting

### First Request Takes Too Long
- Normal behavior: Model downloads (~1GB) and loads (~2GB RAM)
- Subsequent requests are much faster (cached model)

### Model Loading Fails
- Check internet connection (needed for HuggingFace download)
- Ensure sufficient RAM (at least 4GB recommended)
- Check console logs for specific error messages

### API Returns 503
- Model is still loading, retry after a few seconds
- Check `/health` endpoint for loading status

## Architecture

- **Framework**: FastAPI with async support
- **Lazy Loading**: Model loads on first request
- **ML Backend**: PyTorch + HuggingFace Transformers
- **Streaming**: Server-Sent Events (SSE) support
- **Dual Compatibility**: OpenAI and Anthropic API formats

## Files

- `app.py` - Main API application with lazy loading
- `openelm_tokenizer.py` - Tokenizer utilities
- `examples/` - Usage examples
- `requirements.txt` - Dependencies

## License

This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.