Spaces:

megharudushi
/

agentic-api

Runtime error

App Files Files Community

agentic-api / README.md

MiniMax Agent

Implement lazy loading - model loads on first request to avoid startup timeouts

2aeb5c7 about 2 months ago

preview code

raw

history blame contribute delete

5.11 kB

	---
	title: OpenELM OpenAI API
	emoji: 🤖
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	---

	# OpenELM OpenAI & Anthropic API Compatible Wrapper (Lazy Loading)

	A FastAPI-based service with lazy model loading that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models. The model only loads on the first API request to avoid startup timeouts in resource-constrained environments.

	## Key Feature: Lazy Loading

	Unlike traditional deployments that load the model at startup (causing timeouts), this implementation uses lazy loading:

	1. Fast Startup: API is available immediately (model not loaded yet)
	2. On-Demand Loading: Model loads when first request arrives
	3. Better Reliability: Avoids startup timeouts and memory issues
	4. Resource Efficient: Only uses resources when needed

	## How It Works

	When you make your first API request:
	- The API returns a 503 status temporarily
	- The model downloads and loads in the background
	- Subsequent requests work normally
	- Progress is logged to the console

	## Quick Start

	### Build and Run

	```bash
	# Build and run with Docker
	docker build -t openelm-api .
	docker run -p 8000:8000 openelm-api
	```

	### Make Your First Request

	```bash
	# This will trigger model loading
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "openelm-450m-instruct",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 100
	}'
	```

	The first request will take longer while the model loads. Check the logs for progress.

	## API Reference

	### OpenAI API Endpoint

	POST `/v1/chat/completions`

	Request:
	```json
	{
	"model": "openelm-450m-instruct",
	"messages": [
	{"role": "user", "content": "Your prompt here"}
	],
	"temperature": 0.7,
	"max_tokens": 1024
	}
	```

	Response:
	```json
	{
	"id": "chatcmpl-abc123",
	"object": "chat.completion",
	"created": 1677858242,
	"model": "openelm-450m-instruct",
	"choices": [{
	"message": {"role": "assistant", "content": "Generated response"},
	"finish_reason": "stop"
	}],
	"usage": {
	"prompt_tokens": 10,
	"completion_tokens": 25,
	"total_tokens": 35
	}
	}
	```

	### Anthropic API Endpoint

	POST `/v1/messages`

	Request:
	```json
	{
	"model": "openelm-450m-instruct",
	"messages": [{"role": "user", "content": "Your prompt here"}],
	"max_tokens": 1024
	}
	```

	## Health Check

	```bash
	curl http://localhost:8000/health
	```

	Response:
	```json
	{
	"status": "initializing", // or "healthy" after model loads
	"model_loaded": false
	}
	```

	## Using with OpenAI SDK

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="dummy"
	)

	# First request triggers model loading
	response = client.chat.completions.create(
	model="openelm-450m-instruct",
	messages=[{"role": "user", "content": "Hello!"}],
	max_tokens=100
	)

	print(response.choices[0].message.content)
	```

	## Using with Anthropic SDK

	```python
	import anthropic

	client = anthropic.Anthropic(
	base_url="http://localhost:8000/v1",
	api_key="dummy"
	)

	# First request triggers model loading
	message = client.messages.create(
	model="openelm-450m-instruct",
	messages=[{"role": "user", "content": "Hello!"}],
	max_tokens=100
	)

	print(message.content[0].text)
	```

	## Expected Behavior

	### First Request (Model Loading)
	```
	$ curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"messages":[{"role":"user","content":"Hello"}]}'

	# Response (after model loads):
	{"id":"chatcmpl-...", ...}
	```

	### Console Output During Loading
	```
	Initializing OpenELM model (this may take a moment)...
	Loading tokenizer...
	Loading model...
	Model apple/OpenELM-450M-Instruct loaded successfully!
	```

	## Model Information

	- Default Model: apple/OpenELM-450M-Instruct
	- Parameters: 450M
	- Context Window: 2048 tokens
	- Weight Format: Safetensors
	- Lazy Loading: Model loads on first request

	## Troubleshooting

	### First Request Takes Too Long
	- Normal behavior: Model downloads (~1GB) and loads (~2GB RAM)
	- Subsequent requests are much faster (cached model)

	### Model Loading Fails
	- Check internet connection (needed for HuggingFace download)
	- Ensure sufficient RAM (at least 4GB recommended)
	- Check console logs for specific error messages

	### API Returns 503
	- Model is still loading, retry after a few seconds
	- Check `/health` endpoint for loading status

	## Architecture

	- Framework: FastAPI with async support
	- Lazy Loading: Model loads on first request
	- ML Backend: PyTorch + HuggingFace Transformers
	- Streaming: Server-Sent Events (SSE) support
	- Dual Compatibility: OpenAI and Anthropic API formats

	## Files

	- `app.py` - Main API application with lazy loading
	- `openelm_tokenizer.py` - Tokenizer utilities
	- `examples/` - Usage examples
	- `requirements.txt` - Dependencies

	## License

	This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.