Spaces:
Runtime error
Runtime error
MiniMax Agent commited on
Commit ·
9604400
1
Parent(s): 91bc5ae
Add Anthropic API compatible wrapper for OpenELM models
Browse files- Dockerfile +26 -3
- README.md +167 -3
- app.py +659 -5
- examples/anthropic_sdk_example.py +112 -0
- examples/curl_examples.sh +116 -0
- requirements.txt +6 -0
Dockerfile
CHANGED
|
@@ -1,7 +1,13 @@
|
|
| 1 |
# Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
|
| 2 |
# you will also find guides on how best to write your Dockerfile
|
|
|
|
| 3 |
|
| 4 |
-
FROM python:3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
RUN useradd -m -u 1000 user
|
| 7 |
USER user
|
|
@@ -9,8 +15,25 @@ ENV PATH="/home/user/.local/bin:$PATH"
|
|
| 9 |
|
| 10 |
WORKDIR /app
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
COPY --chown=user ./requirements.txt requirements.txt
|
| 13 |
-
RUN pip install --no-cache-dir --upgrade -r requirements.txt
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
COPY --chown=user . /app
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
|
| 2 |
# you will also find guides on how best to write your Dockerfile
|
| 3 |
+
# OpenELM Anthropic API Compatible Wrapper
|
| 4 |
|
| 5 |
+
FROM python:3.10-slim
|
| 6 |
+
|
| 7 |
+
# Install system dependencies
|
| 8 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 9 |
+
build-essential \
|
| 10 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 11 |
|
| 12 |
RUN useradd -m -u 1000 user
|
| 13 |
USER user
|
|
|
|
| 15 |
|
| 16 |
WORKDIR /app
|
| 17 |
|
| 18 |
+
# Set environment variables for memory optimization
|
| 19 |
+
ENV PYTHONUNBUFFERED=1
|
| 20 |
+
ENV TRANSFORMERS_CACHE=/app/.cache
|
| 21 |
+
ENV HF_HOME=/app/.cache/huggingface
|
| 22 |
+
ENV HUGGINGFACE_HUB_CACHE=/app/.cache/huggingface
|
| 23 |
+
|
| 24 |
+
# Copy requirements first for better caching
|
| 25 |
COPY --chown=user ./requirements.txt requirements.txt
|
|
|
|
| 26 |
|
| 27 |
+
# Install Python dependencies
|
| 28 |
+
# Install PyTorch with CUDA support if available, otherwise CPU version
|
| 29 |
+
RUN pip install --no-cache-dir --upgrade pip wheel
|
| 30 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 31 |
+
|
| 32 |
+
# Copy application code
|
| 33 |
COPY --chown=user . /app
|
| 34 |
+
|
| 35 |
+
# Expose the API port
|
| 36 |
+
EXPOSE 8000 7860
|
| 37 |
+
|
| 38 |
+
# Set default command with extended timeout for model loading
|
| 39 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--timeout-keep-alive", "120"]
|
README.md
CHANGED
|
@@ -1,10 +1,174 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: OpenELM Anthropic API
|
| 3 |
+
emoji: 🤖
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# OpenELM Anthropic API Compatible Wrapper
|
| 11 |
+
|
| 12 |
+
A FastAPI-based service that provides an Anthropic-compatible API for Apple's OpenELM models, allowing you to use the Anthropic SDK with OpenELM for text generation tasks.
|
| 13 |
+
|
| 14 |
+
## Overview
|
| 15 |
+
|
| 16 |
+
This project creates a REST API that mimics the Anthropic Messages API format, enabling developers to use OpenELM models with existing Anthropic SDK code with minimal modifications. The API supports both streaming and non-streaming responses, multi-turn conversations, system prompts, and various generation parameters.
|
| 17 |
+
|
| 18 |
+
The OpenELM (Open Efficient Language Model) family from Apple uses a layer-wise scaling strategy to efficiently allocate parameters within each transformer layer, resulting in enhanced accuracy while maintaining computational efficiency. This wrapper makes these powerful models accessible through a familiar API interface.
|
| 19 |
+
|
| 20 |
+
## Features
|
| 21 |
+
|
| 22 |
+
The API provides comprehensive support for Anthropic-style message generation with several key capabilities. First, it offers full Anthropic API compatibility, including endpoints that match the Anthropic Messages API structure, making it easy to integrate with existing codebases. Second, it supports streaming responses through Server-Sent Events (SSE), enabling real-time output display as tokens are generated. Third, the API handles multi-turn conversations by maintaining conversation history and formatting prompts appropriately for OpenELM models.
|
| 23 |
+
|
| 24 |
+
Additionally, the wrapper properly handles system prompts by prepending them to the conversation context, which is essential for defining assistant behavior. The API also provides flexible generation parameters, allowing control over temperature, top-p sampling, maximum tokens, and other generation settings. Finally, comprehensive token usage statistics are included in responses, matching the Anthropic response format exactly.
|
| 25 |
+
|
| 26 |
+
## Quick Start
|
| 27 |
+
|
| 28 |
+
### Using Docker (Recommended)
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
# Build and run with Docker
|
| 32 |
+
docker build -t openelm-anthropic-api .
|
| 33 |
+
docker run -p 8000:8000 openelm-anthropic-api
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
### Local Development
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
# Clone and install dependencies
|
| 40 |
+
pip install -r requirements.txt
|
| 41 |
+
|
| 42 |
+
# Start the server
|
| 43 |
+
python -m uvicorn app:app --host 0.0.0.0 --port 8000
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Test the API
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
# Basic message generation
|
| 50 |
+
curl -X POST http://localhost:8000/v1/messages \
|
| 51 |
+
-H "Content-Type: application/json" \
|
| 52 |
+
-d '{
|
| 53 |
+
"model": "openelm-450m-instruct",
|
| 54 |
+
"messages": [{"role": "user", "content": "Say hello!"}],
|
| 55 |
+
"max_tokens": 100
|
| 56 |
+
}'
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## API Reference
|
| 60 |
+
|
| 61 |
+
### Endpoints
|
| 62 |
+
|
| 63 |
+
| Method | Endpoint | Description |
|
| 64 |
+
|--------|----------|-------------|
|
| 65 |
+
| GET | / | API information |
|
| 66 |
+
| GET | /health | Health check |
|
| 67 |
+
| GET | /v1/models | List available models |
|
| 68 |
+
| POST | /v1/messages | Create message (non-streaming) |
|
| 69 |
+
| POST | /v1/messages/stream | Create message (streaming) |
|
| 70 |
+
|
| 71 |
+
### Request Format
|
| 72 |
+
|
| 73 |
+
```json
|
| 74 |
+
{
|
| 75 |
+
"model": "openelm-450m-instruct",
|
| 76 |
+
"messages": [
|
| 77 |
+
{"role": "user", "content": "Your prompt here"}
|
| 78 |
+
],
|
| 79 |
+
"system": "Optional system prompt",
|
| 80 |
+
"max_tokens": 1024,
|
| 81 |
+
"temperature": 0.7,
|
| 82 |
+
"top_p": 0.9,
|
| 83 |
+
"stream": false
|
| 84 |
+
}
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Response Format
|
| 88 |
+
|
| 89 |
+
```json
|
| 90 |
+
{
|
| 91 |
+
"id": "msg_abc123",
|
| 92 |
+
"type": "message",
|
| 93 |
+
"role": "assistant",
|
| 94 |
+
"content": [{"type": "text", "text": "Generated response"}],
|
| 95 |
+
"model": "openelm-450m-instruct",
|
| 96 |
+
"stop_reason": "end_turn",
|
| 97 |
+
"usage": {
|
| 98 |
+
"input_tokens": 10,
|
| 99 |
+
"output_tokens": 50
|
| 100 |
+
}
|
| 101 |
+
}
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## Using with Anthropic SDK
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
from anthropic import Anthropic
|
| 108 |
+
|
| 109 |
+
# Point to your local API
|
| 110 |
+
client = Anthropic(
|
| 111 |
+
base_url="http://localhost:8000/v1",
|
| 112 |
+
api_key="dummy" # Any string works
|
| 113 |
+
)
|
| 114 |
+
|
| 115 |
+
# Use the same API you use with Claude!
|
| 116 |
+
response = client.messages.create(
|
| 117 |
+
model="openelm-450m-instruct",
|
| 118 |
+
messages=[{"role": "user", "content": "Hello!"}],
|
| 119 |
+
max_tokens=100
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
print(response.content[0].text)
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Model Information
|
| 126 |
+
|
| 127 |
+
- **Default Model**: apple/OpenELM-450M-Instruct
|
| 128 |
+
- **Parameters**: 450M
|
| 129 |
+
- **Context Window**: 2048 tokens
|
| 130 |
+
- **Weight Format**: Safetensors (secure and efficient)
|
| 131 |
+
- **Quantization**: FP16 for optimal performance
|
| 132 |
+
|
| 133 |
+
## Architecture
|
| 134 |
+
|
| 135 |
+
- **Framework**: FastAPI with async support
|
| 136 |
+
- **ML Backend**: PyTorch + HuggingFace Transformers
|
| 137 |
+
- **Model Loading**: Lazy loading on startup with caching
|
| 138 |
+
- **Streaming**: Server-Sent Events (SSE)
|
| 139 |
+
- **Response Format**: 100% Anthropic API compatible
|
| 140 |
+
|
| 141 |
+
## Configuration
|
| 142 |
+
|
| 143 |
+
Environment variables can be used to customize the deployment:
|
| 144 |
+
|
| 145 |
+
| Variable | Default | Description |
|
| 146 |
+
|----------|---------|-------------|
|
| 147 |
+
| PORT | 8000 | API server port |
|
| 148 |
+
| HF_HOME | ~/.cache/huggingface | Model cache directory |
|
| 149 |
+
| TRANSFORMERS_CACHE | ~/.cache/transformers | Transformers cache |
|
| 150 |
+
|
| 151 |
+
## Examples
|
| 152 |
+
|
| 153 |
+
See the `examples/` directory for complete usage examples:
|
| 154 |
+
|
| 155 |
+
- `anthropic_sdk_example.py` - Python SDK usage
|
| 156 |
+
- `curl_examples.sh` - Command-line examples
|
| 157 |
+
|
| 158 |
+
## Troubleshooting
|
| 159 |
+
|
| 160 |
+
- **Model not loading**: Check internet connection for HuggingFace download
|
| 161 |
+
- **Out of memory**: Reduce max_tokens or use CPU inference
|
| 162 |
+
- **Slow responses**: First request downloads model (subsequent requests are faster)
|
| 163 |
+
- **Port conflicts**: Change PORT environment variable
|
| 164 |
+
|
| 165 |
+
## License
|
| 166 |
+
|
| 167 |
+
This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses. Please refer to the model card on Hugging Face for licensing information regarding the model weights.
|
| 168 |
+
|
| 169 |
+
## Resources
|
| 170 |
+
|
| 171 |
+
- [OpenELM Model Card](https://huggingface.co/apple/OpenELM-450M-Instruct)
|
| 172 |
+
- [Anthropic API Documentation](https://docs.anthropic.com)
|
| 173 |
+
- [FastAPI Documentation](https://fastapi.tiangolo.com)
|
| 174 |
+
- [HuggingFace Transformers](https://huggingface.co/docs/transformers)
|
app.py
CHANGED
|
@@ -1,7 +1,661 @@
|
|
| 1 |
-
|
|
|
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
OpenELM Anthropic API Compatible Wrapper
|
| 3 |
|
| 4 |
+
This FastAPI application provides an Anthropic-compatible API for the OpenELM model,
|
| 5 |
+
allowing users to call OpenELM models using the Anthropic SDK with minimal code changes.
|
| 6 |
+
"""
|
| 7 |
|
| 8 |
+
import asyncio
|
| 9 |
+
import uuid
|
| 10 |
+
import sys
|
| 11 |
+
from contextlib import asynccontextmanager
|
| 12 |
+
from typing import AsyncIterator, List, Optional, Dict, Any
|
| 13 |
+
|
| 14 |
+
import torch
|
| 15 |
+
from fastapi import FastAPI, HTTPException, Request
|
| 16 |
+
from fastapi.responses import JSONResponse, StreamingResponse
|
| 17 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 18 |
+
from pydantic import BaseModel, Field
|
| 19 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 20 |
+
from huggingface_hub import hf_hub_download
|
| 21 |
+
import os
|
| 22 |
+
|
| 23 |
+
# Import for streaming
|
| 24 |
+
from transformers import TextIteratorStreamer
|
| 25 |
+
from threading import Thread
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# Global model and tokenizer references
|
| 29 |
+
model = None
|
| 30 |
+
tokenizer = None
|
| 31 |
+
model_id = "apple/OpenELM-450M-Instruct"
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
@asynccontextmanager
|
| 35 |
+
async def lifespan(app: FastAPI) -> AsyncIterator:
|
| 36 |
+
"""Load model on startup and clean up on shutdown."""
|
| 37 |
+
global model, tokenizer
|
| 38 |
+
|
| 39 |
+
print("Loading OpenELM model...")
|
| 40 |
+
try:
|
| 41 |
+
# Load tokenizer
|
| 42 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 43 |
+
model_id,
|
| 44 |
+
trust_remote_code=True
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
# Load model with safetensors support
|
| 48 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 49 |
+
model_id,
|
| 50 |
+
torch_dtype=torch.float16,
|
| 51 |
+
use_safetensors=True,
|
| 52 |
+
trust_remote_code=True,
|
| 53 |
+
device_map="auto" if torch.cuda.is_available() else None
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
model.eval()
|
| 57 |
+
print(f"Model {model_id} loaded successfully!")
|
| 58 |
+
|
| 59 |
+
# Print model info
|
| 60 |
+
if hasattr(model, 'config'):
|
| 61 |
+
print(f"Model config: hidden_size={getattr(model.config, 'hidden_size', 'N/A')}, "
|
| 62 |
+
f"num_layers={getattr(model.config, 'num_layers', 'N/A')}, "
|
| 63 |
+
f"num_attention_heads={getattr(model.config, 'num_attention_heads', 'N/A')}")
|
| 64 |
+
|
| 65 |
+
except Exception as e:
|
| 66 |
+
print(f"Error loading model: {e}", file=sys.stderr)
|
| 67 |
+
# Continue without model - allow health check
|
| 68 |
+
model = None
|
| 69 |
+
tokenizer = None
|
| 70 |
+
|
| 71 |
+
yield
|
| 72 |
+
|
| 73 |
+
# Cleanup
|
| 74 |
+
if model is not None:
|
| 75 |
+
del model
|
| 76 |
+
if tokenizer is not None:
|
| 77 |
+
del tokenizer
|
| 78 |
+
torch.cuda.empty_cache() if torch.cuda.is_available() else None
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
# Create FastAPI app
|
| 82 |
+
app = FastAPI(
|
| 83 |
+
title="OpenELM Anthropic API",
|
| 84 |
+
description="Anthropic API compatible wrapper for OpenELM models",
|
| 85 |
+
version="1.0.0",
|
| 86 |
+
lifespan=lifespan
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
# Add CORS middleware
|
| 90 |
+
app.add_middleware(
|
| 91 |
+
CORSMiddleware,
|
| 92 |
+
allow_origins=["*"],
|
| 93 |
+
allow_credentials=True,
|
| 94 |
+
allow_methods=["*"],
|
| 95 |
+
allow_headers=["*"],
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
# ==================== Pydantic Models ====================
|
| 100 |
+
|
| 101 |
+
class MessageContent(BaseModel):
|
| 102 |
+
"""Content of a message."""
|
| 103 |
+
type: str = "text"
|
| 104 |
+
text: str
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
class Message(BaseModel):
|
| 108 |
+
"""A message in the conversation."""
|
| 109 |
+
role: str
|
| 110 |
+
content: str | List[MessageContent]
|
| 111 |
+
name: Optional[str] = None
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
class Usage(BaseModel):
|
| 115 |
+
"""Token usage statistics."""
|
| 116 |
+
input_tokens: int = 0
|
| 117 |
+
output_tokens: int = 0
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
class ContentBlock(BaseModel):
|
| 121 |
+
"""Content block in the response."""
|
| 122 |
+
type: str = "text"
|
| 123 |
+
text: str
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
class MessageResponse(BaseModel):
|
| 127 |
+
"""Response message format matching Anthropic API."""
|
| 128 |
+
id: str
|
| 129 |
+
type: str = "message"
|
| 130 |
+
role: str = "assistant"
|
| 131 |
+
content: List[ContentBlock]
|
| 132 |
+
model: str
|
| 133 |
+
stop_reason: Optional[str] = None
|
| 134 |
+
stop_sequence: Optional[str] = None
|
| 135 |
+
usage: Usage
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
class MessageCreateParams(BaseModel):
|
| 139 |
+
"""Parameters for creating a message (Anthropic API compatible)."""
|
| 140 |
+
model: str = "openelm-450m-instruct"
|
| 141 |
+
messages: List[Message]
|
| 142 |
+
system: Optional[str] = None
|
| 143 |
+
max_tokens: int = Field(default=1024, ge=1, le=4096)
|
| 144 |
+
temperature: Optional[float] = Field(default=None, ge=0.0, le=1.0)
|
| 145 |
+
top_p: Optional[float] = Field(default=None, ge=0.0, le=1.0)
|
| 146 |
+
top_k: Optional[int] = Field(default=None, ge=1)
|
| 147 |
+
stop_sequences: Optional[List[str]] = None
|
| 148 |
+
stream: Optional[bool] = False
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
class ModelInfo(BaseModel):
|
| 152 |
+
"""Information about an available model."""
|
| 153 |
+
id: str
|
| 154 |
+
object: str = "model"
|
| 155 |
+
created: int = 0
|
| 156 |
+
owned_by: str = "openelm"
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
class ModelListResponse(BaseModel):
|
| 160 |
+
"""List of available models."""
|
| 161 |
+
object: str = "list"
|
| 162 |
+
data: List[ModelInfo]
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
# ==================== Helper Functions ====================
|
| 166 |
+
|
| 167 |
+
def format_prompt_for_openelm(
|
| 168 |
+
messages: List[Message],
|
| 169 |
+
system: Optional[str] = None
|
| 170 |
+
) -> str:
|
| 171 |
+
"""
|
| 172 |
+
Format messages into a prompt suitable for OpenELM.
|
| 173 |
+
|
| 174 |
+
OpenELM uses raw text continuation, not ChatML. We convert the
|
| 175 |
+
conversation history into a script-like format.
|
| 176 |
+
"""
|
| 177 |
+
prompt_parts = []
|
| 178 |
+
|
| 179 |
+
# Add system prompt first if provided
|
| 180 |
+
if system:
|
| 181 |
+
prompt_parts.append(f"[System: {system}]")
|
| 182 |
+
|
| 183 |
+
# Build conversation history
|
| 184 |
+
for msg in messages:
|
| 185 |
+
role = msg.role.lower()
|
| 186 |
+
content = msg.content
|
| 187 |
+
|
| 188 |
+
# Handle both string and list content formats
|
| 189 |
+
if isinstance(content, list):
|
| 190 |
+
text_parts = []
|
| 191 |
+
for block in content:
|
| 192 |
+
if hasattr(block, 'text'):
|
| 193 |
+
text_parts.append(block.text)
|
| 194 |
+
elif isinstance(block, dict) and 'text' in block:
|
| 195 |
+
text_parts.append(block['text'])
|
| 196 |
+
content = ''.join(text_parts)
|
| 197 |
+
elif not isinstance(content, str):
|
| 198 |
+
content = str(content)
|
| 199 |
+
|
| 200 |
+
# Format based on role
|
| 201 |
+
if role == "system":
|
| 202 |
+
prompt_parts.append(f"[System: {content}]")
|
| 203 |
+
elif role == "user":
|
| 204 |
+
prompt_parts.append(f"User: {content}")
|
| 205 |
+
elif role == "assistant":
|
| 206 |
+
prompt_parts.append(f"Assistant: {content}")
|
| 207 |
+
else:
|
| 208 |
+
prompt_parts.append(f"{role}: {content}")
|
| 209 |
+
|
| 210 |
+
# Add the final Assistant: prefix for completion
|
| 211 |
+
prompt_parts.append("Assistant:")
|
| 212 |
+
|
| 213 |
+
return "\n\n".join(prompt_parts)
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
def count_tokens(text: str) -> int:
|
| 217 |
+
"""Estimate token count (approximation)."""
|
| 218 |
+
if tokenizer:
|
| 219 |
+
return len(tokenizer.encode(text))
|
| 220 |
+
# Rough approximation: ~4 characters per token
|
| 221 |
+
return max(1, len(text) // 4)
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
def truncate_prompt(prompt: str, max_tokens: int, system: Optional[str] = None) -> str:
|
| 225 |
+
"""Truncate prompt to fit within context window."""
|
| 226 |
+
current_tokens = count_tokens(prompt)
|
| 227 |
+
|
| 228 |
+
if current_tokens <= max_tokens:
|
| 229 |
+
return prompt
|
| 230 |
+
|
| 231 |
+
# Split into parts and remove from the beginning (keep system if present)
|
| 232 |
+
lines = prompt.split("\n\n")
|
| 233 |
+
|
| 234 |
+
# If system is present, keep it at the start
|
| 235 |
+
system_line = None
|
| 236 |
+
if lines and lines[0].startswith("[System:"):
|
| 237 |
+
system_line = lines[0]
|
| 238 |
+
lines = lines[1:]
|
| 239 |
+
|
| 240 |
+
# Remove oldest messages until within limit
|
| 241 |
+
truncated_lines = []
|
| 242 |
+
for line in reversed(lines):
|
| 243 |
+
truncated_lines.insert(0, line)
|
| 244 |
+
current_tokens = count_tokens("\n\n".join([system_line] + truncated_lines) if system_line else "\n\n".join(truncated_lines))
|
| 245 |
+
if current_tokens <= max_tokens:
|
| 246 |
+
break
|
| 247 |
+
|
| 248 |
+
if system_line:
|
| 249 |
+
return "\n\n".join([system_line] + truncated_lines)
|
| 250 |
+
return "\n\n".join(truncated_lines)
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
def map_anthropic_params_to_transformers(
|
| 254 |
+
temperature: Optional[float],
|
| 255 |
+
top_p: Optional[float],
|
| 256 |
+
top_k: Optional[int],
|
| 257 |
+
max_tokens: int
|
| 258 |
+
) -> Dict[str, Any]:
|
| 259 |
+
"""Map Anthropic parameters to transformers generation parameters."""
|
| 260 |
+
params = {
|
| 261 |
+
"max_new_tokens": max_tokens,
|
| 262 |
+
}
|
| 263 |
+
|
| 264 |
+
if temperature is not None:
|
| 265 |
+
if temperature == 0:
|
| 266 |
+
params["do_sample"] = False
|
| 267 |
+
else:
|
| 268 |
+
params["temperature"] = temperature
|
| 269 |
+
params["do_sample"] = True
|
| 270 |
+
|
| 271 |
+
if top_p is not None:
|
| 272 |
+
params["top_p"] = top_p
|
| 273 |
+
|
| 274 |
+
if top_k is not None:
|
| 275 |
+
params["top_k"] = top_k
|
| 276 |
+
|
| 277 |
+
return params
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
# ==================== API Endpoints ====================
|
| 281 |
+
|
| 282 |
+
@app.get("/", tags=["Root"])
|
| 283 |
+
async def root():
|
| 284 |
+
"""Root endpoint with API information."""
|
| 285 |
+
return {
|
| 286 |
+
"name": "OpenELM Anthropic API",
|
| 287 |
+
"version": "1.0.0",
|
| 288 |
+
"description": "Anthropic API compatible wrapper for OpenELM models",
|
| 289 |
+
"endpoints": {
|
| 290 |
+
"messages": "POST /v1/messages",
|
| 291 |
+
"models": "GET /v1/models",
|
| 292 |
+
"health": "GET /health"
|
| 293 |
+
}
|
| 294 |
+
}
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
@app.get("/health", tags=["Health"])
|
| 298 |
+
async def health_check():
|
| 299 |
+
"""Health check endpoint."""
|
| 300 |
+
status = "healthy" if model is not None else "unhealthy"
|
| 301 |
+
return {
|
| 302 |
+
"status": status,
|
| 303 |
+
"model_loaded": model is not None,
|
| 304 |
+
"tokenizer_loaded": tokenizer is not None
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
@app.get("/v1/models", response_model=ModelListResponse, tags=["Models"])
|
| 309 |
+
async def list_models():
|
| 310 |
+
"""List available models (Anthropic API compatible)."""
|
| 311 |
+
return ModelListResponse(
|
| 312 |
+
data=[
|
| 313 |
+
ModelInfo(
|
| 314 |
+
id="openelm-450m-instruct",
|
| 315 |
+
owned_by="apple",
|
| 316 |
+
created=int(uuid.uuid1().time)
|
| 317 |
+
)
|
| 318 |
+
]
|
| 319 |
+
)
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
@app.post("/v1/messages", response_model=MessageResponse, tags=["Messages"])
|
| 323 |
+
async def create_message(
|
| 324 |
+
params: MessageCreateParams,
|
| 325 |
+
request: Request
|
| 326 |
+
):
|
| 327 |
+
"""
|
| 328 |
+
Create a message completion (Anthropic API compatible).
|
| 329 |
+
|
| 330 |
+
This endpoint accepts Anthropic-style messages and returns responses
|
| 331 |
+
in the same format, allowing existing code to work with OpenELM.
|
| 332 |
+
"""
|
| 333 |
+
# Check if model is loaded
|
| 334 |
+
if model is None or tokenizer is None:
|
| 335 |
+
raise HTTPException(
|
| 336 |
+
status_code=503,
|
| 337 |
+
detail="Model not loaded. Please wait for model to initialize."
|
| 338 |
+
)
|
| 339 |
+
|
| 340 |
+
try:
|
| 341 |
+
# Format prompt for OpenELM
|
| 342 |
+
messages = params.messages
|
| 343 |
+
|
| 344 |
+
# Extract message contents
|
| 345 |
+
formatted_messages = []
|
| 346 |
+
for msg in messages:
|
| 347 |
+
content = msg.content
|
| 348 |
+
if isinstance(content, list):
|
| 349 |
+
text_content = ""
|
| 350 |
+
for block in content:
|
| 351 |
+
if hasattr(block, 'text'):
|
| 352 |
+
text_content += block.text
|
| 353 |
+
content = text_content
|
| 354 |
+
formatted_messages.append(Message(
|
| 355 |
+
role=msg.role,
|
| 356 |
+
content=content
|
| 357 |
+
))
|
| 358 |
+
|
| 359 |
+
prompt = format_prompt_for_openelm(formatted_messages, params.system)
|
| 360 |
+
|
| 361 |
+
# Truncate if needed (OpenELM typically has 2048 context window)
|
| 362 |
+
max_context_tokens = 2048 - params.max_tokens
|
| 363 |
+
prompt = truncate_prompt(prompt, max_context_tokens, params.system)
|
| 364 |
+
|
| 365 |
+
# Tokenize input
|
| 366 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 367 |
+
input_tokens = len(inputs.input_ids[0])
|
| 368 |
+
|
| 369 |
+
# Move to same device as model
|
| 370 |
+
if hasattr(model, 'device'):
|
| 371 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 372 |
+
|
| 373 |
+
# Map parameters
|
| 374 |
+
gen_params = map_anthropic_params_to_transformers(
|
| 375 |
+
params.temperature,
|
| 376 |
+
params.top_p,
|
| 377 |
+
params.top_k,
|
| 378 |
+
params.max_tokens
|
| 379 |
+
)
|
| 380 |
+
|
| 381 |
+
# Generate
|
| 382 |
+
with torch.no_grad():
|
| 383 |
+
outputs = model.generate(
|
| 384 |
+
**inputs,
|
| 385 |
+
**gen_params,
|
| 386 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 387 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 388 |
+
)
|
| 389 |
+
|
| 390 |
+
# Decode output
|
| 391 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 392 |
+
|
| 393 |
+
# Extract the assistant's response (everything after "Assistant:")
|
| 394 |
+
response_text = generated_text
|
| 395 |
+
if "Assistant:" in generated_text:
|
| 396 |
+
response_text = generated_text.split("Assistant:")[-1].strip()
|
| 397 |
+
elif ":" in generated_text:
|
| 398 |
+
# Find the last role and extract content after it
|
| 399 |
+
lines = generated_text.split("\n")
|
| 400 |
+
in_assistant = False
|
| 401 |
+
response_parts = []
|
| 402 |
+
for line in lines:
|
| 403 |
+
if line.startswith("Assistant:"):
|
| 404 |
+
in_assistant = True
|
| 405 |
+
response_parts.append(line.replace("Assistant:", "").strip())
|
| 406 |
+
elif in_assistant and not line.startswith("User:") and not line.startswith("System:"):
|
| 407 |
+
response_parts.append(line)
|
| 408 |
+
elif line.startswith("User:") or line.startswith("System:"):
|
| 409 |
+
in_assistant = False
|
| 410 |
+
response_text = "\n".join(response_parts).strip()
|
| 411 |
+
|
| 412 |
+
output_tokens = count_tokens(response_text)
|
| 413 |
+
|
| 414 |
+
# Build response matching Anthropic format
|
| 415 |
+
response_id = f"msg_{uuid.uuid4().hex[:8]}"
|
| 416 |
+
|
| 417 |
+
return MessageResponse(
|
| 418 |
+
id=response_id,
|
| 419 |
+
role="assistant",
|
| 420 |
+
content=[ContentBlock(type="text", text=response_text)],
|
| 421 |
+
model="openelm-450m-instruct",
|
| 422 |
+
stop_reason="end_turn",
|
| 423 |
+
usage=Usage(
|
| 424 |
+
input_tokens=input_tokens,
|
| 425 |
+
output_tokens=output_tokens
|
| 426 |
+
)
|
| 427 |
+
)
|
| 428 |
+
|
| 429 |
+
except Exception as e:
|
| 430 |
+
raise HTTPException(
|
| 431 |
+
status_code=500,
|
| 432 |
+
detail=f"Generation failed: {str(e)}"
|
| 433 |
+
)
|
| 434 |
+
|
| 435 |
+
|
| 436 |
+
@app.post("/v1/messages/stream", tags=["Messages"])
|
| 437 |
+
async def create_message_stream(
|
| 438 |
+
params: MessageCreateParams,
|
| 439 |
+
request: Request
|
| 440 |
+
):
|
| 441 |
+
"""
|
| 442 |
+
Create a streaming message completion (Anthropic API compatible).
|
| 443 |
+
|
| 444 |
+
Returns Server-Sent Events (SSE) with streaming response.
|
| 445 |
+
"""
|
| 446 |
+
# Check if model is loaded
|
| 447 |
+
if model is None or tokenizer is None:
|
| 448 |
+
raise HTTPException(
|
| 449 |
+
status_code=503,
|
| 450 |
+
detail="Model not loaded. Please wait for model to initialize."
|
| 451 |
+
)
|
| 452 |
+
|
| 453 |
+
if not params.stream:
|
| 454 |
+
raise HTTPException(
|
| 455 |
+
status_code=400,
|
| 456 |
+
detail="Stream parameter must be true for streaming endpoint"
|
| 457 |
+
)
|
| 458 |
+
|
| 459 |
+
async def generate_stream():
|
| 460 |
+
"""Generate streaming response."""
|
| 461 |
+
try:
|
| 462 |
+
# Format prompt for OpenELM
|
| 463 |
+
messages = params.messages
|
| 464 |
+
|
| 465 |
+
# Extract message contents
|
| 466 |
+
formatted_messages = []
|
| 467 |
+
for msg in messages:
|
| 468 |
+
content = msg.content
|
| 469 |
+
if isinstance(content, list):
|
| 470 |
+
text_content = ""
|
| 471 |
+
for block in content:
|
| 472 |
+
if hasattr(block, 'text'):
|
| 473 |
+
text_content += block.text
|
| 474 |
+
content = text_content
|
| 475 |
+
formatted_messages.append(Message(
|
| 476 |
+
role=msg.role,
|
| 477 |
+
content=content
|
| 478 |
+
))
|
| 479 |
+
|
| 480 |
+
prompt = format_prompt_for_openelm(formatted_messages, params.system)
|
| 481 |
+
|
| 482 |
+
# Truncate if needed
|
| 483 |
+
max_context_tokens = 2048 - params.max_tokens
|
| 484 |
+
prompt = truncate_prompt(prompt, max_context_tokens, params.system)
|
| 485 |
+
|
| 486 |
+
# Tokenize
|
| 487 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 488 |
+
input_tokens = len(inputs.input_ids[0])
|
| 489 |
+
|
| 490 |
+
# Move to same device as model
|
| 491 |
+
if hasattr(model, 'device'):
|
| 492 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 493 |
+
|
| 494 |
+
# Map parameters
|
| 495 |
+
gen_params = map_anthropic_params_to_transformers(
|
| 496 |
+
params.temperature,
|
| 497 |
+
params.top_p,
|
| 498 |
+
params.top_k,
|
| 499 |
+
params.max_tokens
|
| 500 |
+
)
|
| 501 |
+
|
| 502 |
+
# Set up streaming
|
| 503 |
+
gen_params["stopping_criteria"] = []
|
| 504 |
+
|
| 505 |
+
# Use TextIteratorStreamer for streaming
|
| 506 |
+
streamer = TextIteratorStreamer(
|
| 507 |
+
tokenizer,
|
| 508 |
+
skip_prompt=True,
|
| 509 |
+
skip_special_tokens=True
|
| 510 |
+
)
|
| 511 |
+
|
| 512 |
+
gen_params["streamer"] = streamer
|
| 513 |
+
|
| 514 |
+
# Run generation in a separate thread
|
| 515 |
+
def generate():
|
| 516 |
+
with torch.no_grad():
|
| 517 |
+
model.generate(**inputs, **gen_params)
|
| 518 |
+
|
| 519 |
+
thread = Thread(target=generate)
|
| 520 |
+
thread.start()
|
| 521 |
+
|
| 522 |
+
# Send message_start event
|
| 523 |
+
message_id = f"msg_{uuid.uuid4().hex[:8]}"
|
| 524 |
+
yield f"event: message_start\ndata: {MessageResponse(id=message_id, model='openelm-450m-instruct', usage=Usage()).model_dump_json()}\n\n"
|
| 525 |
+
|
| 526 |
+
# Send content_block_start event
|
| 527 |
+
yield f"event: content_block_start\ndata: {{\"type\": \"text\", \"text\": \"\"}}\n\n"
|
| 528 |
+
|
| 529 |
+
# Stream the generated text
|
| 530 |
+
full_text = ""
|
| 531 |
+
for text in streamer:
|
| 532 |
+
full_text += text
|
| 533 |
+
yield f"event: content_block_delta\ndata: {{\"type\": \"text_delta\", \"text\": \"{text}\"}}\n\n"
|
| 534 |
+
|
| 535 |
+
# Send content_block_stop event
|
| 536 |
+
yield f"event: content_block_stop\ndata: {{\"type\": \"content_block\", \"text\": \"\"}}\n\n"
|
| 537 |
+
|
| 538 |
+
# Calculate usage
|
| 539 |
+
output_tokens = count_tokens(full_text)
|
| 540 |
+
|
| 541 |
+
# Send message_delta event
|
| 542 |
+
usage_data = {"input_tokens": input_tokens, "output_tokens": output_tokens}
|
| 543 |
+
yield f"event: message_delta\ndata: {{\"delta\": {{\"stop_reason\": \"end_turn\"}}, \"usage\": {usage_data}}}\n\n"
|
| 544 |
+
|
| 545 |
+
# Send message_stop event
|
| 546 |
+
yield f"event: message_stop\ndata: {{}}\n\n"
|
| 547 |
+
|
| 548 |
+
thread.join()
|
| 549 |
+
|
| 550 |
+
except Exception as e:
|
| 551 |
+
yield f"event: error\ndata: {{\"error\": \"{str(e)}\"}}\n\n"
|
| 552 |
+
|
| 553 |
+
return StreamingResponse(
|
| 554 |
+
generate_stream(),
|
| 555 |
+
media_type="text/event-stream",
|
| 556 |
+
headers={
|
| 557 |
+
"Cache-Control": "no-cache",
|
| 558 |
+
"Connection": "keep-alive",
|
| 559 |
+
"X-Accel-Buffering": "no",
|
| 560 |
+
}
|
| 561 |
+
)
|
| 562 |
+
|
| 563 |
+
|
| 564 |
+
# ==================== Anthropic SDK Compatibility ====================
|
| 565 |
+
|
| 566 |
+
class AnthropicClient:
|
| 567 |
+
"""
|
| 568 |
+
Simple Anthropic SDK compatible client for testing.
|
| 569 |
+
|
| 570 |
+
Usage:
|
| 571 |
+
client = AnthropicClient(base_url="http://localhost:8000/v1", api_key="dummy")
|
| 572 |
+
response = client.messages.create(
|
| 573 |
+
model="openelm-450m-instruct",
|
| 574 |
+
messages=[{"role": "user", "content": "Hello!"}],
|
| 575 |
+
max_tokens=100
|
| 576 |
+
)
|
| 577 |
+
"""
|
| 578 |
+
|
| 579 |
+
def __init__(self, base_url: str = "http://localhost:8000", api_key: str = "dummy"):
|
| 580 |
+
self.base_url = base_url.rstrip("/")
|
| 581 |
+
self.api_key = api_key
|
| 582 |
+
self.session = None
|
| 583 |
+
|
| 584 |
+
def _get_session(self):
|
| 585 |
+
"""Get or create a requests session."""
|
| 586 |
+
import requests
|
| 587 |
+
if self.session is None:
|
| 588 |
+
self.session = requests.Session()
|
| 589 |
+
self.session.headers.update({
|
| 590 |
+
"Authorization": f"Bearer {self.api_key}",
|
| 591 |
+
"Content-Type": "application/json"
|
| 592 |
+
})
|
| 593 |
+
return self.session
|
| 594 |
+
|
| 595 |
+
def messages(self) -> "MessageResource":
|
| 596 |
+
"""Access message operations."""
|
| 597 |
+
return MessageResource(self)
|
| 598 |
+
|
| 599 |
+
|
| 600 |
+
class MessageResource:
|
| 601 |
+
"""Resource for message operations."""
|
| 602 |
+
|
| 603 |
+
def __init__(self, client: AnthropicClient):
|
| 604 |
+
self.client = client
|
| 605 |
+
|
| 606 |
+
def create(
|
| 607 |
+
self,
|
| 608 |
+
model: str,
|
| 609 |
+
messages: List[Dict[str, str]],
|
| 610 |
+
system: Optional[str] = None,
|
| 611 |
+
max_tokens: int = 1024,
|
| 612 |
+
temperature: Optional[float] = None,
|
| 613 |
+
top_p: Optional[float] = None,
|
| 614 |
+
stream: bool = False
|
| 615 |
+
) -> Dict[str, Any]:
|
| 616 |
+
"""Create a message."""
|
| 617 |
+
import requests
|
| 618 |
+
|
| 619 |
+
url = f"{self.client.base_url}/v1/messages"
|
| 620 |
+
if stream:
|
| 621 |
+
url = f"{self.client.base_url}/v1/messages/stream"
|
| 622 |
+
|
| 623 |
+
payload = {
|
| 624 |
+
"model": model,
|
| 625 |
+
"messages": messages,
|
| 626 |
+
"max_tokens": max_tokens,
|
| 627 |
+
}
|
| 628 |
+
|
| 629 |
+
if system:
|
| 630 |
+
payload["system"] = system
|
| 631 |
+
if temperature is not None:
|
| 632 |
+
payload["temperature"] = temperature
|
| 633 |
+
if top_p is not None:
|
| 634 |
+
payload["top_p"] = top_p
|
| 635 |
+
if stream:
|
| 636 |
+
payload["stream"] = True
|
| 637 |
+
|
| 638 |
+
response = self.client._get_session().post(url, json=payload)
|
| 639 |
+
|
| 640 |
+
if response.status_code != 200:
|
| 641 |
+
raise Exception(f"API request failed: {response.text}")
|
| 642 |
+
|
| 643 |
+
return response.json()
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
# ==================== Main Entry Point ====================
|
| 647 |
+
|
| 648 |
+
if __name__ == "__main__":
|
| 649 |
+
import uvicorn
|
| 650 |
+
|
| 651 |
+
# Get port from environment or use default
|
| 652 |
+
port = int(os.environ.get("PORT", 8000))
|
| 653 |
+
|
| 654 |
+
# Run the server
|
| 655 |
+
uvicorn.run(
|
| 656 |
+
"app:app",
|
| 657 |
+
host="0.0.0.0",
|
| 658 |
+
port=port,
|
| 659 |
+
reload=False,
|
| 660 |
+
workers=1
|
| 661 |
+
)
|
examples/anthropic_sdk_example.py
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Example: Using Anthropic SDK with OpenELM API
|
| 3 |
+
|
| 4 |
+
This example demonstrates how to use the Anthropic SDK (or compatible client)
|
| 5 |
+
to call OpenELM models through our Anthropic API compatible wrapper.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
python examples/anthropic_sdk_example.py
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import sys
|
| 12 |
+
import os
|
| 13 |
+
|
| 14 |
+
# Add parent directory to path for imports
|
| 15 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
| 16 |
+
|
| 17 |
+
from app import AnthropicClient
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def main():
|
| 21 |
+
"""Example usage of the Anthropic-compatible OpenELM API."""
|
| 22 |
+
|
| 23 |
+
# Create client pointing to our local API
|
| 24 |
+
base_url = os.environ.get("OPENELM_API_URL", "http://localhost:8000")
|
| 25 |
+
client = AnthropicClient(base_url=base_url, api_key="dummy-key")
|
| 26 |
+
|
| 27 |
+
print("=" * 60)
|
| 28 |
+
print("OpenELM Anthropic API - Usage Example")
|
| 29 |
+
print("=" * 60)
|
| 30 |
+
print(f"API URL: {base_url}")
|
| 31 |
+
print()
|
| 32 |
+
|
| 33 |
+
# Example 1: Basic message generation
|
| 34 |
+
print("Example 1: Basic Message Generation")
|
| 35 |
+
print("-" * 40)
|
| 36 |
+
|
| 37 |
+
response = client.messages().create(
|
| 38 |
+
model="openelm-450m-instruct",
|
| 39 |
+
messages=[
|
| 40 |
+
{"role": "user", "content": "Say hello in a friendly way!"}
|
| 41 |
+
],
|
| 42 |
+
max_tokens=100,
|
| 43 |
+
temperature=0.7
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
print(f"Response ID: {response['id']}")
|
| 47 |
+
print(f"Model: {response['model']}")
|
| 48 |
+
print(f"Content: {response['content'][0]['text']}")
|
| 49 |
+
print(f"Usage: {response['usage']}")
|
| 50 |
+
print()
|
| 51 |
+
|
| 52 |
+
# Example 2: Multi-turn conversation
|
| 53 |
+
print("Example 2: Multi-turn Conversation")
|
| 54 |
+
print("-" * 40)
|
| 55 |
+
|
| 56 |
+
response = client.messages().create(
|
| 57 |
+
model="openelm-450m-instruct",
|
| 58 |
+
messages=[
|
| 59 |
+
{"role": "user", "content": "What is artificial intelligence?"},
|
| 60 |
+
{"role": "assistant", "content": "Artificial intelligence, or AI, refers to systems that can perform tasks that typically require human intelligence."},
|
| 61 |
+
{"role": "user", "content": "Can you give me some examples?"}
|
| 62 |
+
],
|
| 63 |
+
max_tokens=150,
|
| 64 |
+
temperature=0.5
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
print(f"Content: {response['content'][0]['text']}")
|
| 68 |
+
print(f"Usage: {response['usage']}")
|
| 69 |
+
print()
|
| 70 |
+
|
| 71 |
+
# Example 3: Using system prompt
|
| 72 |
+
print("Example 3: Using System Prompt")
|
| 73 |
+
print("-" * 40)
|
| 74 |
+
|
| 75 |
+
response = client.messages().create(
|
| 76 |
+
model="openelm-450m-instruct",
|
| 77 |
+
messages=[
|
| 78 |
+
{"role": "user", "content": "Explain quantum computing simply."}
|
| 79 |
+
],
|
| 80 |
+
system="You are a helpful science educator who explains complex topics simply.",
|
| 81 |
+
max_tokens=200,
|
| 82 |
+
temperature=0.8
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
print(f"Content: {response['content'][0]['text']}")
|
| 86 |
+
print(f"Usage: {response['usage']}")
|
| 87 |
+
print()
|
| 88 |
+
|
| 89 |
+
# Example 4: Deterministic generation (temperature=0)
|
| 90 |
+
print("Example 4: Deterministic Generation (temperature=0)")
|
| 91 |
+
print("-" * 40)
|
| 92 |
+
|
| 93 |
+
response = client.messages().create(
|
| 94 |
+
model="openelm-450m-instruct",
|
| 95 |
+
messages=[
|
| 96 |
+
{"role": "user", "content": "What is 2 + 2?"}
|
| 97 |
+
],
|
| 98 |
+
max_tokens=50,
|
| 99 |
+
temperature=0.0 # Deterministic output
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
print(f"Content: {response['content'][0]['text']}")
|
| 103 |
+
print(f"Usage: {response['usage']}")
|
| 104 |
+
print()
|
| 105 |
+
|
| 106 |
+
print("=" * 60)
|
| 107 |
+
print("All examples completed successfully!")
|
| 108 |
+
print("=" * 60)
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
if __name__ == "__main__":
|
| 112 |
+
main()
|
examples/curl_examples.sh
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
# OpenELM Anthropic API - Curl Examples
|
| 3 |
+
#
|
| 4 |
+
# This script demonstrates how to call the OpenELM Anthropic API
|
| 5 |
+
# using curl commands directly.
|
| 6 |
+
#
|
| 7 |
+
# Usage:
|
| 8 |
+
# chmod +x examples/curl_examples.sh
|
| 9 |
+
# ./examples/curl_examples.sh
|
| 10 |
+
|
| 11 |
+
# Set API base URL (default: localhost:8000)
|
| 12 |
+
API_URL="${OPENELM_API_URL:-http://localhost:8000}"
|
| 13 |
+
API_URL="${API_URL%/}" # Remove trailing slash
|
| 14 |
+
|
| 15 |
+
echo "=============================================="
|
| 16 |
+
echo "OpenELM Anthropic API - Curl Examples"
|
| 17 |
+
echo "=============================================="
|
| 18 |
+
echo "API URL: $API_URL"
|
| 19 |
+
echo ""
|
| 20 |
+
|
| 21 |
+
# Example 1: Health Check
|
| 22 |
+
echo "Example 1: Health Check"
|
| 23 |
+
echo "------------------------"
|
| 24 |
+
curl -s "$API_URL/health" | python3 -m json.tool
|
| 25 |
+
echo ""
|
| 26 |
+
|
| 27 |
+
# Example 2: List Available Models
|
| 28 |
+
echo "Example 2: List Available Models"
|
| 29 |
+
echo "---------------------------------"
|
| 30 |
+
curl -s "$API_URL/v1/models" | python3 -m json.tool
|
| 31 |
+
echo ""
|
| 32 |
+
|
| 33 |
+
# Example 3: Basic Message Generation
|
| 34 |
+
echo "Example 3: Basic Message Generation"
|
| 35 |
+
echo "------------------------------------"
|
| 36 |
+
curl -s -X POST "$API_URL/v1/messages" \
|
| 37 |
+
-H "Content-Type: application/json" \
|
| 38 |
+
-d '{
|
| 39 |
+
"model": "openelm-450m-instruct",
|
| 40 |
+
"messages": [
|
| 41 |
+
{
|
| 42 |
+
"role": "user",
|
| 43 |
+
"content": "Say hello in a friendly way!"
|
| 44 |
+
}
|
| 45 |
+
],
|
| 46 |
+
"max_tokens": 100,
|
| 47 |
+
"temperature": 0.7
|
| 48 |
+
}' | python3 -m json.tool
|
| 49 |
+
echo ""
|
| 50 |
+
|
| 51 |
+
# Example 4: Multi-turn Conversation
|
| 52 |
+
echo "Example 4: Multi-turn Conversation"
|
| 53 |
+
echo "-----------------------------------"
|
| 54 |
+
curl -s -X POST "$API_URL/v1/messages" \
|
| 55 |
+
-H "Content-Type: application/json" \
|
| 56 |
+
-d '{
|
| 57 |
+
"model": "openelm-450m-instruct",
|
| 58 |
+
"messages": [
|
| 59 |
+
{
|
| 60 |
+
"role": "user",
|
| 61 |
+
"content": "What is Python?"
|
| 62 |
+
},
|
| 63 |
+
{
|
| 64 |
+
"role": "assistant",
|
| 65 |
+
"content": "Python is a high-level programming language known for its simplicity and readability."
|
| 66 |
+
},
|
| 67 |
+
{
|
| 68 |
+
"role": "user",
|
| 69 |
+
"content": "What is it used for?"
|
| 70 |
+
}
|
| 71 |
+
],
|
| 72 |
+
"max_tokens": 150,
|
| 73 |
+
"temperature": 0.5
|
| 74 |
+
}' | python3 -m json.tool
|
| 75 |
+
echo ""
|
| 76 |
+
|
| 77 |
+
# Example 5: Using System Prompt
|
| 78 |
+
echo "Example 5: Using System Prompt"
|
| 79 |
+
echo "-------------------------------"
|
| 80 |
+
curl -s -X POST "$API_URL/v1/messages" \
|
| 81 |
+
-H "Content-Type: application/json" \
|
| 82 |
+
-d '{
|
| 83 |
+
"model": "openelm-450m-instruct",
|
| 84 |
+
"messages": [
|
| 85 |
+
{
|
| 86 |
+
"role": "user",
|
| 87 |
+
"content": "Explain the concept simply."
|
| 88 |
+
}
|
| 89 |
+
],
|
| 90 |
+
"system": "You are a helpful tutor who explains things simply.",
|
| 91 |
+
"max_tokens": 200,
|
| 92 |
+
"temperature": 0.8
|
| 93 |
+
}' | python3 -m json.tool
|
| 94 |
+
echo ""
|
| 95 |
+
|
| 96 |
+
# Example 6: Deterministic Generation (temperature=0)
|
| 97 |
+
echo "Example 6: Deterministic Generation"
|
| 98 |
+
echo "------------------------------------"
|
| 99 |
+
curl -s -X POST "$API_URL/v1/messages" \
|
| 100 |
+
-H "Content-Type: application/json" \
|
| 101 |
+
-d '{
|
| 102 |
+
"model": "openelm-450m-instruct",
|
| 103 |
+
"messages": [
|
| 104 |
+
{
|
| 105 |
+
"role": "user",
|
| 106 |
+
"content": "What is the capital of France?"
|
| 107 |
+
}
|
| 108 |
+
],
|
| 109 |
+
"max_tokens": 50,
|
| 110 |
+
"temperature": 0.0
|
| 111 |
+
}' | python3 -m json.tool
|
| 112 |
+
echo ""
|
| 113 |
+
|
| 114 |
+
echo "=============================================="
|
| 115 |
+
echo "All curl examples completed!"
|
| 116 |
+
echo "=============================================="
|
requirements.txt
CHANGED
|
@@ -1,2 +1,8 @@
|
|
| 1 |
fastapi
|
| 2 |
uvicorn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
fastapi
|
| 2 |
uvicorn
|
| 3 |
+
torch
|
| 4 |
+
transformers
|
| 5 |
+
safetensors
|
| 6 |
+
accelerate
|
| 7 |
+
huggingface-hub
|
| 8 |
+
requests
|