MiniMax Agent commited on
Commit
9604400
·
1 Parent(s): 91bc5ae

Add Anthropic API compatible wrapper for OpenELM models

Browse files
Files changed (6) hide show
  1. Dockerfile +26 -3
  2. README.md +167 -3
  3. app.py +659 -5
  4. examples/anthropic_sdk_example.py +112 -0
  5. examples/curl_examples.sh +116 -0
  6. requirements.txt +6 -0
Dockerfile CHANGED
@@ -1,7 +1,13 @@
1
  # Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
2
  # you will also find guides on how best to write your Dockerfile
 
3
 
4
- FROM python:3.9
 
 
 
 
 
5
 
6
  RUN useradd -m -u 1000 user
7
  USER user
@@ -9,8 +15,25 @@ ENV PATH="/home/user/.local/bin:$PATH"
9
 
10
  WORKDIR /app
11
 
 
 
 
 
 
 
 
12
  COPY --chown=user ./requirements.txt requirements.txt
13
- RUN pip install --no-cache-dir --upgrade -r requirements.txt
14
 
 
 
 
 
 
 
15
  COPY --chown=user . /app
16
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
 
 
 
 
 
 
1
  # Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
2
  # you will also find guides on how best to write your Dockerfile
3
+ # OpenELM Anthropic API Compatible Wrapper
4
 
5
+ FROM python:3.10-slim
6
+
7
+ # Install system dependencies
8
+ RUN apt-get update && apt-get install -y --no-install-recommends \
9
+ build-essential \
10
+ && rm -rf /var/lib/apt/lists/*
11
 
12
  RUN useradd -m -u 1000 user
13
  USER user
 
15
 
16
  WORKDIR /app
17
 
18
+ # Set environment variables for memory optimization
19
+ ENV PYTHONUNBUFFERED=1
20
+ ENV TRANSFORMERS_CACHE=/app/.cache
21
+ ENV HF_HOME=/app/.cache/huggingface
22
+ ENV HUGGINGFACE_HUB_CACHE=/app/.cache/huggingface
23
+
24
+ # Copy requirements first for better caching
25
  COPY --chown=user ./requirements.txt requirements.txt
 
26
 
27
+ # Install Python dependencies
28
+ # Install PyTorch with CUDA support if available, otherwise CPU version
29
+ RUN pip install --no-cache-dir --upgrade pip wheel
30
+ RUN pip install --no-cache-dir -r requirements.txt
31
+
32
+ # Copy application code
33
  COPY --chown=user . /app
34
+
35
+ # Expose the API port
36
+ EXPOSE 8000 7860
37
+
38
+ # Set default command with extended timeout for model loading
39
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--timeout-keep-alive", "120"]
README.md CHANGED
@@ -1,10 +1,174 @@
1
  ---
2
- title: Agentic Api
3
- emoji:
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: OpenELM Anthropic API
3
+ emoji: 🤖
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
7
  pinned: false
8
  ---
9
 
10
+ # OpenELM Anthropic API Compatible Wrapper
11
+
12
+ A FastAPI-based service that provides an Anthropic-compatible API for Apple's OpenELM models, allowing you to use the Anthropic SDK with OpenELM for text generation tasks.
13
+
14
+ ## Overview
15
+
16
+ This project creates a REST API that mimics the Anthropic Messages API format, enabling developers to use OpenELM models with existing Anthropic SDK code with minimal modifications. The API supports both streaming and non-streaming responses, multi-turn conversations, system prompts, and various generation parameters.
17
+
18
+ The OpenELM (Open Efficient Language Model) family from Apple uses a layer-wise scaling strategy to efficiently allocate parameters within each transformer layer, resulting in enhanced accuracy while maintaining computational efficiency. This wrapper makes these powerful models accessible through a familiar API interface.
19
+
20
+ ## Features
21
+
22
+ The API provides comprehensive support for Anthropic-style message generation with several key capabilities. First, it offers full Anthropic API compatibility, including endpoints that match the Anthropic Messages API structure, making it easy to integrate with existing codebases. Second, it supports streaming responses through Server-Sent Events (SSE), enabling real-time output display as tokens are generated. Third, the API handles multi-turn conversations by maintaining conversation history and formatting prompts appropriately for OpenELM models.
23
+
24
+ Additionally, the wrapper properly handles system prompts by prepending them to the conversation context, which is essential for defining assistant behavior. The API also provides flexible generation parameters, allowing control over temperature, top-p sampling, maximum tokens, and other generation settings. Finally, comprehensive token usage statistics are included in responses, matching the Anthropic response format exactly.
25
+
26
+ ## Quick Start
27
+
28
+ ### Using Docker (Recommended)
29
+
30
+ ```bash
31
+ # Build and run with Docker
32
+ docker build -t openelm-anthropic-api .
33
+ docker run -p 8000:8000 openelm-anthropic-api
34
+ ```
35
+
36
+ ### Local Development
37
+
38
+ ```bash
39
+ # Clone and install dependencies
40
+ pip install -r requirements.txt
41
+
42
+ # Start the server
43
+ python -m uvicorn app:app --host 0.0.0.0 --port 8000
44
+ ```
45
+
46
+ ### Test the API
47
+
48
+ ```bash
49
+ # Basic message generation
50
+ curl -X POST http://localhost:8000/v1/messages \
51
+ -H "Content-Type: application/json" \
52
+ -d '{
53
+ "model": "openelm-450m-instruct",
54
+ "messages": [{"role": "user", "content": "Say hello!"}],
55
+ "max_tokens": 100
56
+ }'
57
+ ```
58
+
59
+ ## API Reference
60
+
61
+ ### Endpoints
62
+
63
+ | Method | Endpoint | Description |
64
+ |--------|----------|-------------|
65
+ | GET | / | API information |
66
+ | GET | /health | Health check |
67
+ | GET | /v1/models | List available models |
68
+ | POST | /v1/messages | Create message (non-streaming) |
69
+ | POST | /v1/messages/stream | Create message (streaming) |
70
+
71
+ ### Request Format
72
+
73
+ ```json
74
+ {
75
+ "model": "openelm-450m-instruct",
76
+ "messages": [
77
+ {"role": "user", "content": "Your prompt here"}
78
+ ],
79
+ "system": "Optional system prompt",
80
+ "max_tokens": 1024,
81
+ "temperature": 0.7,
82
+ "top_p": 0.9,
83
+ "stream": false
84
+ }
85
+ ```
86
+
87
+ ### Response Format
88
+
89
+ ```json
90
+ {
91
+ "id": "msg_abc123",
92
+ "type": "message",
93
+ "role": "assistant",
94
+ "content": [{"type": "text", "text": "Generated response"}],
95
+ "model": "openelm-450m-instruct",
96
+ "stop_reason": "end_turn",
97
+ "usage": {
98
+ "input_tokens": 10,
99
+ "output_tokens": 50
100
+ }
101
+ }
102
+ ```
103
+
104
+ ## Using with Anthropic SDK
105
+
106
+ ```python
107
+ from anthropic import Anthropic
108
+
109
+ # Point to your local API
110
+ client = Anthropic(
111
+ base_url="http://localhost:8000/v1",
112
+ api_key="dummy" # Any string works
113
+ )
114
+
115
+ # Use the same API you use with Claude!
116
+ response = client.messages.create(
117
+ model="openelm-450m-instruct",
118
+ messages=[{"role": "user", "content": "Hello!"}],
119
+ max_tokens=100
120
+ )
121
+
122
+ print(response.content[0].text)
123
+ ```
124
+
125
+ ## Model Information
126
+
127
+ - **Default Model**: apple/OpenELM-450M-Instruct
128
+ - **Parameters**: 450M
129
+ - **Context Window**: 2048 tokens
130
+ - **Weight Format**: Safetensors (secure and efficient)
131
+ - **Quantization**: FP16 for optimal performance
132
+
133
+ ## Architecture
134
+
135
+ - **Framework**: FastAPI with async support
136
+ - **ML Backend**: PyTorch + HuggingFace Transformers
137
+ - **Model Loading**: Lazy loading on startup with caching
138
+ - **Streaming**: Server-Sent Events (SSE)
139
+ - **Response Format**: 100% Anthropic API compatible
140
+
141
+ ## Configuration
142
+
143
+ Environment variables can be used to customize the deployment:
144
+
145
+ | Variable | Default | Description |
146
+ |----------|---------|-------------|
147
+ | PORT | 8000 | API server port |
148
+ | HF_HOME | ~/.cache/huggingface | Model cache directory |
149
+ | TRANSFORMERS_CACHE | ~/.cache/transformers | Transformers cache |
150
+
151
+ ## Examples
152
+
153
+ See the `examples/` directory for complete usage examples:
154
+
155
+ - `anthropic_sdk_example.py` - Python SDK usage
156
+ - `curl_examples.sh` - Command-line examples
157
+
158
+ ## Troubleshooting
159
+
160
+ - **Model not loading**: Check internet connection for HuggingFace download
161
+ - **Out of memory**: Reduce max_tokens or use CPU inference
162
+ - **Slow responses**: First request downloads model (subsequent requests are faster)
163
+ - **Port conflicts**: Change PORT environment variable
164
+
165
+ ## License
166
+
167
+ This project is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses. Please refer to the model card on Hugging Face for licensing information regarding the model weights.
168
+
169
+ ## Resources
170
+
171
+ - [OpenELM Model Card](https://huggingface.co/apple/OpenELM-450M-Instruct)
172
+ - [Anthropic API Documentation](https://docs.anthropic.com)
173
+ - [FastAPI Documentation](https://fastapi.tiangolo.com)
174
+ - [HuggingFace Transformers](https://huggingface.co/docs/transformers)
app.py CHANGED
@@ -1,7 +1,661 @@
1
- from fastapi import FastAPI
 
2
 
3
- app = FastAPI()
 
 
4
 
5
- @app.get("/")
6
- def greet_json():
7
- return {"Hello": "World!"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ OpenELM Anthropic API Compatible Wrapper
3
 
4
+ This FastAPI application provides an Anthropic-compatible API for the OpenELM model,
5
+ allowing users to call OpenELM models using the Anthropic SDK with minimal code changes.
6
+ """
7
 
8
+ import asyncio
9
+ import uuid
10
+ import sys
11
+ from contextlib import asynccontextmanager
12
+ from typing import AsyncIterator, List, Optional, Dict, Any
13
+
14
+ import torch
15
+ from fastapi import FastAPI, HTTPException, Request
16
+ from fastapi.responses import JSONResponse, StreamingResponse
17
+ from fastapi.middleware.cors import CORSMiddleware
18
+ from pydantic import BaseModel, Field
19
+ from transformers import AutoTokenizer, AutoModelForCausalLM
20
+ from huggingface_hub import hf_hub_download
21
+ import os
22
+
23
+ # Import for streaming
24
+ from transformers import TextIteratorStreamer
25
+ from threading import Thread
26
+
27
+
28
+ # Global model and tokenizer references
29
+ model = None
30
+ tokenizer = None
31
+ model_id = "apple/OpenELM-450M-Instruct"
32
+
33
+
34
+ @asynccontextmanager
35
+ async def lifespan(app: FastAPI) -> AsyncIterator:
36
+ """Load model on startup and clean up on shutdown."""
37
+ global model, tokenizer
38
+
39
+ print("Loading OpenELM model...")
40
+ try:
41
+ # Load tokenizer
42
+ tokenizer = AutoTokenizer.from_pretrained(
43
+ model_id,
44
+ trust_remote_code=True
45
+ )
46
+
47
+ # Load model with safetensors support
48
+ model = AutoModelForCausalLM.from_pretrained(
49
+ model_id,
50
+ torch_dtype=torch.float16,
51
+ use_safetensors=True,
52
+ trust_remote_code=True,
53
+ device_map="auto" if torch.cuda.is_available() else None
54
+ )
55
+
56
+ model.eval()
57
+ print(f"Model {model_id} loaded successfully!")
58
+
59
+ # Print model info
60
+ if hasattr(model, 'config'):
61
+ print(f"Model config: hidden_size={getattr(model.config, 'hidden_size', 'N/A')}, "
62
+ f"num_layers={getattr(model.config, 'num_layers', 'N/A')}, "
63
+ f"num_attention_heads={getattr(model.config, 'num_attention_heads', 'N/A')}")
64
+
65
+ except Exception as e:
66
+ print(f"Error loading model: {e}", file=sys.stderr)
67
+ # Continue without model - allow health check
68
+ model = None
69
+ tokenizer = None
70
+
71
+ yield
72
+
73
+ # Cleanup
74
+ if model is not None:
75
+ del model
76
+ if tokenizer is not None:
77
+ del tokenizer
78
+ torch.cuda.empty_cache() if torch.cuda.is_available() else None
79
+
80
+
81
+ # Create FastAPI app
82
+ app = FastAPI(
83
+ title="OpenELM Anthropic API",
84
+ description="Anthropic API compatible wrapper for OpenELM models",
85
+ version="1.0.0",
86
+ lifespan=lifespan
87
+ )
88
+
89
+ # Add CORS middleware
90
+ app.add_middleware(
91
+ CORSMiddleware,
92
+ allow_origins=["*"],
93
+ allow_credentials=True,
94
+ allow_methods=["*"],
95
+ allow_headers=["*"],
96
+ )
97
+
98
+
99
+ # ==================== Pydantic Models ====================
100
+
101
+ class MessageContent(BaseModel):
102
+ """Content of a message."""
103
+ type: str = "text"
104
+ text: str
105
+
106
+
107
+ class Message(BaseModel):
108
+ """A message in the conversation."""
109
+ role: str
110
+ content: str | List[MessageContent]
111
+ name: Optional[str] = None
112
+
113
+
114
+ class Usage(BaseModel):
115
+ """Token usage statistics."""
116
+ input_tokens: int = 0
117
+ output_tokens: int = 0
118
+
119
+
120
+ class ContentBlock(BaseModel):
121
+ """Content block in the response."""
122
+ type: str = "text"
123
+ text: str
124
+
125
+
126
+ class MessageResponse(BaseModel):
127
+ """Response message format matching Anthropic API."""
128
+ id: str
129
+ type: str = "message"
130
+ role: str = "assistant"
131
+ content: List[ContentBlock]
132
+ model: str
133
+ stop_reason: Optional[str] = None
134
+ stop_sequence: Optional[str] = None
135
+ usage: Usage
136
+
137
+
138
+ class MessageCreateParams(BaseModel):
139
+ """Parameters for creating a message (Anthropic API compatible)."""
140
+ model: str = "openelm-450m-instruct"
141
+ messages: List[Message]
142
+ system: Optional[str] = None
143
+ max_tokens: int = Field(default=1024, ge=1, le=4096)
144
+ temperature: Optional[float] = Field(default=None, ge=0.0, le=1.0)
145
+ top_p: Optional[float] = Field(default=None, ge=0.0, le=1.0)
146
+ top_k: Optional[int] = Field(default=None, ge=1)
147
+ stop_sequences: Optional[List[str]] = None
148
+ stream: Optional[bool] = False
149
+
150
+
151
+ class ModelInfo(BaseModel):
152
+ """Information about an available model."""
153
+ id: str
154
+ object: str = "model"
155
+ created: int = 0
156
+ owned_by: str = "openelm"
157
+
158
+
159
+ class ModelListResponse(BaseModel):
160
+ """List of available models."""
161
+ object: str = "list"
162
+ data: List[ModelInfo]
163
+
164
+
165
+ # ==================== Helper Functions ====================
166
+
167
+ def format_prompt_for_openelm(
168
+ messages: List[Message],
169
+ system: Optional[str] = None
170
+ ) -> str:
171
+ """
172
+ Format messages into a prompt suitable for OpenELM.
173
+
174
+ OpenELM uses raw text continuation, not ChatML. We convert the
175
+ conversation history into a script-like format.
176
+ """
177
+ prompt_parts = []
178
+
179
+ # Add system prompt first if provided
180
+ if system:
181
+ prompt_parts.append(f"[System: {system}]")
182
+
183
+ # Build conversation history
184
+ for msg in messages:
185
+ role = msg.role.lower()
186
+ content = msg.content
187
+
188
+ # Handle both string and list content formats
189
+ if isinstance(content, list):
190
+ text_parts = []
191
+ for block in content:
192
+ if hasattr(block, 'text'):
193
+ text_parts.append(block.text)
194
+ elif isinstance(block, dict) and 'text' in block:
195
+ text_parts.append(block['text'])
196
+ content = ''.join(text_parts)
197
+ elif not isinstance(content, str):
198
+ content = str(content)
199
+
200
+ # Format based on role
201
+ if role == "system":
202
+ prompt_parts.append(f"[System: {content}]")
203
+ elif role == "user":
204
+ prompt_parts.append(f"User: {content}")
205
+ elif role == "assistant":
206
+ prompt_parts.append(f"Assistant: {content}")
207
+ else:
208
+ prompt_parts.append(f"{role}: {content}")
209
+
210
+ # Add the final Assistant: prefix for completion
211
+ prompt_parts.append("Assistant:")
212
+
213
+ return "\n\n".join(prompt_parts)
214
+
215
+
216
+ def count_tokens(text: str) -> int:
217
+ """Estimate token count (approximation)."""
218
+ if tokenizer:
219
+ return len(tokenizer.encode(text))
220
+ # Rough approximation: ~4 characters per token
221
+ return max(1, len(text) // 4)
222
+
223
+
224
+ def truncate_prompt(prompt: str, max_tokens: int, system: Optional[str] = None) -> str:
225
+ """Truncate prompt to fit within context window."""
226
+ current_tokens = count_tokens(prompt)
227
+
228
+ if current_tokens <= max_tokens:
229
+ return prompt
230
+
231
+ # Split into parts and remove from the beginning (keep system if present)
232
+ lines = prompt.split("\n\n")
233
+
234
+ # If system is present, keep it at the start
235
+ system_line = None
236
+ if lines and lines[0].startswith("[System:"):
237
+ system_line = lines[0]
238
+ lines = lines[1:]
239
+
240
+ # Remove oldest messages until within limit
241
+ truncated_lines = []
242
+ for line in reversed(lines):
243
+ truncated_lines.insert(0, line)
244
+ current_tokens = count_tokens("\n\n".join([system_line] + truncated_lines) if system_line else "\n\n".join(truncated_lines))
245
+ if current_tokens <= max_tokens:
246
+ break
247
+
248
+ if system_line:
249
+ return "\n\n".join([system_line] + truncated_lines)
250
+ return "\n\n".join(truncated_lines)
251
+
252
+
253
+ def map_anthropic_params_to_transformers(
254
+ temperature: Optional[float],
255
+ top_p: Optional[float],
256
+ top_k: Optional[int],
257
+ max_tokens: int
258
+ ) -> Dict[str, Any]:
259
+ """Map Anthropic parameters to transformers generation parameters."""
260
+ params = {
261
+ "max_new_tokens": max_tokens,
262
+ }
263
+
264
+ if temperature is not None:
265
+ if temperature == 0:
266
+ params["do_sample"] = False
267
+ else:
268
+ params["temperature"] = temperature
269
+ params["do_sample"] = True
270
+
271
+ if top_p is not None:
272
+ params["top_p"] = top_p
273
+
274
+ if top_k is not None:
275
+ params["top_k"] = top_k
276
+
277
+ return params
278
+
279
+
280
+ # ==================== API Endpoints ====================
281
+
282
+ @app.get("/", tags=["Root"])
283
+ async def root():
284
+ """Root endpoint with API information."""
285
+ return {
286
+ "name": "OpenELM Anthropic API",
287
+ "version": "1.0.0",
288
+ "description": "Anthropic API compatible wrapper for OpenELM models",
289
+ "endpoints": {
290
+ "messages": "POST /v1/messages",
291
+ "models": "GET /v1/models",
292
+ "health": "GET /health"
293
+ }
294
+ }
295
+
296
+
297
+ @app.get("/health", tags=["Health"])
298
+ async def health_check():
299
+ """Health check endpoint."""
300
+ status = "healthy" if model is not None else "unhealthy"
301
+ return {
302
+ "status": status,
303
+ "model_loaded": model is not None,
304
+ "tokenizer_loaded": tokenizer is not None
305
+ }
306
+
307
+
308
+ @app.get("/v1/models", response_model=ModelListResponse, tags=["Models"])
309
+ async def list_models():
310
+ """List available models (Anthropic API compatible)."""
311
+ return ModelListResponse(
312
+ data=[
313
+ ModelInfo(
314
+ id="openelm-450m-instruct",
315
+ owned_by="apple",
316
+ created=int(uuid.uuid1().time)
317
+ )
318
+ ]
319
+ )
320
+
321
+
322
+ @app.post("/v1/messages", response_model=MessageResponse, tags=["Messages"])
323
+ async def create_message(
324
+ params: MessageCreateParams,
325
+ request: Request
326
+ ):
327
+ """
328
+ Create a message completion (Anthropic API compatible).
329
+
330
+ This endpoint accepts Anthropic-style messages and returns responses
331
+ in the same format, allowing existing code to work with OpenELM.
332
+ """
333
+ # Check if model is loaded
334
+ if model is None or tokenizer is None:
335
+ raise HTTPException(
336
+ status_code=503,
337
+ detail="Model not loaded. Please wait for model to initialize."
338
+ )
339
+
340
+ try:
341
+ # Format prompt for OpenELM
342
+ messages = params.messages
343
+
344
+ # Extract message contents
345
+ formatted_messages = []
346
+ for msg in messages:
347
+ content = msg.content
348
+ if isinstance(content, list):
349
+ text_content = ""
350
+ for block in content:
351
+ if hasattr(block, 'text'):
352
+ text_content += block.text
353
+ content = text_content
354
+ formatted_messages.append(Message(
355
+ role=msg.role,
356
+ content=content
357
+ ))
358
+
359
+ prompt = format_prompt_for_openelm(formatted_messages, params.system)
360
+
361
+ # Truncate if needed (OpenELM typically has 2048 context window)
362
+ max_context_tokens = 2048 - params.max_tokens
363
+ prompt = truncate_prompt(prompt, max_context_tokens, params.system)
364
+
365
+ # Tokenize input
366
+ inputs = tokenizer(prompt, return_tensors="pt")
367
+ input_tokens = len(inputs.input_ids[0])
368
+
369
+ # Move to same device as model
370
+ if hasattr(model, 'device'):
371
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
372
+
373
+ # Map parameters
374
+ gen_params = map_anthropic_params_to_transformers(
375
+ params.temperature,
376
+ params.top_p,
377
+ params.top_k,
378
+ params.max_tokens
379
+ )
380
+
381
+ # Generate
382
+ with torch.no_grad():
383
+ outputs = model.generate(
384
+ **inputs,
385
+ **gen_params,
386
+ pad_token_id=tokenizer.eos_token_id,
387
+ eos_token_id=tokenizer.eos_token_id,
388
+ )
389
+
390
+ # Decode output
391
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
392
+
393
+ # Extract the assistant's response (everything after "Assistant:")
394
+ response_text = generated_text
395
+ if "Assistant:" in generated_text:
396
+ response_text = generated_text.split("Assistant:")[-1].strip()
397
+ elif ":" in generated_text:
398
+ # Find the last role and extract content after it
399
+ lines = generated_text.split("\n")
400
+ in_assistant = False
401
+ response_parts = []
402
+ for line in lines:
403
+ if line.startswith("Assistant:"):
404
+ in_assistant = True
405
+ response_parts.append(line.replace("Assistant:", "").strip())
406
+ elif in_assistant and not line.startswith("User:") and not line.startswith("System:"):
407
+ response_parts.append(line)
408
+ elif line.startswith("User:") or line.startswith("System:"):
409
+ in_assistant = False
410
+ response_text = "\n".join(response_parts).strip()
411
+
412
+ output_tokens = count_tokens(response_text)
413
+
414
+ # Build response matching Anthropic format
415
+ response_id = f"msg_{uuid.uuid4().hex[:8]}"
416
+
417
+ return MessageResponse(
418
+ id=response_id,
419
+ role="assistant",
420
+ content=[ContentBlock(type="text", text=response_text)],
421
+ model="openelm-450m-instruct",
422
+ stop_reason="end_turn",
423
+ usage=Usage(
424
+ input_tokens=input_tokens,
425
+ output_tokens=output_tokens
426
+ )
427
+ )
428
+
429
+ except Exception as e:
430
+ raise HTTPException(
431
+ status_code=500,
432
+ detail=f"Generation failed: {str(e)}"
433
+ )
434
+
435
+
436
+ @app.post("/v1/messages/stream", tags=["Messages"])
437
+ async def create_message_stream(
438
+ params: MessageCreateParams,
439
+ request: Request
440
+ ):
441
+ """
442
+ Create a streaming message completion (Anthropic API compatible).
443
+
444
+ Returns Server-Sent Events (SSE) with streaming response.
445
+ """
446
+ # Check if model is loaded
447
+ if model is None or tokenizer is None:
448
+ raise HTTPException(
449
+ status_code=503,
450
+ detail="Model not loaded. Please wait for model to initialize."
451
+ )
452
+
453
+ if not params.stream:
454
+ raise HTTPException(
455
+ status_code=400,
456
+ detail="Stream parameter must be true for streaming endpoint"
457
+ )
458
+
459
+ async def generate_stream():
460
+ """Generate streaming response."""
461
+ try:
462
+ # Format prompt for OpenELM
463
+ messages = params.messages
464
+
465
+ # Extract message contents
466
+ formatted_messages = []
467
+ for msg in messages:
468
+ content = msg.content
469
+ if isinstance(content, list):
470
+ text_content = ""
471
+ for block in content:
472
+ if hasattr(block, 'text'):
473
+ text_content += block.text
474
+ content = text_content
475
+ formatted_messages.append(Message(
476
+ role=msg.role,
477
+ content=content
478
+ ))
479
+
480
+ prompt = format_prompt_for_openelm(formatted_messages, params.system)
481
+
482
+ # Truncate if needed
483
+ max_context_tokens = 2048 - params.max_tokens
484
+ prompt = truncate_prompt(prompt, max_context_tokens, params.system)
485
+
486
+ # Tokenize
487
+ inputs = tokenizer(prompt, return_tensors="pt")
488
+ input_tokens = len(inputs.input_ids[0])
489
+
490
+ # Move to same device as model
491
+ if hasattr(model, 'device'):
492
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
493
+
494
+ # Map parameters
495
+ gen_params = map_anthropic_params_to_transformers(
496
+ params.temperature,
497
+ params.top_p,
498
+ params.top_k,
499
+ params.max_tokens
500
+ )
501
+
502
+ # Set up streaming
503
+ gen_params["stopping_criteria"] = []
504
+
505
+ # Use TextIteratorStreamer for streaming
506
+ streamer = TextIteratorStreamer(
507
+ tokenizer,
508
+ skip_prompt=True,
509
+ skip_special_tokens=True
510
+ )
511
+
512
+ gen_params["streamer"] = streamer
513
+
514
+ # Run generation in a separate thread
515
+ def generate():
516
+ with torch.no_grad():
517
+ model.generate(**inputs, **gen_params)
518
+
519
+ thread = Thread(target=generate)
520
+ thread.start()
521
+
522
+ # Send message_start event
523
+ message_id = f"msg_{uuid.uuid4().hex[:8]}"
524
+ yield f"event: message_start\ndata: {MessageResponse(id=message_id, model='openelm-450m-instruct', usage=Usage()).model_dump_json()}\n\n"
525
+
526
+ # Send content_block_start event
527
+ yield f"event: content_block_start\ndata: {{\"type\": \"text\", \"text\": \"\"}}\n\n"
528
+
529
+ # Stream the generated text
530
+ full_text = ""
531
+ for text in streamer:
532
+ full_text += text
533
+ yield f"event: content_block_delta\ndata: {{\"type\": \"text_delta\", \"text\": \"{text}\"}}\n\n"
534
+
535
+ # Send content_block_stop event
536
+ yield f"event: content_block_stop\ndata: {{\"type\": \"content_block\", \"text\": \"\"}}\n\n"
537
+
538
+ # Calculate usage
539
+ output_tokens = count_tokens(full_text)
540
+
541
+ # Send message_delta event
542
+ usage_data = {"input_tokens": input_tokens, "output_tokens": output_tokens}
543
+ yield f"event: message_delta\ndata: {{\"delta\": {{\"stop_reason\": \"end_turn\"}}, \"usage\": {usage_data}}}\n\n"
544
+
545
+ # Send message_stop event
546
+ yield f"event: message_stop\ndata: {{}}\n\n"
547
+
548
+ thread.join()
549
+
550
+ except Exception as e:
551
+ yield f"event: error\ndata: {{\"error\": \"{str(e)}\"}}\n\n"
552
+
553
+ return StreamingResponse(
554
+ generate_stream(),
555
+ media_type="text/event-stream",
556
+ headers={
557
+ "Cache-Control": "no-cache",
558
+ "Connection": "keep-alive",
559
+ "X-Accel-Buffering": "no",
560
+ }
561
+ )
562
+
563
+
564
+ # ==================== Anthropic SDK Compatibility ====================
565
+
566
+ class AnthropicClient:
567
+ """
568
+ Simple Anthropic SDK compatible client for testing.
569
+
570
+ Usage:
571
+ client = AnthropicClient(base_url="http://localhost:8000/v1", api_key="dummy")
572
+ response = client.messages.create(
573
+ model="openelm-450m-instruct",
574
+ messages=[{"role": "user", "content": "Hello!"}],
575
+ max_tokens=100
576
+ )
577
+ """
578
+
579
+ def __init__(self, base_url: str = "http://localhost:8000", api_key: str = "dummy"):
580
+ self.base_url = base_url.rstrip("/")
581
+ self.api_key = api_key
582
+ self.session = None
583
+
584
+ def _get_session(self):
585
+ """Get or create a requests session."""
586
+ import requests
587
+ if self.session is None:
588
+ self.session = requests.Session()
589
+ self.session.headers.update({
590
+ "Authorization": f"Bearer {self.api_key}",
591
+ "Content-Type": "application/json"
592
+ })
593
+ return self.session
594
+
595
+ def messages(self) -> "MessageResource":
596
+ """Access message operations."""
597
+ return MessageResource(self)
598
+
599
+
600
+ class MessageResource:
601
+ """Resource for message operations."""
602
+
603
+ def __init__(self, client: AnthropicClient):
604
+ self.client = client
605
+
606
+ def create(
607
+ self,
608
+ model: str,
609
+ messages: List[Dict[str, str]],
610
+ system: Optional[str] = None,
611
+ max_tokens: int = 1024,
612
+ temperature: Optional[float] = None,
613
+ top_p: Optional[float] = None,
614
+ stream: bool = False
615
+ ) -> Dict[str, Any]:
616
+ """Create a message."""
617
+ import requests
618
+
619
+ url = f"{self.client.base_url}/v1/messages"
620
+ if stream:
621
+ url = f"{self.client.base_url}/v1/messages/stream"
622
+
623
+ payload = {
624
+ "model": model,
625
+ "messages": messages,
626
+ "max_tokens": max_tokens,
627
+ }
628
+
629
+ if system:
630
+ payload["system"] = system
631
+ if temperature is not None:
632
+ payload["temperature"] = temperature
633
+ if top_p is not None:
634
+ payload["top_p"] = top_p
635
+ if stream:
636
+ payload["stream"] = True
637
+
638
+ response = self.client._get_session().post(url, json=payload)
639
+
640
+ if response.status_code != 200:
641
+ raise Exception(f"API request failed: {response.text}")
642
+
643
+ return response.json()
644
+
645
+
646
+ # ==================== Main Entry Point ====================
647
+
648
+ if __name__ == "__main__":
649
+ import uvicorn
650
+
651
+ # Get port from environment or use default
652
+ port = int(os.environ.get("PORT", 8000))
653
+
654
+ # Run the server
655
+ uvicorn.run(
656
+ "app:app",
657
+ host="0.0.0.0",
658
+ port=port,
659
+ reload=False,
660
+ workers=1
661
+ )
examples/anthropic_sdk_example.py ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Example: Using Anthropic SDK with OpenELM API
3
+
4
+ This example demonstrates how to use the Anthropic SDK (or compatible client)
5
+ to call OpenELM models through our Anthropic API compatible wrapper.
6
+
7
+ Usage:
8
+ python examples/anthropic_sdk_example.py
9
+ """
10
+
11
+ import sys
12
+ import os
13
+
14
+ # Add parent directory to path for imports
15
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
16
+
17
+ from app import AnthropicClient
18
+
19
+
20
+ def main():
21
+ """Example usage of the Anthropic-compatible OpenELM API."""
22
+
23
+ # Create client pointing to our local API
24
+ base_url = os.environ.get("OPENELM_API_URL", "http://localhost:8000")
25
+ client = AnthropicClient(base_url=base_url, api_key="dummy-key")
26
+
27
+ print("=" * 60)
28
+ print("OpenELM Anthropic API - Usage Example")
29
+ print("=" * 60)
30
+ print(f"API URL: {base_url}")
31
+ print()
32
+
33
+ # Example 1: Basic message generation
34
+ print("Example 1: Basic Message Generation")
35
+ print("-" * 40)
36
+
37
+ response = client.messages().create(
38
+ model="openelm-450m-instruct",
39
+ messages=[
40
+ {"role": "user", "content": "Say hello in a friendly way!"}
41
+ ],
42
+ max_tokens=100,
43
+ temperature=0.7
44
+ )
45
+
46
+ print(f"Response ID: {response['id']}")
47
+ print(f"Model: {response['model']}")
48
+ print(f"Content: {response['content'][0]['text']}")
49
+ print(f"Usage: {response['usage']}")
50
+ print()
51
+
52
+ # Example 2: Multi-turn conversation
53
+ print("Example 2: Multi-turn Conversation")
54
+ print("-" * 40)
55
+
56
+ response = client.messages().create(
57
+ model="openelm-450m-instruct",
58
+ messages=[
59
+ {"role": "user", "content": "What is artificial intelligence?"},
60
+ {"role": "assistant", "content": "Artificial intelligence, or AI, refers to systems that can perform tasks that typically require human intelligence."},
61
+ {"role": "user", "content": "Can you give me some examples?"}
62
+ ],
63
+ max_tokens=150,
64
+ temperature=0.5
65
+ )
66
+
67
+ print(f"Content: {response['content'][0]['text']}")
68
+ print(f"Usage: {response['usage']}")
69
+ print()
70
+
71
+ # Example 3: Using system prompt
72
+ print("Example 3: Using System Prompt")
73
+ print("-" * 40)
74
+
75
+ response = client.messages().create(
76
+ model="openelm-450m-instruct",
77
+ messages=[
78
+ {"role": "user", "content": "Explain quantum computing simply."}
79
+ ],
80
+ system="You are a helpful science educator who explains complex topics simply.",
81
+ max_tokens=200,
82
+ temperature=0.8
83
+ )
84
+
85
+ print(f"Content: {response['content'][0]['text']}")
86
+ print(f"Usage: {response['usage']}")
87
+ print()
88
+
89
+ # Example 4: Deterministic generation (temperature=0)
90
+ print("Example 4: Deterministic Generation (temperature=0)")
91
+ print("-" * 40)
92
+
93
+ response = client.messages().create(
94
+ model="openelm-450m-instruct",
95
+ messages=[
96
+ {"role": "user", "content": "What is 2 + 2?"}
97
+ ],
98
+ max_tokens=50,
99
+ temperature=0.0 # Deterministic output
100
+ )
101
+
102
+ print(f"Content: {response['content'][0]['text']}")
103
+ print(f"Usage: {response['usage']}")
104
+ print()
105
+
106
+ print("=" * 60)
107
+ print("All examples completed successfully!")
108
+ print("=" * 60)
109
+
110
+
111
+ if __name__ == "__main__":
112
+ main()
examples/curl_examples.sh ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # OpenELM Anthropic API - Curl Examples
3
+ #
4
+ # This script demonstrates how to call the OpenELM Anthropic API
5
+ # using curl commands directly.
6
+ #
7
+ # Usage:
8
+ # chmod +x examples/curl_examples.sh
9
+ # ./examples/curl_examples.sh
10
+
11
+ # Set API base URL (default: localhost:8000)
12
+ API_URL="${OPENELM_API_URL:-http://localhost:8000}"
13
+ API_URL="${API_URL%/}" # Remove trailing slash
14
+
15
+ echo "=============================================="
16
+ echo "OpenELM Anthropic API - Curl Examples"
17
+ echo "=============================================="
18
+ echo "API URL: $API_URL"
19
+ echo ""
20
+
21
+ # Example 1: Health Check
22
+ echo "Example 1: Health Check"
23
+ echo "------------------------"
24
+ curl -s "$API_URL/health" | python3 -m json.tool
25
+ echo ""
26
+
27
+ # Example 2: List Available Models
28
+ echo "Example 2: List Available Models"
29
+ echo "---------------------------------"
30
+ curl -s "$API_URL/v1/models" | python3 -m json.tool
31
+ echo ""
32
+
33
+ # Example 3: Basic Message Generation
34
+ echo "Example 3: Basic Message Generation"
35
+ echo "------------------------------------"
36
+ curl -s -X POST "$API_URL/v1/messages" \
37
+ -H "Content-Type: application/json" \
38
+ -d '{
39
+ "model": "openelm-450m-instruct",
40
+ "messages": [
41
+ {
42
+ "role": "user",
43
+ "content": "Say hello in a friendly way!"
44
+ }
45
+ ],
46
+ "max_tokens": 100,
47
+ "temperature": 0.7
48
+ }' | python3 -m json.tool
49
+ echo ""
50
+
51
+ # Example 4: Multi-turn Conversation
52
+ echo "Example 4: Multi-turn Conversation"
53
+ echo "-----------------------------------"
54
+ curl -s -X POST "$API_URL/v1/messages" \
55
+ -H "Content-Type: application/json" \
56
+ -d '{
57
+ "model": "openelm-450m-instruct",
58
+ "messages": [
59
+ {
60
+ "role": "user",
61
+ "content": "What is Python?"
62
+ },
63
+ {
64
+ "role": "assistant",
65
+ "content": "Python is a high-level programming language known for its simplicity and readability."
66
+ },
67
+ {
68
+ "role": "user",
69
+ "content": "What is it used for?"
70
+ }
71
+ ],
72
+ "max_tokens": 150,
73
+ "temperature": 0.5
74
+ }' | python3 -m json.tool
75
+ echo ""
76
+
77
+ # Example 5: Using System Prompt
78
+ echo "Example 5: Using System Prompt"
79
+ echo "-------------------------------"
80
+ curl -s -X POST "$API_URL/v1/messages" \
81
+ -H "Content-Type: application/json" \
82
+ -d '{
83
+ "model": "openelm-450m-instruct",
84
+ "messages": [
85
+ {
86
+ "role": "user",
87
+ "content": "Explain the concept simply."
88
+ }
89
+ ],
90
+ "system": "You are a helpful tutor who explains things simply.",
91
+ "max_tokens": 200,
92
+ "temperature": 0.8
93
+ }' | python3 -m json.tool
94
+ echo ""
95
+
96
+ # Example 6: Deterministic Generation (temperature=0)
97
+ echo "Example 6: Deterministic Generation"
98
+ echo "------------------------------------"
99
+ curl -s -X POST "$API_URL/v1/messages" \
100
+ -H "Content-Type: application/json" \
101
+ -d '{
102
+ "model": "openelm-450m-instruct",
103
+ "messages": [
104
+ {
105
+ "role": "user",
106
+ "content": "What is the capital of France?"
107
+ }
108
+ ],
109
+ "max_tokens": 50,
110
+ "temperature": 0.0
111
+ }' | python3 -m json.tool
112
+ echo ""
113
+
114
+ echo "=============================================="
115
+ echo "All curl examples completed!"
116
+ echo "=============================================="
requirements.txt CHANGED
@@ -1,2 +1,8 @@
1
  fastapi
2
  uvicorn
 
 
 
 
 
 
 
1
  fastapi
2
  uvicorn
3
+ torch
4
+ transformers
5
+ safetensors
6
+ accelerate
7
+ huggingface-hub
8
+ requests