Rajhuggingface4253 commited on
Commit
1c8f4c0
·
verified ·
1 Parent(s): 9a9ee27

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +45 -0
  2. app.py +1190 -0
  3. config.py +65 -0
  4. requirements.txt +34 -0
Dockerfile ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LFM2.5-VL Vision FastAPI Backend - Dockerfile
2
+ # Optimized for CPU deployment with Q8 quantization
3
+
4
+ FROM python:3.11-slim
5
+
6
+ # Install minimal dependencies
7
+ RUN apt-get update && apt-get install -y --no-install-recommends \
8
+ curl \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ # Create non-root user
12
+ RUN useradd -m -u 1000 appuser
13
+
14
+ # Set working directory
15
+ WORKDIR /app
16
+
17
+ # Copy requirements first for caching
18
+ COPY requirements.txt .
19
+
20
+ # Install Python dependencies with CPU-only PyTorch
21
+ RUN pip install --no-cache-dir --upgrade pip && \
22
+ pip install --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
23
+
24
+ # Copy application code
25
+ COPY --chown=appuser:appuser app.py config.py ./
26
+
27
+ # Switch to non-root user
28
+ USER appuser
29
+
30
+ # Environment variables
31
+ ENV PYTHONUNBUFFERED=1 \
32
+ LFMVL_HOST=0.0.0.0 \
33
+ LFMVL_PORT=7860 \
34
+
35
+
36
+
37
+ # Expose port
38
+ EXPOSE 7860
39
+
40
+ # Health check
41
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=3 \
42
+ CMD curl -f http://localhost:7860/health || exit 1
43
+
44
+ # Run
45
+ CMD ["python", "app.py"]
app.py ADDED
@@ -0,0 +1,1190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LFM2.5-VL FastAPI Backend - ONNX Runtime Edition
3
+ =================================================
4
+ Production-ready FastAPI backend for LiquidAI LFM2.5-VL-1.6B Vision-Language model.
5
+ Uses official ONNX model with Q8 quantization for fast CPU inference.
6
+
7
+ Features:
8
+ - ONNX Runtime for fast CPU inference (no GPU required)
9
+ - Q8 quantization for 95%+ accuracy retention
10
+ - Multi-format image support (JPEG, PNG, GIF, WebP, BMP)
11
+ - Streaming SSE responses
12
+ - OpenAI-compatible API
13
+ - Optimized for HuggingFace Spaces (2 vCPU, 16GB RAM)
14
+ """
15
+
16
+ import asyncio
17
+ import base64
18
+ import io
19
+ import json
20
+ import logging
21
+ import time
22
+ import uuid
23
+ import threading
24
+ from contextlib import asynccontextmanager
25
+ from typing import AsyncGenerator, Dict, List, Optional, Union
26
+ from pathlib import Path
27
+
28
+ import numpy as np
29
+ import onnxruntime as ort
30
+ from fastapi import FastAPI, HTTPException, Request, UploadFile, File
31
+ from fastapi.middleware.cors import CORSMiddleware
32
+ from fastapi.responses import JSONResponse
33
+ from huggingface_hub import hf_hub_download, list_repo_files
34
+ from pydantic import BaseModel, Field
35
+ from sse_starlette.sse import EventSourceResponse
36
+ from transformers import AutoImageProcessor, PreTrainedTokenizerFast
37
+ from PIL import Image
38
+ import aiohttp
39
+
40
+ from config import settings
41
+
42
+ # Configure logging
43
+ logging.basicConfig(
44
+ level=getattr(logging, settings.log_level.upper()),
45
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
46
+ )
47
+ logger = logging.getLogger(__name__)
48
+
49
+
50
+ # ==============================================================================
51
+ # Pydantic Models for OpenAI-compatible API
52
+ # ==============================================================================
53
+
54
+ class ImageContent(BaseModel):
55
+ type: str = "image"
56
+ image_url: Optional[str] = None # data:image/jpeg;base64,... or URL
57
+
58
+
59
+ class TextContent(BaseModel):
60
+ type: str = "text"
61
+ text: str
62
+
63
+
64
+ class VisionMessage(BaseModel):
65
+ role: str = Field(..., description="Role: 'system', 'user', or 'assistant'")
66
+ content: Union[str, List[Union[ImageContent, TextContent, dict]]] = Field(..., description="Message content")
67
+
68
+
69
+ class VisionCompletionRequest(BaseModel):
70
+ model: str = Field(default="lfm-vision", description="Model identifier")
71
+ messages: List[VisionMessage] = Field(..., description="Conversation messages")
72
+ temperature: Optional[float] = Field(default=None, ge=0.0, le=2.0)
73
+ top_p: Optional[float] = Field(default=None, ge=0.0, le=1.0)
74
+ top_k: Optional[int] = Field(default=None, ge=0)
75
+ max_tokens: Optional[int] = Field(default=None, ge=1)
76
+ stream: bool = Field(default=False, description="Enable streaming response")
77
+ stop: Optional[Union[str, List[str]]] = Field(default=None)
78
+
79
+
80
+ class ChatMessage(BaseModel):
81
+ role: str = Field(..., description="Role: 'system', 'user', or 'assistant'")
82
+ content: str = Field(..., description="Message content")
83
+
84
+
85
+ class ChatCompletionRequest(BaseModel):
86
+ model: str = Field(default="lfm-vision", description="Model identifier")
87
+ messages: List[ChatMessage] = Field(..., description="Conversation messages")
88
+ temperature: Optional[float] = Field(default=None, ge=0.0, le=2.0)
89
+ top_p: Optional[float] = Field(default=None, ge=0.0, le=1.0)
90
+ top_k: Optional[int] = Field(default=None, ge=0)
91
+ max_tokens: Optional[int] = Field(default=None, ge=1)
92
+ stream: bool = Field(default=False, description="Enable streaming response")
93
+
94
+
95
+ class ChatCompletionChoice(BaseModel):
96
+ index: int
97
+ message: ChatMessage
98
+ finish_reason: Optional[str] = None
99
+
100
+
101
+ class ChatCompletionResponse(BaseModel):
102
+ id: str
103
+ object: str = "chat.completion"
104
+ created: int
105
+ model: str
106
+ choices: List[ChatCompletionChoice]
107
+ usage: Dict[str, int]
108
+
109
+
110
+ class ModelInfo(BaseModel):
111
+ id: str
112
+ object: str = "model"
113
+ created: int
114
+ owned_by: str = "liquid-ai"
115
+
116
+
117
+ class ModelListResponse(BaseModel):
118
+ object: str = "list"
119
+ data: List[ModelInfo]
120
+
121
+
122
+ # ==============================================================================
123
+ # ONNX Vision Model Manager
124
+ # ==============================================================================
125
+
126
+ # ONNX dtype mapping
127
+ ONNX_DTYPE = {
128
+ "tensor(float)": np.float32,
129
+ "tensor(float16)": np.float16,
130
+ "tensor(int64)": np.int64
131
+ }
132
+
133
+
134
+ class Lfm2VlProcessorWrapper:
135
+ """
136
+ Custom processor wrapper that combines ImageProcessor + Tokenizer.
137
+ This bypasses the AutoProcessor tokenizer auto-detection bug in LFM models.
138
+ """
139
+
140
+ def __init__(self, image_processor, tokenizer):
141
+ self.image_processor = image_processor
142
+ self.tokenizer = tokenizer
143
+
144
+ def apply_chat_template(self, messages, add_generation_prompt=True, tokenize=False, **kwargs):
145
+ """
146
+ Apply chat template for vision-language model.
147
+ Converts vision message format [{"type": "image"}, {"type": "text", "text": "..."}]
148
+ to text with <image> placeholders as expected by the tokenizer.
149
+ """
150
+ # Transform vision messages to text format
151
+ text_messages = []
152
+ for msg in messages:
153
+ role = msg.get("role", "user") if isinstance(msg, dict) else getattr(msg, "role", "user")
154
+ content = msg.get("content", "") if isinstance(msg, dict) else getattr(msg, "content", "")
155
+
156
+ if isinstance(content, list):
157
+ # Vision message format: [{"type": "image"}, {"type": "text", "text": "..."}]
158
+ text_parts = []
159
+ for item in content:
160
+ if isinstance(item, dict):
161
+ item_type = item.get("type", "")
162
+ if item_type == "image":
163
+ text_parts.append("<image>")
164
+ elif item_type == "text":
165
+ text_parts.append(item.get("text", ""))
166
+ else:
167
+ text_parts.append(str(item))
168
+ content = "".join(text_parts)
169
+
170
+ text_messages.append({"role": role, "content": content})
171
+
172
+ return self.tokenizer.apply_chat_template(
173
+ text_messages,
174
+ add_generation_prompt=add_generation_prompt,
175
+ tokenize=tokenize,
176
+ **kwargs
177
+ )
178
+
179
+ def __call__(self, images=None, text=None, **kwargs):
180
+ """
181
+ Process images and text for the vision-language model.
182
+
183
+ CRITICAL: The vision encoder produces N image embeddings (e.g., 256 for a 512x512 image).
184
+ Each embedding needs its own <image> token position in input_ids.
185
+
186
+ This method:
187
+ 1. Processes images FIRST to determine N (number of image tokens)
188
+ 2. Expands single <image> in text to N consecutive <image> tokens
189
+ 3. Tokenizes the expanded text
190
+
191
+ Returns a dict with pixel_values, input_ids, attention_mask, etc.
192
+ """
193
+ result = {}
194
+ return_tensors = kwargs.pop('return_tensors', None)
195
+ num_image_tokens = 0
196
+
197
+ # Step 1: Process images FIRST to get the number of image tokens
198
+ if images is not None:
199
+ image_outputs = self.image_processor(images=images, return_tensors=return_tensors)
200
+ result.update(image_outputs)
201
+
202
+ # Calculate number of image tokens from pixel_values shape
203
+ # pixel_values shape: [batch, num_patches, hidden_dim]
204
+ # The MLP projector in LFM2.5-VL reduces patches by factor of 4
205
+ # Reference: https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B
206
+ if 'pixel_values' in image_outputs:
207
+ pv = image_outputs['pixel_values']
208
+ num_patches = pv.shape[1] if hasattr(pv, 'shape') else pv.size(1)
209
+ # MLP projector reduces by factor of 4: 1024 patches → 256 tokens
210
+ num_image_tokens = num_patches // 4
211
+ logger.debug(f"Image processing: {num_patches} patches → {num_image_tokens} image tokens")
212
+
213
+ # Step 2: Expand <image> placeholder(s) to match token count
214
+ if text is not None:
215
+ # Ensure text is a string
216
+ if isinstance(text, list):
217
+ text = text[0] if len(text) == 1 else " ".join(text)
218
+
219
+ # Expand each <image> placeholder to N <image> tokens
220
+ if num_image_tokens > 0 and "<image>" in text:
221
+ # Count existing <image> placeholders
222
+ image_count = text.count("<image>")
223
+ # Each placeholder represents one image, expand to num_image_tokens
224
+ tokens_per_image = num_image_tokens // image_count if image_count > 0 else num_image_tokens
225
+ expanded_image = "<image>" * tokens_per_image
226
+ text = text.replace("<image>", expanded_image)
227
+ logger.debug(f"Expanded {image_count} <image> placeholder(s) to {tokens_per_image} tokens each")
228
+
229
+ text_outputs = self.tokenizer(
230
+ text,
231
+ return_tensors=return_tensors,
232
+ padding=kwargs.get('padding', False),
233
+ truncation=kwargs.get('truncation', False),
234
+ max_length=kwargs.get('max_length', None)
235
+ )
236
+ result.update(text_outputs)
237
+
238
+ return result
239
+
240
+
241
+ class ONNXVisionModelManager:
242
+ """Manages ONNX Vision-Language model with 3 sessions: embed_tokens, embed_images, decoder."""
243
+
244
+ def __init__(self):
245
+ self._embed_tokens = None
246
+ self._embed_images = None
247
+ self._decoder = None
248
+ self._processor = None
249
+ self._cache_template = None
250
+ self._lock = threading.Lock()
251
+
252
+ @property
253
+ def is_loaded(self) -> bool:
254
+ return all([self._embed_tokens, self._embed_images, self._decoder])
255
+
256
+ def download_models(self) -> Dict[str, str]:
257
+ """Download ONNX model files from HuggingFace."""
258
+ model_id = settings.model_id
259
+ encoder_var = settings.encoder_variant
260
+ decoder_var = settings.decoder_variant
261
+
262
+ logger.info(f"Downloading model: {model_id}")
263
+ logger.info(f" Encoder variant: {encoder_var}")
264
+ logger.info(f" Decoder variant: {decoder_var}")
265
+
266
+ paths = {}
267
+
268
+ # Download embed_tokens (use same variant as encoder or fp16)
269
+ embed_suffix = f"_fp16" if encoder_var in ["fp16", "q8", "q4"] else ""
270
+ paths["embed_tokens"] = hf_hub_download(model_id, f"onnx/embed_tokens{embed_suffix}.onnx")
271
+
272
+ # Download embed_images (vision encoder)
273
+ img_suffix = f"_{encoder_var}" if encoder_var != "fp32" else ""
274
+ paths["embed_images"] = hf_hub_download(model_id, f"onnx/embed_images{img_suffix}.onnx")
275
+
276
+ # Download decoder
277
+ dec_suffix = f"_{decoder_var}" if decoder_var != "fp32" else ""
278
+ paths["decoder"] = hf_hub_download(model_id, f"onnx/decoder{dec_suffix}.onnx")
279
+
280
+ # Download all data files - use exact prefix matching to avoid downloading wrong variants
281
+ # Expected files for selected variants only (e.g., decoder_q8.onnx_data, not decoder.onnx_data)
282
+ expected_prefixes = [
283
+ f"onnx/embed_tokens{embed_suffix}.onnx_data",
284
+ f"onnx/embed_images{img_suffix}.onnx_data",
285
+ f"onnx/decoder{dec_suffix}.onnx_data"
286
+ ]
287
+
288
+ for f in list_repo_files(model_id):
289
+ if f.startswith("onnx/") and ".onnx_data" in f:
290
+ # Check if this file STARTS WITH one of our expected prefixes
291
+ # This handles split files like decoder_q8.onnx_data, decoder_q8.onnx_data_1, etc.
292
+ if any(f.startswith(prefix) for prefix in expected_prefixes):
293
+ logger.info(f"Downloading: {f}")
294
+ hf_hub_download(model_id, f)
295
+
296
+ return paths
297
+
298
+ def load_model(self) -> None:
299
+ """Load the ONNX models and processor."""
300
+ with self._lock:
301
+ if self.is_loaded:
302
+ return
303
+
304
+ logger.info("=" * 60)
305
+ logger.info("Loading LFM2.5-VL-1.6B Vision-Language ONNX model...")
306
+ logger.info(f"Model: {settings.model_id}")
307
+ logger.info(f"Encoder: {settings.encoder_variant} (Q8 = ~95% accuracy)")
308
+ logger.info(f"Decoder: {settings.decoder_variant}")
309
+ logger.info("=" * 60)
310
+
311
+ start_time = time.time()
312
+
313
+ # Download models
314
+ paths = self.download_models()
315
+
316
+ # Configure ONNX Runtime for CPU
317
+ sess_options = ort.SessionOptions()
318
+ sess_options.intra_op_num_threads = settings.num_threads
319
+ sess_options.inter_op_num_threads = settings.num_threads
320
+ sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
321
+
322
+ # Load ONNX sessions
323
+ self._embed_tokens = ort.InferenceSession(
324
+ paths["embed_tokens"],
325
+ sess_options=sess_options,
326
+ providers=['CPUExecutionProvider']
327
+ )
328
+
329
+ self._embed_images = ort.InferenceSession(
330
+ paths["embed_images"],
331
+ sess_options=sess_options,
332
+ providers=['CPUExecutionProvider']
333
+ )
334
+
335
+ self._decoder = ort.InferenceSession(
336
+ paths["decoder"],
337
+ sess_options=sess_options,
338
+ providers=['CPUExecutionProvider']
339
+ )
340
+
341
+ # Load processor components separately to bypass TokenizersBackend bug
342
+ # LFM models incorrectly specify TokenizersBackend as tokenizer_class
343
+ logger.info("Loading image processor...")
344
+ image_processor = AutoImageProcessor.from_pretrained(
345
+ settings.model_id,
346
+ trust_remote_code=True
347
+ )
348
+
349
+ logger.info("Loading tokenizer with PreTrainedTokenizerFast...")
350
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
351
+ settings.model_id,
352
+ trust_remote_code=True
353
+ )
354
+
355
+ # Create our custom processor wrapper
356
+ self._processor = Lfm2VlProcessorWrapper(
357
+ image_processor=image_processor,
358
+ tokenizer=tokenizer
359
+ )
360
+ logger.info(f"✓ Processor created: {type(self._processor).__name__}")
361
+
362
+ # Initialize cache template for decoder
363
+ self._init_cache_template()
364
+
365
+ load_time = time.time() - start_time
366
+ logger.info("=" * 60)
367
+ logger.info(f"✓ Model loaded in {load_time:.2f}s")
368
+ logger.info(f" Threads: {settings.num_threads}")
369
+ logger.info(f" Provider: CPU")
370
+ logger.info("=" * 60)
371
+
372
+ def _init_cache_template(self) -> None:
373
+ """Initialize KV cache template for decoder."""
374
+ self._cache_template = {}
375
+ for inp in self._decoder.get_inputs():
376
+ if inp.name in {"inputs_embeds", "attention_mask", "position_ids"}:
377
+ continue
378
+
379
+ shape = [d if isinstance(d, int) else 1 for d in inp.shape]
380
+ for i, d in enumerate(inp.shape):
381
+ if isinstance(d, str) and "sequence" in d.lower():
382
+ shape[i] = 0
383
+
384
+ dtype = ONNX_DTYPE.get(inp.type, np.float32)
385
+ self._cache_template[inp.name] = (shape, dtype)
386
+
387
+ def _create_empty_cache(self) -> Dict[str, np.ndarray]:
388
+ """Create a new empty KV cache."""
389
+ return {
390
+ name: np.zeros(shape, dtype=dtype)
391
+ for name, (shape, dtype) in self._cache_template.items()
392
+ }
393
+
394
+ @property
395
+ def processor(self):
396
+ if self._processor is None:
397
+ raise RuntimeError("Processor not loaded")
398
+ return self._processor
399
+
400
+ def process_image(self, image: Image.Image) -> Dict[str, np.ndarray]:
401
+ """Process image to embeddings."""
402
+ # Ensure RGB
403
+ if image.mode != "RGB":
404
+ image = image.convert("RGB")
405
+
406
+ return image
407
+
408
+ def generate(
409
+ self,
410
+ images: List[Image.Image],
411
+ messages: List[dict],
412
+ max_tokens: int = 512,
413
+ temperature: float = 0.1,
414
+ top_k: int = 50,
415
+ top_p: float = 0.1,
416
+ stop_tokens: Optional[List[int]] = None
417
+ ) -> List[int]:
418
+ """Generate tokens using ONNX Vision model."""
419
+ tokenizer = self._processor.tokenizer
420
+
421
+ if stop_tokens is None:
422
+ stop_tokens = [tokenizer.eos_token_id]
423
+
424
+ # Process inputs through processor
425
+ prompt = self._processor.apply_chat_template(messages, add_generation_prompt=True)
426
+ inputs = self._processor(
427
+ images=images if images else None,
428
+ text=prompt,
429
+ return_tensors="pt"
430
+ )
431
+
432
+ # Convert to numpy with correct dtypes
433
+ input_ids = inputs["input_ids"].numpy().astype(np.int64)
434
+
435
+ # Get token embeddings
436
+ token_outputs = self._embed_tokens.run(None, {"input_ids": input_ids})
437
+ token_embeds = token_outputs[0]
438
+
439
+ # Process images if present
440
+ if images and "pixel_values" in inputs:
441
+ pixel_values = inputs["pixel_values"].numpy().astype(np.float32)
442
+ pixel_attention_mask = inputs.get("pixel_attention_mask", None)
443
+ spatial_shapes = inputs.get("spatial_shapes", None)
444
+
445
+ image_feed = {"pixel_values": pixel_values}
446
+ if pixel_attention_mask is not None:
447
+ image_feed["pixel_attention_mask"] = pixel_attention_mask.numpy().astype(np.int64)
448
+ if spatial_shapes is not None:
449
+ image_feed["spatial_shapes"] = spatial_shapes.numpy().astype(np.int64)
450
+
451
+ image_outputs = self._embed_images.run(None, image_feed)
452
+ image_embeds = image_outputs[0]
453
+
454
+ # Replace <image> tokens with image embeddings
455
+ image_token_id = tokenizer.convert_tokens_to_ids("<image>")
456
+ image_positions = np.where(input_ids[0] == image_token_id)[0]
457
+ for i, pos in enumerate(image_positions):
458
+ if i < len(image_embeds):
459
+ token_embeds[0, pos] = image_embeds[i]
460
+
461
+ # Initialize KV cache
462
+ cache = self._create_empty_cache()
463
+ seq_len = token_embeds.shape[1]
464
+ generated_tokens = []
465
+
466
+ for step in range(max_tokens):
467
+ if step == 0:
468
+ embeds = token_embeds.astype(np.float32)
469
+ else:
470
+ last_token = np.array([[generated_tokens[-1]]], dtype=np.int64)
471
+ embeds = self._embed_tokens.run(None, {"input_ids": last_token})[0].astype(np.float32)
472
+
473
+ attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
474
+
475
+ feed = {"inputs_embeds": embeds, "attention_mask": attn_mask, **cache}
476
+ outputs = self._decoder.run(None, feed)
477
+
478
+ # Get logits and apply temperature
479
+ logits = outputs[0][0, -1]
480
+
481
+ if temperature > 0:
482
+ logits = logits / temperature
483
+
484
+ # Apply top-k
485
+ if top_k > 0:
486
+ indices_to_remove = np.argsort(logits)[:-top_k]
487
+ logits[indices_to_remove] = -np.inf
488
+
489
+ # Apply top-p (nucleus sampling)
490
+ if top_p < 1.0:
491
+ sorted_indices = np.argsort(logits)[::-1]
492
+ sorted_logits = logits[sorted_indices]
493
+ probs = np.exp(sorted_logits - np.max(sorted_logits))
494
+ probs = probs / probs.sum()
495
+ cumulative_probs = np.cumsum(probs)
496
+ sorted_indices_to_remove = cumulative_probs > top_p
497
+ sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].copy()
498
+ sorted_indices_to_remove[0] = False
499
+ indices_to_remove = sorted_indices[sorted_indices_to_remove]
500
+ logits[indices_to_remove] = -np.inf
501
+
502
+ # Sample
503
+ probs = np.exp(logits - np.max(logits))
504
+ probs = probs / probs.sum()
505
+ next_token = int(np.random.choice(len(probs), p=probs))
506
+ else:
507
+ next_token = int(np.argmax(logits))
508
+
509
+ generated_tokens.append(next_token)
510
+
511
+ # Update cache
512
+ for i, out in enumerate(self._decoder.get_outputs()[1:], 1):
513
+ name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
514
+ if name in cache:
515
+ cache[name] = outputs[i]
516
+
517
+ if next_token in stop_tokens:
518
+ break
519
+
520
+ return generated_tokens
521
+
522
+ def generate_stream(
523
+ self,
524
+ images: List[Image.Image],
525
+ messages: List[dict],
526
+ max_tokens: int = 2000,
527
+ temperature: float = 0.1,
528
+ top_k: int = 50,
529
+ top_p: float = 0.1,
530
+ stop_tokens: Optional[List[int]] = None
531
+ ):
532
+ """Streaming generation for Vision model."""
533
+ tokenizer = self._processor.tokenizer
534
+
535
+ if stop_tokens is None:
536
+ stop_tokens = [tokenizer.eos_token_id]
537
+
538
+ # Process inputs through processor
539
+ prompt = self._processor.apply_chat_template(messages, add_generation_prompt=True)
540
+ inputs = self._processor(
541
+ images=images if images else None,
542
+ text=prompt,
543
+ return_tensors="pt"
544
+ )
545
+
546
+ # Convert to numpy with correct dtypes
547
+ input_ids = inputs["input_ids"].numpy().astype(np.int64)
548
+
549
+ # Get token embeddings
550
+ token_outputs = self._embed_tokens.run(None, {"input_ids": input_ids})
551
+ token_embeds = token_outputs[0]
552
+
553
+ # Process images if present
554
+ if images and "pixel_values" in inputs:
555
+ pixel_values = inputs["pixel_values"].numpy().astype(np.float32)
556
+ pixel_attention_mask = inputs.get("pixel_attention_mask", None)
557
+ spatial_shapes = inputs.get("spatial_shapes", None)
558
+
559
+ image_feed = {"pixel_values": pixel_values}
560
+ if pixel_attention_mask is not None:
561
+ image_feed["pixel_attention_mask"] = pixel_attention_mask.numpy().astype(np.int64)
562
+ if spatial_shapes is not None:
563
+ image_feed["spatial_shapes"] = spatial_shapes.numpy().astype(np.int64)
564
+
565
+ image_outputs = self._embed_images.run(None, image_feed)
566
+ image_embeds = image_outputs[0]
567
+
568
+ # Replace <image> tokens with image embeddings
569
+ image_token_id = tokenizer.convert_tokens_to_ids("<image>")
570
+ image_positions = np.where(input_ids[0] == image_token_id)[0]
571
+ for i, pos in enumerate(image_positions):
572
+ if i < len(image_embeds):
573
+ token_embeds[0, pos] = image_embeds[i]
574
+
575
+ # Initialize KV cache
576
+ cache = self._create_empty_cache()
577
+ seq_len = token_embeds.shape[1]
578
+ generated_tokens = []
579
+
580
+ # Pre-allocate attention mask
581
+ max_possible_len = seq_len + max_tokens
582
+ attn_mask = np.ones((1, max_possible_len), dtype=np.int64)
583
+
584
+ # Pre-compute flags
585
+ use_temp = temperature > 0
586
+ use_top_k = top_k > 0
587
+ use_top_p = top_p < 1.0
588
+
589
+ feed = {}
590
+
591
+ for step in range(max_tokens):
592
+ current_len = seq_len + step
593
+
594
+ if step == 0:
595
+ embeds = token_embeds.astype(np.float32)
596
+ else:
597
+ last_token = np.array([[generated_tokens[-1]]], dtype=np.int64)
598
+ embeds = self._embed_tokens.run(None, {"input_ids": last_token})[0].astype(np.float32)
599
+
600
+ # Update Feed Dict
601
+ feed.clear()
602
+ feed["inputs_embeds"] = embeds
603
+ feed["attention_mask"] = attn_mask[:, :current_len]
604
+ feed.update(cache)
605
+
606
+ # Inference
607
+ outputs = self._decoder.run(None, feed)
608
+ logits = outputs[0][0, -1]
609
+
610
+ # Sampling
611
+ if use_temp:
612
+ logits /= temperature
613
+
614
+ if use_top_k and top_k < len(logits):
615
+ top_k_idx = np.argpartition(logits, -top_k)[-top_k:]
616
+ mask = np.ones(logits.shape, dtype=bool)
617
+ mask[top_k_idx] = False
618
+ logits[mask] = -np.inf
619
+
620
+ if use_top_p:
621
+ valid_mask = logits > -np.inf
622
+ if valid_mask.any():
623
+ valid_logits = logits[valid_mask]
624
+ valid_indices = np.where(valid_mask)[0]
625
+
626
+ sorted_indices = np.argsort(valid_logits)[::-1]
627
+ sorted_logits = valid_logits[sorted_indices]
628
+
629
+ exp_logits = np.exp(sorted_logits - np.max(sorted_logits))
630
+ probs = exp_logits / exp_logits.sum()
631
+
632
+ cumulative = np.cumsum(probs)
633
+ cutoff = np.searchsorted(cumulative, top_p)
634
+ cutoff = min(cutoff + 1, len(sorted_logits))
635
+
636
+ accepted_indices = sorted_indices[:cutoff]
637
+ accepted_probs = probs[:cutoff]
638
+ accepted_probs /= accepted_probs.sum()
639
+
640
+ sample_idx = np.searchsorted(np.cumsum(accepted_probs), np.random.rand())
641
+ next_token = int(valid_indices[accepted_indices[sample_idx]])
642
+ else:
643
+ next_token = int(np.argmax(logits))
644
+ else:
645
+ valid_mask = logits > -np.inf
646
+ valid_logits = logits[valid_mask]
647
+ valid_indices = np.where(valid_mask)[0]
648
+ exp_logits = np.exp(valid_logits - np.max(valid_logits))
649
+ probs = exp_logits / exp_logits.sum()
650
+ sample_idx = np.searchsorted(np.cumsum(probs), np.random.rand())
651
+ next_token = int(valid_indices[sample_idx])
652
+ else:
653
+ next_token = int(np.argmax(logits))
654
+
655
+ generated_tokens.append(next_token)
656
+ yield next_token
657
+
658
+ if next_token in stop_tokens:
659
+ break
660
+
661
+ # Update Cache
662
+ for i, out in enumerate(self._decoder.get_outputs()[1:], 1):
663
+ name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
664
+ if name in cache:
665
+ cache[name] = outputs[i]
666
+
667
+ def unload(self) -> None:
668
+ """Unload models from memory."""
669
+ with self._lock:
670
+ if self._embed_tokens is not None:
671
+ del self._embed_tokens
672
+ del self._embed_images
673
+ del self._decoder
674
+ del self._processor
675
+ self._embed_tokens = None
676
+ self._embed_images = None
677
+ self._decoder = None
678
+ self._processor = None
679
+ logger.info("Models unloaded")
680
+
681
+
682
+ # Global model manager
683
+ model_manager = ONNXVisionModelManager()
684
+
685
+
686
+ # ==============================================================================
687
+ # Image Processing Utilities
688
+ # ==============================================================================
689
+
690
+ def resize_image_for_model(image: Image.Image, max_dim: int = 512) -> Image.Image:
691
+ """
692
+ Resize image to max dimension while preserving aspect ratio.
693
+ Uses LANCZOS (highest quality) resampling for best visual fidelity.
694
+
695
+ This optimization ensures:
696
+ - Consistent processing time (~3-4s) regardless of input size
697
+ - Single-patch processing (256 tokens) instead of tiling
698
+ - Reduced memory usage
699
+
700
+ Args:
701
+ image: PIL Image to resize
702
+ max_dim: Maximum dimension (width or height), default 512
703
+
704
+ Returns:
705
+ Resized PIL Image (or original if already small enough)
706
+ """
707
+ width, height = image.size
708
+
709
+ # Skip if already small enough
710
+ if width <= max_dim and height <= max_dim:
711
+ logger.debug(f"Image {width}x{height} already within {max_dim}px limit")
712
+ return image
713
+
714
+ # Calculate new dimensions (preserve aspect ratio)
715
+ ratio = min(max_dim / width, max_dim / height)
716
+ new_width = int(width * ratio)
717
+ new_height = int(height * ratio)
718
+
719
+ logger.info(f"Resizing image: {width}x{height} → {new_width}x{new_height} (LANCZOS)")
720
+
721
+ # Resize with high-quality LANCZOS filter
722
+ return image.resize((new_width, new_height), Image.Resampling.LANCZOS)
723
+
724
+ async def load_image_from_url(url: str) -> Image.Image:
725
+ """Load image from URL, convert to RGB, and resize for optimal processing."""
726
+ async with aiohttp.ClientSession() as session:
727
+ async with session.get(url) as response:
728
+ if response.status != 200:
729
+ raise HTTPException(status_code=400, detail=f"Failed to fetch image from URL: {url}")
730
+ data = await response.read()
731
+ image = Image.open(io.BytesIO(data))
732
+ # Convert to RGB to ensure consistent channel format
733
+ if image.mode != 'RGB':
734
+ image = image.convert('RGB')
735
+ # Resize for optimal model processing (max 512x512)
736
+ image = resize_image_for_model(image)
737
+ return image
738
+
739
+
740
+ def load_image_from_base64(data_url: str) -> Image.Image:
741
+ """Load image from base64 data URL, convert to RGB, and resize for optimal processing."""
742
+ # Format: data:image/jpeg;base64,/9j/4AAQ...
743
+ if "," in data_url:
744
+ header, encoded = data_url.split(",", 1)
745
+ else:
746
+ encoded = data_url
747
+
748
+ image_data = base64.b64decode(encoded)
749
+ image = Image.open(io.BytesIO(image_data))
750
+ # Convert to RGB to ensure consistent channel format
751
+ if image.mode != 'RGB':
752
+ image = image.convert('RGB')
753
+ # Resize for optimal model processing (max 512x512)
754
+ image = resize_image_for_model(image)
755
+ return image
756
+
757
+
758
+ async def process_image_content(content: Union[ImageContent, dict]) -> Optional[Image.Image]:
759
+ """Process image content from request."""
760
+ if isinstance(content, dict):
761
+ content = ImageContent(**content)
762
+
763
+ if content.type != "image":
764
+ return None
765
+
766
+ if not content.image_url:
767
+ return None
768
+
769
+ url = content.image_url
770
+
771
+ # Check if it's a base64 data URL
772
+ if url.startswith("data:"):
773
+ return load_image_from_base64(url)
774
+ else:
775
+ # It's a regular URL
776
+ return await load_image_from_url(url)
777
+
778
+
779
+ # ==============================================================================
780
+ # Application Lifecycle
781
+ # ==============================================================================
782
+
783
+ @asynccontextmanager
784
+ async def lifespan(app: FastAPI):
785
+ """Application lifespan handler."""
786
+ logger.info("Starting LFM2.5-VL Vision API Server (ONNX Runtime)...")
787
+
788
+ loop = asyncio.get_event_loop()
789
+ await loop.run_in_executor(None, model_manager.load_model)
790
+
791
+ yield
792
+
793
+ logger.info("Shutting down...")
794
+ model_manager.unload()
795
+
796
+
797
+ # ==============================================================================
798
+ # FastAPI Application
799
+ # ==============================================================================
800
+
801
+ app = FastAPI(
802
+ title=settings.app_name,
803
+ description="Fast CPU inference for LiquidAI LFM2.5-VL-1.6B Vision-Language model using ONNX Runtime",
804
+ version=settings.app_version,
805
+ lifespan=lifespan,
806
+ docs_url="/docs",
807
+ redoc_url="/redoc",
808
+ )
809
+
810
+ app.add_middleware(
811
+ CORSMiddleware,
812
+ allow_origins=["*"],
813
+ allow_credentials=False,
814
+ allow_methods=["*"],
815
+ allow_headers=["*"],
816
+ expose_headers=["*"],
817
+ )
818
+
819
+
820
+ @app.middleware("http")
821
+ async def add_cors_for_null_origin(request: Request, call_next):
822
+ """Handle CORS for null origin (when HTML is opened from file://)."""
823
+ origin = request.headers.get("origin", "")
824
+ response = await call_next(request)
825
+
826
+ if origin == "null" or not origin:
827
+ response.headers["Access-Control-Allow-Origin"] = "*"
828
+ response.headers["Access-Control-Allow-Methods"] = "GET, POST, PUT, DELETE, OPTIONS"
829
+ response.headers["Access-Control-Allow-Headers"] = "*"
830
+ response.headers["Access-Control-Expose-Headers"] = "*"
831
+
832
+ return response
833
+
834
+
835
+ # ==============================================================================
836
+ # Helper Functions
837
+ # ==============================================================================
838
+
839
+ def generate_id() -> str:
840
+ return f"chatcmpl-{uuid.uuid4().hex[:12]}"
841
+
842
+
843
+ async def extract_images_and_text(messages: List[VisionMessage]) -> tuple[List[Image.Image], List[dict]]:
844
+ """Extract images and convert messages to processor format."""
845
+ images = []
846
+ processed_messages = []
847
+
848
+ for msg in messages:
849
+ if isinstance(msg.content, str):
850
+ # Simple text message
851
+ processed_messages.append({
852
+ "role": msg.role,
853
+ "content": msg.content
854
+ })
855
+ else:
856
+ # Mixed content (images + text)
857
+ content_parts = []
858
+ for item in msg.content:
859
+ if isinstance(item, dict):
860
+ item_type = item.get("type", "")
861
+ else:
862
+ item_type = item.type
863
+
864
+ if item_type == "image":
865
+ image = await process_image_content(item)
866
+ if image:
867
+ images.append(image)
868
+ content_parts.append({"type": "image"})
869
+ elif item_type == "text":
870
+ text = item.get("text", "") if isinstance(item, dict) else item.text
871
+ content_parts.append({"type": "text", "text": text})
872
+
873
+ processed_messages.append({
874
+ "role": msg.role,
875
+ "content": content_parts
876
+ })
877
+
878
+ return images, processed_messages
879
+
880
+
881
+ async def stream_vision_completion(request: VisionCompletionRequest) -> AsyncGenerator[str, None]:
882
+ """Streaming vision completion."""
883
+ request_id = generate_id()
884
+ created = int(time.time())
885
+
886
+ loop = asyncio.get_running_loop()
887
+ async_queue = asyncio.Queue()
888
+
889
+ # Extract images and process messages
890
+ images, processed_messages = await extract_images_and_text(request.messages)
891
+
892
+ tokenizer = model_manager.processor.tokenizer
893
+
894
+ # Config
895
+ max_tokens = request.max_tokens or settings.max_tokens
896
+ temperature = request.temperature if request.temperature is not None else settings.temperature
897
+ top_k = request.top_k if request.top_k is not None else settings.top_k
898
+ top_p = request.top_p if request.top_p is not None else settings.top_p
899
+
900
+ # Prepare stop tokens
901
+ stop_tokens = [tokenizer.eos_token_id]
902
+ if request.stop:
903
+ if isinstance(request.stop, str):
904
+ encoded = tokenizer.encode(request.stop, add_special_tokens=False)
905
+ if encoded:
906
+ stop_tokens.append(encoded[0])
907
+ elif isinstance(request.stop, list):
908
+ for stop_str in request.stop:
909
+ encoded = tokenizer.encode(stop_str, add_special_tokens=False)
910
+ if encoded:
911
+ stop_tokens.append(encoded[0])
912
+
913
+ def generate_tokens():
914
+ try:
915
+ for token in model_manager.generate_stream(
916
+ images,
917
+ processed_messages,
918
+ max_tokens=max_tokens,
919
+ temperature=temperature,
920
+ top_k=top_k,
921
+ top_p=top_p,
922
+ stop_tokens=stop_tokens
923
+ ):
924
+ loop.call_soon_threadsafe(async_queue.put_nowait, ("token", token))
925
+ except Exception as e:
926
+ logger.error(f"Stream generation error: {e}")
927
+ loop.call_soon_threadsafe(async_queue.put_nowait, ("error", str(e)))
928
+ finally:
929
+ loop.call_soon_threadsafe(async_queue.put_nowait, ("done", None))
930
+
931
+ threading.Thread(target=generate_tokens, daemon=True).start()
932
+
933
+ try:
934
+ while True:
935
+ msg_type, data = await async_queue.get()
936
+
937
+ if msg_type == "token":
938
+ text = tokenizer.decode([data], skip_special_tokens=True)
939
+ if text:
940
+ chunk = {
941
+ "id": request_id,
942
+ "object": "chat.completion.chunk",
943
+ "created": created,
944
+ "model": request.model,
945
+ "choices": [{
946
+ "index": 0,
947
+ "delta": {"content": text},
948
+ "finish_reason": None
949
+ }]
950
+ }
951
+ yield {"data": json.dumps(chunk)}
952
+
953
+ elif msg_type == "done":
954
+ final = {
955
+ "id": request_id,
956
+ "object": "chat.completion.chunk",
957
+ "created": created,
958
+ "model": request.model,
959
+ "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
960
+ }
961
+ yield {"data": json.dumps(final)}
962
+ yield {"data": "[DONE]"}
963
+ break
964
+
965
+ elif msg_type == "error":
966
+ logger.error(f"Stream error: {data}")
967
+ yield {"data": json.dumps({"error": {"message": data}})}
968
+ break
969
+
970
+ except asyncio.CancelledError:
971
+ logger.info(f"Stream cancelled for request {request_id[:8]}")
972
+ raise
973
+ except Exception as e:
974
+ logger.error(f"Streaming error: {e}")
975
+ yield {"data": json.dumps({"error": {"message": str(e)}})}
976
+
977
+
978
+ # ==============================================================================
979
+ # API Endpoints
980
+ # ==============================================================================
981
+
982
+ @app.get("/", response_class=JSONResponse)
983
+ async def health_check():
984
+ """Health check with model status."""
985
+ return {
986
+ "status": "ready" if model_manager.is_loaded else "loading",
987
+ "model": {
988
+ "id": settings.model_id,
989
+ "encoder_variant": settings.encoder_variant,
990
+ "decoder_variant": settings.decoder_variant,
991
+ "loaded": model_manager.is_loaded,
992
+ "backend": "ONNX Runtime",
993
+ "type": "vision-language"
994
+ },
995
+ "server": {
996
+ "name": settings.app_name,
997
+ "version": settings.app_version,
998
+ "port": settings.port
999
+ },
1000
+ "supported_formats": settings.supported_formats
1001
+ }
1002
+
1003
+
1004
+ @app.get("/health")
1005
+ async def health():
1006
+ if not model_manager.is_loaded:
1007
+ raise HTTPException(status_code=503, detail="Model not loaded")
1008
+ return {"status": "healthy"}
1009
+
1010
+
1011
+ @app.get("/v1/models", response_model=ModelListResponse)
1012
+ async def list_models():
1013
+ return ModelListResponse(
1014
+ data=[
1015
+ ModelInfo(id="lfm-vision", created=int(time.time())),
1016
+ ModelInfo(id="lfm-2.5-vl-1.6b-onnx", created=int(time.time()))
1017
+ ]
1018
+ )
1019
+
1020
+
1021
+ @app.post("/v1/vision/completions")
1022
+ async def vision_completions(request: VisionCompletionRequest):
1023
+ """Vision-language completion with image support."""
1024
+ if not model_manager.is_loaded:
1025
+ raise HTTPException(status_code=503, detail="Model not loaded")
1026
+
1027
+ if request.stream:
1028
+ return EventSourceResponse(
1029
+ stream_vision_completion(request),
1030
+ media_type="text/event-stream",
1031
+ ping=30000,
1032
+ ping_message_factory=lambda: '{"type": "ping"}'
1033
+ )
1034
+
1035
+ try:
1036
+ # Extract images and process messages
1037
+ images, processed_messages = await extract_images_and_text(request.messages)
1038
+
1039
+ tokenizer = model_manager.processor.tokenizer
1040
+
1041
+ max_tokens = request.max_tokens or settings.max_tokens
1042
+ temperature = request.temperature if request.temperature is not None else settings.temperature
1043
+ top_k = request.top_k if request.top_k is not None else settings.top_k
1044
+ top_p = request.top_p if request.top_p is not None else settings.top_p
1045
+
1046
+ start_time = time.time()
1047
+
1048
+ loop = asyncio.get_event_loop()
1049
+ tokens = await loop.run_in_executor(
1050
+ None,
1051
+ lambda: model_manager.generate(
1052
+ images,
1053
+ processed_messages,
1054
+ max_tokens=max_tokens,
1055
+ temperature=temperature,
1056
+ top_k=top_k,
1057
+ top_p=top_p
1058
+ )
1059
+ )
1060
+
1061
+ response_text = tokenizer.decode(tokens, skip_special_tokens=True)
1062
+ gen_time = time.time() - start_time
1063
+
1064
+ logger.debug(f"Generated {len(tokens)} tokens in {gen_time:.2f}s")
1065
+
1066
+ return ChatCompletionResponse(
1067
+ id=generate_id(),
1068
+ created=int(time.time()),
1069
+ model=request.model,
1070
+ choices=[
1071
+ ChatCompletionChoice(
1072
+ index=0,
1073
+ message=ChatMessage(role="assistant", content=response_text),
1074
+ finish_reason="stop"
1075
+ )
1076
+ ],
1077
+ usage={
1078
+ "prompt_tokens": 0, # Would need to track input tokens
1079
+ "completion_tokens": len(tokens),
1080
+ "total_tokens": len(tokens)
1081
+ }
1082
+ )
1083
+
1084
+ except Exception as e:
1085
+ logger.error(f"Vision completion error: {e}")
1086
+ raise HTTPException(status_code=500, detail=str(e))
1087
+
1088
+
1089
+ @app.post("/v1/chat/completions")
1090
+ async def chat_completions(request: ChatCompletionRequest):
1091
+ """Text-only chat completion (for compatibility)."""
1092
+ if not model_manager.is_loaded:
1093
+ raise HTTPException(status_code=503, detail="Model not loaded")
1094
+
1095
+ # Convert to vision request format (no images)
1096
+ vision_messages = [
1097
+ VisionMessage(role=m.role, content=m.content)
1098
+ for m in request.messages
1099
+ ]
1100
+
1101
+ vision_request = VisionCompletionRequest(
1102
+ model=request.model,
1103
+ messages=vision_messages,
1104
+ temperature=request.temperature,
1105
+ top_p=request.top_p,
1106
+ top_k=request.top_k,
1107
+ max_tokens=request.max_tokens,
1108
+ stream=request.stream
1109
+ )
1110
+
1111
+ return await vision_completions(vision_request)
1112
+
1113
+
1114
+ @app.post("/v1/vision/upload")
1115
+ async def upload_image(
1116
+ file: UploadFile = File(...),
1117
+ prompt: str = "What is in this image?"
1118
+ ):
1119
+ """Direct image upload endpoint."""
1120
+ if not model_manager.is_loaded:
1121
+ raise HTTPException(status_code=503, detail="Model not loaded")
1122
+
1123
+ # Validate file type
1124
+ content_type = file.content_type or ""
1125
+ file_ext = Path(file.filename or "").suffix.lower().lstrip(".")
1126
+
1127
+ if file_ext not in settings.supported_formats and not any(fmt in content_type for fmt in settings.supported_formats):
1128
+ raise HTTPException(
1129
+ status_code=400,
1130
+ detail=f"Unsupported image format. Supported: {settings.supported_formats}"
1131
+ )
1132
+
1133
+ # Read and process image
1134
+ contents = await file.read()
1135
+ if len(contents) > settings.max_image_size_mb * 1024 * 1024:
1136
+ raise HTTPException(
1137
+ status_code=400,
1138
+ detail=f"Image too large. Max size: {settings.max_image_size_mb}MB"
1139
+ )
1140
+
1141
+ try:
1142
+ image = Image.open(io.BytesIO(contents))
1143
+ except Exception as e:
1144
+ raise HTTPException(status_code=400, detail=f"Invalid image: {e}")
1145
+
1146
+ # Create request
1147
+ messages = [{
1148
+ "role": "user",
1149
+ "content": [
1150
+ {"type": "image"},
1151
+ {"type": "text", "text": prompt}
1152
+ ]
1153
+ }]
1154
+
1155
+ tokenizer = model_manager.processor.tokenizer
1156
+
1157
+ tokens = model_manager.generate(
1158
+ [image],
1159
+ messages,
1160
+ max_tokens=settings.max_tokens,
1161
+ temperature=settings.temperature,
1162
+ top_k=settings.top_k,
1163
+ top_p=settings.top_p
1164
+ )
1165
+
1166
+ response_text = tokenizer.decode(tokens, skip_special_tokens=True)
1167
+
1168
+ return {
1169
+ "id": generate_id(),
1170
+ "model": "lfm-vision",
1171
+ "response": response_text
1172
+ }
1173
+
1174
+
1175
+ # ==============================================================================
1176
+ # Run Server
1177
+ # ==============================================================================
1178
+
1179
+ if __name__ == "__main__":
1180
+ import uvicorn
1181
+
1182
+ logger.info(f"Starting server on {settings.host}:{settings.port}")
1183
+
1184
+ uvicorn.run(
1185
+ "app:app",
1186
+ host=settings.host,
1187
+ port=settings.port,
1188
+ reload=False,
1189
+ log_level=settings.log_level
1190
+ )
config.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration for LFM2.5-VL Vision-Language FastAPI Backend.
3
+ Optimized for CPU deployment (2 vCPU, 16GB RAM).
4
+ Uses ONNX Runtime with Q8 quantization for efficient inference.
5
+ """
6
+
7
+ from functools import lru_cache
8
+ from typing import List
9
+
10
+ from pydantic_settings import BaseSettings
11
+
12
+
13
+ class Settings(BaseSettings):
14
+ """Application settings optimized for CPU deployment."""
15
+
16
+ # Application metadata
17
+ app_name: str = "LFM2.5-VL Vision API"
18
+ app_version: str = "1.0.0"
19
+
20
+ # Model settings - Using official ONNX model with Q4 for speed
21
+ model_id: str = "LiquidAI/LFM2.5-VL-1.6B-ONNX"
22
+ encoder_variant: str = "q4" # Options: q4, q8, fp16 (q4 = fastest)
23
+ decoder_variant: str = "q4" # Options: q4, q8, fp16 (q4 = fastest)
24
+
25
+ # Server settings (same as lfm-text for consistency)
26
+ host: str = "0.0.0.0"
27
+ port: int = 7860
28
+
29
+ # CORS settings
30
+ cors_origins: List[str] = ["*"]
31
+
32
+ # Vision processing settings
33
+ min_image_tokens: int = 64
34
+ max_image_tokens: int = 256
35
+ do_image_splitting: bool = True
36
+
37
+ # Supported image formats
38
+ supported_formats: List[str] = ["jpeg", "jpg", "png", "gif", "webp", "bmp"]
39
+ max_image_size_mb: int = 10
40
+
41
+ # Generation defaults (from LiquidAI recommendations)
42
+ temperature: float = 0.1
43
+ top_k: int = 50
44
+ top_p: float = 0.5
45
+ min_p: float = 0.15
46
+ max_tokens: int = 1024
47
+ repetition_penalty: float = 1.05
48
+
49
+ # CPU optimization
50
+ num_threads: int = 2 # Optimized for 2 vCPU
51
+
52
+ # Logging
53
+ log_level: str = "info"
54
+
55
+ class Config:
56
+ env_prefix = "LFMVL_"
57
+
58
+
59
+ @lru_cache()
60
+ def get_settings() -> Settings:
61
+ """Get cached settings."""
62
+ return Settings()
63
+
64
+
65
+ settings = get_settings()
requirements.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FastAPI LFM2.5-VL Vision Backend Dependencies
2
+ # CPU-friendly with ONNX Runtime and Q8 quantization
3
+
4
+ # Web Framework
5
+ fastapi>=0.109.0
6
+ uvicorn[standard]>=0.27.0
7
+
8
+ # Server-Sent Events for Streaming
9
+ sse-starlette>=2.0.0
10
+
11
+ # ONNX Runtime for fast CPU inference
12
+ onnxruntime>=1.17.0
13
+
14
+ # Transformers for processor (image + text processing)
15
+ transformers>=4.40.0
16
+ huggingface-hub>=0.21.0
17
+ tokenizers>=0.19.0
18
+ sentencepiece>=0.1.99
19
+
20
+ # Configuration
21
+ pydantic-settings>=2.1.0
22
+
23
+ # Utilities
24
+ python-multipart>=0.0.9
25
+ numpy>=1.24.0
26
+
27
+ # Image processing
28
+ pillow>=10.0.0
29
+ aiofiles>=23.0.0
30
+ aiohttp
31
+ # PyTorch CPU-only (required for processor/tokenizer compatibility)
32
+ --extra-index-url https://download.pytorch.org/whl/cpu
33
+ torch
34
+ torchvision