KinetoLabs Claude Opus 4.5 commited on
Commit
333c083
·
1 Parent(s): 3b08f11

Replace 30B MoE with dual 8B models (Thinking + Instruct)

Browse files

Architecture change:
- Vision: Qwen3-VL-30B-A3B-Instruct → dual Qwen3-VL-8B-Thinking + 8B-Instruct
- Two-stage pipeline: Thinking (deep analysis) → Instruct (JSON formatting)
- VRAM: 90GB → 68GB (~22GB savings, 20GB headroom on 4xL4)

Key changes:
- models/real.py: New DualVisionModel with token-based </think> parsing
- config/settings.py: Dual model paths (vision_model_thinking, vision_model_instruct)
- config/inference.py: ThinkingInferenceConfig (temp=0.6, max_tokens=32768)
- Removed all lazy loading code (load_vision/unload_vision/load_rag)
- All 4 models now load simultaneously at startup

Per Qwen3-VL GitHub recommended hyperparameters for thinking models.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

.env.example CHANGED
@@ -8,7 +8,8 @@ MOCK_MODELS=true
8
  SERVER_HOST=0.0.0.0
9
  SERVER_PORT=7860
10
 
11
- # Optional: Override model paths
12
- # VISION_MODEL=Qwen/Qwen3-VL-30B-A3B-Instruct
 
13
  # EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
14
  # RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B
 
8
  SERVER_HOST=0.0.0.0
9
  SERVER_PORT=7860
10
 
11
+ # Optional: Override model paths (Dual 8B architecture)
12
+ # VISION_MODEL_THINKING=Qwen/Qwen3-VL-8B-Thinking
13
+ # VISION_MODEL_INSTRUCT=Qwen/Qwen3-VL-8B-Instruct
14
  # EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
15
  # RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B
CLAUDE.md CHANGED
@@ -13,7 +13,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
13
  ## Critical Constraints
14
 
15
  1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
16
- 2. **Memory Budget** - 4xL4 96GB total: ~58GB vision (30B BF16) + ~16GB embedding + ~16GB reranker (~90GB used, ~6GB headroom)
17
  3. **Processing Time** - 60-90 seconds per assessment is acceptable
18
  4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
19
  5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
@@ -23,7 +23,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
23
  | Component | Technology |
24
  |-----------|------------|
25
  | UI Framework | Gradio 6.x |
26
- | Vision/Generation | Qwen3-VL-30B-A3B-Instruct |
 
27
  | Embeddings | Qwen3-VL-Embedding-8B |
28
  | Reranker | Qwen3-VL-Reranker-8B |
29
  | Vector Store | ChromaDB 0.4.x |
@@ -148,28 +149,34 @@ Source documents in `/RAG-KB/`:
148
 
149
  ## Multi-GPU Model Loading
150
 
151
- The 4xL4 setup requires models to be distributed across GPUs. Use `device_map="auto"` in transformers:
152
 
153
  ```python
154
- model = AutoModel.from_pretrained(
155
- "Qwen/Qwen3-VL-30B-A3B-Instruct",
 
156
  torch_dtype=torch.bfloat16,
157
- device_map="auto", # Automatically distributes across available GPUs
 
 
 
 
 
 
158
  trust_remote_code=True
159
  )
160
  ```
161
 
162
- Expected distribution (BF16, ~90GB total):
163
- - Vision model (30B): ~58GB spread across GPUs via device_map="auto"
 
164
  - Embedding model (8B): ~16GB
165
  - Reranker model (8B): ~16GB
166
- - Headroom: ~6GB for KV cache
167
-
168
- **Fallback**: If VRAM issues arise, use `Qwen/Qwen3-VL-8B-Instruct` (~16GB) instead of 30B
169
 
170
  ## Local Development Strategy
171
 
172
- The RTX 4090 (24GB VRAM) cannot run the full model stack (~90GB required). Use this workflow:
173
 
174
  1. Set `MOCK_MODELS=true` environment variable
175
  2. Mock responses return realistic JSON matching vision output schema
 
13
  ## Critical Constraints
14
 
15
  1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
16
+ 2. **Memory Budget** - 4xL4 88GB usable: ~36GB vision (dual 8B) + ~16GB embedding + ~16GB reranker (~68GB used, ~20GB headroom)
17
  3. **Processing Time** - 60-90 seconds per assessment is acceptable
18
  4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
19
  5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
 
23
  | Component | Technology |
24
  |-----------|------------|
25
  | UI Framework | Gradio 6.x |
26
+ | Vision (Thinking) | Qwen3-VL-8B-Thinking |
27
+ | Vision (Instruct) | Qwen3-VL-8B-Instruct |
28
  | Embeddings | Qwen3-VL-Embedding-8B |
29
  | Reranker | Qwen3-VL-Reranker-8B |
30
  | Vector Store | ChromaDB 0.4.x |
 
149
 
150
  ## Multi-GPU Model Loading
151
 
152
+ All 4 models are loaded simultaneously at startup (~68GB total on 4xL4 GPUs):
153
 
154
  ```python
155
+ # Vision models (dual 8B architecture)
156
+ thinking_model = Qwen3VLForConditionalGeneration.from_pretrained(
157
+ "Qwen/Qwen3-VL-8B-Thinking",
158
  torch_dtype=torch.bfloat16,
159
+ device_map="auto",
160
+ trust_remote_code=True
161
+ )
162
+ instruct_model = Qwen3VLForConditionalGeneration.from_pretrained(
163
+ "Qwen/Qwen3-VL-8B-Instruct",
164
+ torch_dtype=torch.bfloat16,
165
+ device_map="auto",
166
  trust_remote_code=True
167
  )
168
  ```
169
 
170
+ Expected distribution (BF16, ~68GB total):
171
+ - Vision Thinking model (8B): ~18GB
172
+ - Vision Instruct model (8B): ~18GB
173
  - Embedding model (8B): ~16GB
174
  - Reranker model (8B): ~16GB
175
+ - Headroom: ~20GB for KV cache and overhead
 
 
176
 
177
  ## Local Development Strategy
178
 
179
+ The RTX 4090 (24GB VRAM) cannot run the full model stack (~68GB required). Use this workflow:
180
 
181
  1. Set `MOCK_MODELS=true` environment variable
182
  2. Mock responses return realistic JSON matching vision output schema
FDAM_AI_Pipeline_Technical_Spec.md CHANGED
@@ -34,7 +34,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
34
 
35
  ### Key Constraints
36
  - 100% locally-owned models (no Claude/OpenAI API calls)
37
- - HuggingFace Spaces deployment with Nvidia A100 80GB
38
  - 60-90 second processing time acceptable
39
  - Static RAG knowledge base (no user-uploaded documents)
40
 
@@ -75,7 +75,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
75
 
76
  ┌─────────────────────────────────────────────────────────────────────────────┐
77
  │ VISION ANALYSIS MODULE │
78
- (Qwen3-VL-30B-A3B-Instruct)
79
  ├─────────────────────────────────────────────────────────────────────────────┤
80
  │ Per Image: │
81
  │ ├── Zone Classification (Burn/Near-Field/Far-Field) + confidence │
@@ -113,7 +113,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
113
 
114
  ┌─────────────────────────────────────────────────────────────────────────────┐
115
  │ DOCUMENT GENERATION MODULE │
116
- (Qwen3-VL-30B-A3B-Instruct)
117
  ├─────────────────────────────────────────────────────────────────────────────┤
118
  │ Outputs: │
119
  │ ├── Cleaning Specification / SOW (primary) │
@@ -144,12 +144,13 @@ Build an AI-powered fire damage assessment system that generates professional Cl
144
  | Component | Technology | Version |
145
  |-----------|------------|---------|
146
  | Platform | HuggingFace Spaces | - |
147
- | GPU | Nvidia A100 | 80GB |
148
- | Vision/Generation Model | Qwen3-VL-30B-A3B-Instruct | Latest |
 
149
  | Embedding Model | Qwen3-VL-Embedding-8B | Latest |
150
  | Reranker Model | Qwen3-VL-Reranker-8B | Latest |
151
  | Vector Store | ChromaDB | 0.4.x |
152
- | UI Framework | Gradio | 4.x |
153
  | PDF Generation | Pandoc | 3.x |
154
  | Image Processing | Pillow, OpenCV | Latest |
155
 
@@ -157,16 +158,16 @@ Build an AI-powered fire damage assessment system that generates professional Cl
157
 
158
  ## 3. Model Stack Configuration
159
 
160
- ### Memory Budget (A100 80GB)
161
 
162
  | Component | VRAM | Status |
163
  |-----------|------|--------|
164
- | Qwen3-VL-30B-A3B-Instruct | ~24GB | Always loaded |
 
165
  | Qwen3-VL-Embedding-8B | ~16GB | Always loaded |
166
  | Qwen3-VL-Reranker-8B | ~16GB | Always loaded |
167
- | ChromaDB + KV Cache | ~5GB | Always loaded |
168
- | **Available Headroom** | ~19GB | Context expansion |
169
- | **Total** | ~61GB | ✅ Fits |
170
 
171
  ### Model Loading Configuration
172
 
@@ -175,58 +176,52 @@ Build an AI-powered fire damage assessment system that generates professional Cl
175
 
176
  import torch
177
  from transformers import (
178
- Qwen3VLMoeForConditionalGeneration, # Note: Qwen3-VL uses MoE architecture
179
  AutoProcessor,
180
- AutoModel,
181
- AutoTokenizer
182
  )
183
 
184
  class ModelStack:
185
- """Manages all models with concurrent loading on A100 80GB."""
186
-
187
  def __init__(self, device="cuda"):
188
  self.device = device
189
  self.models = {}
190
  self.processors = {}
191
-
192
  def load_all(self):
193
- """Load all models into VRAM."""
194
- print("Loading Qwen3-VL-30B-A3B-Instruct (Vision + Generation)...")
195
- self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
196
- "Qwen/Qwen3-VL-30B-A3B-Instruct",
 
197
  torch_dtype=torch.bfloat16,
198
  device_map="auto",
199
  trust_remote_code=True
200
  )
201
- self.processors["vision"] = AutoProcessor.from_pretrained(
202
- "Qwen/Qwen3-VL-30B-A3B-Instruct",
203
  trust_remote_code=True
204
  )
205
-
206
- print("Loading Qwen3-VL-Embedding-8B (Multimodal RAG)...")
207
- self.models["embedding"] = AutoModel.from_pretrained(
208
- "Qwen/Qwen3-VL-Embedding-8B",
209
  torch_dtype=torch.bfloat16,
210
  device_map="auto",
211
  trust_remote_code=True
212
  )
213
- self.processors["embedding"] = AutoProcessor.from_pretrained(
214
- "Qwen/Qwen3-VL-Embedding-8B",
215
  trust_remote_code=True
216
  )
217
-
 
 
 
 
218
  print("Loading Qwen3-VL-Reranker-8B (Retrieval Precision)...")
219
- self.models["reranker"] = AutoModel.from_pretrained(
220
- "Qwen/Qwen3-VL-Reranker-8B",
221
- torch_dtype=torch.bfloat16,
222
- device_map="auto",
223
- trust_remote_code=True
224
- )
225
- self.processors["reranker"] = AutoProcessor.from_pretrained(
226
- "Qwen/Qwen3-VL-Reranker-8B",
227
- trust_remote_code=True
228
- )
229
-
230
  print("All models loaded successfully.")
231
  return self
232
 
 
34
 
35
  ### Key Constraints
36
  - 100% locally-owned models (no Claude/OpenAI API calls)
37
+ - HuggingFace Spaces deployment with Nvidia 4xL4 (88GB total)
38
  - 60-90 second processing time acceptable
39
  - Static RAG knowledge base (no user-uploaded documents)
40
 
 
75
 
76
  ┌─────────────────────────────────────────────────────────────────────────────┐
77
  │ VISION ANALYSIS MODULE │
78
+ (Qwen3-VL-8B-Thinking → Qwen3-VL-8B-Instruct)
79
  ├─────────────────────────────────────────────────────────────────────────────┤
80
  │ Per Image: │
81
  │ ├── Zone Classification (Burn/Near-Field/Far-Field) + confidence │
 
113
 
114
  ┌─────────────────────────────────────────────────────────────────────────────┐
115
  │ DOCUMENT GENERATION MODULE │
116
+ (Deterministic template + calculations)
117
  ├─────────────────────────────────────────────────────────────────────────────┤
118
  │ Outputs: │
119
  │ ├── Cleaning Specification / SOW (primary) │
 
144
  | Component | Technology | Version |
145
  |-----------|------------|---------|
146
  | Platform | HuggingFace Spaces | - |
147
+ | GPU | Nvidia 4xL4 | 88GB total |
148
+ | Vision (Thinking) | Qwen3-VL-8B-Thinking | Latest |
149
+ | Vision (Instruct) | Qwen3-VL-8B-Instruct | Latest |
150
  | Embedding Model | Qwen3-VL-Embedding-8B | Latest |
151
  | Reranker Model | Qwen3-VL-Reranker-8B | Latest |
152
  | Vector Store | ChromaDB | 0.4.x |
153
+ | UI Framework | Gradio | 6.x |
154
  | PDF Generation | Pandoc | 3.x |
155
  | Image Processing | Pillow, OpenCV | Latest |
156
 
 
158
 
159
  ## 3. Model Stack Configuration
160
 
161
+ ### Memory Budget (4xL4 88GB)
162
 
163
  | Component | VRAM | Status |
164
  |-----------|------|--------|
165
+ | Qwen3-VL-8B-Thinking | ~18GB | Always loaded |
166
+ | Qwen3-VL-8B-Instruct | ~18GB | Always loaded |
167
  | Qwen3-VL-Embedding-8B | ~16GB | Always loaded |
168
  | Qwen3-VL-Reranker-8B | ~16GB | Always loaded |
169
+ | **Total** | ~68GB | Fits |
170
+ | **Available Headroom** | ~20GB | KV cache + overhead |
 
171
 
172
  ### Model Loading Configuration
173
 
 
176
 
177
  import torch
178
  from transformers import (
179
+ Qwen3VLForConditionalGeneration,
180
  AutoProcessor,
 
 
181
  )
182
 
183
  class ModelStack:
184
+ """Manages all models with concurrent loading on 4xL4 (88GB total)."""
185
+
186
  def __init__(self, device="cuda"):
187
  self.device = device
188
  self.models = {}
189
  self.processors = {}
190
+
191
  def load_all(self):
192
+ """Load all models into VRAM (~68GB total)."""
193
+ # Dual vision architecture
194
+ print("Loading Qwen3-VL-8B-Thinking (Vision Analysis)...")
195
+ self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
196
+ "Qwen/Qwen3-VL-8B-Thinking",
197
  torch_dtype=torch.bfloat16,
198
  device_map="auto",
199
  trust_remote_code=True
200
  )
201
+ self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
202
+ "Qwen/Qwen3-VL-8B-Thinking",
203
  trust_remote_code=True
204
  )
205
+
206
+ print("Loading Qwen3-VL-8B-Instruct (JSON Formatting)...")
207
+ self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
208
+ "Qwen/Qwen3-VL-8B-Instruct",
209
  torch_dtype=torch.bfloat16,
210
  device_map="auto",
211
  trust_remote_code=True
212
  )
213
+ self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
214
+ "Qwen/Qwen3-VL-8B-Instruct",
215
  trust_remote_code=True
216
  )
217
+
218
+ # RAG models
219
+ print("Loading Qwen3-VL-Embedding-8B (Multimodal RAG)...")
220
+ # Uses official Qwen3VLEmbedder from scripts/qwen3_vl/
221
+
222
  print("Loading Qwen3-VL-Reranker-8B (Retrieval Precision)...")
223
+ # Uses official Qwen3VLReranker from scripts/qwen3_vl/
224
+
 
 
 
 
 
 
 
 
 
225
  print("All models loaded successfully.")
226
  return self
227
 
README.md CHANGED
@@ -32,8 +32,9 @@ suggested_hardware: l4x4
32
 
33
  ## Technical Details
34
 
35
- ### Model Stack (~90GB VRAM)
36
- - **Vision**: Qwen3-VL-30B-A3B-Instruct (~58GB)
 
37
  - **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
38
  - **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
39
 
 
32
 
33
  ## Technical Details
34
 
35
+ ### Model Stack (~68GB VRAM)
36
+ - **Vision (Thinking)**: Qwen3-VL-8B-Thinking (~18GB) - Deep analysis with reasoning
37
+ - **Vision (Instruct)**: Qwen3-VL-8B-Instruct (~18GB) - Structured JSON output
38
  - **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
39
  - **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
40
 
config/inference.py CHANGED
@@ -7,15 +7,31 @@ and FDAM Technical Spec requirements.
7
  from dataclasses import dataclass
8
 
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  @dataclass
11
  class VisionInferenceConfig:
12
- """Configuration for vision model inference.
13
 
14
- Per FDAM Technical Spec Section 3 and Qwen3-VL-30B-A3B-Instruct model card.
15
  """
16
 
17
  max_new_tokens: int = 4096
18
- temperature: float = 0.1 # Low temperature for deterministic output
19
  top_p: float = 0.9
20
  do_sample: bool = True
21
  repetition_penalty: float = 1.1 # Reduce repetition in generated text
@@ -66,7 +82,8 @@ class RAGConfig:
66
 
67
 
68
  # Default configurations
69
- vision_config = VisionInferenceConfig()
 
70
  generation_config = GenerationInferenceConfig()
71
  embedding_config = EmbeddingConfig()
72
  reranker_config = RerankerConfig()
 
7
  from dataclasses import dataclass
8
 
9
 
10
+ @dataclass
11
+ class ThinkingInferenceConfig:
12
+ """Configuration for 8B-Thinking model inference.
13
+
14
+ Per Qwen3-VL GitHub recommended hyperparameters for thinking models.
15
+ Used for deep analysis with <think> chains.
16
+ """
17
+
18
+ max_new_tokens: int = 32768 # Extended for reasoning chains (model supports 40960)
19
+ temperature: float = 0.6 # Per Qwen3-VL GitHub docs
20
+ top_p: float = 0.95
21
+ top_k: int = 20
22
+ do_sample: bool = True
23
+ repetition_penalty: float = 1.0 # Per Qwen3-VL docs (not presence_penalty)
24
+
25
+
26
  @dataclass
27
  class VisionInferenceConfig:
28
+ """Configuration for 8B-Instruct model inference.
29
 
30
+ Per FDAM Technical Spec Section 3. Used for structured JSON output.
31
  """
32
 
33
  max_new_tokens: int = 4096
34
+ temperature: float = 0.1 # Low temperature for deterministic JSON output
35
  top_p: float = 0.9
36
  do_sample: bool = True
37
  repetition_penalty: float = 1.1 # Reduce repetition in generated text
 
82
 
83
 
84
  # Default configurations
85
+ thinking_config = ThinkingInferenceConfig()
86
+ vision_config = VisionInferenceConfig() # Now used for Instruct model
87
  generation_config = GenerationInferenceConfig()
88
  embedding_config = EmbeddingConfig()
89
  reranker_config = RerankerConfig()
config/settings.py CHANGED
@@ -17,13 +17,12 @@ class Settings(BaseSettings):
17
  mock_models: bool = True
18
 
19
  # Model paths (for production on HuggingFace Spaces)
20
- vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Instruct"
 
 
21
  embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
22
  reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
23
 
24
- # Fallback vision model if VRAM issues
25
- vision_model_fallback: str = "Qwen/Qwen3-VL-8B-Instruct"
26
-
27
  # ChromaDB
28
  chroma_persist_dir: str = "./chroma_db"
29
 
 
17
  mock_models: bool = True
18
 
19
  # Model paths (for production on HuggingFace Spaces)
20
+ # Dual 8B architecture: Thinking for analysis, Instruct for structured output
21
+ vision_model_thinking: str = "Qwen/Qwen3-VL-8B-Thinking"
22
+ vision_model_instruct: str = "Qwen/Qwen3-VL-8B-Instruct"
23
  embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
24
  reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
25
 
 
 
 
26
  # ChromaDB
27
  chroma_persist_dir: str = "./chroma_db"
28
 
models/loader.py CHANGED
@@ -1,13 +1,13 @@
1
  """Model loading with mock/real switching based on environment.
2
 
3
  Supports two loading modes:
4
- - MOCK_MODELS=true: Loads all mock models at startup (fast, for local dev)
5
- - MOCK_MODELS=false: Uses LAZY LOADING (models loaded on-demand by pipeline)
6
 
7
- Lazy Loading Strategy (for 4xL4 GPUs with 88GB total):
8
- - Vision 30B (~60GB) loaded before Stage 2, unloaded after
9
- - RAG models (~32GB) loaded before Stage 3
10
- - Peak usage ~60GB, never both simultaneously
11
  """
12
 
13
  import logging
@@ -28,8 +28,8 @@ _model_stack: ModelStack | None = None
28
  def get_model_stack() -> ModelStack:
29
  """Get model stack based on environment configuration.
30
 
31
- For mock models: Loads all models immediately (fast, for local dev).
32
- For real models: Returns uninitialized stack for lazy loading.
33
  """
34
  start_time = time.time()
35
 
@@ -42,25 +42,24 @@ def get_model_stack() -> ModelStack:
42
  logger.info(f"Mock model stack loaded in {elapsed:.2f}s")
43
  return stack
44
  else:
45
- logger.info("Creating REAL model stack (production mode - lazy loading)")
46
- logger.info(f"Vision model: {settings.vision_model}")
 
47
  logger.info(f"Embedding model: {settings.embedding_model}")
48
  logger.info(f"Reranker model: {settings.reranker_model}")
49
- logger.info("NOTE: Models will be loaded on-demand by pipeline stages")
50
  from models.real import RealModelStack
51
 
52
- # Don't load models yet - pipeline will call load_vision() and load_rag()
53
- stack = RealModelStack()
54
  elapsed = time.time() - start_time
55
- logger.info(f"Real model stack initialized in {elapsed:.2f}s (no models loaded yet)")
56
  return stack
57
 
58
 
59
  def get_models() -> ModelStack:
60
  """Get or create the singleton model stack.
61
 
62
- For real models, this returns an uninitialized stack.
63
- Call stack.load_vision() or stack.load_rag() as needed.
64
  """
65
  global _model_stack
66
  if _model_stack is None:
 
1
  """Model loading with mock/real switching based on environment.
2
 
3
  Supports two loading modes:
4
+ - MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
5
+ - MOCK_MODELS=false: Loads all real models at startup (~68GB total)
6
 
7
+ Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
8
+ - Vision Thinking 8B (~18GB) + Vision Instruct 8B (~18GB) = ~36GB
9
+ - Embedding 8B (~16GB) + Reranker 8B (~16GB) = ~32GB
10
+ - Total: ~68GB, leaving ~20GB headroom
11
  """
12
 
13
  import logging
 
28
  def get_model_stack() -> ModelStack:
29
  """Get model stack based on environment configuration.
30
 
31
+ For mock models: Loads mock models immediately (fast, for local dev).
32
+ For real models: Loads all 4 models at startup (~68GB total).
33
  """
34
  start_time = time.time()
35
 
 
42
  logger.info(f"Mock model stack loaded in {elapsed:.2f}s")
43
  return stack
44
  else:
45
+ logger.info("Loading REAL model stack (production mode)")
46
+ logger.info(f"Vision thinking model: {settings.vision_model_thinking}")
47
+ logger.info(f"Vision instruct model: {settings.vision_model_instruct}")
48
  logger.info(f"Embedding model: {settings.embedding_model}")
49
  logger.info(f"Reranker model: {settings.reranker_model}")
 
50
  from models.real import RealModelStack
51
 
52
+ # Load all models at startup (simultaneous loading)
53
+ stack = RealModelStack().load_all()
54
  elapsed = time.time() - start_time
55
+ logger.info(f"Real model stack loaded in {elapsed:.2f}s")
56
  return stack
57
 
58
 
59
  def get_models() -> ModelStack:
60
  """Get or create the singleton model stack.
61
 
62
+ Returns fully loaded model stack (all models ready for inference).
 
63
  """
64
  global _model_stack
65
  if _model_stack is None:
models/mock.py CHANGED
@@ -1,4 +1,9 @@
1
- """Mock model implementations for local development on RTX 4090."""
 
 
 
 
 
2
 
3
  import logging
4
  import random
@@ -9,7 +14,12 @@ logger = logging.getLogger(__name__)
9
 
10
 
11
  class MockVisionModel:
12
- """Mock vision model that returns realistic JSON responses."""
 
 
 
 
 
13
 
14
  ZONES = ["burn", "near-field", "far-field"]
15
  CONDITIONS = ["background", "light", "moderate", "heavy", "structural-damage"]
@@ -28,11 +38,31 @@ class MockVisionModel:
28
  {"type": "ductwork-flexible", "category": "hvac"},
29
  ]
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
32
- """Return mock vision analysis matching the spec schema."""
33
- logger.debug(f"Mock vision analysis (context: {len(context)} chars)")
 
 
34
  selected_zone = random.choice(self.ZONES)
35
  selected_condition = random.choice(self.CONDITIONS)
 
 
 
36
  logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
37
 
38
  # Generate 2-4 random materials
@@ -62,12 +92,18 @@ class MockVisionModel:
62
  "zone": {
63
  "classification": selected_zone,
64
  "confidence": round(random.uniform(0.7, 0.95), 2),
65
- "reasoning": f"Mock analysis detected {selected_zone} zone characteristics based on visible damage patterns",
 
 
 
66
  },
67
  "condition": {
68
  "level": selected_condition,
69
  "confidence": round(random.uniform(0.65, 0.90), 2),
70
- "reasoning": f"Surface shows {selected_condition} contamination levels",
 
 
 
71
  },
72
  "materials": materials,
73
  "combustion_indicators": {
@@ -188,35 +224,25 @@ class MockRerankerModel:
188
  class MockModelStack:
189
  """Mock model stack for local development.
190
 
191
- Unlike RealModelStack, mock models are always loaded together.
192
- The is_vision_loaded() and is_rag_loaded() methods are provided
193
- for API compatibility with the lazy loading pipeline.
194
  """
195
 
196
  def __init__(self):
197
  self.vision = MockVisionModel()
198
  self.embedding = MockEmbeddingModel()
199
  self.reranker = MockRerankerModel()
200
- self.loaded = False
201
 
202
  def load_all(self) -> "MockModelStack":
203
- """Simulate model loading."""
204
  logger.info("Loading mock models for local development")
205
- logger.debug(" Vision model: MockVisionModel")
206
- logger.debug(" Embedding model: MockEmbeddingModel")
207
  logger.debug(" Reranker model: MockRerankerModel")
208
- self.loaded = True
209
  logger.info("All mock models loaded successfully")
210
  return self
211
 
212
  def is_loaded(self) -> bool:
213
  """Check if models are loaded."""
214
- return self.loaded
215
-
216
- def is_vision_loaded(self) -> bool:
217
- """Check if vision model is loaded (always True when loaded)."""
218
- return self.loaded
219
-
220
- def is_rag_loaded(self) -> bool:
221
- """Check if RAG models are loaded (always True when loaded)."""
222
- return self.loaded
 
1
+ """Mock model implementations for local development on RTX 4090.
2
+
3
+ Simulates the dual 8B vision model architecture:
4
+ - MockVisionModel simulates two-stage pipeline (Thinking -> Instruct)
5
+ - All models loaded together at startup (no lazy loading)
6
+ """
7
 
8
  import logging
9
  import random
 
14
 
15
 
16
  class MockVisionModel:
17
+ """Mock vision model that simulates dual-model pipeline output.
18
+
19
+ Simulates:
20
+ - Stage 1: Thinking model generates reasoning
21
+ - Stage 2: Instruct model formats to JSON
22
+ """
23
 
24
  ZONES = ["burn", "near-field", "far-field"]
25
  CONDITIONS = ["background", "light", "moderate", "heavy", "structural-damage"]
 
38
  {"type": "ductwork-flexible", "category": "hvac"},
39
  ]
40
 
41
+ # Mock reasoning patterns to simulate Thinking model output
42
+ REASONING_PATTERNS = {
43
+ "burn": "Direct fire involvement evident from structural char and complete combustion patterns.",
44
+ "near-field": "Adjacent to burn zone with heavy smoke deposits and heat-induced discoloration.",
45
+ "far-field": "Light smoke migration only, no direct heat exposure or structural damage visible.",
46
+ }
47
+
48
+ CONDITION_REASONING = {
49
+ "background": "Surfaces appear clean with no visible contamination.",
50
+ "light": "Faint discoloration visible, minimal deposits present.",
51
+ "moderate": "Clear contamination with visible film on surfaces.",
52
+ "heavy": "Thick deposits obscuring surface texture.",
53
+ "structural-damage": "Physical damage requiring repair before cleaning.",
54
+ }
55
+
56
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
57
+ """Return mock vision analysis simulating dual-model pipeline output."""
58
+ logger.debug(f"Mock dual-model vision analysis (context: {len(context)} chars)")
59
+
60
+ # Simulate Stage 1: Thinking model selects classifications
61
  selected_zone = random.choice(self.ZONES)
62
  selected_condition = random.choice(self.CONDITIONS)
63
+
64
+ logger.debug("Mock Stage 1 (Thinking): Generated reasoning")
65
+ logger.debug("Mock Stage 2 (Instruct): Formatted to JSON")
66
  logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
67
 
68
  # Generate 2-4 random materials
 
92
  "zone": {
93
  "classification": selected_zone,
94
  "confidence": round(random.uniform(0.7, 0.95), 2),
95
+ "reasoning": self.REASONING_PATTERNS.get(
96
+ selected_zone,
97
+ f"Mock analysis detected {selected_zone} zone characteristics",
98
+ ),
99
  },
100
  "condition": {
101
  "level": selected_condition,
102
  "confidence": round(random.uniform(0.65, 0.90), 2),
103
+ "reasoning": self.CONDITION_REASONING.get(
104
+ selected_condition,
105
+ f"Surface shows {selected_condition} contamination levels",
106
+ ),
107
  },
108
  "materials": materials,
109
  "combustion_indicators": {
 
224
  class MockModelStack:
225
  """Mock model stack for local development.
226
 
227
+ All models loaded together at startup (matches production behavior).
 
 
228
  """
229
 
230
  def __init__(self):
231
  self.vision = MockVisionModel()
232
  self.embedding = MockEmbeddingModel()
233
  self.reranker = MockRerankerModel()
234
+ self._loaded = False
235
 
236
  def load_all(self) -> "MockModelStack":
237
+ """Load all mock models."""
238
  logger.info("Loading mock models for local development")
239
+ logger.debug(" Vision model: MockVisionModel (simulates dual 8B pipeline)")
240
+ logger.debug(" Embedding model: MockEmbeddingModel (4096-dim)")
241
  logger.debug(" Reranker model: MockRerankerModel")
242
+ self._loaded = True
243
  logger.info("All mock models loaded successfully")
244
  return self
245
 
246
  def is_loaded(self) -> bool:
247
  """Check if models are loaded."""
248
+ return self._loaded
 
 
 
 
 
 
 
 
models/real.py CHANGED
@@ -1,21 +1,21 @@
1
  """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
2
 
3
  This module loads the actual Qwen3-VL models for production use.
4
- Uses LAZY LOADING to fit within 88GB VRAM (4xL4 with ~22GB each).
5
 
6
- Memory Strategy:
7
- - Vision 30B (~60GB): Loaded ONLY during Stage 2 (Vision Analysis)
8
- - Embedding 8B (~16GB): Loaded ONLY during Stages 3+ (RAG)
9
- - Reranker 8B (~16GB): Loaded ONLY during Stages 3+ (RAG)
10
- - Peak usage: ~60GB (never all three simultaneously)
 
11
 
12
  Model Loading:
13
- - Vision: Qwen3VLMoeForConditionalGeneration (standard transformers)
14
  - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
15
  - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
16
  """
17
 
18
- import gc
19
  import json
20
  import logging
21
  import re
@@ -24,7 +24,7 @@ import torch
24
  from typing import Any
25
  from PIL import Image
26
 
27
- from config.inference import vision_config
28
  from config.settings import settings
29
 
30
  logger = logging.getLogger(__name__)
@@ -33,17 +33,15 @@ logger = logging.getLogger(__name__)
33
  class RealModelStack:
34
  """Real model stack for production on HuggingFace Spaces.
35
 
36
- Uses LAZY LOADING to prevent OOM errors on 4xL4 (88GB total):
37
- - Vision 30B (~60GB) and RAG models (~32GB) are never loaded simultaneously
38
- - Pipeline calls load_vision() before Stage 2, unload_vision() after
39
- - Pipeline calls load_rag() before Stage 3
40
  """
41
 
42
  def __init__(self):
43
  self.models: dict[str, Any] = {}
44
  self.processors: dict[str, Any] = {}
45
- self._vision_loaded = False
46
- self._rag_loaded = False
47
 
48
  def _log_gpu_status(self):
49
  """Log current GPU memory status."""
@@ -57,114 +55,53 @@ class RealModelStack:
57
  free = total - allocated
58
  logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
59
 
60
- def load_vision(self) -> "RealModelStack":
61
- """Load only the vision model (~60GB in BF16).
62
 
63
- Call this before Stage 2 (Vision Analysis).
64
- Must call unload_vision() before load_rag() to free memory.
65
  """
66
- if self._vision_loaded:
67
- logger.debug("Vision model already loaded, skipping")
68
  return self
69
 
70
- from transformers import AutoProcessor
71
 
72
  device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
73
- logger.info(f"Loading vision model on {device_type}")
74
  self._log_gpu_status()
75
 
76
- logger.info(f"Loading vision model: {settings.vision_model}")
77
- vision_start = time.time()
78
- try:
79
- from transformers import Qwen3VLMoeForConditionalGeneration
80
-
81
- self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
82
- settings.vision_model,
83
- torch_dtype=torch.bfloat16,
84
- device_map="auto",
85
- trust_remote_code=True,
86
- )
87
- self.processors["vision"] = AutoProcessor.from_pretrained(
88
- settings.vision_model,
89
- trust_remote_code=True,
90
- )
91
- logger.info(f"Vision model loaded in {time.time() - vision_start:.2f}s")
92
- except Exception as e:
93
- logger.warning(f"Failed to load 30B vision model: {e}")
94
- logger.info(f"Falling back to {settings.vision_model_fallback}")
95
- from transformers import Qwen3VLMoeForConditionalGeneration
96
-
97
- self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
98
- settings.vision_model_fallback,
99
- torch_dtype=torch.bfloat16,
100
- device_map="auto",
101
- trust_remote_code=True,
102
- )
103
- self.processors["vision"] = AutoProcessor.from_pretrained(
104
- settings.vision_model_fallback,
105
- trust_remote_code=True,
106
- )
107
- logger.info(f"Fallback vision model loaded in {time.time() - vision_start:.2f}s")
108
 
109
- self._vision_loaded = True
110
- self._log_gpu_status()
111
- return self
112
-
113
- def unload_vision(self):
114
- """Unload vision model and free CUDA memory.
115
-
116
- Uses accelerate's remove_hook_from_module per HuggingFace docs.
117
- Call this after Stage 2 (Vision Analysis) to free memory for RAG.
118
- """
119
- if not self._vision_loaded or "vision" not in self.models:
120
- logger.debug("Vision model not loaded, skipping unload")
121
- return
122
-
123
- logger.info("Unloading vision model to free memory for RAG...")
124
- self._log_gpu_status()
125
-
126
- try:
127
- from accelerate.hooks import remove_hook_from_module
128
-
129
- # CRITICAL: Remove hooks before deleting (required for device_map="auto")
130
- model = self.models["vision"]
131
- if hasattr(model, 'model'):
132
- # Some wrappers have nested model
133
- remove_hook_from_module(model.model, recurse=True)
134
- remove_hook_from_module(model, recurse=True)
135
- logger.debug("Accelerate hooks removed from vision model")
136
- except ImportError:
137
- logger.warning("accelerate.hooks not available, proceeding with basic cleanup")
138
- except Exception as e:
139
- logger.warning(f"Hook removal failed (continuing anyway): {e}")
140
-
141
- # Delete model and processor
142
- del self.models["vision"]
143
- del self.processors["vision"]
144
- self._vision_loaded = False
145
-
146
- # Clear CUDA cache (may not free 100% but sufficient for sequential loading)
147
- gc.collect()
148
- torch.cuda.empty_cache()
149
-
150
- logger.info("Vision model unloaded, CUDA cache cleared")
151
- self._log_gpu_status()
152
-
153
- def load_rag(self) -> "RealModelStack":
154
- """Load embedding and reranker models (~32GB total in BF16).
155
-
156
- Call this before Stage 3 (RAG Retrieval).
157
- Must call unload_vision() first to have enough memory.
158
- """
159
- if self._rag_loaded:
160
- logger.debug("RAG models already loaded, skipping")
161
- return self
162
-
163
- if self._vision_loaded:
164
- logger.warning("Vision model still loaded! Call unload_vision() first to avoid OOM.")
165
 
166
- logger.info("Loading RAG models (embedding + reranker)...")
167
- self._log_gpu_status()
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  # Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
170
  logger.info(f"Loading embedding model: {settings.embedding_model}")
@@ -190,59 +127,51 @@ class RealModelStack:
190
  self.processors["reranker"] = self.models["reranker"].processor
191
  logger.info(f"Reranker model loaded in {time.time() - reranker_start:.2f}s")
192
 
193
- self._rag_loaded = True
194
- logger.info("RAG models loaded successfully")
 
195
  self._log_gpu_status()
196
  return self
197
 
198
- def load_all(self) -> "RealModelStack":
199
- """Load all models (DEPRECATED - use lazy loading instead).
200
-
201
- This method is kept for backward compatibility but will cause OOM
202
- on 4xL4 GPUs. Use load_vision() and load_rag() sequentially instead.
203
- """
204
- logger.warning("load_all() is deprecated - use load_vision() and load_rag() for lazy loading")
205
- self.load_vision()
206
- # Note: This WILL cause OOM on 4xL4 as vision (60GB) + RAG (32GB) > 88GB
207
- self.load_rag()
208
- return self
209
-
210
  def is_loaded(self) -> bool:
211
- """Check if any models are loaded."""
212
- return self._vision_loaded or self._rag_loaded
213
-
214
- def is_vision_loaded(self) -> bool:
215
- """Check if vision model is loaded."""
216
- return self._vision_loaded
217
-
218
- def is_rag_loaded(self) -> bool:
219
- """Check if RAG models are loaded."""
220
- return self._rag_loaded
221
 
222
  @property
223
- def vision(self) -> "RealVisionModel":
224
- """Return vision model wrapped for pipeline consumption."""
225
- if not self._vision_loaded:
226
- raise RuntimeError("Vision model not loaded. Call load_vision() first.")
227
- return RealVisionModel(self.models["vision"], self.processors["vision"])
 
 
 
 
 
228
 
229
  @property
230
  def embedding(self) -> "RealEmbeddingModel":
231
  """Return embedding model wrapped for pipeline consumption."""
232
- if not self._rag_loaded:
233
- raise RuntimeError("Embedding model not loaded. Call load_rag() first.")
234
  return RealEmbeddingModel(self.models["embedding"], self.processors["embedding"])
235
 
236
  @property
237
  def reranker(self) -> "RealRerankerModel":
238
  """Return reranker model wrapped for pipeline consumption."""
239
- if not self._rag_loaded:
240
- raise RuntimeError("Reranker model not loaded. Call load_rag() first.")
241
  return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
242
 
243
 
244
- class RealVisionModel:
245
- """Wrapper for real vision model inference."""
 
 
 
 
 
 
246
 
247
  # System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
248
  VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
@@ -280,60 +209,123 @@ Identify visible materials and categorize as:
280
  - Flag any areas that require professional on-site verification
281
  - Note any potential access issues visible in the image"""
282
 
283
- # Analysis prompt template with JSON schema
284
- ANALYSIS_PROMPT = """Analyze this fire damage image and return a JSON response with the following structure:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
- {
287
- "zone": {
 
 
 
 
 
 
 
288
  "classification": "burn" | "near-field" | "far-field",
289
  "confidence": 0.0-1.0,
290
  "reasoning": "explanation"
291
- },
292
- "condition": {
293
  "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
294
  "confidence": 0.0-1.0,
295
  "reasoning": "explanation"
296
- },
297
  "materials": [
298
- {
299
  "type": "material type (e.g., drywall, concrete, steel, wood)",
300
  "category": "non-porous" | "semi-porous" | "porous" | "hvac",
301
  "confidence": 0.0-1.0,
302
  "location_description": "where in image",
303
- "bounding_box": {"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}
304
- }
305
  ],
306
- "combustion_indicators": {
307
  "soot_visible": true/false,
308
  "soot_pattern": "description or null",
309
  "char_visible": true/false,
310
  "char_description": "description or null",
311
  "ash_visible": true/false,
312
  "ash_description": "description or null"
313
- },
314
  "structural_concerns": ["list of structural issues if any"],
315
  "access_issues": ["list of access problems if any"],
316
  "recommended_sampling_locations": [
317
- {
318
  "description": "where to sample",
319
  "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
320
  "priority": "high" | "medium" | "low"
321
- }
322
  ],
323
  "flags_for_review": ["any items requiring human review"]
324
- }
325
 
326
  IMPORTANT: Return ONLY valid JSON, no additional text."""
327
 
328
- def __init__(self, model, processor):
329
- self.model = model
330
- self.processor = processor
 
 
331
 
332
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
333
- """Analyze an image and return structured results."""
 
 
 
 
334
  start_time = time.time()
335
- logger.debug(f"Starting vision analysis (context: {len(context)} chars)")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
336
 
 
 
 
 
 
 
 
 
337
  try:
338
  from qwen_vl_utils import process_vision_info
339
  except ImportError:
@@ -341,7 +333,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
341
  process_vision_info = None
342
 
343
  # Build the analysis prompt with context
344
- prompt = self.ANALYSIS_PROMPT
345
  if context:
346
  prompt = f"Context: {context}\n\n{prompt}"
347
 
@@ -360,104 +352,142 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
360
  }
361
  ]
362
 
363
- try:
364
- # Apply chat template
365
- text = self.processor.apply_chat_template(
366
- messages, tokenize=False, add_generation_prompt=True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
367
  )
368
 
369
- # Process vision info if available
370
- if process_vision_info:
371
- image_inputs, video_inputs = process_vision_info(messages)
372
- inputs = self.processor(
373
- text=[text],
374
- images=image_inputs,
375
- videos=video_inputs,
376
- return_tensors="pt",
377
- padding=True,
378
- )
379
- else:
380
- # Fallback: basic image processing
381
- inputs = self.processor(
382
- text=[text],
383
- images=[image],
384
- return_tensors="pt",
385
- padding=True,
386
- )
387
-
388
- # Note: With device_map="auto", transformers handles device routing internally
389
- # Do NOT call .to(device) - it breaks distributed models
390
-
391
- # Log inference config being used
392
- logger.debug(f"Vision inference config: max_new_tokens={vision_config.max_new_tokens}, "
393
- f"do_sample={vision_config.do_sample}, temp={vision_config.temperature}")
394
-
395
- # Generate response using config values
396
- inference_start = time.time()
397
- with torch.no_grad():
398
- if vision_config.do_sample:
399
- outputs = self.model.generate(
400
- **inputs,
401
- max_new_tokens=vision_config.max_new_tokens,
402
- do_sample=True,
403
- temperature=vision_config.temperature,
404
- top_p=vision_config.top_p,
405
- repetition_penalty=vision_config.repetition_penalty,
406
- )
407
- else:
408
- # Deterministic mode (no sampling)
409
- outputs = self.model.generate(
410
- **inputs,
411
- max_new_tokens=vision_config.max_new_tokens,
412
- do_sample=False,
413
- temperature=None,
414
- top_p=None,
415
- repetition_penalty=vision_config.repetition_penalty,
416
- )
417
-
418
- inference_time = time.time() - inference_start
419
- logger.debug(f"Vision inference completed in {inference_time:.2f}s")
420
-
421
- # Decode response
422
- response_text = self.processor.decode(
423
- outputs[0], skip_special_tokens=True
424
  )
425
- logger.debug(f"Response length: {len(response_text)} chars")
426
 
427
- # Parse JSON from response
428
- result = self._parse_vision_response(response_text)
429
 
430
- # Log result summary
431
- total_time = time.time() - start_time
432
- zone = result.get("zone", {}).get("classification", "unknown")
433
- zone_conf = result.get("zone", {}).get("confidence", 0)
434
- condition = result.get("condition", {}).get("level", "unknown")
435
- condition_conf = result.get("condition", {}).get("confidence", 0)
436
- num_materials = len(result.get("materials", []))
437
- logger.info(f"Vision analysis complete in {total_time:.2f}s: "
438
- f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
439
- f"materials={num_materials}")
440
 
441
- return result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
442
 
443
- except Exception as e:
444
- logger.error(f"Vision analysis failed: {e}")
445
- return self._get_fallback_response(str(e))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
446
 
447
- def _parse_vision_response(self, response: str) -> dict[str, Any]:
448
- """Parse JSON response from vision model."""
449
  try:
450
  # Try to extract JSON from response
451
- # Look for JSON block in various formats
452
  json_match = re.search(r'\{[\s\S]*\}', response)
453
  if json_match:
454
  json_str = json_match.group()
455
  return json.loads(json_str)
456
  else:
457
- logger.warning("No JSON found in vision response")
458
  return self._get_fallback_response("No JSON in response")
459
  except json.JSONDecodeError as e:
460
- logger.warning(f"Failed to parse vision JSON: {e}")
461
  return self._get_fallback_response(f"JSON parse error: {e}")
462
 
463
  def _get_fallback_response(self, reason: str) -> dict[str, Any]:
 
1
  """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
2
 
3
  This module loads the actual Qwen3-VL models for production use.
4
+ All models are loaded simultaneously at startup (~68GB total).
5
 
6
+ Memory Strategy (Simultaneous Loading):
7
+ - Vision Thinking 8B (~18GB): Deep analysis with reasoning chains
8
+ - Vision Instruct 8B (~18GB): Structured JSON output formatting
9
+ - Embedding 8B (~16GB): RAG document embedding
10
+ - Reranker 8B (~16GB): RAG retrieval reranking
11
+ - Total: ~68GB on 88GB available (20GB headroom)
12
 
13
  Model Loading:
14
+ - Vision: Qwen3VLForConditionalGeneration (standard transformers)
15
  - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
16
  - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
17
  """
18
 
 
19
  import json
20
  import logging
21
  import re
 
24
  from typing import Any
25
  from PIL import Image
26
 
27
+ from config.inference import thinking_config, vision_config
28
  from config.settings import settings
29
 
30
  logger = logging.getLogger(__name__)
 
33
  class RealModelStack:
34
  """Real model stack for production on HuggingFace Spaces.
35
 
36
+ Loads all 4 models simultaneously at initialization (~68GB total):
37
+ - Dual vision (Thinking + Instruct): ~36GB
38
+ - Embedding + Reranker: ~32GB
 
39
  """
40
 
41
  def __init__(self):
42
  self.models: dict[str, Any] = {}
43
  self.processors: dict[str, Any] = {}
44
+ self._loaded = False
 
45
 
46
  def _log_gpu_status(self):
47
  """Log current GPU memory status."""
 
55
  free = total - allocated
56
  logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
57
 
58
+ def load_all(self) -> "RealModelStack":
59
+ """Load all models simultaneously.
60
 
61
+ Loads dual vision models (Thinking + Instruct) and RAG models
62
+ (Embedding + Reranker) for ~68GB total VRAM usage.
63
  """
64
+ if self._loaded:
65
+ logger.debug("Models already loaded, skipping")
66
  return self
67
 
68
+ from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
69
 
70
  device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
71
+ logger.info(f"Loading all models on {device_type}")
72
  self._log_gpu_status()
73
 
74
+ total_start = time.time()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ # Vision Thinking model (~18GB in BF16)
77
+ logger.info(f"Loading vision thinking model: {settings.vision_model_thinking}")
78
+ thinking_start = time.time()
79
+ self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
80
+ settings.vision_model_thinking,
81
+ torch_dtype=torch.bfloat16,
82
+ device_map="auto",
83
+ trust_remote_code=True,
84
+ )
85
+ self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
86
+ settings.vision_model_thinking,
87
+ trust_remote_code=True,
88
+ )
89
+ logger.info(f"Vision thinking model loaded in {time.time() - thinking_start:.2f}s")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
+ # Vision Instruct model (~18GB in BF16)
92
+ logger.info(f"Loading vision instruct model: {settings.vision_model_instruct}")
93
+ instruct_start = time.time()
94
+ self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
95
+ settings.vision_model_instruct,
96
+ torch_dtype=torch.bfloat16,
97
+ device_map="auto",
98
+ trust_remote_code=True,
99
+ )
100
+ self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
101
+ settings.vision_model_instruct,
102
+ trust_remote_code=True,
103
+ )
104
+ logger.info(f"Vision instruct model loaded in {time.time() - instruct_start:.2f}s")
105
 
106
  # Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
107
  logger.info(f"Loading embedding model: {settings.embedding_model}")
 
127
  self.processors["reranker"] = self.models["reranker"].processor
128
  logger.info(f"Reranker model loaded in {time.time() - reranker_start:.2f}s")
129
 
130
+ self._loaded = True
131
+ total_time = time.time() - total_start
132
+ logger.info(f"All models loaded in {total_time:.2f}s")
133
  self._log_gpu_status()
134
  return self
135
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  def is_loaded(self) -> bool:
137
+ """Check if models are loaded."""
138
+ return self._loaded
 
 
 
 
 
 
 
 
139
 
140
  @property
141
+ def vision(self) -> "DualVisionModel":
142
+ """Return dual vision model wrapped for pipeline consumption."""
143
+ if not self._loaded:
144
+ raise RuntimeError("Models not loaded. Call load_all() first.")
145
+ return DualVisionModel(
146
+ thinking_model=self.models["vision_thinking"],
147
+ thinking_processor=self.processors["vision_thinking"],
148
+ instruct_model=self.models["vision_instruct"],
149
+ instruct_processor=self.processors["vision_instruct"],
150
+ )
151
 
152
  @property
153
  def embedding(self) -> "RealEmbeddingModel":
154
  """Return embedding model wrapped for pipeline consumption."""
155
+ if not self._loaded:
156
+ raise RuntimeError("Models not loaded. Call load_all() first.")
157
  return RealEmbeddingModel(self.models["embedding"], self.processors["embedding"])
158
 
159
  @property
160
  def reranker(self) -> "RealRerankerModel":
161
  """Return reranker model wrapped for pipeline consumption."""
162
+ if not self._loaded:
163
+ raise RuntimeError("Models not loaded. Call load_all() first.")
164
  return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
165
 
166
 
167
+ class DualVisionModel:
168
+ """Dual vision model for two-stage fire damage analysis.
169
+
170
+ Uses Qwen3-VL-8B-Thinking for deep analysis with reasoning chains,
171
+ then Qwen3-VL-8B-Instruct to format results into structured JSON.
172
+
173
+ Pipeline: Image -> Thinking (analysis) -> Instruct (JSON formatting) -> Output
174
+ """
175
 
176
  # System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
177
  VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
 
209
  - Flag any areas that require professional on-site verification
210
  - Note any potential access issues visible in the image"""
211
 
212
+ # Analysis prompt for Thinking model (open-ended reasoning)
213
+ THINKING_ANALYSIS_PROMPT = """Analyze this fire damage image thoroughly. Consider:
214
+
215
+ 1. What zone classification applies (burn, near-field, or far-field) and why?
216
+ 2. What is the contamination condition level (background, light, moderate, heavy, or structural-damage)?
217
+ 3. What materials are visible and what is their porosity category?
218
+ 4. What combustion indicators (soot, char, ash) are present and where?
219
+ 5. Are there any structural concerns or access issues?
220
+ 6. Where would you recommend sampling and what type of samples?
221
+
222
+ Provide detailed reasoning for each assessment, explaining the visual evidence that supports your conclusions."""
223
+
224
+ # Formatter prompt for Instruct model (structured JSON output)
225
+ INSTRUCT_FORMATTER_SYSTEM = """You are a technical document formatter. Your task is to convert fire damage analysis into a precise JSON structure.
226
+
227
+ Preserve all findings from the analysis accurately. Assign confidence scores (0.0-1.0) based on the certainty expressed in the analysis:
228
+ - Very certain statements: 0.85-0.95
229
+ - Reasonably confident: 0.70-0.84
230
+ - Somewhat uncertain: 0.50-0.69
231
+ - Uncertain/fallback: 0.30-0.49"""
232
 
233
+ INSTRUCT_FORMATTER_PROMPT = """Based on the following fire damage analysis, generate a JSON response with this exact structure:
234
+
235
+ <analysis>
236
+ {analysis}
237
+ </analysis>
238
+
239
+ Generate JSON with this structure:
240
+ {{
241
+ "zone": {{
242
  "classification": "burn" | "near-field" | "far-field",
243
  "confidence": 0.0-1.0,
244
  "reasoning": "explanation"
245
+ }},
246
+ "condition": {{
247
  "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
248
  "confidence": 0.0-1.0,
249
  "reasoning": "explanation"
250
+ }},
251
  "materials": [
252
+ {{
253
  "type": "material type (e.g., drywall, concrete, steel, wood)",
254
  "category": "non-porous" | "semi-porous" | "porous" | "hvac",
255
  "confidence": 0.0-1.0,
256
  "location_description": "where in image",
257
+ "bounding_box": {{"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}}
258
+ }}
259
  ],
260
+ "combustion_indicators": {{
261
  "soot_visible": true/false,
262
  "soot_pattern": "description or null",
263
  "char_visible": true/false,
264
  "char_description": "description or null",
265
  "ash_visible": true/false,
266
  "ash_description": "description or null"
267
+ }},
268
  "structural_concerns": ["list of structural issues if any"],
269
  "access_issues": ["list of access problems if any"],
270
  "recommended_sampling_locations": [
271
+ {{
272
  "description": "where to sample",
273
  "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
274
  "priority": "high" | "medium" | "low"
275
+ }}
276
  ],
277
  "flags_for_review": ["any items requiring human review"]
278
+ }}
279
 
280
  IMPORTANT: Return ONLY valid JSON, no additional text."""
281
 
282
+ def __init__(self, thinking_model, thinking_processor, instruct_model, instruct_processor):
283
+ self.thinking_model = thinking_model
284
+ self.thinking_processor = thinking_processor
285
+ self.instruct_model = instruct_model
286
+ self.instruct_processor = instruct_processor
287
 
288
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
289
+ """Analyze an image using two-stage pipeline.
290
+
291
+ Stage 1: Thinking model generates detailed analysis with reasoning
292
+ Stage 2: Instruct model formats the analysis into structured JSON
293
+ """
294
  start_time = time.time()
295
+ logger.debug(f"Starting dual-model vision analysis (context: {len(context)} chars)")
296
+
297
+ try:
298
+ # Stage 1: Deep analysis with Thinking model
299
+ thinking_start = time.time()
300
+ analysis_text = self._run_thinking_stage(image, context)
301
+ thinking_time = time.time() - thinking_start
302
+ logger.debug(f"Thinking stage completed in {thinking_time:.2f}s, output: {len(analysis_text)} chars")
303
+
304
+ # Stage 2: Format to JSON with Instruct model
305
+ instruct_start = time.time()
306
+ result = self._run_instruct_stage(analysis_text)
307
+ instruct_time = time.time() - instruct_start
308
+ logger.debug(f"Instruct stage completed in {instruct_time:.2f}s")
309
+
310
+ # Log result summary
311
+ total_time = time.time() - start_time
312
+ zone = result.get("zone", {}).get("classification", "unknown")
313
+ zone_conf = result.get("zone", {}).get("confidence", 0)
314
+ condition = result.get("condition", {}).get("level", "unknown")
315
+ condition_conf = result.get("condition", {}).get("confidence", 0)
316
+ num_materials = len(result.get("materials", []))
317
+ logger.info(f"Vision analysis complete in {total_time:.2f}s (thinking: {thinking_time:.2f}s, instruct: {instruct_time:.2f}s): "
318
+ f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
319
+ f"materials={num_materials}")
320
 
321
+ return result
322
+
323
+ except Exception as e:
324
+ logger.error(f"Vision analysis failed: {e}")
325
+ return self._get_fallback_response(str(e))
326
+
327
+ def _run_thinking_stage(self, image: Image.Image, context: str) -> str:
328
+ """Run the Thinking model to generate detailed analysis."""
329
  try:
330
  from qwen_vl_utils import process_vision_info
331
  except ImportError:
 
333
  process_vision_info = None
334
 
335
  # Build the analysis prompt with context
336
+ prompt = self.THINKING_ANALYSIS_PROMPT
337
  if context:
338
  prompt = f"Context: {context}\n\n{prompt}"
339
 
 
352
  }
353
  ]
354
 
355
+ # Apply chat template with thinking enabled (default for Thinking model)
356
+ text = self.thinking_processor.apply_chat_template(
357
+ messages, tokenize=False, add_generation_prompt=True
358
+ )
359
+
360
+ # Process vision info if available
361
+ if process_vision_info:
362
+ image_inputs, video_inputs = process_vision_info(messages)
363
+ inputs = self.thinking_processor(
364
+ text=[text],
365
+ images=image_inputs,
366
+ videos=video_inputs,
367
+ return_tensors="pt",
368
+ padding=True,
369
+ )
370
+ else:
371
+ # Fallback: basic image processing
372
+ inputs = self.thinking_processor(
373
+ text=[text],
374
+ images=[image],
375
+ return_tensors="pt",
376
+ padding=True,
377
  )
378
 
379
+ # Generate response using thinking config (per Qwen3-VL GitHub recommendations)
380
+ logger.debug(f"Thinking inference config: max_new_tokens={thinking_config.max_new_tokens}, "
381
+ f"temp={thinking_config.temperature}, top_p={thinking_config.top_p}, top_k={thinking_config.top_k}")
382
+
383
+ with torch.no_grad():
384
+ outputs = self.thinking_model.generate(
385
+ **inputs,
386
+ max_new_tokens=thinking_config.max_new_tokens,
387
+ do_sample=thinking_config.do_sample,
388
+ temperature=thinking_config.temperature,
389
+ top_p=thinking_config.top_p,
390
+ top_k=thinking_config.top_k,
391
+ repetition_penalty=thinking_config.repetition_penalty,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
392
  )
 
393
 
394
+ # Decode response - get raw token IDs first for proper parsing
395
+ output_ids = outputs[0].tolist()
396
 
397
+ # The Thinking model's chat template includes opening <think> tag
398
+ # Output format: reasoning_content</think>final_answer
399
+ # Get </think> token ID dynamically from tokenizer (more robust than hardcoding)
400
+ think_end_token = self.thinking_processor.tokenizer.encode(
401
+ "</think>", add_special_tokens=False
402
+ )[0]
 
 
 
 
403
 
404
+ try:
405
+ # Find the </think> token position
406
+ think_end_idx = len(output_ids) - output_ids[::-1].index(think_end_token)
407
+ # Extract reasoning (before </think>) and answer (after </think>)
408
+ reasoning_ids = output_ids[:think_end_idx]
409
+ answer_ids = output_ids[think_end_idx:]
410
+
411
+ reasoning = self.thinking_processor.decode(
412
+ reasoning_ids, skip_special_tokens=True
413
+ ).strip()
414
+ final_answer = self.thinking_processor.decode(
415
+ answer_ids, skip_special_tokens=True
416
+ ).strip()
417
+
418
+ logger.debug(f"Extracted thinking: {len(reasoning)} chars reasoning, {len(final_answer)} chars answer")
419
+ return f"Reasoning:\n{reasoning}\n\nConclusions:\n{final_answer}"
420
+
421
+ except ValueError:
422
+ # No </think> token found - use full response as-is
423
+ response_text = self.thinking_processor.decode(
424
+ output_ids, skip_special_tokens=True
425
+ ).strip()
426
+ logger.debug(f"No </think> token found, using full response: {len(response_text)} chars")
427
+ return response_text
428
+
429
+ def _run_instruct_stage(self, analysis_text: str) -> dict[str, Any]:
430
+ """Run the Instruct model to format analysis into JSON."""
431
+ # Prepare messages for Instruct model (text-only, no image)
432
+ prompt = self.INSTRUCT_FORMATTER_PROMPT.format(analysis=analysis_text)
433
 
434
+ messages = [
435
+ {
436
+ "role": "system",
437
+ "content": self.INSTRUCT_FORMATTER_SYSTEM,
438
+ },
439
+ {
440
+ "role": "user",
441
+ "content": prompt,
442
+ }
443
+ ]
444
+
445
+ # Apply chat template
446
+ text = self.instruct_processor.apply_chat_template(
447
+ messages, tokenize=False, add_generation_prompt=True
448
+ )
449
+
450
+ inputs = self.instruct_processor(
451
+ text=[text],
452
+ return_tensors="pt",
453
+ padding=True,
454
+ )
455
+
456
+ # Generate response using vision config (low temp for consistent JSON)
457
+ logger.debug(f"Instruct inference config: max_new_tokens={vision_config.max_new_tokens}, "
458
+ f"temp={vision_config.temperature}")
459
+
460
+ with torch.no_grad():
461
+ outputs = self.instruct_model.generate(
462
+ **inputs,
463
+ max_new_tokens=vision_config.max_new_tokens,
464
+ do_sample=vision_config.do_sample,
465
+ temperature=vision_config.temperature,
466
+ top_p=vision_config.top_p,
467
+ repetition_penalty=vision_config.repetition_penalty,
468
+ )
469
+
470
+ # Decode response
471
+ response_text = self.instruct_processor.decode(
472
+ outputs[0], skip_special_tokens=True
473
+ )
474
+
475
+ # Parse JSON from response
476
+ return self._parse_json_response(response_text)
477
 
478
+ def _parse_json_response(self, response: str) -> dict[str, Any]:
479
+ """Parse JSON response from instruct model."""
480
  try:
481
  # Try to extract JSON from response
 
482
  json_match = re.search(r'\{[\s\S]*\}', response)
483
  if json_match:
484
  json_str = json_match.group()
485
  return json.loads(json_str)
486
  else:
487
+ logger.warning("No JSON found in instruct response")
488
  return self._get_fallback_response("No JSON in response")
489
  except json.JSONDecodeError as e:
490
+ logger.warning(f"Failed to parse JSON: {e}")
491
  return self._get_fallback_response(f"JSON parse error: {e}")
492
 
493
  def _get_fallback_response(self, reason: str) -> dict[str, Any]:
pipeline/main.py CHANGED
@@ -199,11 +199,6 @@ class FDAMPipeline:
199
  logger.info(f"Stage 2/6: Vision Analysis ({len(session.images)} images)")
200
  report_progress(2, "Analyzing images with AI...")
201
  model_stack = get_models()
202
-
203
- # Lazy load vision model (for real models only - mock models are already loaded)
204
- if hasattr(model_stack, 'load_vision') and not model_stack.is_vision_loaded():
205
- logger.info("Lazy loading vision model...")
206
- model_stack.load_vision()
207
  vision_results = {}
208
  annotated_images = []
209
  room_mapping = {}
@@ -260,20 +255,11 @@ class FDAMPipeline:
260
  logger.info(f"Stage 2 completed in {time.time() - stage_start:.2f}s: "
261
  f"{len(vision_results)} images analyzed")
262
 
263
- # Unload vision model to free memory for RAG (for real models only)
264
- if hasattr(model_stack, 'unload_vision') and model_stack.is_vision_loaded():
265
- logger.info("Unloading vision model to free memory for RAG...")
266
- model_stack.unload_vision()
267
-
268
  # Stage 3: RAG Retrieval
269
  stage_start = time.time()
270
  logger.info("Stage 3/6: RAG Retrieval")
271
  report_progress(3, "Retrieving FDAM methodology context...")
272
 
273
- # Lazy load RAG models (for real models only - mock models are already loaded)
274
- if hasattr(model_stack, 'load_rag') and not model_stack.is_rag_loaded():
275
- logger.info("Lazy loading RAG models (embedding + reranker)...")
276
- model_stack.load_rag()
277
  # RAG is integrated into disposition engine, just verify connection
278
  try:
279
  test_results = self.retriever.retrieve("test connection", top_k=1)
 
199
  logger.info(f"Stage 2/6: Vision Analysis ({len(session.images)} images)")
200
  report_progress(2, "Analyzing images with AI...")
201
  model_stack = get_models()
 
 
 
 
 
202
  vision_results = {}
203
  annotated_images = []
204
  room_mapping = {}
 
255
  logger.info(f"Stage 2 completed in {time.time() - stage_start:.2f}s: "
256
  f"{len(vision_results)} images analyzed")
257
 
 
 
 
 
 
258
  # Stage 3: RAG Retrieval
259
  stage_start = time.time()
260
  logger.info("Stage 3/6: RAG Retrieval")
261
  report_progress(3, "Retrieving FDAM methodology context...")
262
 
 
 
 
 
263
  # RAG is integrated into disposition engine, just verify connection
264
  try:
265
  test_results = self.retriever.retrieve("test connection", top_k=1)
rag/retriever.py CHANGED
@@ -88,7 +88,7 @@ class SharedReranker:
88
  """Reranker that uses the shared model from RealModelStack.
89
 
90
  This avoids loading a duplicate reranker model - instead uses the
91
- model already loaded by the pipeline via model_stack.load_rag().
92
  """
93
 
94
  def rerank(
@@ -109,13 +109,7 @@ class SharedReranker:
109
 
110
  model_stack = get_models()
111
 
112
- # Check if RAG models are loaded
113
- if not model_stack.is_rag_loaded():
114
- logger.warning("RAG models not loaded yet - reranking may fail")
115
- # Return neutral scores as fallback
116
- return [0.5] * len(documents)
117
-
118
- # Use the shared reranker model
119
  return model_stack.reranker.rerank(query, documents)
120
 
121
 
 
88
  """Reranker that uses the shared model from RealModelStack.
89
 
90
  This avoids loading a duplicate reranker model - instead uses the
91
+ model already loaded by the pipeline at startup.
92
  """
93
 
94
  def rerank(
 
109
 
110
  model_stack = get_models()
111
 
112
+ # Use the shared reranker model (always loaded at startup)
 
 
 
 
 
 
113
  return model_stack.reranker.rerank(query, documents)
114
 
115
 
rag/vectorstore.py CHANGED
@@ -62,7 +62,7 @@ class SharedEmbeddingFunction:
62
  """Embedding function that uses the shared model from RealModelStack.
63
 
64
  This avoids loading a duplicate embedding model - instead uses the
65
- model already loaded by the pipeline via model_stack.load_rag().
66
 
67
  For ChromaDB compatibility, this wraps the model stack's embedding model.
68
  """
@@ -75,13 +75,7 @@ class SharedEmbeddingFunction:
75
 
76
  model_stack = get_models()
77
 
78
- # Check if RAG models are loaded
79
- if not model_stack.is_rag_loaded():
80
- logger.warning("RAG models not loaded yet - embeddings may fail")
81
- # Return zero vectors as fallback
82
- return [[0.0] * self.EMBEDDING_DIM for _ in input]
83
-
84
- # Use the shared embedding model
85
  return model_stack.embedding.embed_batch(input)
86
 
87
 
 
62
  """Embedding function that uses the shared model from RealModelStack.
63
 
64
  This avoids loading a duplicate embedding model - instead uses the
65
+ model already loaded by the pipeline at startup.
66
 
67
  For ChromaDB compatibility, this wraps the model stack's embedding model.
68
  """
 
75
 
76
  model_stack = get_models()
77
 
78
+ # Use the shared embedding model (always loaded at startup)
 
 
 
 
 
 
79
  return model_stack.embedding.embed_batch(input)
80
 
81