KinetoLabs Claude Opus 4.5 commited on
Commit
706520f
·
1 Parent(s): 0699c5f

Replace dual 8B with single 30B-A3B FP8 vision model

Browse files

Simplify pipeline architecture:
- Vision: Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB) replaces dual 8B
- Embedding: 2B model (2048-dim) replaces 8B
- Reranker: 2B model replaces 8B
- Total VRAM: ~38-43GB (was ~68GB), 45GB+ headroom on 4xL4

Key changes:
- vLLM with FP8 quantization (built-in, no autoawq needed)
- Proper Qwen3-VL chat template formatting via processor
- Removed dual-model Thinking→Instruct pipeline
- Single model handles analysis + structured JSON output

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

.env.example CHANGED
@@ -8,8 +8,7 @@ MOCK_MODELS=true
8
  SERVER_HOST=0.0.0.0
9
  SERVER_PORT=7860
10
 
11
- # Optional: Override model paths (Dual 8B architecture)
12
- # VISION_MODEL_THINKING=Qwen/Qwen3-VL-8B-Thinking
13
- # VISION_MODEL_INSTRUCT=Qwen/Qwen3-VL-8B-Instruct
14
- # EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
15
- # RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B
 
8
  SERVER_HOST=0.0.0.0
9
  SERVER_PORT=7860
10
 
11
+ # Optional: Override model paths (FP8 + 2B architecture)
12
+ # VISION_MODEL=Qwen/Qwen3-VL-30B-A3B-Thinking-FP8
13
+ # EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-2B
14
+ # RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-2B
 
CLAUDE.md CHANGED
@@ -13,7 +13,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
13
  ## Critical Constraints
14
 
15
  1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
16
- 2. **Memory Budget** - 4xL4 88GB usable: ~36GB vision (dual 8B) + ~16GB embedding + ~16GB reranker (~68GB used, ~20GB headroom)
17
  3. **Processing Time** - 60-90 seconds per assessment is acceptable
18
  4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
19
  5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
@@ -23,10 +23,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
23
  | Component | Technology |
24
  |-----------|------------|
25
  | UI Framework | Gradio 6.x |
26
- | Vision (Thinking) | Qwen3-VL-8B-Thinking |
27
- | Vision (Instruct) | Qwen3-VL-8B-Instruct |
28
- | Embeddings | Qwen3-VL-Embedding-8B |
29
- | Reranker | Qwen3-VL-Reranker-8B |
30
  | Vector Store | ChromaDB 0.4.x |
31
  | Validation | Pydantic 2.x |
32
  | PDF Generation | Pandoc 3.x |
@@ -149,40 +149,42 @@ Source documents in `/RAG-KB/`:
149
 
150
  ## Multi-GPU Model Loading
151
 
152
- All 4 models are loaded simultaneously at startup (~68GB total on 4xL4 GPUs):
153
 
154
  ```python
155
- # Vision models (dual 8B architecture)
156
- thinking_model = Qwen3VLForConditionalGeneration.from_pretrained(
157
- "Qwen/Qwen3-VL-8B-Thinking",
158
- torch_dtype=torch.bfloat16,
159
- device_map="auto",
160
- trust_remote_code=True
161
- )
162
- instruct_model = Qwen3VLForConditionalGeneration.from_pretrained(
163
- "Qwen/Qwen3-VL-8B-Instruct",
164
- torch_dtype=torch.bfloat16,
165
- device_map="auto",
166
- trust_remote_code=True
167
  )
 
 
 
 
 
168
  ```
169
 
170
- Expected distribution (BF16, ~68GB total):
171
- - Vision Thinking model (8B): ~18GB
172
- - Vision Instruct model (8B): ~18GB
173
- - Embedding model (8B): ~16GB
174
- - Reranker model (8B): ~16GB
175
- - Headroom: ~20GB for KV cache and overhead
176
 
177
  ## Local Development Strategy
178
 
179
- The RTX 4090 (24GB VRAM) cannot run the full model stack (~68GB required). Use this workflow:
180
 
181
  1. Set `MOCK_MODELS=true` environment variable
182
- 2. Mock responses return realistic JSON matching vision output schema
183
  3. Test pipeline logic, UI, calculations without real inference
184
  4. Deploy to HuggingFace Spaces for real model testing
185
  5. Request build logs after deployment to confirm success
 
186
 
187
  ## Code Style
188
 
 
13
  ## Critical Constraints
14
 
15
  1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
16
+ 2. **Memory Budget** - 4xL4 88GB usable: ~30-35GB vision (30B FP8) + ~4GB embedding + ~4GB reranker (~38-43GB used, ~45GB+ headroom)
17
  3. **Processing Time** - 60-90 seconds per assessment is acceptable
18
  4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
19
  5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
 
23
  | Component | Technology |
24
  |-----------|------------|
25
  | UI Framework | Gradio 6.x |
26
+ | Vision | Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (via vLLM) |
27
+ | Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
28
+ | Reranker | Qwen/Qwen3-VL-Reranker-2B |
29
+ | Inference | vLLM with FP8 quantization |
30
  | Vector Store | ChromaDB 0.4.x |
31
  | Validation | Pydantic 2.x |
32
  | PDF Generation | Pandoc 3.x |
 
149
 
150
  ## Multi-GPU Model Loading
151
 
152
+ All 3 models are loaded at startup (~38-43GB total on 4xL4 GPUs):
153
 
154
  ```python
155
+ from vllm import LLM, SamplingParams
156
+
157
+ # Vision model via vLLM with FP8 quantization (built-in)
158
+ vision_model = LLM(
159
+ model="Qwen/Qwen3-VL-30B-A3B-Thinking-FP8",
160
+ tensor_parallel_size=4, # Distribute across all 4 GPUs
161
+ trust_remote_code=True,
162
+ gpu_memory_utilization=0.70,
163
+ max_model_len=32768,
 
 
 
164
  )
165
+
166
+ # Embedding and Reranker use official Qwen3VL loaders
167
+ from scripts.qwen3_vl import Qwen3VLEmbedder, Qwen3VLReranker
168
+ embedding_model = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B", torch_dtype=torch.bfloat16)
169
+ reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
170
  ```
171
 
172
+ Expected distribution (FP8 + BF16, ~38-43GB total):
173
+ - Vision model (30B FP8): ~30-35GB
174
+ - Embedding model (2B): ~4GB
175
+ - Reranker model (2B): ~4GB
176
+ - Headroom: ~45GB+ for KV cache and overhead
 
177
 
178
  ## Local Development Strategy
179
 
180
+ The RTX 4090 (24GB VRAM) cannot run the production model stack. Use this workflow:
181
 
182
  1. Set `MOCK_MODELS=true` environment variable
183
+ 2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
184
  3. Test pipeline logic, UI, calculations without real inference
185
  4. Deploy to HuggingFace Spaces for real model testing
186
  5. Request build logs after deployment to confirm success
187
+ 6. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
188
 
189
  ## Code Style
190
 
FDAM_AI_Pipeline_Technical_Spec.md DELETED
The diff for this file is too large to render. See raw diff
 
README.md CHANGED
@@ -32,11 +32,10 @@ suggested_hardware: l4x4
32
 
33
  ## Technical Details
34
 
35
- ### Model Stack (~68GB VRAM)
36
- - **Vision (Thinking)**: Qwen3-VL-8B-Thinking (~18GB) - Deep analysis with reasoning
37
- - **Vision (Instruct)**: Qwen3-VL-8B-Instruct (~18GB) - Structured JSON output
38
- - **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
39
- - **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
40
 
41
  ### Zone Classifications
42
  - **Burn Zone**: Direct fire involvement, structural damage
 
32
 
33
  ## Technical Details
34
 
35
+ ### Model Stack (~38-43GB VRAM)
36
+ - **Vision**: Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB) - Reasoning-enhanced analysis with structured JSON output
37
+ - **Embeddings**: Qwen3-VL-Embedding-2B (~4GB)
38
+ - **Reranker**: Qwen3-VL-Reranker-2B (~4GB)
 
39
 
40
  ### Zone Classifications
41
  - **Burn Zone**: Direct fire involvement, structural damage
config/inference.py CHANGED
@@ -2,39 +2,29 @@
2
 
3
  Configuration values aligned with official Qwen3-VL model recommendations
4
  and FDAM Technical Spec requirements.
 
 
 
 
 
5
  """
6
 
7
  from dataclasses import dataclass
8
 
9
 
10
  @dataclass
11
- class ThinkingInferenceConfig:
12
- """Configuration for 8B-Thinking model inference.
13
 
14
- Per Qwen3-VL GitHub recommended hyperparameters for thinking models.
15
- Used for deep analysis with <think> chains.
16
  """
17
 
18
- max_new_tokens: int = 8192 # Balanced for reasoning + reasonable time (~7 min)
19
  temperature: float = 0.6 # Per Qwen3-VL GitHub docs
20
  top_p: float = 0.95
21
  top_k: int = 20
22
- do_sample: bool = True
23
- repetition_penalty: float = 1.0 # Per Qwen3-VL docs (not presence_penalty)
24
-
25
-
26
- @dataclass
27
- class VisionInferenceConfig:
28
- """Configuration for 8B-Instruct model inference.
29
-
30
- Per FDAM Technical Spec Section 3. Used for structured JSON output.
31
- """
32
-
33
- max_new_tokens: int = 4096
34
- temperature: float = 0.1 # Low temperature for deterministic JSON output
35
- top_p: float = 0.9
36
- do_sample: bool = True
37
- repetition_penalty: float = 1.1 # Reduce repetition in generated text
38
 
39
 
40
  @dataclass
@@ -55,10 +45,10 @@ class GenerationInferenceConfig:
55
  class EmbeddingConfig:
56
  """Configuration for embedding model.
57
 
58
- Per Qwen3-VL-Embedding-8B config.json: text_config.hidden_size = 4096
59
  """
60
 
61
- embedding_dimension: int = 4096 # Per Qwen3-VL-Embedding-8B hidden_size
62
  normalize: bool = True # L2 normalization (per official implementation)
63
 
64
 
@@ -82,8 +72,7 @@ class RAGConfig:
82
 
83
 
84
  # Default configurations
85
- thinking_config = ThinkingInferenceConfig()
86
- vision_config = VisionInferenceConfig() # Now used for Instruct model
87
  generation_config = GenerationInferenceConfig()
88
  embedding_config = EmbeddingConfig()
89
  reranker_config = RerankerConfig()
 
2
 
3
  Configuration values aligned with official Qwen3-VL model recommendations
4
  and FDAM Technical Spec requirements.
5
+
6
+ Pipeline uses:
7
+ - Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (single model, FP8 via vLLM)
8
+ - Embedding: Qwen/Qwen3-VL-Embedding-2B (2048-dim)
9
+ - Reranker: Qwen/Qwen3-VL-Reranker-2B
10
  """
11
 
12
  from dataclasses import dataclass
13
 
14
 
15
  @dataclass
16
+ class VisionInferenceConfig:
17
+ """Configuration for 30B-A3B FP8 vision model inference.
18
 
19
+ Single model handles both analysis and structured JSON output.
20
+ Uses vLLM with tensor parallelism across 4 GPUs.
21
  """
22
 
23
+ max_tokens: int = 8192 # vLLM uses max_tokens not max_new_tokens
24
  temperature: float = 0.6 # Per Qwen3-VL GitHub docs
25
  top_p: float = 0.95
26
  top_k: int = 20
27
+ repetition_penalty: float = 1.0 # Per Qwen3-VL docs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
 
30
  @dataclass
 
45
  class EmbeddingConfig:
46
  """Configuration for embedding model.
47
 
48
+ Per Qwen3-VL-Embedding-2B config.json: text_config.hidden_size = 2048
49
  """
50
 
51
+ embedding_dimension: int = 2048 # Per Qwen3-VL-Embedding-2B hidden_size
52
  normalize: bool = True # L2 normalization (per official implementation)
53
 
54
 
 
72
 
73
 
74
  # Default configurations
75
+ vision_config = VisionInferenceConfig() # Single 30B-A3B FP8 model
 
76
  generation_config = GenerationInferenceConfig()
77
  embedding_config = EmbeddingConfig()
78
  reranker_config = RerankerConfig()
config/settings.py CHANGED
@@ -17,11 +17,14 @@ class Settings(BaseSettings):
17
  mock_models: bool = True
18
 
19
  # Model paths (for production on HuggingFace Spaces)
20
- # Dual 8B architecture: Thinking for analysis, Instruct for structured output
21
- vision_model_thinking: str = "Qwen/Qwen3-VL-8B-Thinking"
22
- vision_model_instruct: str = "Qwen/Qwen3-VL-8B-Instruct"
23
- embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
24
- reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
 
 
 
25
 
26
  # ChromaDB
27
  chroma_persist_dir: str = "./chroma_db"
 
17
  mock_models: bool = True
18
 
19
  # Model paths (for production on HuggingFace Spaces)
20
+ # Single 30B-A3B MoE model with FP8 quantization via vLLM (official, reasoning-enhanced)
21
+ vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
22
+ embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
23
+ reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
24
+
25
+ # vLLM configuration
26
+ vllm_tensor_parallel_size: int = 4 # Use all 4 L4 GPUs
27
+ vllm_max_model_len: int = 32768 # Context window
28
 
29
  # ChromaDB
30
  chroma_persist_dir: str = "./chroma_db"
models/loader.py CHANGED
@@ -2,12 +2,13 @@
2
 
3
  Supports two loading modes:
4
  - MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
5
- - MOCK_MODELS=false: Loads all real models at startup (~68GB total)
6
 
7
  Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
8
- - Vision Thinking 8B (~18GB) + Vision Instruct 8B (~18GB) = ~36GB
9
- - Embedding 8B (~16GB) + Reranker 8B (~16GB) = ~32GB
10
- - Total: ~68GB, leaving ~20GB headroom
 
11
  """
12
 
13
  import logging
@@ -29,7 +30,7 @@ def get_model_stack() -> ModelStack:
29
  """Get model stack based on environment configuration.
30
 
31
  For mock models: Loads mock models immediately (fast, for local dev).
32
- For real models: Loads all 4 models at startup (~68GB total).
33
  """
34
  start_time = time.time()
35
 
@@ -43,8 +44,7 @@ def get_model_stack() -> ModelStack:
43
  return stack
44
  else:
45
  logger.info("Loading REAL model stack (production mode)")
46
- logger.info(f"Vision thinking model: {settings.vision_model_thinking}")
47
- logger.info(f"Vision instruct model: {settings.vision_model_instruct}")
48
  logger.info(f"Embedding model: {settings.embedding_model}")
49
  logger.info(f"Reranker model: {settings.reranker_model}")
50
  from models.real import RealModelStack
 
2
 
3
  Supports two loading modes:
4
  - MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
5
+ - MOCK_MODELS=false: Loads all real models at startup (~38-43GB total)
6
 
7
  Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
8
+ - Vision 30B-A3B FP8 via vLLM: ~30-35GB
9
+ - Embedding 2B: ~4GB
10
+ - Reranker 2B: ~4GB
11
+ - Total: ~38-43GB, leaving ~45GB+ headroom
12
  """
13
 
14
  import logging
 
30
  """Get model stack based on environment configuration.
31
 
32
  For mock models: Loads mock models immediately (fast, for local dev).
33
+ For real models: Loads all 3 models at startup (~38-43GB total).
34
  """
35
  start_time = time.time()
36
 
 
44
  return stack
45
  else:
46
  logger.info("Loading REAL model stack (production mode)")
47
+ logger.info(f"Vision model: {settings.vision_model} (FP8 via vLLM)")
 
48
  logger.info(f"Embedding model: {settings.embedding_model}")
49
  logger.info(f"Reranker model: {settings.reranker_model}")
50
  from models.real import RealModelStack
models/mock.py CHANGED
@@ -1,7 +1,7 @@
1
  """Mock model implementations for local development on RTX 4090.
2
 
3
- Simulates the dual 8B vision model architecture:
4
- - MockVisionModel simulates two-stage pipeline (Thinking -> Instruct)
5
  - All models loaded together at startup (no lazy loading)
6
  """
7
 
@@ -14,11 +14,10 @@ logger = logging.getLogger(__name__)
14
 
15
 
16
  class MockVisionModel:
17
- """Mock vision model that simulates dual-model pipeline output.
18
 
19
- Simulates:
20
- - Stage 1: Thinking model generates reasoning
21
- - Stage 2: Instruct model formats to JSON
22
  """
23
 
24
  ZONES = ["burn", "near-field", "far-field"]
@@ -54,15 +53,13 @@ class MockVisionModel:
54
  }
55
 
56
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
57
- """Return mock vision analysis simulating dual-model pipeline output."""
58
- logger.debug(f"Mock dual-model vision analysis (context: {len(context)} chars)")
59
 
60
- # Simulate Stage 1: Thinking model selects classifications
61
  selected_zone = random.choice(self.ZONES)
62
  selected_condition = random.choice(self.CONDITIONS)
63
 
64
- logger.debug("Mock Stage 1 (Thinking): Generated reasoning")
65
- logger.debug("Mock Stage 2 (Instruct): Formatted to JSON")
66
  logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
67
 
68
  # Generate 2-4 random materials
@@ -141,16 +138,16 @@ class MockVisionModel:
141
  class MockEmbeddingModel:
142
  """Mock embedding model that returns deterministic vectors.
143
 
144
- Dimension matches Qwen3-VL-Embedding-8B (4096-dim).
145
  Uses last-token pooling concept with L2 normalization.
146
  """
147
 
148
- def __init__(self, dimension: int = 4096):
149
- """Initialize with dimension matching real Qwen3-VL-Embedding-8B model."""
150
  self.dimension = dimension
151
 
152
  def embed(self, text: str) -> list[float]:
153
- """Return mock embedding vector (4096-dim, L2 normalized).
154
 
155
  Uses hash of text for reproducibility, simulating last-token pooling.
156
  """
@@ -176,7 +173,7 @@ class MockEmbeddingModel:
176
  class MockRerankerModel:
177
  """Mock reranker that returns realistic relevance scores.
178
 
179
- Simulates Qwen3-VL-Reranker behavior with 0-1 sigmoid-like scores.
180
  """
181
 
182
  def rerank(self, query: str, documents: list[str]) -> list[float]:
@@ -236,9 +233,9 @@ class MockModelStack:
236
  def load_all(self) -> "MockModelStack":
237
  """Load all mock models."""
238
  logger.info("Loading mock models for local development")
239
- logger.debug(" Vision model: MockVisionModel (simulates dual 8B pipeline)")
240
- logger.debug(" Embedding model: MockEmbeddingModel (4096-dim)")
241
- logger.debug(" Reranker model: MockRerankerModel")
242
  self._loaded = True
243
  logger.info("All mock models loaded successfully")
244
  return self
 
1
  """Mock model implementations for local development on RTX 4090.
2
 
3
+ Simulates the 30B-A3B FP8 vision model architecture:
4
+ - MockVisionModel simulates single-model analysis + JSON output
5
  - All models loaded together at startup (no lazy loading)
6
  """
7
 
 
14
 
15
 
16
  class MockVisionModel:
17
+ """Mock vision model that simulates 30B-A3B FP8 model output.
18
 
19
+ Simulates single-model analysis with structured JSON output.
20
+ The real model uses vLLM with FP8 quantization.
 
21
  """
22
 
23
  ZONES = ["burn", "near-field", "far-field"]
 
53
  }
54
 
55
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
56
+ """Return mock vision analysis simulating 30B-A3B FP8 model output."""
57
+ logger.debug(f"Mock 30B-A3B FP8 vision analysis (context: {len(context)} chars)")
58
 
59
+ # Simulate model generating analysis + JSON
60
  selected_zone = random.choice(self.ZONES)
61
  selected_condition = random.choice(self.CONDITIONS)
62
 
 
 
63
  logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
64
 
65
  # Generate 2-4 random materials
 
138
  class MockEmbeddingModel:
139
  """Mock embedding model that returns deterministic vectors.
140
 
141
+ Dimension matches Qwen3-VL-Embedding-2B (2048-dim).
142
  Uses last-token pooling concept with L2 normalization.
143
  """
144
 
145
+ def __init__(self, dimension: int = 2048):
146
+ """Initialize with dimension matching real Qwen3-VL-Embedding-2B model."""
147
  self.dimension = dimension
148
 
149
  def embed(self, text: str) -> list[float]:
150
+ """Return mock embedding vector (2048-dim, L2 normalized).
151
 
152
  Uses hash of text for reproducibility, simulating last-token pooling.
153
  """
 
173
  class MockRerankerModel:
174
  """Mock reranker that returns realistic relevance scores.
175
 
176
+ Simulates Qwen3-VL-Reranker-2B behavior with 0-1 sigmoid-like scores.
177
  """
178
 
179
  def rerank(self, query: str, documents: list[str]) -> list[float]:
 
233
  def load_all(self) -> "MockModelStack":
234
  """Load all mock models."""
235
  logger.info("Loading mock models for local development")
236
+ logger.debug(" Vision model: MockVisionModel (simulates 30B-A3B FP8)")
237
+ logger.debug(" Embedding model: MockEmbeddingModel (2048-dim)")
238
+ logger.debug(" Reranker model: MockRerankerModel (simulates 2B)")
239
  self._loaded = True
240
  logger.info("All mock models loaded successfully")
241
  return self
models/real.py CHANGED
@@ -1,17 +1,13 @@
1
  """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
2
 
3
- This module loads the actual Qwen3-VL models for production use.
4
- All models are loaded simultaneously at startup (~68GB total).
5
-
6
- Memory Strategy (Simultaneous Loading):
7
- - Vision Thinking 8B (~18GB): Deep analysis with reasoning chains
8
- - Vision Instruct 8B (~18GB): Structured JSON output formatting
9
- - Embedding 8B (~16GB): RAG document embedding
10
- - Reranker 8B (~16GB): RAG retrieval reranking
11
- - Total: ~68GB on 88GB available (20GB headroom)
12
 
13
  Model Loading:
14
- - Vision: Qwen3VLForConditionalGeneration (standard transformers)
15
  - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
16
  - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
17
  """
@@ -24,7 +20,7 @@ import torch
24
  from typing import Any
25
  from PIL import Image
26
 
27
- from config.inference import thinking_config, vision_config
28
  from config.settings import settings
29
 
30
  logger = logging.getLogger(__name__)
@@ -33,9 +29,10 @@ logger = logging.getLogger(__name__)
33
  class RealModelStack:
34
  """Real model stack for production on HuggingFace Spaces.
35
 
36
- Loads all 4 models simultaneously at initialization (~68GB total):
37
- - Dual vision (Thinking + Instruct): ~36GB
38
- - Embedding + Reranker: ~32GB
 
39
  """
40
 
41
  def __init__(self):
@@ -56,54 +53,53 @@ class RealModelStack:
56
  logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
57
 
58
  def load_all(self) -> "RealModelStack":
59
- """Load all models simultaneously.
60
 
61
- Loads dual vision models (Thinking + Instruct) and RAG models
62
- (Embedding + Reranker) for ~68GB total VRAM usage.
63
  """
64
  if self._loaded:
65
  logger.debug("Models already loaded, skipping")
66
  return self
67
 
68
- from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
69
-
70
- device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
71
- logger.info(f"Loading all models on {device_type}")
72
  self._log_gpu_status()
73
 
74
  total_start = time.time()
75
 
76
- # Vision Thinking model (~18GB in BF16)
77
- logger.info(f"Loading vision thinking model: {settings.vision_model_thinking}")
78
- thinking_start = time.time()
79
- self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
80
- settings.vision_model_thinking,
81
- torch_dtype=torch.bfloat16,
82
- device_map="auto",
83
- trust_remote_code=True,
84
- )
85
- self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
86
- settings.vision_model_thinking,
87
  trust_remote_code=True,
 
 
88
  )
89
- logger.info(f"Vision thinking model loaded in {time.time() - thinking_start:.2f}s")
90
 
91
- # Vision Instruct model (~18GB in BF16)
92
- logger.info(f"Loading vision instruct model: {settings.vision_model_instruct}")
93
- instruct_start = time.time()
94
- self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
95
- settings.vision_model_instruct,
96
- torch_dtype=torch.bfloat16,
97
- device_map="auto",
98
  trust_remote_code=True,
99
  )
100
- self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
101
- settings.vision_model_instruct,
102
- trust_remote_code=True,
 
 
 
 
 
103
  )
104
- logger.info(f"Vision instruct model loaded in {time.time() - instruct_start:.2f}s")
105
 
106
- # Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
 
 
107
  logger.info(f"Loading embedding model: {settings.embedding_model}")
108
  embed_start = time.time()
109
  from scripts.qwen3_vl import Qwen3VLEmbedder
@@ -115,7 +111,7 @@ class RealModelStack:
115
  self.processors["embedding"] = self.models["embedding"].processor
116
  logger.info(f"Embedding model loaded in {time.time() - embed_start:.2f}s")
117
 
118
- # Reranker model (~16GB in BF16) - Using official Qwen3VLReranker
119
  logger.info(f"Loading reranker model: {settings.reranker_model}")
120
  reranker_start = time.time()
121
  from scripts.qwen3_vl import Qwen3VLReranker
@@ -138,15 +134,14 @@ class RealModelStack:
138
  return self._loaded
139
 
140
  @property
141
- def vision(self) -> "DualVisionModel":
142
- """Return dual vision model wrapped for pipeline consumption."""
143
  if not self._loaded:
144
  raise RuntimeError("Models not loaded. Call load_all() first.")
145
- return DualVisionModel(
146
- thinking_model=self.models["vision_thinking"],
147
- thinking_processor=self.processors["vision_thinking"],
148
- instruct_model=self.models["vision_instruct"],
149
- instruct_processor=self.processors["vision_instruct"],
150
  )
151
 
152
  @property
@@ -164,20 +159,21 @@ class RealModelStack:
164
  return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
165
 
166
 
167
- class DualVisionModel:
168
- """Dual vision model for two-stage fire damage analysis.
169
 
170
- Uses Qwen3-VL-8B-Thinking for deep analysis with reasoning chains,
171
- then Qwen3-VL-8B-Instruct to format results into structured JSON.
 
172
 
173
- Pipeline: Image -> Thinking (analysis) -> Instruct (JSON formatting) -> Output
174
  """
175
 
176
- # System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
177
  VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
178
 
179
  ## Your Task
180
- Analyze the provided image and extract structured information about fire damage, materials, and conditions.
181
 
182
  ## Zone Classification Criteria
183
  - **Burn Zone**: Direct fire involvement. Look for structural char, complete combustion, exposed/damaged structural elements.
@@ -191,121 +187,105 @@ Analyze the provided image and extract structured information about fire damage,
191
  - **Heavy**: Thick deposits; surface texture obscured; heavy coating visible.
192
  - **Structural Damage**: Physical damage requiring repair before cleaning (charring, warping, holes, collapse).
193
 
194
- ## Material Identification
195
- Identify visible materials and categorize as:
196
  - **Non-porous**: steel, concrete, glass, metal, CMU (concrete masonry unit)
197
  - **Semi-porous**: painted drywall, sealed wood
198
  - **Porous**: unpainted drywall, carpet, insulation, acoustic tile, upholstery
199
  - **HVAC**: rigid ductwork, flexible ductwork
200
 
201
  ## Combustion Particle Visual Indicators
202
- - **Soot**: Black/dark gray coating with oily/sticky appearance; fine uniform texture; often creates "shadow" patterns
203
- - **Char**: Black angular fragments; visible wood grain or fibrous structure; larger particles
204
- - **Ash**: Gray/white powdery residue; crystalline appearance; often found with char
205
-
206
- ## Important Notes
207
- - This is VISUAL assessment only - definitive particle identification requires laboratory analysis
208
- - When uncertain between two classifications, note both with relative confidence
209
- - Flag any areas that require professional on-site verification
210
- - Note any potential access issues visible in the image"""
211
-
212
- # Analysis prompt for Thinking model (open-ended reasoning)
213
- THINKING_ANALYSIS_PROMPT = """Analyze this fire damage image thoroughly. Consider:
214
-
215
- 1. What zone classification applies (burn, near-field, or far-field) and why?
216
- 2. What is the contamination condition level (background, light, moderate, heavy, or structural-damage)?
217
- 3. What materials are visible and what is their porosity category?
218
- 4. What combustion indicators (soot, char, ash) are present and where?
219
- 5. Are there any structural concerns or access issues?
220
- 6. Where would you recommend sampling and what type of samples?
221
-
222
- Provide detailed reasoning for each assessment, explaining the visual evidence that supports your conclusions."""
223
-
224
- # Formatter prompt for Instruct model (structured JSON output)
225
- INSTRUCT_FORMATTER_SYSTEM = """You are a technical document formatter. Your task is to convert fire damage analysis into a precise JSON structure.
226
-
227
- Preserve all findings from the analysis accurately. Assign confidence scores (0.0-1.0) based on the certainty expressed in the analysis:
228
- - Very certain statements: 0.85-0.95
229
- - Reasonably confident: 0.70-0.84
230
- - Somewhat uncertain: 0.50-0.69
231
- - Uncertain/fallback: 0.30-0.49"""
232
-
233
- INSTRUCT_FORMATTER_PROMPT = """Based on the following fire damage analysis, generate a JSON response with this exact structure:
234
-
235
- <analysis>
236
- {analysis}
237
- </analysis>
238
-
239
- Generate JSON with this structure:
240
- {{
241
- "zone": {{
242
  "classification": "burn" | "near-field" | "far-field",
243
  "confidence": 0.0-1.0,
244
  "reasoning": "explanation"
245
- }},
246
- "condition": {{
247
  "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
248
  "confidence": 0.0-1.0,
249
  "reasoning": "explanation"
250
- }},
251
  "materials": [
252
- {{
253
- "type": "material type (e.g., drywall, concrete, steel, wood)",
254
  "category": "non-porous" | "semi-porous" | "porous" | "hvac",
255
  "confidence": 0.0-1.0,
256
  "location_description": "where in image",
257
- "bounding_box": {{"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}}
258
- }}
259
  ],
260
- "combustion_indicators": {{
261
  "soot_visible": true/false,
262
  "soot_pattern": "description or null",
263
  "char_visible": true/false,
264
  "char_description": "description or null",
265
  "ash_visible": true/false,
266
  "ash_description": "description or null"
267
- }},
268
  "structural_concerns": ["list of structural issues if any"],
269
  "access_issues": ["list of access problems if any"],
270
  "recommended_sampling_locations": [
271
- {{
272
  "description": "where to sample",
273
  "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
274
  "priority": "high" | "medium" | "low"
275
- }}
276
  ],
277
  "flags_for_review": ["any items requiring human review"]
278
- }}
279
 
280
  IMPORTANT: Return ONLY valid JSON, no additional text."""
281
 
282
- def __init__(self, thinking_model, thinking_processor, instruct_model, instruct_processor):
283
- self.thinking_model = thinking_model
284
- self.thinking_processor = thinking_processor
285
- self.instruct_model = instruct_model
286
- self.instruct_processor = instruct_processor
287
 
288
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
289
- """Analyze an image using two-stage pipeline.
 
 
 
 
290
 
291
- Stage 1: Thinking model generates detailed analysis with reasoning
292
- Stage 2: Instruct model formats the analysis into structured JSON
293
  """
294
  start_time = time.time()
295
- logger.debug(f"Starting dual-model vision analysis (context: {len(context)} chars)")
296
 
297
  try:
298
- # Stage 1: Deep analysis with Thinking model
299
- thinking_start = time.time()
300
- analysis_text = self._run_thinking_stage(image, context)
301
- thinking_time = time.time() - thinking_start
302
- logger.debug(f"Thinking stage completed in {thinking_time:.2f}s, output: {len(analysis_text)} chars")
303
-
304
- # Stage 2: Format to JSON with Instruct model
305
- instruct_start = time.time()
306
- result = self._run_instruct_stage(analysis_text)
307
- instruct_time = time.time() - instruct_start
308
- logger.debug(f"Instruct stage completed in {instruct_time:.2f}s")
 
 
 
 
 
 
 
 
 
 
 
 
 
309
 
310
  # Log result summary
311
  total_time = time.time() - start_time
@@ -314,7 +294,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
314
  condition = result.get("condition", {}).get("level", "unknown")
315
  condition_conf = result.get("condition", {}).get("confidence", 0)
316
  num_materials = len(result.get("materials", []))
317
- logger.info(f"Vision analysis complete in {total_time:.2f}s (thinking: {thinking_time:.2f}s, instruct: {instruct_time:.2f}s): "
318
  f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
319
  f"materials={num_materials}")
320
 
@@ -324,159 +304,32 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
324
  logger.error(f"Vision analysis failed: {e}")
325
  return self._get_fallback_response(str(e))
326
 
327
- def _run_thinking_stage(self, image: Image.Image, context: str) -> str:
328
- """Run the Thinking model to generate detailed analysis."""
329
- try:
330
- from qwen_vl_utils import process_vision_info
331
- except ImportError:
332
- logger.warning("qwen_vl_utils not available, using basic processing")
333
- process_vision_info = None
334
 
335
- # Build the analysis prompt with context
336
- prompt = self.THINKING_ANALYSIS_PROMPT
 
 
 
 
337
  if context:
338
- prompt = f"Context: {context}\n\n{prompt}"
339
 
340
- # Prepare messages in Qwen-VL format with system prompt
341
  messages = [
342
- {
343
- "role": "system",
344
- "content": self.VISION_SYSTEM_PROMPT,
345
- },
346
  {
347
  "role": "user",
348
  "content": [
349
  {"type": "image", "image": image},
350
- {"type": "text", "text": prompt},
351
  ],
352
- }
353
- ]
354
-
355
- # Apply chat template with thinking enabled (default for Thinking model)
356
- text = self.thinking_processor.apply_chat_template(
357
- messages, tokenize=False, add_generation_prompt=True
358
- )
359
-
360
- # Process vision info if available
361
- if process_vision_info:
362
- image_inputs, video_inputs = process_vision_info(messages)
363
- inputs = self.thinking_processor(
364
- text=[text],
365
- images=image_inputs,
366
- videos=video_inputs,
367
- return_tensors="pt",
368
- padding=True,
369
- )
370
- else:
371
- # Fallback: basic image processing
372
- inputs = self.thinking_processor(
373
- text=[text],
374
- images=[image],
375
- return_tensors="pt",
376
- padding=True,
377
- )
378
-
379
- # Generate response using thinking config (per Qwen3-VL GitHub recommendations)
380
- logger.debug(f"Thinking inference config: max_new_tokens={thinking_config.max_new_tokens}, "
381
- f"temp={thinking_config.temperature}, top_p={thinking_config.top_p}, top_k={thinking_config.top_k}")
382
-
383
- with torch.no_grad():
384
- outputs = self.thinking_model.generate(
385
- **inputs,
386
- max_new_tokens=thinking_config.max_new_tokens,
387
- do_sample=thinking_config.do_sample,
388
- temperature=thinking_config.temperature,
389
- top_p=thinking_config.top_p,
390
- top_k=thinking_config.top_k,
391
- repetition_penalty=thinking_config.repetition_penalty,
392
- )
393
-
394
- # Decode response - get raw token IDs first for proper parsing
395
- output_ids = outputs[0].tolist()
396
-
397
- # The Thinking model's chat template includes opening <think> tag
398
- # Output format: reasoning_content</think>final_answer
399
- # Get </think> token ID dynamically from tokenizer (more robust than hardcoding)
400
- think_end_token = self.thinking_processor.tokenizer.encode(
401
- "</think>", add_special_tokens=False
402
- )[0]
403
-
404
- try:
405
- # Find the </think> token position
406
- think_end_idx = len(output_ids) - output_ids[::-1].index(think_end_token)
407
- # Extract reasoning (before </think>) and answer (after </think>)
408
- reasoning_ids = output_ids[:think_end_idx]
409
- answer_ids = output_ids[think_end_idx:]
410
-
411
- reasoning = self.thinking_processor.decode(
412
- reasoning_ids, skip_special_tokens=True
413
- ).strip()
414
- final_answer = self.thinking_processor.decode(
415
- answer_ids, skip_special_tokens=True
416
- ).strip()
417
-
418
- logger.debug(f"Extracted thinking: {len(reasoning)} chars reasoning, {len(final_answer)} chars answer")
419
- return f"Reasoning:\n{reasoning}\n\nConclusions:\n{final_answer}"
420
-
421
- except ValueError:
422
- # No </think> token found - use full response as-is
423
- response_text = self.thinking_processor.decode(
424
- output_ids, skip_special_tokens=True
425
- ).strip()
426
- logger.debug(f"No </think> token found, using full response: {len(response_text)} chars")
427
- return response_text
428
-
429
- def _run_instruct_stage(self, analysis_text: str) -> dict[str, Any]:
430
- """Run the Instruct model to format analysis into JSON."""
431
- # Prepare messages for Instruct model (text-only, no image)
432
- prompt = self.INSTRUCT_FORMATTER_PROMPT.format(analysis=analysis_text)
433
-
434
- messages = [
435
- {
436
- "role": "system",
437
- "content": self.INSTRUCT_FORMATTER_SYSTEM,
438
  },
439
- {
440
- "role": "user",
441
- "content": prompt,
442
- }
443
  ]
444
-
445
- # Apply chat template
446
- text = self.instruct_processor.apply_chat_template(
447
- messages, tokenize=False, add_generation_prompt=True
448
- )
449
-
450
- inputs = self.instruct_processor(
451
- text=[text],
452
- return_tensors="pt",
453
- padding=True,
454
- )
455
-
456
- # Generate response using vision config (low temp for consistent JSON)
457
- logger.debug(f"Instruct inference config: max_new_tokens={vision_config.max_new_tokens}, "
458
- f"temp={vision_config.temperature}")
459
-
460
- with torch.no_grad():
461
- outputs = self.instruct_model.generate(
462
- **inputs,
463
- max_new_tokens=vision_config.max_new_tokens,
464
- do_sample=vision_config.do_sample,
465
- temperature=vision_config.temperature,
466
- top_p=vision_config.top_p,
467
- repetition_penalty=vision_config.repetition_penalty,
468
- )
469
-
470
- # Decode response
471
- response_text = self.instruct_processor.decode(
472
- outputs[0], skip_special_tokens=True
473
- )
474
-
475
- # Parse JSON from response
476
- return self._parse_json_response(response_text)
477
 
478
  def _parse_json_response(self, response: str) -> dict[str, Any]:
479
- """Parse JSON response from instruct model."""
480
  try:
481
  # Try to extract JSON from response
482
  json_match = re.search(r'\{[\s\S]*\}', response)
@@ -484,7 +337,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
484
  json_str = json_match.group()
485
  return json.loads(json_str)
486
  else:
487
- logger.warning("No JSON found in instruct response")
488
  return self._get_fallback_response("No JSON in response")
489
  except json.JSONDecodeError as e:
490
  logger.warning(f"Failed to parse JSON: {e}")
@@ -533,6 +386,8 @@ class RealEmbeddingModel:
533
 
534
  Uses the official Qwen3VLEmbedder from QwenLM/Qwen3-VL-Embedding.
535
  The model handles last-token pooling and L2 normalization internally.
 
 
536
  """
537
 
538
  def __init__(self, model, processor):
@@ -557,7 +412,7 @@ class RealEmbeddingModel:
557
  text: Input text to embed
558
 
559
  Returns:
560
- List of floats representing the embedding (4096-dim for 8B model)
561
  """
562
  try:
563
  # Use official process() API - expects list of dicts
@@ -569,8 +424,8 @@ class RealEmbeddingModel:
569
 
570
  except Exception as e:
571
  logger.error(f"Embedding generation failed: {e}")
572
- # Return zero vector as fallback (4096-dim per Qwen3-VL-Embedding-8B)
573
- hidden_size = getattr(self.model.model.config, "hidden_size", 4096)
574
  return [0.0] * hidden_size
575
 
576
  def embed_batch(self, texts: list[str]) -> list[list[float]]:
@@ -584,7 +439,7 @@ class RealEmbeddingModel:
584
  return [emb.cpu().tolist() for emb in embeddings]
585
  except Exception as e:
586
  logger.error(f"Batch embedding generation failed: {e}")
587
- hidden_size = getattr(self.model.model.config, "hidden_size", 4096)
588
  return [[0.0] * hidden_size for _ in texts]
589
 
590
 
@@ -597,7 +452,7 @@ class RealRerankerModel:
597
  - Creates a binary linear layer: weight = yes_weight - no_weight
598
  - Scores = sigmoid(linear(last_token_hidden_state))
599
 
600
- Reference: https://github.com/QwenLM/Qwen3-VL-Embedding
601
  """
602
 
603
  def __init__(self, model, processor):
 
1
  """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
2
 
3
+ This module loads the production models:
4
+ - Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB via vLLM)
5
+ - Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
6
+ - Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
7
+ - Total: ~38-43GB on 88GB available (45GB+ headroom)
 
 
 
 
8
 
9
  Model Loading:
10
+ - Vision: vLLM with FP8 quantization (built-in) and tensor parallelism
11
  - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
12
  - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
13
  """
 
20
  from typing import Any
21
  from PIL import Image
22
 
23
+ from config.inference import vision_config
24
  from config.settings import settings
25
 
26
  logger = logging.getLogger(__name__)
 
29
  class RealModelStack:
30
  """Real model stack for production on HuggingFace Spaces.
31
 
32
+ Loads all 3 models at initialization (~38-43GB total):
33
+ - FP8 Vision via vLLM: ~30-35GB
34
+ - Embedding 2B: ~4GB
35
+ - Reranker 2B: ~4GB
36
  """
37
 
38
  def __init__(self):
 
53
  logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
54
 
55
  def load_all(self) -> "RealModelStack":
56
+ """Load all models.
57
 
58
+ Loads FP8 vision model via vLLM and RAG models (Embedding + Reranker).
 
59
  """
60
  if self._loaded:
61
  logger.debug("Models already loaded, skipping")
62
  return self
63
 
64
+ logger.info("Loading production models...")
 
 
 
65
  self._log_gpu_status()
66
 
67
  total_start = time.time()
68
 
69
+ # Vision model via vLLM (~30-35GB in FP8)
70
+ logger.info(f"Loading vision model: {settings.vision_model}")
71
+ vision_start = time.time()
72
+
73
+ from vllm import LLM, SamplingParams
74
+ from transformers import AutoProcessor
75
+
76
+ self.models["vision"] = LLM(
77
+ model=settings.vision_model,
78
+ # FP8 quantization is built into model weights, no quantization param needed
79
+ tensor_parallel_size=settings.vllm_tensor_parallel_size,
80
  trust_remote_code=True,
81
+ gpu_memory_utilization=0.70, # Per Qwen FP8 model recommendations
82
+ max_model_len=settings.vllm_max_model_len,
83
  )
 
84
 
85
+ # Load processor for chat template formatting
86
+ self.processors["vision"] = AutoProcessor.from_pretrained(
87
+ settings.vision_model,
 
 
 
 
88
  trust_remote_code=True,
89
  )
90
+
91
+ # Store sampling params for inference
92
+ self.models["vision_sampling_params"] = SamplingParams(
93
+ max_tokens=vision_config.max_tokens,
94
+ temperature=vision_config.temperature,
95
+ top_p=vision_config.top_p,
96
+ top_k=vision_config.top_k,
97
+ repetition_penalty=vision_config.repetition_penalty,
98
  )
 
99
 
100
+ logger.info(f"Vision model loaded in {time.time() - vision_start:.2f}s")
101
+
102
+ # Embedding model (~4GB in BF16) - Using official Qwen3VLEmbedder
103
  logger.info(f"Loading embedding model: {settings.embedding_model}")
104
  embed_start = time.time()
105
  from scripts.qwen3_vl import Qwen3VLEmbedder
 
111
  self.processors["embedding"] = self.models["embedding"].processor
112
  logger.info(f"Embedding model loaded in {time.time() - embed_start:.2f}s")
113
 
114
+ # Reranker model (~4GB in BF16) - Using official Qwen3VLReranker
115
  logger.info(f"Loading reranker model: {settings.reranker_model}")
116
  reranker_start = time.time()
117
  from scripts.qwen3_vl import Qwen3VLReranker
 
134
  return self._loaded
135
 
136
  @property
137
+ def vision(self) -> "VisionModel":
138
+ """Return FP8 vision model wrapped for pipeline consumption."""
139
  if not self._loaded:
140
  raise RuntimeError("Models not loaded. Call load_all() first.")
141
+ return VisionModel(
142
+ model=self.models["vision"],
143
+ processor=self.processors["vision"],
144
+ sampling_params=self.models["vision_sampling_params"],
 
145
  )
146
 
147
  @property
 
159
  return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
160
 
161
 
162
+ class VisionModel:
163
+ """Vision model for fire damage analysis.
164
 
165
+ Uses Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 via vLLM for inference.
166
+ Reasoning-enhanced model handles analysis with extended thinking
167
+ and outputs structured JSON.
168
 
169
+ Pipeline: Image -> Thinking Model (reasoning + JSON) -> Output
170
  """
171
 
172
+ # System prompt for FDAM fire damage assessment
173
  VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
174
 
175
  ## Your Task
176
+ Analyze the provided image and return a structured JSON response with fire damage assessment.
177
 
178
  ## Zone Classification Criteria
179
  - **Burn Zone**: Direct fire involvement. Look for structural char, complete combustion, exposed/damaged structural elements.
 
187
  - **Heavy**: Thick deposits; surface texture obscured; heavy coating visible.
188
  - **Structural Damage**: Physical damage requiring repair before cleaning (charring, warping, holes, collapse).
189
 
190
+ ## Material Categories
 
191
  - **Non-porous**: steel, concrete, glass, metal, CMU (concrete masonry unit)
192
  - **Semi-porous**: painted drywall, sealed wood
193
  - **Porous**: unpainted drywall, carpet, insulation, acoustic tile, upholstery
194
  - **HVAC**: rigid ductwork, flexible ductwork
195
 
196
  ## Combustion Particle Visual Indicators
197
+ - **Soot**: Black/dark gray coating with oily/sticky appearance; fine uniform texture
198
+ - **Char**: Black angular fragments; visible wood grain or fibrous structure
199
+ - **Ash**: Gray/white powdery residue; crystalline appearance"""
200
+
201
+ # JSON output format prompt
202
+ JSON_FORMAT_PROMPT = """Analyze this fire damage image and return a JSON response with this exact structure:
203
+
204
+ {
205
+ "zone": {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
  "classification": "burn" | "near-field" | "far-field",
207
  "confidence": 0.0-1.0,
208
  "reasoning": "explanation"
209
+ },
210
+ "condition": {
211
  "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
212
  "confidence": 0.0-1.0,
213
  "reasoning": "explanation"
214
+ },
215
  "materials": [
216
+ {
217
+ "type": "material type",
218
  "category": "non-porous" | "semi-porous" | "porous" | "hvac",
219
  "confidence": 0.0-1.0,
220
  "location_description": "where in image",
221
+ "bounding_box": {"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}
222
+ }
223
  ],
224
+ "combustion_indicators": {
225
  "soot_visible": true/false,
226
  "soot_pattern": "description or null",
227
  "char_visible": true/false,
228
  "char_description": "description or null",
229
  "ash_visible": true/false,
230
  "ash_description": "description or null"
231
+ },
232
  "structural_concerns": ["list of structural issues if any"],
233
  "access_issues": ["list of access problems if any"],
234
  "recommended_sampling_locations": [
235
+ {
236
  "description": "where to sample",
237
  "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
238
  "priority": "high" | "medium" | "low"
239
+ }
240
  ],
241
  "flags_for_review": ["any items requiring human review"]
242
+ }
243
 
244
  IMPORTANT: Return ONLY valid JSON, no additional text."""
245
 
246
+ def __init__(self, model, processor, sampling_params):
247
+ self.model = model
248
+ self.processor = processor
249
+ self.sampling_params = sampling_params
 
250
 
251
  def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
252
+ """Analyze an image using the FP8 vision model via vLLM.
253
+
254
+ Args:
255
+ image: PIL Image to analyze
256
+ context: Optional context string (room info, etc.)
257
 
258
+ Returns:
259
+ Structured dict with zone, condition, materials, etc.
260
  """
261
  start_time = time.time()
262
+ logger.debug(f"Starting FP8 vision analysis (context: {len(context)} chars)")
263
 
264
  try:
265
+ # Build messages in Qwen3-VL format
266
+ messages = self._build_messages(image, context)
267
+
268
+ # Apply chat template to format prompt correctly
269
+ prompt = self.processor.apply_chat_template(
270
+ messages,
271
+ tokenize=False,
272
+ add_generation_prompt=True,
273
+ )
274
+
275
+ # Generate response using vLLM multimodal API
276
+ # Per vLLM docs: pass PIL image directly in multi_modal_data dict
277
+ outputs = self.model.generate(
278
+ prompts=[{
279
+ "prompt": prompt,
280
+ "multi_modal_data": {"image": image}, # Single PIL image
281
+ }],
282
+ sampling_params=self.sampling_params,
283
+ )
284
+
285
+ response_text = outputs[0].outputs[0].text
286
+
287
+ # Parse JSON from response
288
+ result = self._parse_json_response(response_text)
289
 
290
  # Log result summary
291
  total_time = time.time() - start_time
 
294
  condition = result.get("condition", {}).get("level", "unknown")
295
  condition_conf = result.get("condition", {}).get("confidence", 0)
296
  num_materials = len(result.get("materials", []))
297
+ logger.info(f"Vision analysis complete in {total_time:.2f}s: "
298
  f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
299
  f"materials={num_materials}")
300
 
 
304
  logger.error(f"Vision analysis failed: {e}")
305
  return self._get_fallback_response(str(e))
306
 
307
+ def _build_messages(self, image: Image.Image, context: str) -> list[dict]:
308
+ """Build messages in Qwen3-VL format for chat template.
 
 
 
 
 
309
 
310
+ Qwen3-VL expects:
311
+ - System message with role="system"
312
+ - User message with mixed content [{"type": "image", ...}, {"type": "text", ...}]
313
+ """
314
+ # Build user text content
315
+ user_text = self.JSON_FORMAT_PROMPT
316
  if context:
317
+ user_text = f"Context: {context}\n\n{user_text}"
318
 
 
319
  messages = [
320
+ {"role": "system", "content": self.VISION_SYSTEM_PROMPT},
 
 
 
321
  {
322
  "role": "user",
323
  "content": [
324
  {"type": "image", "image": image},
325
+ {"type": "text", "text": user_text},
326
  ],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
327
  },
 
 
 
 
328
  ]
329
+ return messages
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
330
 
331
  def _parse_json_response(self, response: str) -> dict[str, Any]:
332
+ """Parse JSON response from model."""
333
  try:
334
  # Try to extract JSON from response
335
  json_match = re.search(r'\{[\s\S]*\}', response)
 
337
  json_str = json_match.group()
338
  return json.loads(json_str)
339
  else:
340
+ logger.warning("No JSON found in response")
341
  return self._get_fallback_response("No JSON in response")
342
  except json.JSONDecodeError as e:
343
  logger.warning(f"Failed to parse JSON: {e}")
 
386
 
387
  Uses the official Qwen3VLEmbedder from QwenLM/Qwen3-VL-Embedding.
388
  The model handles last-token pooling and L2 normalization internally.
389
+
390
+ Model: Qwen/Qwen3-VL-Embedding-2B (2048-dim output)
391
  """
392
 
393
  def __init__(self, model, processor):
 
412
  text: Input text to embed
413
 
414
  Returns:
415
+ List of floats representing the embedding (2048-dim for 2B model)
416
  """
417
  try:
418
  # Use official process() API - expects list of dicts
 
424
 
425
  except Exception as e:
426
  logger.error(f"Embedding generation failed: {e}")
427
+ # Return zero vector as fallback (2048-dim per Qwen3-VL-Embedding-2B)
428
+ hidden_size = getattr(self.model.model.config, "hidden_size", 2048)
429
  return [0.0] * hidden_size
430
 
431
  def embed_batch(self, texts: list[str]) -> list[list[float]]:
 
439
  return [emb.cpu().tolist() for emb in embeddings]
440
  except Exception as e:
441
  logger.error(f"Batch embedding generation failed: {e}")
442
+ hidden_size = getattr(self.model.model.config, "hidden_size", 2048)
443
  return [[0.0] * hidden_size for _ in texts]
444
 
445
 
 
452
  - Creates a binary linear layer: weight = yes_weight - no_weight
453
  - Scores = sigmoid(linear(last_token_hidden_state))
454
 
455
+ Model: Qwen/Qwen3-VL-Reranker-2B
456
  """
457
 
458
  def __init__(self, model, processor):
rag/vectorstore.py CHANGED
@@ -22,10 +22,10 @@ class MockEmbeddingFunction:
22
  """Mock embedding function for local development.
23
 
24
  Generates deterministic pseudo-embeddings based on text hash.
25
- Produces 4096-dimensional vectors (matches Qwen3-VL-Embedding-8B).
26
  """
27
 
28
- EMBEDDING_DIM = 4096 # Per Qwen3-VL-Embedding-8B hidden_size
29
 
30
  def __call__(self, input: list[str]) -> list[list[float]]:
31
  """Generate mock embeddings for a list of texts."""
@@ -67,7 +67,7 @@ class SharedEmbeddingFunction:
67
  For ChromaDB compatibility, this wraps the model stack's embedding model.
68
  """
69
 
70
- EMBEDDING_DIM = 4096 # Per Qwen3-VL-Embedding-8B hidden_size
71
 
72
  def __call__(self, input: list[str]) -> list[list[float]]:
73
  """Generate embeddings using the shared model from model stack."""
 
22
  """Mock embedding function for local development.
23
 
24
  Generates deterministic pseudo-embeddings based on text hash.
25
+ Produces 2048-dimensional vectors (matches Qwen3-VL-Embedding-2B).
26
  """
27
 
28
+ EMBEDDING_DIM = 2048 # Per Qwen3-VL-Embedding-2B hidden_size
29
 
30
  def __call__(self, input: list[str]) -> list[list[float]]:
31
  """Generate mock embeddings for a list of texts."""
 
67
  For ChromaDB compatibility, this wraps the model stack's embedding model.
68
  """
69
 
70
+ EMBEDDING_DIM = 2048 # Per Qwen3-VL-Embedding-2B hidden_size
71
 
72
  def __call__(self, input: list[str]) -> list[list[float]]:
73
  """Generate embeddings using the shared model from model stack."""
requirements.txt CHANGED
@@ -5,6 +5,9 @@ accelerate
5
  qwen-vl-utils>=0.0.14
6
  torchvision
7
 
 
 
 
8
  # UI
9
  gradio>=6.0.0,<7.0.0
10
 
 
5
  qwen-vl-utils>=0.0.14
6
  torchvision
7
 
8
+ # vLLM for FP8 quantized model inference (>=0.11.0 required for Qwen3-VL support)
9
+ vllm>=0.11.0
10
+
11
  # UI
12
  gradio>=6.0.0,<7.0.0
13
 
scripts/qwen3_vl/__init__.py CHANGED
@@ -4,8 +4,8 @@ Source: https://github.com/QwenLM/Qwen3-VL-Embedding
4
  License: Apache 2.0
5
 
6
  These are the official loading classes for:
7
- - Qwen/Qwen3-VL-Embedding-8B
8
- - Qwen/Qwen3-VL-Reranker-8B
9
  """
10
 
11
  from scripts.qwen3_vl.qwen3_vl_embedding import Qwen3VLEmbedder, Qwen3VLForEmbedding
 
4
  License: Apache 2.0
5
 
6
  These are the official loading classes for:
7
+ - Qwen/Qwen3-VL-Embedding-2B (or 8B)
8
+ - Qwen/Qwen3-VL-Reranker-2B (or 8B)
9
  """
10
 
11
  from scripts.qwen3_vl.qwen3_vl_embedding import Qwen3VLEmbedder, Qwen3VLForEmbedding