Spaces:
Paused
Paused
Commit
·
333c083
1
Parent(s):
3b08f11
Replace 30B MoE with dual 8B models (Thinking + Instruct)
Browse filesArchitecture change:
- Vision: Qwen3-VL-30B-A3B-Instruct → dual Qwen3-VL-8B-Thinking + 8B-Instruct
- Two-stage pipeline: Thinking (deep analysis) → Instruct (JSON formatting)
- VRAM: 90GB → 68GB (~22GB savings, 20GB headroom on 4xL4)
Key changes:
- models/real.py: New DualVisionModel with token-based </think> parsing
- config/settings.py: Dual model paths (vision_model_thinking, vision_model_instruct)
- config/inference.py: ThinkingInferenceConfig (temp=0.6, max_tokens=32768)
- Removed all lazy loading code (load_vision/unload_vision/load_rag)
- All 4 models now load simultaneously at startup
Per Qwen3-VL GitHub recommended hyperparameters for thinking models.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- .env.example +3 -2
- CLAUDE.md +19 -12
- FDAM_AI_Pipeline_Technical_Spec.md +36 -41
- README.md +3 -2
- config/inference.py +21 -4
- config/settings.py +3 -4
- models/loader.py +15 -16
- models/mock.py +49 -23
- models/real.py +280 -250
- pipeline/main.py +0 -14
- rag/retriever.py +2 -8
- rag/vectorstore.py +2 -8
.env.example
CHANGED
|
@@ -8,7 +8,8 @@ MOCK_MODELS=true
|
|
| 8 |
SERVER_HOST=0.0.0.0
|
| 9 |
SERVER_PORT=7860
|
| 10 |
|
| 11 |
-
# Optional: Override model paths
|
| 12 |
-
#
|
|
|
|
| 13 |
# EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
|
| 14 |
# RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B
|
|
|
|
| 8 |
SERVER_HOST=0.0.0.0
|
| 9 |
SERVER_PORT=7860
|
| 10 |
|
| 11 |
+
# Optional: Override model paths (Dual 8B architecture)
|
| 12 |
+
# VISION_MODEL_THINKING=Qwen/Qwen3-VL-8B-Thinking
|
| 13 |
+
# VISION_MODEL_INSTRUCT=Qwen/Qwen3-VL-8B-Instruct
|
| 14 |
# EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
|
| 15 |
# RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B
|
CLAUDE.md
CHANGED
|
@@ -13,7 +13,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 13 |
## Critical Constraints
|
| 14 |
|
| 15 |
1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
|
| 16 |
-
2. **Memory Budget** - 4xL4
|
| 17 |
3. **Processing Time** - 60-90 seconds per assessment is acceptable
|
| 18 |
4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
|
| 19 |
5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
|
|
@@ -23,7 +23,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 23 |
| Component | Technology |
|
| 24 |
|-----------|------------|
|
| 25 |
| UI Framework | Gradio 6.x |
|
| 26 |
-
| Vision
|
|
|
|
| 27 |
| Embeddings | Qwen3-VL-Embedding-8B |
|
| 28 |
| Reranker | Qwen3-VL-Reranker-8B |
|
| 29 |
| Vector Store | ChromaDB 0.4.x |
|
|
@@ -148,28 +149,34 @@ Source documents in `/RAG-KB/`:
|
|
| 148 |
|
| 149 |
## Multi-GPU Model Loading
|
| 150 |
|
| 151 |
-
|
| 152 |
|
| 153 |
```python
|
| 154 |
-
|
| 155 |
-
|
|
|
|
| 156 |
torch_dtype=torch.bfloat16,
|
| 157 |
-
device_map="auto",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
trust_remote_code=True
|
| 159 |
)
|
| 160 |
```
|
| 161 |
|
| 162 |
-
Expected distribution (BF16, ~
|
| 163 |
-
- Vision model (
|
|
|
|
| 164 |
- Embedding model (8B): ~16GB
|
| 165 |
- Reranker model (8B): ~16GB
|
| 166 |
-
- Headroom: ~
|
| 167 |
-
|
| 168 |
-
**Fallback**: If VRAM issues arise, use `Qwen/Qwen3-VL-8B-Instruct` (~16GB) instead of 30B
|
| 169 |
|
| 170 |
## Local Development Strategy
|
| 171 |
|
| 172 |
-
The RTX 4090 (24GB VRAM) cannot run the full model stack (~
|
| 173 |
|
| 174 |
1. Set `MOCK_MODELS=true` environment variable
|
| 175 |
2. Mock responses return realistic JSON matching vision output schema
|
|
|
|
| 13 |
## Critical Constraints
|
| 14 |
|
| 15 |
1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
|
| 16 |
+
2. **Memory Budget** - 4xL4 88GB usable: ~36GB vision (dual 8B) + ~16GB embedding + ~16GB reranker (~68GB used, ~20GB headroom)
|
| 17 |
3. **Processing Time** - 60-90 seconds per assessment is acceptable
|
| 18 |
4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
|
| 19 |
5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
|
|
|
|
| 23 |
| Component | Technology |
|
| 24 |
|-----------|------------|
|
| 25 |
| UI Framework | Gradio 6.x |
|
| 26 |
+
| Vision (Thinking) | Qwen3-VL-8B-Thinking |
|
| 27 |
+
| Vision (Instruct) | Qwen3-VL-8B-Instruct |
|
| 28 |
| Embeddings | Qwen3-VL-Embedding-8B |
|
| 29 |
| Reranker | Qwen3-VL-Reranker-8B |
|
| 30 |
| Vector Store | ChromaDB 0.4.x |
|
|
|
|
| 149 |
|
| 150 |
## Multi-GPU Model Loading
|
| 151 |
|
| 152 |
+
All 4 models are loaded simultaneously at startup (~68GB total on 4xL4 GPUs):
|
| 153 |
|
| 154 |
```python
|
| 155 |
+
# Vision models (dual 8B architecture)
|
| 156 |
+
thinking_model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 157 |
+
"Qwen/Qwen3-VL-8B-Thinking",
|
| 158 |
torch_dtype=torch.bfloat16,
|
| 159 |
+
device_map="auto",
|
| 160 |
+
trust_remote_code=True
|
| 161 |
+
)
|
| 162 |
+
instruct_model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 163 |
+
"Qwen/Qwen3-VL-8B-Instruct",
|
| 164 |
+
torch_dtype=torch.bfloat16,
|
| 165 |
+
device_map="auto",
|
| 166 |
trust_remote_code=True
|
| 167 |
)
|
| 168 |
```
|
| 169 |
|
| 170 |
+
Expected distribution (BF16, ~68GB total):
|
| 171 |
+
- Vision Thinking model (8B): ~18GB
|
| 172 |
+
- Vision Instruct model (8B): ~18GB
|
| 173 |
- Embedding model (8B): ~16GB
|
| 174 |
- Reranker model (8B): ~16GB
|
| 175 |
+
- Headroom: ~20GB for KV cache and overhead
|
|
|
|
|
|
|
| 176 |
|
| 177 |
## Local Development Strategy
|
| 178 |
|
| 179 |
+
The RTX 4090 (24GB VRAM) cannot run the full model stack (~68GB required). Use this workflow:
|
| 180 |
|
| 181 |
1. Set `MOCK_MODELS=true` environment variable
|
| 182 |
2. Mock responses return realistic JSON matching vision output schema
|
FDAM_AI_Pipeline_Technical_Spec.md
CHANGED
|
@@ -34,7 +34,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
|
|
| 34 |
|
| 35 |
### Key Constraints
|
| 36 |
- 100% locally-owned models (no Claude/OpenAI API calls)
|
| 37 |
-
- HuggingFace Spaces deployment with Nvidia
|
| 38 |
- 60-90 second processing time acceptable
|
| 39 |
- Static RAG knowledge base (no user-uploaded documents)
|
| 40 |
|
|
@@ -75,7 +75,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
|
|
| 75 |
▼
|
| 76 |
┌─────────────────────────────────────────────────────────────────────────────┐
|
| 77 |
│ VISION ANALYSIS MODULE │
|
| 78 |
-
│
|
| 79 |
├─────────────────────────────────────────────────────────────────────────────┤
|
| 80 |
│ Per Image: │
|
| 81 |
│ ├── Zone Classification (Burn/Near-Field/Far-Field) + confidence │
|
|
@@ -113,7 +113,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
|
|
| 113 |
▼
|
| 114 |
┌─────────────────────────────────────────────────────────────────────────────┐
|
| 115 |
│ DOCUMENT GENERATION MODULE │
|
| 116 |
-
│
|
| 117 |
├─────────────────────────────────────────────────────────────────────────────┤
|
| 118 |
│ Outputs: │
|
| 119 |
│ ├── Cleaning Specification / SOW (primary) │
|
|
@@ -144,12 +144,13 @@ Build an AI-powered fire damage assessment system that generates professional Cl
|
|
| 144 |
| Component | Technology | Version |
|
| 145 |
|-----------|------------|---------|
|
| 146 |
| Platform | HuggingFace Spaces | - |
|
| 147 |
-
| GPU | Nvidia
|
| 148 |
-
| Vision
|
|
|
|
| 149 |
| Embedding Model | Qwen3-VL-Embedding-8B | Latest |
|
| 150 |
| Reranker Model | Qwen3-VL-Reranker-8B | Latest |
|
| 151 |
| Vector Store | ChromaDB | 0.4.x |
|
| 152 |
-
| UI Framework | Gradio |
|
| 153 |
| PDF Generation | Pandoc | 3.x |
|
| 154 |
| Image Processing | Pillow, OpenCV | Latest |
|
| 155 |
|
|
@@ -157,16 +158,16 @@ Build an AI-powered fire damage assessment system that generates professional Cl
|
|
| 157 |
|
| 158 |
## 3. Model Stack Configuration
|
| 159 |
|
| 160 |
-
### Memory Budget (
|
| 161 |
|
| 162 |
| Component | VRAM | Status |
|
| 163 |
|-----------|------|--------|
|
| 164 |
-
| Qwen3-VL-
|
|
|
|
| 165 |
| Qwen3-VL-Embedding-8B | ~16GB | Always loaded |
|
| 166 |
| Qwen3-VL-Reranker-8B | ~16GB | Always loaded |
|
| 167 |
-
|
|
| 168 |
-
| **Available Headroom** | ~
|
| 169 |
-
| **Total** | ~61GB | ✅ Fits |
|
| 170 |
|
| 171 |
### Model Loading Configuration
|
| 172 |
|
|
@@ -175,58 +176,52 @@ Build an AI-powered fire damage assessment system that generates professional Cl
|
|
| 175 |
|
| 176 |
import torch
|
| 177 |
from transformers import (
|
| 178 |
-
|
| 179 |
AutoProcessor,
|
| 180 |
-
AutoModel,
|
| 181 |
-
AutoTokenizer
|
| 182 |
)
|
| 183 |
|
| 184 |
class ModelStack:
|
| 185 |
-
"""Manages all models with concurrent loading on
|
| 186 |
-
|
| 187 |
def __init__(self, device="cuda"):
|
| 188 |
self.device = device
|
| 189 |
self.models = {}
|
| 190 |
self.processors = {}
|
| 191 |
-
|
| 192 |
def load_all(self):
|
| 193 |
-
"""Load all models into VRAM."""
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
|
|
|
| 197 |
torch_dtype=torch.bfloat16,
|
| 198 |
device_map="auto",
|
| 199 |
trust_remote_code=True
|
| 200 |
)
|
| 201 |
-
self.processors["
|
| 202 |
-
"Qwen/Qwen3-VL-
|
| 203 |
trust_remote_code=True
|
| 204 |
)
|
| 205 |
-
|
| 206 |
-
print("Loading Qwen3-VL-
|
| 207 |
-
self.models["
|
| 208 |
-
"Qwen/Qwen3-VL-
|
| 209 |
torch_dtype=torch.bfloat16,
|
| 210 |
device_map="auto",
|
| 211 |
trust_remote_code=True
|
| 212 |
)
|
| 213 |
-
self.processors["
|
| 214 |
-
"Qwen/Qwen3-VL-
|
| 215 |
trust_remote_code=True
|
| 216 |
)
|
| 217 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
print("Loading Qwen3-VL-Reranker-8B (Retrieval Precision)...")
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
torch_dtype=torch.bfloat16,
|
| 222 |
-
device_map="auto",
|
| 223 |
-
trust_remote_code=True
|
| 224 |
-
)
|
| 225 |
-
self.processors["reranker"] = AutoProcessor.from_pretrained(
|
| 226 |
-
"Qwen/Qwen3-VL-Reranker-8B",
|
| 227 |
-
trust_remote_code=True
|
| 228 |
-
)
|
| 229 |
-
|
| 230 |
print("All models loaded successfully.")
|
| 231 |
return self
|
| 232 |
|
|
|
|
| 34 |
|
| 35 |
### Key Constraints
|
| 36 |
- 100% locally-owned models (no Claude/OpenAI API calls)
|
| 37 |
+
- HuggingFace Spaces deployment with Nvidia 4xL4 (88GB total)
|
| 38 |
- 60-90 second processing time acceptable
|
| 39 |
- Static RAG knowledge base (no user-uploaded documents)
|
| 40 |
|
|
|
|
| 75 |
▼
|
| 76 |
┌─────────────────────────────────────────────────────────────────────────────┐
|
| 77 |
│ VISION ANALYSIS MODULE │
|
| 78 |
+
│ (Qwen3-VL-8B-Thinking → Qwen3-VL-8B-Instruct) │
|
| 79 |
├─────────────────────────────────────────────────────────────────────────────┤
|
| 80 |
│ Per Image: │
|
| 81 |
│ ├── Zone Classification (Burn/Near-Field/Far-Field) + confidence │
|
|
|
|
| 113 |
▼
|
| 114 |
┌─────────────────────────────────────────────────────────────────────────────┐
|
| 115 |
│ DOCUMENT GENERATION MODULE │
|
| 116 |
+
│ (Deterministic template + calculations) │
|
| 117 |
├─────────────────────────────────────────────────────────────────────────────┤
|
| 118 |
│ Outputs: │
|
| 119 |
│ ├── Cleaning Specification / SOW (primary) │
|
|
|
|
| 144 |
| Component | Technology | Version |
|
| 145 |
|-----------|------------|---------|
|
| 146 |
| Platform | HuggingFace Spaces | - |
|
| 147 |
+
| GPU | Nvidia 4xL4 | 88GB total |
|
| 148 |
+
| Vision (Thinking) | Qwen3-VL-8B-Thinking | Latest |
|
| 149 |
+
| Vision (Instruct) | Qwen3-VL-8B-Instruct | Latest |
|
| 150 |
| Embedding Model | Qwen3-VL-Embedding-8B | Latest |
|
| 151 |
| Reranker Model | Qwen3-VL-Reranker-8B | Latest |
|
| 152 |
| Vector Store | ChromaDB | 0.4.x |
|
| 153 |
+
| UI Framework | Gradio | 6.x |
|
| 154 |
| PDF Generation | Pandoc | 3.x |
|
| 155 |
| Image Processing | Pillow, OpenCV | Latest |
|
| 156 |
|
|
|
|
| 158 |
|
| 159 |
## 3. Model Stack Configuration
|
| 160 |
|
| 161 |
+
### Memory Budget (4xL4 88GB)
|
| 162 |
|
| 163 |
| Component | VRAM | Status |
|
| 164 |
|-----------|------|--------|
|
| 165 |
+
| Qwen3-VL-8B-Thinking | ~18GB | Always loaded |
|
| 166 |
+
| Qwen3-VL-8B-Instruct | ~18GB | Always loaded |
|
| 167 |
| Qwen3-VL-Embedding-8B | ~16GB | Always loaded |
|
| 168 |
| Qwen3-VL-Reranker-8B | ~16GB | Always loaded |
|
| 169 |
+
| **Total** | ~68GB | ✅ Fits |
|
| 170 |
+
| **Available Headroom** | ~20GB | KV cache + overhead |
|
|
|
|
| 171 |
|
| 172 |
### Model Loading Configuration
|
| 173 |
|
|
|
|
| 176 |
|
| 177 |
import torch
|
| 178 |
from transformers import (
|
| 179 |
+
Qwen3VLForConditionalGeneration,
|
| 180 |
AutoProcessor,
|
|
|
|
|
|
|
| 181 |
)
|
| 182 |
|
| 183 |
class ModelStack:
|
| 184 |
+
"""Manages all models with concurrent loading on 4xL4 (88GB total)."""
|
| 185 |
+
|
| 186 |
def __init__(self, device="cuda"):
|
| 187 |
self.device = device
|
| 188 |
self.models = {}
|
| 189 |
self.processors = {}
|
| 190 |
+
|
| 191 |
def load_all(self):
|
| 192 |
+
"""Load all models into VRAM (~68GB total)."""
|
| 193 |
+
# Dual vision architecture
|
| 194 |
+
print("Loading Qwen3-VL-8B-Thinking (Vision Analysis)...")
|
| 195 |
+
self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 196 |
+
"Qwen/Qwen3-VL-8B-Thinking",
|
| 197 |
torch_dtype=torch.bfloat16,
|
| 198 |
device_map="auto",
|
| 199 |
trust_remote_code=True
|
| 200 |
)
|
| 201 |
+
self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
|
| 202 |
+
"Qwen/Qwen3-VL-8B-Thinking",
|
| 203 |
trust_remote_code=True
|
| 204 |
)
|
| 205 |
+
|
| 206 |
+
print("Loading Qwen3-VL-8B-Instruct (JSON Formatting)...")
|
| 207 |
+
self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 208 |
+
"Qwen/Qwen3-VL-8B-Instruct",
|
| 209 |
torch_dtype=torch.bfloat16,
|
| 210 |
device_map="auto",
|
| 211 |
trust_remote_code=True
|
| 212 |
)
|
| 213 |
+
self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
|
| 214 |
+
"Qwen/Qwen3-VL-8B-Instruct",
|
| 215 |
trust_remote_code=True
|
| 216 |
)
|
| 217 |
+
|
| 218 |
+
# RAG models
|
| 219 |
+
print("Loading Qwen3-VL-Embedding-8B (Multimodal RAG)...")
|
| 220 |
+
# Uses official Qwen3VLEmbedder from scripts/qwen3_vl/
|
| 221 |
+
|
| 222 |
print("Loading Qwen3-VL-Reranker-8B (Retrieval Precision)...")
|
| 223 |
+
# Uses official Qwen3VLReranker from scripts/qwen3_vl/
|
| 224 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
print("All models loaded successfully.")
|
| 226 |
return self
|
| 227 |
|
README.md
CHANGED
|
@@ -32,8 +32,9 @@ suggested_hardware: l4x4
|
|
| 32 |
|
| 33 |
## Technical Details
|
| 34 |
|
| 35 |
-
### Model Stack (~
|
| 36 |
-
- **Vision**: Qwen3-VL-
|
|
|
|
| 37 |
- **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
|
| 38 |
- **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
|
| 39 |
|
|
|
|
| 32 |
|
| 33 |
## Technical Details
|
| 34 |
|
| 35 |
+
### Model Stack (~68GB VRAM)
|
| 36 |
+
- **Vision (Thinking)**: Qwen3-VL-8B-Thinking (~18GB) - Deep analysis with reasoning
|
| 37 |
+
- **Vision (Instruct)**: Qwen3-VL-8B-Instruct (~18GB) - Structured JSON output
|
| 38 |
- **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
|
| 39 |
- **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
|
| 40 |
|
config/inference.py
CHANGED
|
@@ -7,15 +7,31 @@ and FDAM Technical Spec requirements.
|
|
| 7 |
from dataclasses import dataclass
|
| 8 |
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
@dataclass
|
| 11 |
class VisionInferenceConfig:
|
| 12 |
-
"""Configuration for
|
| 13 |
|
| 14 |
-
Per FDAM Technical Spec Section 3
|
| 15 |
"""
|
| 16 |
|
| 17 |
max_new_tokens: int = 4096
|
| 18 |
-
temperature: float = 0.1 # Low temperature for deterministic output
|
| 19 |
top_p: float = 0.9
|
| 20 |
do_sample: bool = True
|
| 21 |
repetition_penalty: float = 1.1 # Reduce repetition in generated text
|
|
@@ -66,7 +82,8 @@ class RAGConfig:
|
|
| 66 |
|
| 67 |
|
| 68 |
# Default configurations
|
| 69 |
-
|
|
|
|
| 70 |
generation_config = GenerationInferenceConfig()
|
| 71 |
embedding_config = EmbeddingConfig()
|
| 72 |
reranker_config = RerankerConfig()
|
|
|
|
| 7 |
from dataclasses import dataclass
|
| 8 |
|
| 9 |
|
| 10 |
+
@dataclass
|
| 11 |
+
class ThinkingInferenceConfig:
|
| 12 |
+
"""Configuration for 8B-Thinking model inference.
|
| 13 |
+
|
| 14 |
+
Per Qwen3-VL GitHub recommended hyperparameters for thinking models.
|
| 15 |
+
Used for deep analysis with <think> chains.
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
max_new_tokens: int = 32768 # Extended for reasoning chains (model supports 40960)
|
| 19 |
+
temperature: float = 0.6 # Per Qwen3-VL GitHub docs
|
| 20 |
+
top_p: float = 0.95
|
| 21 |
+
top_k: int = 20
|
| 22 |
+
do_sample: bool = True
|
| 23 |
+
repetition_penalty: float = 1.0 # Per Qwen3-VL docs (not presence_penalty)
|
| 24 |
+
|
| 25 |
+
|
| 26 |
@dataclass
|
| 27 |
class VisionInferenceConfig:
|
| 28 |
+
"""Configuration for 8B-Instruct model inference.
|
| 29 |
|
| 30 |
+
Per FDAM Technical Spec Section 3. Used for structured JSON output.
|
| 31 |
"""
|
| 32 |
|
| 33 |
max_new_tokens: int = 4096
|
| 34 |
+
temperature: float = 0.1 # Low temperature for deterministic JSON output
|
| 35 |
top_p: float = 0.9
|
| 36 |
do_sample: bool = True
|
| 37 |
repetition_penalty: float = 1.1 # Reduce repetition in generated text
|
|
|
|
| 82 |
|
| 83 |
|
| 84 |
# Default configurations
|
| 85 |
+
thinking_config = ThinkingInferenceConfig()
|
| 86 |
+
vision_config = VisionInferenceConfig() # Now used for Instruct model
|
| 87 |
generation_config = GenerationInferenceConfig()
|
| 88 |
embedding_config = EmbeddingConfig()
|
| 89 |
reranker_config = RerankerConfig()
|
config/settings.py
CHANGED
|
@@ -17,13 +17,12 @@ class Settings(BaseSettings):
|
|
| 17 |
mock_models: bool = True
|
| 18 |
|
| 19 |
# Model paths (for production on HuggingFace Spaces)
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
|
| 22 |
reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
|
| 23 |
|
| 24 |
-
# Fallback vision model if VRAM issues
|
| 25 |
-
vision_model_fallback: str = "Qwen/Qwen3-VL-8B-Instruct"
|
| 26 |
-
|
| 27 |
# ChromaDB
|
| 28 |
chroma_persist_dir: str = "./chroma_db"
|
| 29 |
|
|
|
|
| 17 |
mock_models: bool = True
|
| 18 |
|
| 19 |
# Model paths (for production on HuggingFace Spaces)
|
| 20 |
+
# Dual 8B architecture: Thinking for analysis, Instruct for structured output
|
| 21 |
+
vision_model_thinking: str = "Qwen/Qwen3-VL-8B-Thinking"
|
| 22 |
+
vision_model_instruct: str = "Qwen/Qwen3-VL-8B-Instruct"
|
| 23 |
embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
|
| 24 |
reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
|
| 25 |
|
|
|
|
|
|
|
|
|
|
| 26 |
# ChromaDB
|
| 27 |
chroma_persist_dir: str = "./chroma_db"
|
| 28 |
|
models/loader.py
CHANGED
|
@@ -1,13 +1,13 @@
|
|
| 1 |
"""Model loading with mock/real switching based on environment.
|
| 2 |
|
| 3 |
Supports two loading modes:
|
| 4 |
-
- MOCK_MODELS=true: Loads
|
| 5 |
-
- MOCK_MODELS=false:
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- Vision
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
"""
|
| 12 |
|
| 13 |
import logging
|
|
@@ -28,8 +28,8 @@ _model_stack: ModelStack | None = None
|
|
| 28 |
def get_model_stack() -> ModelStack:
|
| 29 |
"""Get model stack based on environment configuration.
|
| 30 |
|
| 31 |
-
For mock models: Loads
|
| 32 |
-
For real models:
|
| 33 |
"""
|
| 34 |
start_time = time.time()
|
| 35 |
|
|
@@ -42,25 +42,24 @@ def get_model_stack() -> ModelStack:
|
|
| 42 |
logger.info(f"Mock model stack loaded in {elapsed:.2f}s")
|
| 43 |
return stack
|
| 44 |
else:
|
| 45 |
-
logger.info("
|
| 46 |
-
logger.info(f"Vision model: {settings.
|
|
|
|
| 47 |
logger.info(f"Embedding model: {settings.embedding_model}")
|
| 48 |
logger.info(f"Reranker model: {settings.reranker_model}")
|
| 49 |
-
logger.info("NOTE: Models will be loaded on-demand by pipeline stages")
|
| 50 |
from models.real import RealModelStack
|
| 51 |
|
| 52 |
-
#
|
| 53 |
-
stack = RealModelStack()
|
| 54 |
elapsed = time.time() - start_time
|
| 55 |
-
logger.info(f"Real model stack
|
| 56 |
return stack
|
| 57 |
|
| 58 |
|
| 59 |
def get_models() -> ModelStack:
|
| 60 |
"""Get or create the singleton model stack.
|
| 61 |
|
| 62 |
-
|
| 63 |
-
Call stack.load_vision() or stack.load_rag() as needed.
|
| 64 |
"""
|
| 65 |
global _model_stack
|
| 66 |
if _model_stack is None:
|
|
|
|
| 1 |
"""Model loading with mock/real switching based on environment.
|
| 2 |
|
| 3 |
Supports two loading modes:
|
| 4 |
+
- MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
|
| 5 |
+
- MOCK_MODELS=false: Loads all real models at startup (~68GB total)
|
| 6 |
|
| 7 |
+
Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
|
| 8 |
+
- Vision Thinking 8B (~18GB) + Vision Instruct 8B (~18GB) = ~36GB
|
| 9 |
+
- Embedding 8B (~16GB) + Reranker 8B (~16GB) = ~32GB
|
| 10 |
+
- Total: ~68GB, leaving ~20GB headroom
|
| 11 |
"""
|
| 12 |
|
| 13 |
import logging
|
|
|
|
| 28 |
def get_model_stack() -> ModelStack:
|
| 29 |
"""Get model stack based on environment configuration.
|
| 30 |
|
| 31 |
+
For mock models: Loads mock models immediately (fast, for local dev).
|
| 32 |
+
For real models: Loads all 4 models at startup (~68GB total).
|
| 33 |
"""
|
| 34 |
start_time = time.time()
|
| 35 |
|
|
|
|
| 42 |
logger.info(f"Mock model stack loaded in {elapsed:.2f}s")
|
| 43 |
return stack
|
| 44 |
else:
|
| 45 |
+
logger.info("Loading REAL model stack (production mode)")
|
| 46 |
+
logger.info(f"Vision thinking model: {settings.vision_model_thinking}")
|
| 47 |
+
logger.info(f"Vision instruct model: {settings.vision_model_instruct}")
|
| 48 |
logger.info(f"Embedding model: {settings.embedding_model}")
|
| 49 |
logger.info(f"Reranker model: {settings.reranker_model}")
|
|
|
|
| 50 |
from models.real import RealModelStack
|
| 51 |
|
| 52 |
+
# Load all models at startup (simultaneous loading)
|
| 53 |
+
stack = RealModelStack().load_all()
|
| 54 |
elapsed = time.time() - start_time
|
| 55 |
+
logger.info(f"Real model stack loaded in {elapsed:.2f}s")
|
| 56 |
return stack
|
| 57 |
|
| 58 |
|
| 59 |
def get_models() -> ModelStack:
|
| 60 |
"""Get or create the singleton model stack.
|
| 61 |
|
| 62 |
+
Returns fully loaded model stack (all models ready for inference).
|
|
|
|
| 63 |
"""
|
| 64 |
global _model_stack
|
| 65 |
if _model_stack is None:
|
models/mock.py
CHANGED
|
@@ -1,4 +1,9 @@
|
|
| 1 |
-
"""Mock model implementations for local development on RTX 4090.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
import logging
|
| 4 |
import random
|
|
@@ -9,7 +14,12 @@ logger = logging.getLogger(__name__)
|
|
| 9 |
|
| 10 |
|
| 11 |
class MockVisionModel:
|
| 12 |
-
"""Mock vision model that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
ZONES = ["burn", "near-field", "far-field"]
|
| 15 |
CONDITIONS = ["background", "light", "moderate", "heavy", "structural-damage"]
|
|
@@ -28,11 +38,31 @@ class MockVisionModel:
|
|
| 28 |
{"type": "ductwork-flexible", "category": "hvac"},
|
| 29 |
]
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 32 |
-
"""Return mock vision analysis
|
| 33 |
-
logger.debug(f"Mock vision analysis (context: {len(context)} chars)")
|
|
|
|
|
|
|
| 34 |
selected_zone = random.choice(self.ZONES)
|
| 35 |
selected_condition = random.choice(self.CONDITIONS)
|
|
|
|
|
|
|
|
|
|
| 36 |
logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
|
| 37 |
|
| 38 |
# Generate 2-4 random materials
|
|
@@ -62,12 +92,18 @@ class MockVisionModel:
|
|
| 62 |
"zone": {
|
| 63 |
"classification": selected_zone,
|
| 64 |
"confidence": round(random.uniform(0.7, 0.95), 2),
|
| 65 |
-
"reasoning":
|
|
|
|
|
|
|
|
|
|
| 66 |
},
|
| 67 |
"condition": {
|
| 68 |
"level": selected_condition,
|
| 69 |
"confidence": round(random.uniform(0.65, 0.90), 2),
|
| 70 |
-
"reasoning":
|
|
|
|
|
|
|
|
|
|
| 71 |
},
|
| 72 |
"materials": materials,
|
| 73 |
"combustion_indicators": {
|
|
@@ -188,35 +224,25 @@ class MockRerankerModel:
|
|
| 188 |
class MockModelStack:
|
| 189 |
"""Mock model stack for local development.
|
| 190 |
|
| 191 |
-
|
| 192 |
-
The is_vision_loaded() and is_rag_loaded() methods are provided
|
| 193 |
-
for API compatibility with the lazy loading pipeline.
|
| 194 |
"""
|
| 195 |
|
| 196 |
def __init__(self):
|
| 197 |
self.vision = MockVisionModel()
|
| 198 |
self.embedding = MockEmbeddingModel()
|
| 199 |
self.reranker = MockRerankerModel()
|
| 200 |
-
self.
|
| 201 |
|
| 202 |
def load_all(self) -> "MockModelStack":
|
| 203 |
-
"""
|
| 204 |
logger.info("Loading mock models for local development")
|
| 205 |
-
logger.debug(" Vision model: MockVisionModel")
|
| 206 |
-
logger.debug(" Embedding model: MockEmbeddingModel")
|
| 207 |
logger.debug(" Reranker model: MockRerankerModel")
|
| 208 |
-
self.
|
| 209 |
logger.info("All mock models loaded successfully")
|
| 210 |
return self
|
| 211 |
|
| 212 |
def is_loaded(self) -> bool:
|
| 213 |
"""Check if models are loaded."""
|
| 214 |
-
return self.
|
| 215 |
-
|
| 216 |
-
def is_vision_loaded(self) -> bool:
|
| 217 |
-
"""Check if vision model is loaded (always True when loaded)."""
|
| 218 |
-
return self.loaded
|
| 219 |
-
|
| 220 |
-
def is_rag_loaded(self) -> bool:
|
| 221 |
-
"""Check if RAG models are loaded (always True when loaded)."""
|
| 222 |
-
return self.loaded
|
|
|
|
| 1 |
+
"""Mock model implementations for local development on RTX 4090.
|
| 2 |
+
|
| 3 |
+
Simulates the dual 8B vision model architecture:
|
| 4 |
+
- MockVisionModel simulates two-stage pipeline (Thinking -> Instruct)
|
| 5 |
+
- All models loaded together at startup (no lazy loading)
|
| 6 |
+
"""
|
| 7 |
|
| 8 |
import logging
|
| 9 |
import random
|
|
|
|
| 14 |
|
| 15 |
|
| 16 |
class MockVisionModel:
|
| 17 |
+
"""Mock vision model that simulates dual-model pipeline output.
|
| 18 |
+
|
| 19 |
+
Simulates:
|
| 20 |
+
- Stage 1: Thinking model generates reasoning
|
| 21 |
+
- Stage 2: Instruct model formats to JSON
|
| 22 |
+
"""
|
| 23 |
|
| 24 |
ZONES = ["burn", "near-field", "far-field"]
|
| 25 |
CONDITIONS = ["background", "light", "moderate", "heavy", "structural-damage"]
|
|
|
|
| 38 |
{"type": "ductwork-flexible", "category": "hvac"},
|
| 39 |
]
|
| 40 |
|
| 41 |
+
# Mock reasoning patterns to simulate Thinking model output
|
| 42 |
+
REASONING_PATTERNS = {
|
| 43 |
+
"burn": "Direct fire involvement evident from structural char and complete combustion patterns.",
|
| 44 |
+
"near-field": "Adjacent to burn zone with heavy smoke deposits and heat-induced discoloration.",
|
| 45 |
+
"far-field": "Light smoke migration only, no direct heat exposure or structural damage visible.",
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
CONDITION_REASONING = {
|
| 49 |
+
"background": "Surfaces appear clean with no visible contamination.",
|
| 50 |
+
"light": "Faint discoloration visible, minimal deposits present.",
|
| 51 |
+
"moderate": "Clear contamination with visible film on surfaces.",
|
| 52 |
+
"heavy": "Thick deposits obscuring surface texture.",
|
| 53 |
+
"structural-damage": "Physical damage requiring repair before cleaning.",
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 57 |
+
"""Return mock vision analysis simulating dual-model pipeline output."""
|
| 58 |
+
logger.debug(f"Mock dual-model vision analysis (context: {len(context)} chars)")
|
| 59 |
+
|
| 60 |
+
# Simulate Stage 1: Thinking model selects classifications
|
| 61 |
selected_zone = random.choice(self.ZONES)
|
| 62 |
selected_condition = random.choice(self.CONDITIONS)
|
| 63 |
+
|
| 64 |
+
logger.debug("Mock Stage 1 (Thinking): Generated reasoning")
|
| 65 |
+
logger.debug("Mock Stage 2 (Instruct): Formatted to JSON")
|
| 66 |
logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
|
| 67 |
|
| 68 |
# Generate 2-4 random materials
|
|
|
|
| 92 |
"zone": {
|
| 93 |
"classification": selected_zone,
|
| 94 |
"confidence": round(random.uniform(0.7, 0.95), 2),
|
| 95 |
+
"reasoning": self.REASONING_PATTERNS.get(
|
| 96 |
+
selected_zone,
|
| 97 |
+
f"Mock analysis detected {selected_zone} zone characteristics",
|
| 98 |
+
),
|
| 99 |
},
|
| 100 |
"condition": {
|
| 101 |
"level": selected_condition,
|
| 102 |
"confidence": round(random.uniform(0.65, 0.90), 2),
|
| 103 |
+
"reasoning": self.CONDITION_REASONING.get(
|
| 104 |
+
selected_condition,
|
| 105 |
+
f"Surface shows {selected_condition} contamination levels",
|
| 106 |
+
),
|
| 107 |
},
|
| 108 |
"materials": materials,
|
| 109 |
"combustion_indicators": {
|
|
|
|
| 224 |
class MockModelStack:
|
| 225 |
"""Mock model stack for local development.
|
| 226 |
|
| 227 |
+
All models loaded together at startup (matches production behavior).
|
|
|
|
|
|
|
| 228 |
"""
|
| 229 |
|
| 230 |
def __init__(self):
|
| 231 |
self.vision = MockVisionModel()
|
| 232 |
self.embedding = MockEmbeddingModel()
|
| 233 |
self.reranker = MockRerankerModel()
|
| 234 |
+
self._loaded = False
|
| 235 |
|
| 236 |
def load_all(self) -> "MockModelStack":
|
| 237 |
+
"""Load all mock models."""
|
| 238 |
logger.info("Loading mock models for local development")
|
| 239 |
+
logger.debug(" Vision model: MockVisionModel (simulates dual 8B pipeline)")
|
| 240 |
+
logger.debug(" Embedding model: MockEmbeddingModel (4096-dim)")
|
| 241 |
logger.debug(" Reranker model: MockRerankerModel")
|
| 242 |
+
self._loaded = True
|
| 243 |
logger.info("All mock models loaded successfully")
|
| 244 |
return self
|
| 245 |
|
| 246 |
def is_loaded(self) -> bool:
|
| 247 |
"""Check if models are loaded."""
|
| 248 |
+
return self._loaded
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
models/real.py
CHANGED
|
@@ -1,21 +1,21 @@
|
|
| 1 |
"""Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
|
| 2 |
|
| 3 |
This module loads the actual Qwen3-VL models for production use.
|
| 4 |
-
|
| 5 |
|
| 6 |
-
Memory Strategy:
|
| 7 |
-
- Vision
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
-
|
|
|
|
| 11 |
|
| 12 |
Model Loading:
|
| 13 |
-
- Vision:
|
| 14 |
- Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 15 |
- Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 16 |
"""
|
| 17 |
|
| 18 |
-
import gc
|
| 19 |
import json
|
| 20 |
import logging
|
| 21 |
import re
|
|
@@ -24,7 +24,7 @@ import torch
|
|
| 24 |
from typing import Any
|
| 25 |
from PIL import Image
|
| 26 |
|
| 27 |
-
from config.inference import vision_config
|
| 28 |
from config.settings import settings
|
| 29 |
|
| 30 |
logger = logging.getLogger(__name__)
|
|
@@ -33,17 +33,15 @@ logger = logging.getLogger(__name__)
|
|
| 33 |
class RealModelStack:
|
| 34 |
"""Real model stack for production on HuggingFace Spaces.
|
| 35 |
|
| 36 |
-
|
| 37 |
-
-
|
| 38 |
-
-
|
| 39 |
-
- Pipeline calls load_rag() before Stage 3
|
| 40 |
"""
|
| 41 |
|
| 42 |
def __init__(self):
|
| 43 |
self.models: dict[str, Any] = {}
|
| 44 |
self.processors: dict[str, Any] = {}
|
| 45 |
-
self.
|
| 46 |
-
self._rag_loaded = False
|
| 47 |
|
| 48 |
def _log_gpu_status(self):
|
| 49 |
"""Log current GPU memory status."""
|
|
@@ -57,114 +55,53 @@ class RealModelStack:
|
|
| 57 |
free = total - allocated
|
| 58 |
logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
|
| 59 |
|
| 60 |
-
def
|
| 61 |
-
"""Load
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
"""
|
| 66 |
-
if self.
|
| 67 |
-
logger.debug("
|
| 68 |
return self
|
| 69 |
|
| 70 |
-
from transformers import AutoProcessor
|
| 71 |
|
| 72 |
device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 73 |
-
logger.info(f"Loading
|
| 74 |
self._log_gpu_status()
|
| 75 |
|
| 76 |
-
|
| 77 |
-
vision_start = time.time()
|
| 78 |
-
try:
|
| 79 |
-
from transformers import Qwen3VLMoeForConditionalGeneration
|
| 80 |
-
|
| 81 |
-
self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
|
| 82 |
-
settings.vision_model,
|
| 83 |
-
torch_dtype=torch.bfloat16,
|
| 84 |
-
device_map="auto",
|
| 85 |
-
trust_remote_code=True,
|
| 86 |
-
)
|
| 87 |
-
self.processors["vision"] = AutoProcessor.from_pretrained(
|
| 88 |
-
settings.vision_model,
|
| 89 |
-
trust_remote_code=True,
|
| 90 |
-
)
|
| 91 |
-
logger.info(f"Vision model loaded in {time.time() - vision_start:.2f}s")
|
| 92 |
-
except Exception as e:
|
| 93 |
-
logger.warning(f"Failed to load 30B vision model: {e}")
|
| 94 |
-
logger.info(f"Falling back to {settings.vision_model_fallback}")
|
| 95 |
-
from transformers import Qwen3VLMoeForConditionalGeneration
|
| 96 |
-
|
| 97 |
-
self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
|
| 98 |
-
settings.vision_model_fallback,
|
| 99 |
-
torch_dtype=torch.bfloat16,
|
| 100 |
-
device_map="auto",
|
| 101 |
-
trust_remote_code=True,
|
| 102 |
-
)
|
| 103 |
-
self.processors["vision"] = AutoProcessor.from_pretrained(
|
| 104 |
-
settings.vision_model_fallback,
|
| 105 |
-
trust_remote_code=True,
|
| 106 |
-
)
|
| 107 |
-
logger.info(f"Fallback vision model loaded in {time.time() - vision_start:.2f}s")
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
""
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
logger.info("Unloading vision model to free memory for RAG...")
|
| 124 |
-
self._log_gpu_status()
|
| 125 |
-
|
| 126 |
-
try:
|
| 127 |
-
from accelerate.hooks import remove_hook_from_module
|
| 128 |
-
|
| 129 |
-
# CRITICAL: Remove hooks before deleting (required for device_map="auto")
|
| 130 |
-
model = self.models["vision"]
|
| 131 |
-
if hasattr(model, 'model'):
|
| 132 |
-
# Some wrappers have nested model
|
| 133 |
-
remove_hook_from_module(model.model, recurse=True)
|
| 134 |
-
remove_hook_from_module(model, recurse=True)
|
| 135 |
-
logger.debug("Accelerate hooks removed from vision model")
|
| 136 |
-
except ImportError:
|
| 137 |
-
logger.warning("accelerate.hooks not available, proceeding with basic cleanup")
|
| 138 |
-
except Exception as e:
|
| 139 |
-
logger.warning(f"Hook removal failed (continuing anyway): {e}")
|
| 140 |
-
|
| 141 |
-
# Delete model and processor
|
| 142 |
-
del self.models["vision"]
|
| 143 |
-
del self.processors["vision"]
|
| 144 |
-
self._vision_loaded = False
|
| 145 |
-
|
| 146 |
-
# Clear CUDA cache (may not free 100% but sufficient for sequential loading)
|
| 147 |
-
gc.collect()
|
| 148 |
-
torch.cuda.empty_cache()
|
| 149 |
-
|
| 150 |
-
logger.info("Vision model unloaded, CUDA cache cleared")
|
| 151 |
-
self._log_gpu_status()
|
| 152 |
-
|
| 153 |
-
def load_rag(self) -> "RealModelStack":
|
| 154 |
-
"""Load embedding and reranker models (~32GB total in BF16).
|
| 155 |
-
|
| 156 |
-
Call this before Stage 3 (RAG Retrieval).
|
| 157 |
-
Must call unload_vision() first to have enough memory.
|
| 158 |
-
"""
|
| 159 |
-
if self._rag_loaded:
|
| 160 |
-
logger.debug("RAG models already loaded, skipping")
|
| 161 |
-
return self
|
| 162 |
-
|
| 163 |
-
if self._vision_loaded:
|
| 164 |
-
logger.warning("Vision model still loaded! Call unload_vision() first to avoid OOM.")
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
# Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
|
| 170 |
logger.info(f"Loading embedding model: {settings.embedding_model}")
|
|
@@ -190,59 +127,51 @@ class RealModelStack:
|
|
| 190 |
self.processors["reranker"] = self.models["reranker"].processor
|
| 191 |
logger.info(f"Reranker model loaded in {time.time() - reranker_start:.2f}s")
|
| 192 |
|
| 193 |
-
self.
|
| 194 |
-
|
|
|
|
| 195 |
self._log_gpu_status()
|
| 196 |
return self
|
| 197 |
|
| 198 |
-
def load_all(self) -> "RealModelStack":
|
| 199 |
-
"""Load all models (DEPRECATED - use lazy loading instead).
|
| 200 |
-
|
| 201 |
-
This method is kept for backward compatibility but will cause OOM
|
| 202 |
-
on 4xL4 GPUs. Use load_vision() and load_rag() sequentially instead.
|
| 203 |
-
"""
|
| 204 |
-
logger.warning("load_all() is deprecated - use load_vision() and load_rag() for lazy loading")
|
| 205 |
-
self.load_vision()
|
| 206 |
-
# Note: This WILL cause OOM on 4xL4 as vision (60GB) + RAG (32GB) > 88GB
|
| 207 |
-
self.load_rag()
|
| 208 |
-
return self
|
| 209 |
-
|
| 210 |
def is_loaded(self) -> bool:
|
| 211 |
-
"""Check if
|
| 212 |
-
return self.
|
| 213 |
-
|
| 214 |
-
def is_vision_loaded(self) -> bool:
|
| 215 |
-
"""Check if vision model is loaded."""
|
| 216 |
-
return self._vision_loaded
|
| 217 |
-
|
| 218 |
-
def is_rag_loaded(self) -> bool:
|
| 219 |
-
"""Check if RAG models are loaded."""
|
| 220 |
-
return self._rag_loaded
|
| 221 |
|
| 222 |
@property
|
| 223 |
-
def vision(self) -> "
|
| 224 |
-
"""Return vision model wrapped for pipeline consumption."""
|
| 225 |
-
if not self.
|
| 226 |
-
raise RuntimeError("
|
| 227 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
@property
|
| 230 |
def embedding(self) -> "RealEmbeddingModel":
|
| 231 |
"""Return embedding model wrapped for pipeline consumption."""
|
| 232 |
-
if not self.
|
| 233 |
-
raise RuntimeError("
|
| 234 |
return RealEmbeddingModel(self.models["embedding"], self.processors["embedding"])
|
| 235 |
|
| 236 |
@property
|
| 237 |
def reranker(self) -> "RealRerankerModel":
|
| 238 |
"""Return reranker model wrapped for pipeline consumption."""
|
| 239 |
-
if not self.
|
| 240 |
-
raise RuntimeError("
|
| 241 |
return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
|
| 242 |
|
| 243 |
|
| 244 |
-
class
|
| 245 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
|
| 247 |
# System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
|
| 248 |
VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
|
|
@@ -280,60 +209,123 @@ Identify visible materials and categorize as:
|
|
| 280 |
- Flag any areas that require professional on-site verification
|
| 281 |
- Note any potential access issues visible in the image"""
|
| 282 |
|
| 283 |
-
# Analysis prompt
|
| 284 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 285 |
|
| 286 |
-
|
| 287 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 288 |
"classification": "burn" | "near-field" | "far-field",
|
| 289 |
"confidence": 0.0-1.0,
|
| 290 |
"reasoning": "explanation"
|
| 291 |
-
},
|
| 292 |
-
"condition": {
|
| 293 |
"level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
|
| 294 |
"confidence": 0.0-1.0,
|
| 295 |
"reasoning": "explanation"
|
| 296 |
-
},
|
| 297 |
"materials": [
|
| 298 |
-
{
|
| 299 |
"type": "material type (e.g., drywall, concrete, steel, wood)",
|
| 300 |
"category": "non-porous" | "semi-porous" | "porous" | "hvac",
|
| 301 |
"confidence": 0.0-1.0,
|
| 302 |
"location_description": "where in image",
|
| 303 |
-
"bounding_box": {"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}
|
| 304 |
-
}
|
| 305 |
],
|
| 306 |
-
"combustion_indicators": {
|
| 307 |
"soot_visible": true/false,
|
| 308 |
"soot_pattern": "description or null",
|
| 309 |
"char_visible": true/false,
|
| 310 |
"char_description": "description or null",
|
| 311 |
"ash_visible": true/false,
|
| 312 |
"ash_description": "description or null"
|
| 313 |
-
},
|
| 314 |
"structural_concerns": ["list of structural issues if any"],
|
| 315 |
"access_issues": ["list of access problems if any"],
|
| 316 |
"recommended_sampling_locations": [
|
| 317 |
-
{
|
| 318 |
"description": "where to sample",
|
| 319 |
"sample_type": "tape_lift" | "surface_wipe" | "air_sample",
|
| 320 |
"priority": "high" | "medium" | "low"
|
| 321 |
-
}
|
| 322 |
],
|
| 323 |
"flags_for_review": ["any items requiring human review"]
|
| 324 |
-
}
|
| 325 |
|
| 326 |
IMPORTANT: Return ONLY valid JSON, no additional text."""
|
| 327 |
|
| 328 |
-
def __init__(self,
|
| 329 |
-
self.
|
| 330 |
-
self.
|
|
|
|
|
|
|
| 331 |
|
| 332 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 333 |
-
"""Analyze an image
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
start_time = time.time()
|
| 335 |
-
logger.debug(f"Starting vision analysis (context: {len(context)} chars)")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 336 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 337 |
try:
|
| 338 |
from qwen_vl_utils import process_vision_info
|
| 339 |
except ImportError:
|
|
@@ -341,7 +333,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
|
|
| 341 |
process_vision_info = None
|
| 342 |
|
| 343 |
# Build the analysis prompt with context
|
| 344 |
-
prompt = self.
|
| 345 |
if context:
|
| 346 |
prompt = f"Context: {context}\n\n{prompt}"
|
| 347 |
|
|
@@ -360,104 +352,142 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
|
|
| 360 |
}
|
| 361 |
]
|
| 362 |
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 367 |
)
|
| 368 |
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
text=[text],
|
| 383 |
-
images=[image],
|
| 384 |
-
return_tensors="pt",
|
| 385 |
-
padding=True,
|
| 386 |
-
)
|
| 387 |
-
|
| 388 |
-
# Note: With device_map="auto", transformers handles device routing internally
|
| 389 |
-
# Do NOT call .to(device) - it breaks distributed models
|
| 390 |
-
|
| 391 |
-
# Log inference config being used
|
| 392 |
-
logger.debug(f"Vision inference config: max_new_tokens={vision_config.max_new_tokens}, "
|
| 393 |
-
f"do_sample={vision_config.do_sample}, temp={vision_config.temperature}")
|
| 394 |
-
|
| 395 |
-
# Generate response using config values
|
| 396 |
-
inference_start = time.time()
|
| 397 |
-
with torch.no_grad():
|
| 398 |
-
if vision_config.do_sample:
|
| 399 |
-
outputs = self.model.generate(
|
| 400 |
-
**inputs,
|
| 401 |
-
max_new_tokens=vision_config.max_new_tokens,
|
| 402 |
-
do_sample=True,
|
| 403 |
-
temperature=vision_config.temperature,
|
| 404 |
-
top_p=vision_config.top_p,
|
| 405 |
-
repetition_penalty=vision_config.repetition_penalty,
|
| 406 |
-
)
|
| 407 |
-
else:
|
| 408 |
-
# Deterministic mode (no sampling)
|
| 409 |
-
outputs = self.model.generate(
|
| 410 |
-
**inputs,
|
| 411 |
-
max_new_tokens=vision_config.max_new_tokens,
|
| 412 |
-
do_sample=False,
|
| 413 |
-
temperature=None,
|
| 414 |
-
top_p=None,
|
| 415 |
-
repetition_penalty=vision_config.repetition_penalty,
|
| 416 |
-
)
|
| 417 |
-
|
| 418 |
-
inference_time = time.time() - inference_start
|
| 419 |
-
logger.debug(f"Vision inference completed in {inference_time:.2f}s")
|
| 420 |
-
|
| 421 |
-
# Decode response
|
| 422 |
-
response_text = self.processor.decode(
|
| 423 |
-
outputs[0], skip_special_tokens=True
|
| 424 |
)
|
| 425 |
-
logger.debug(f"Response length: {len(response_text)} chars")
|
| 426 |
|
| 427 |
-
|
| 428 |
-
|
| 429 |
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
num_materials = len(result.get("materials", []))
|
| 437 |
-
logger.info(f"Vision analysis complete in {total_time:.2f}s: "
|
| 438 |
-
f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
|
| 439 |
-
f"materials={num_materials}")
|
| 440 |
|
| 441 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 442 |
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 446 |
|
| 447 |
-
def
|
| 448 |
-
"""Parse JSON response from
|
| 449 |
try:
|
| 450 |
# Try to extract JSON from response
|
| 451 |
-
# Look for JSON block in various formats
|
| 452 |
json_match = re.search(r'\{[\s\S]*\}', response)
|
| 453 |
if json_match:
|
| 454 |
json_str = json_match.group()
|
| 455 |
return json.loads(json_str)
|
| 456 |
else:
|
| 457 |
-
logger.warning("No JSON found in
|
| 458 |
return self._get_fallback_response("No JSON in response")
|
| 459 |
except json.JSONDecodeError as e:
|
| 460 |
-
logger.warning(f"Failed to parse
|
| 461 |
return self._get_fallback_response(f"JSON parse error: {e}")
|
| 462 |
|
| 463 |
def _get_fallback_response(self, reason: str) -> dict[str, Any]:
|
|
|
|
| 1 |
"""Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
|
| 2 |
|
| 3 |
This module loads the actual Qwen3-VL models for production use.
|
| 4 |
+
All models are loaded simultaneously at startup (~68GB total).
|
| 5 |
|
| 6 |
+
Memory Strategy (Simultaneous Loading):
|
| 7 |
+
- Vision Thinking 8B (~18GB): Deep analysis with reasoning chains
|
| 8 |
+
- Vision Instruct 8B (~18GB): Structured JSON output formatting
|
| 9 |
+
- Embedding 8B (~16GB): RAG document embedding
|
| 10 |
+
- Reranker 8B (~16GB): RAG retrieval reranking
|
| 11 |
+
- Total: ~68GB on 88GB available (20GB headroom)
|
| 12 |
|
| 13 |
Model Loading:
|
| 14 |
+
- Vision: Qwen3VLForConditionalGeneration (standard transformers)
|
| 15 |
- Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 16 |
- Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 17 |
"""
|
| 18 |
|
|
|
|
| 19 |
import json
|
| 20 |
import logging
|
| 21 |
import re
|
|
|
|
| 24 |
from typing import Any
|
| 25 |
from PIL import Image
|
| 26 |
|
| 27 |
+
from config.inference import thinking_config, vision_config
|
| 28 |
from config.settings import settings
|
| 29 |
|
| 30 |
logger = logging.getLogger(__name__)
|
|
|
|
| 33 |
class RealModelStack:
|
| 34 |
"""Real model stack for production on HuggingFace Spaces.
|
| 35 |
|
| 36 |
+
Loads all 4 models simultaneously at initialization (~68GB total):
|
| 37 |
+
- Dual vision (Thinking + Instruct): ~36GB
|
| 38 |
+
- Embedding + Reranker: ~32GB
|
|
|
|
| 39 |
"""
|
| 40 |
|
| 41 |
def __init__(self):
|
| 42 |
self.models: dict[str, Any] = {}
|
| 43 |
self.processors: dict[str, Any] = {}
|
| 44 |
+
self._loaded = False
|
|
|
|
| 45 |
|
| 46 |
def _log_gpu_status(self):
|
| 47 |
"""Log current GPU memory status."""
|
|
|
|
| 55 |
free = total - allocated
|
| 56 |
logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
|
| 57 |
|
| 58 |
+
def load_all(self) -> "RealModelStack":
|
| 59 |
+
"""Load all models simultaneously.
|
| 60 |
|
| 61 |
+
Loads dual vision models (Thinking + Instruct) and RAG models
|
| 62 |
+
(Embedding + Reranker) for ~68GB total VRAM usage.
|
| 63 |
"""
|
| 64 |
+
if self._loaded:
|
| 65 |
+
logger.debug("Models already loaded, skipping")
|
| 66 |
return self
|
| 67 |
|
| 68 |
+
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
|
| 69 |
|
| 70 |
device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 71 |
+
logger.info(f"Loading all models on {device_type}")
|
| 72 |
self._log_gpu_status()
|
| 73 |
|
| 74 |
+
total_start = time.time()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
# Vision Thinking model (~18GB in BF16)
|
| 77 |
+
logger.info(f"Loading vision thinking model: {settings.vision_model_thinking}")
|
| 78 |
+
thinking_start = time.time()
|
| 79 |
+
self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 80 |
+
settings.vision_model_thinking,
|
| 81 |
+
torch_dtype=torch.bfloat16,
|
| 82 |
+
device_map="auto",
|
| 83 |
+
trust_remote_code=True,
|
| 84 |
+
)
|
| 85 |
+
self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
|
| 86 |
+
settings.vision_model_thinking,
|
| 87 |
+
trust_remote_code=True,
|
| 88 |
+
)
|
| 89 |
+
logger.info(f"Vision thinking model loaded in {time.time() - thinking_start:.2f}s")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
# Vision Instruct model (~18GB in BF16)
|
| 92 |
+
logger.info(f"Loading vision instruct model: {settings.vision_model_instruct}")
|
| 93 |
+
instruct_start = time.time()
|
| 94 |
+
self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 95 |
+
settings.vision_model_instruct,
|
| 96 |
+
torch_dtype=torch.bfloat16,
|
| 97 |
+
device_map="auto",
|
| 98 |
+
trust_remote_code=True,
|
| 99 |
+
)
|
| 100 |
+
self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
|
| 101 |
+
settings.vision_model_instruct,
|
| 102 |
+
trust_remote_code=True,
|
| 103 |
+
)
|
| 104 |
+
logger.info(f"Vision instruct model loaded in {time.time() - instruct_start:.2f}s")
|
| 105 |
|
| 106 |
# Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
|
| 107 |
logger.info(f"Loading embedding model: {settings.embedding_model}")
|
|
|
|
| 127 |
self.processors["reranker"] = self.models["reranker"].processor
|
| 128 |
logger.info(f"Reranker model loaded in {time.time() - reranker_start:.2f}s")
|
| 129 |
|
| 130 |
+
self._loaded = True
|
| 131 |
+
total_time = time.time() - total_start
|
| 132 |
+
logger.info(f"All models loaded in {total_time:.2f}s")
|
| 133 |
self._log_gpu_status()
|
| 134 |
return self
|
| 135 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
def is_loaded(self) -> bool:
|
| 137 |
+
"""Check if models are loaded."""
|
| 138 |
+
return self._loaded
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
@property
|
| 141 |
+
def vision(self) -> "DualVisionModel":
|
| 142 |
+
"""Return dual vision model wrapped for pipeline consumption."""
|
| 143 |
+
if not self._loaded:
|
| 144 |
+
raise RuntimeError("Models not loaded. Call load_all() first.")
|
| 145 |
+
return DualVisionModel(
|
| 146 |
+
thinking_model=self.models["vision_thinking"],
|
| 147 |
+
thinking_processor=self.processors["vision_thinking"],
|
| 148 |
+
instruct_model=self.models["vision_instruct"],
|
| 149 |
+
instruct_processor=self.processors["vision_instruct"],
|
| 150 |
+
)
|
| 151 |
|
| 152 |
@property
|
| 153 |
def embedding(self) -> "RealEmbeddingModel":
|
| 154 |
"""Return embedding model wrapped for pipeline consumption."""
|
| 155 |
+
if not self._loaded:
|
| 156 |
+
raise RuntimeError("Models not loaded. Call load_all() first.")
|
| 157 |
return RealEmbeddingModel(self.models["embedding"], self.processors["embedding"])
|
| 158 |
|
| 159 |
@property
|
| 160 |
def reranker(self) -> "RealRerankerModel":
|
| 161 |
"""Return reranker model wrapped for pipeline consumption."""
|
| 162 |
+
if not self._loaded:
|
| 163 |
+
raise RuntimeError("Models not loaded. Call load_all() first.")
|
| 164 |
return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
|
| 165 |
|
| 166 |
|
| 167 |
+
class DualVisionModel:
|
| 168 |
+
"""Dual vision model for two-stage fire damage analysis.
|
| 169 |
+
|
| 170 |
+
Uses Qwen3-VL-8B-Thinking for deep analysis with reasoning chains,
|
| 171 |
+
then Qwen3-VL-8B-Instruct to format results into structured JSON.
|
| 172 |
+
|
| 173 |
+
Pipeline: Image -> Thinking (analysis) -> Instruct (JSON formatting) -> Output
|
| 174 |
+
"""
|
| 175 |
|
| 176 |
# System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
|
| 177 |
VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
|
|
|
|
| 209 |
- Flag any areas that require professional on-site verification
|
| 210 |
- Note any potential access issues visible in the image"""
|
| 211 |
|
| 212 |
+
# Analysis prompt for Thinking model (open-ended reasoning)
|
| 213 |
+
THINKING_ANALYSIS_PROMPT = """Analyze this fire damage image thoroughly. Consider:
|
| 214 |
+
|
| 215 |
+
1. What zone classification applies (burn, near-field, or far-field) and why?
|
| 216 |
+
2. What is the contamination condition level (background, light, moderate, heavy, or structural-damage)?
|
| 217 |
+
3. What materials are visible and what is their porosity category?
|
| 218 |
+
4. What combustion indicators (soot, char, ash) are present and where?
|
| 219 |
+
5. Are there any structural concerns or access issues?
|
| 220 |
+
6. Where would you recommend sampling and what type of samples?
|
| 221 |
+
|
| 222 |
+
Provide detailed reasoning for each assessment, explaining the visual evidence that supports your conclusions."""
|
| 223 |
+
|
| 224 |
+
# Formatter prompt for Instruct model (structured JSON output)
|
| 225 |
+
INSTRUCT_FORMATTER_SYSTEM = """You are a technical document formatter. Your task is to convert fire damage analysis into a precise JSON structure.
|
| 226 |
+
|
| 227 |
+
Preserve all findings from the analysis accurately. Assign confidence scores (0.0-1.0) based on the certainty expressed in the analysis:
|
| 228 |
+
- Very certain statements: 0.85-0.95
|
| 229 |
+
- Reasonably confident: 0.70-0.84
|
| 230 |
+
- Somewhat uncertain: 0.50-0.69
|
| 231 |
+
- Uncertain/fallback: 0.30-0.49"""
|
| 232 |
|
| 233 |
+
INSTRUCT_FORMATTER_PROMPT = """Based on the following fire damage analysis, generate a JSON response with this exact structure:
|
| 234 |
+
|
| 235 |
+
<analysis>
|
| 236 |
+
{analysis}
|
| 237 |
+
</analysis>
|
| 238 |
+
|
| 239 |
+
Generate JSON with this structure:
|
| 240 |
+
{{
|
| 241 |
+
"zone": {{
|
| 242 |
"classification": "burn" | "near-field" | "far-field",
|
| 243 |
"confidence": 0.0-1.0,
|
| 244 |
"reasoning": "explanation"
|
| 245 |
+
}},
|
| 246 |
+
"condition": {{
|
| 247 |
"level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
|
| 248 |
"confidence": 0.0-1.0,
|
| 249 |
"reasoning": "explanation"
|
| 250 |
+
}},
|
| 251 |
"materials": [
|
| 252 |
+
{{
|
| 253 |
"type": "material type (e.g., drywall, concrete, steel, wood)",
|
| 254 |
"category": "non-porous" | "semi-porous" | "porous" | "hvac",
|
| 255 |
"confidence": 0.0-1.0,
|
| 256 |
"location_description": "where in image",
|
| 257 |
+
"bounding_box": {{"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}}
|
| 258 |
+
}}
|
| 259 |
],
|
| 260 |
+
"combustion_indicators": {{
|
| 261 |
"soot_visible": true/false,
|
| 262 |
"soot_pattern": "description or null",
|
| 263 |
"char_visible": true/false,
|
| 264 |
"char_description": "description or null",
|
| 265 |
"ash_visible": true/false,
|
| 266 |
"ash_description": "description or null"
|
| 267 |
+
}},
|
| 268 |
"structural_concerns": ["list of structural issues if any"],
|
| 269 |
"access_issues": ["list of access problems if any"],
|
| 270 |
"recommended_sampling_locations": [
|
| 271 |
+
{{
|
| 272 |
"description": "where to sample",
|
| 273 |
"sample_type": "tape_lift" | "surface_wipe" | "air_sample",
|
| 274 |
"priority": "high" | "medium" | "low"
|
| 275 |
+
}}
|
| 276 |
],
|
| 277 |
"flags_for_review": ["any items requiring human review"]
|
| 278 |
+
}}
|
| 279 |
|
| 280 |
IMPORTANT: Return ONLY valid JSON, no additional text."""
|
| 281 |
|
| 282 |
+
def __init__(self, thinking_model, thinking_processor, instruct_model, instruct_processor):
|
| 283 |
+
self.thinking_model = thinking_model
|
| 284 |
+
self.thinking_processor = thinking_processor
|
| 285 |
+
self.instruct_model = instruct_model
|
| 286 |
+
self.instruct_processor = instruct_processor
|
| 287 |
|
| 288 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 289 |
+
"""Analyze an image using two-stage pipeline.
|
| 290 |
+
|
| 291 |
+
Stage 1: Thinking model generates detailed analysis with reasoning
|
| 292 |
+
Stage 2: Instruct model formats the analysis into structured JSON
|
| 293 |
+
"""
|
| 294 |
start_time = time.time()
|
| 295 |
+
logger.debug(f"Starting dual-model vision analysis (context: {len(context)} chars)")
|
| 296 |
+
|
| 297 |
+
try:
|
| 298 |
+
# Stage 1: Deep analysis with Thinking model
|
| 299 |
+
thinking_start = time.time()
|
| 300 |
+
analysis_text = self._run_thinking_stage(image, context)
|
| 301 |
+
thinking_time = time.time() - thinking_start
|
| 302 |
+
logger.debug(f"Thinking stage completed in {thinking_time:.2f}s, output: {len(analysis_text)} chars")
|
| 303 |
+
|
| 304 |
+
# Stage 2: Format to JSON with Instruct model
|
| 305 |
+
instruct_start = time.time()
|
| 306 |
+
result = self._run_instruct_stage(analysis_text)
|
| 307 |
+
instruct_time = time.time() - instruct_start
|
| 308 |
+
logger.debug(f"Instruct stage completed in {instruct_time:.2f}s")
|
| 309 |
+
|
| 310 |
+
# Log result summary
|
| 311 |
+
total_time = time.time() - start_time
|
| 312 |
+
zone = result.get("zone", {}).get("classification", "unknown")
|
| 313 |
+
zone_conf = result.get("zone", {}).get("confidence", 0)
|
| 314 |
+
condition = result.get("condition", {}).get("level", "unknown")
|
| 315 |
+
condition_conf = result.get("condition", {}).get("confidence", 0)
|
| 316 |
+
num_materials = len(result.get("materials", []))
|
| 317 |
+
logger.info(f"Vision analysis complete in {total_time:.2f}s (thinking: {thinking_time:.2f}s, instruct: {instruct_time:.2f}s): "
|
| 318 |
+
f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
|
| 319 |
+
f"materials={num_materials}")
|
| 320 |
|
| 321 |
+
return result
|
| 322 |
+
|
| 323 |
+
except Exception as e:
|
| 324 |
+
logger.error(f"Vision analysis failed: {e}")
|
| 325 |
+
return self._get_fallback_response(str(e))
|
| 326 |
+
|
| 327 |
+
def _run_thinking_stage(self, image: Image.Image, context: str) -> str:
|
| 328 |
+
"""Run the Thinking model to generate detailed analysis."""
|
| 329 |
try:
|
| 330 |
from qwen_vl_utils import process_vision_info
|
| 331 |
except ImportError:
|
|
|
|
| 333 |
process_vision_info = None
|
| 334 |
|
| 335 |
# Build the analysis prompt with context
|
| 336 |
+
prompt = self.THINKING_ANALYSIS_PROMPT
|
| 337 |
if context:
|
| 338 |
prompt = f"Context: {context}\n\n{prompt}"
|
| 339 |
|
|
|
|
| 352 |
}
|
| 353 |
]
|
| 354 |
|
| 355 |
+
# Apply chat template with thinking enabled (default for Thinking model)
|
| 356 |
+
text = self.thinking_processor.apply_chat_template(
|
| 357 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 358 |
+
)
|
| 359 |
+
|
| 360 |
+
# Process vision info if available
|
| 361 |
+
if process_vision_info:
|
| 362 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 363 |
+
inputs = self.thinking_processor(
|
| 364 |
+
text=[text],
|
| 365 |
+
images=image_inputs,
|
| 366 |
+
videos=video_inputs,
|
| 367 |
+
return_tensors="pt",
|
| 368 |
+
padding=True,
|
| 369 |
+
)
|
| 370 |
+
else:
|
| 371 |
+
# Fallback: basic image processing
|
| 372 |
+
inputs = self.thinking_processor(
|
| 373 |
+
text=[text],
|
| 374 |
+
images=[image],
|
| 375 |
+
return_tensors="pt",
|
| 376 |
+
padding=True,
|
| 377 |
)
|
| 378 |
|
| 379 |
+
# Generate response using thinking config (per Qwen3-VL GitHub recommendations)
|
| 380 |
+
logger.debug(f"Thinking inference config: max_new_tokens={thinking_config.max_new_tokens}, "
|
| 381 |
+
f"temp={thinking_config.temperature}, top_p={thinking_config.top_p}, top_k={thinking_config.top_k}")
|
| 382 |
+
|
| 383 |
+
with torch.no_grad():
|
| 384 |
+
outputs = self.thinking_model.generate(
|
| 385 |
+
**inputs,
|
| 386 |
+
max_new_tokens=thinking_config.max_new_tokens,
|
| 387 |
+
do_sample=thinking_config.do_sample,
|
| 388 |
+
temperature=thinking_config.temperature,
|
| 389 |
+
top_p=thinking_config.top_p,
|
| 390 |
+
top_k=thinking_config.top_k,
|
| 391 |
+
repetition_penalty=thinking_config.repetition_penalty,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 392 |
)
|
|
|
|
| 393 |
|
| 394 |
+
# Decode response - get raw token IDs first for proper parsing
|
| 395 |
+
output_ids = outputs[0].tolist()
|
| 396 |
|
| 397 |
+
# The Thinking model's chat template includes opening <think> tag
|
| 398 |
+
# Output format: reasoning_content</think>final_answer
|
| 399 |
+
# Get </think> token ID dynamically from tokenizer (more robust than hardcoding)
|
| 400 |
+
think_end_token = self.thinking_processor.tokenizer.encode(
|
| 401 |
+
"</think>", add_special_tokens=False
|
| 402 |
+
)[0]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 403 |
|
| 404 |
+
try:
|
| 405 |
+
# Find the </think> token position
|
| 406 |
+
think_end_idx = len(output_ids) - output_ids[::-1].index(think_end_token)
|
| 407 |
+
# Extract reasoning (before </think>) and answer (after </think>)
|
| 408 |
+
reasoning_ids = output_ids[:think_end_idx]
|
| 409 |
+
answer_ids = output_ids[think_end_idx:]
|
| 410 |
+
|
| 411 |
+
reasoning = self.thinking_processor.decode(
|
| 412 |
+
reasoning_ids, skip_special_tokens=True
|
| 413 |
+
).strip()
|
| 414 |
+
final_answer = self.thinking_processor.decode(
|
| 415 |
+
answer_ids, skip_special_tokens=True
|
| 416 |
+
).strip()
|
| 417 |
+
|
| 418 |
+
logger.debug(f"Extracted thinking: {len(reasoning)} chars reasoning, {len(final_answer)} chars answer")
|
| 419 |
+
return f"Reasoning:\n{reasoning}\n\nConclusions:\n{final_answer}"
|
| 420 |
+
|
| 421 |
+
except ValueError:
|
| 422 |
+
# No </think> token found - use full response as-is
|
| 423 |
+
response_text = self.thinking_processor.decode(
|
| 424 |
+
output_ids, skip_special_tokens=True
|
| 425 |
+
).strip()
|
| 426 |
+
logger.debug(f"No </think> token found, using full response: {len(response_text)} chars")
|
| 427 |
+
return response_text
|
| 428 |
+
|
| 429 |
+
def _run_instruct_stage(self, analysis_text: str) -> dict[str, Any]:
|
| 430 |
+
"""Run the Instruct model to format analysis into JSON."""
|
| 431 |
+
# Prepare messages for Instruct model (text-only, no image)
|
| 432 |
+
prompt = self.INSTRUCT_FORMATTER_PROMPT.format(analysis=analysis_text)
|
| 433 |
|
| 434 |
+
messages = [
|
| 435 |
+
{
|
| 436 |
+
"role": "system",
|
| 437 |
+
"content": self.INSTRUCT_FORMATTER_SYSTEM,
|
| 438 |
+
},
|
| 439 |
+
{
|
| 440 |
+
"role": "user",
|
| 441 |
+
"content": prompt,
|
| 442 |
+
}
|
| 443 |
+
]
|
| 444 |
+
|
| 445 |
+
# Apply chat template
|
| 446 |
+
text = self.instruct_processor.apply_chat_template(
|
| 447 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 448 |
+
)
|
| 449 |
+
|
| 450 |
+
inputs = self.instruct_processor(
|
| 451 |
+
text=[text],
|
| 452 |
+
return_tensors="pt",
|
| 453 |
+
padding=True,
|
| 454 |
+
)
|
| 455 |
+
|
| 456 |
+
# Generate response using vision config (low temp for consistent JSON)
|
| 457 |
+
logger.debug(f"Instruct inference config: max_new_tokens={vision_config.max_new_tokens}, "
|
| 458 |
+
f"temp={vision_config.temperature}")
|
| 459 |
+
|
| 460 |
+
with torch.no_grad():
|
| 461 |
+
outputs = self.instruct_model.generate(
|
| 462 |
+
**inputs,
|
| 463 |
+
max_new_tokens=vision_config.max_new_tokens,
|
| 464 |
+
do_sample=vision_config.do_sample,
|
| 465 |
+
temperature=vision_config.temperature,
|
| 466 |
+
top_p=vision_config.top_p,
|
| 467 |
+
repetition_penalty=vision_config.repetition_penalty,
|
| 468 |
+
)
|
| 469 |
+
|
| 470 |
+
# Decode response
|
| 471 |
+
response_text = self.instruct_processor.decode(
|
| 472 |
+
outputs[0], skip_special_tokens=True
|
| 473 |
+
)
|
| 474 |
+
|
| 475 |
+
# Parse JSON from response
|
| 476 |
+
return self._parse_json_response(response_text)
|
| 477 |
|
| 478 |
+
def _parse_json_response(self, response: str) -> dict[str, Any]:
|
| 479 |
+
"""Parse JSON response from instruct model."""
|
| 480 |
try:
|
| 481 |
# Try to extract JSON from response
|
|
|
|
| 482 |
json_match = re.search(r'\{[\s\S]*\}', response)
|
| 483 |
if json_match:
|
| 484 |
json_str = json_match.group()
|
| 485 |
return json.loads(json_str)
|
| 486 |
else:
|
| 487 |
+
logger.warning("No JSON found in instruct response")
|
| 488 |
return self._get_fallback_response("No JSON in response")
|
| 489 |
except json.JSONDecodeError as e:
|
| 490 |
+
logger.warning(f"Failed to parse JSON: {e}")
|
| 491 |
return self._get_fallback_response(f"JSON parse error: {e}")
|
| 492 |
|
| 493 |
def _get_fallback_response(self, reason: str) -> dict[str, Any]:
|
pipeline/main.py
CHANGED
|
@@ -199,11 +199,6 @@ class FDAMPipeline:
|
|
| 199 |
logger.info(f"Stage 2/6: Vision Analysis ({len(session.images)} images)")
|
| 200 |
report_progress(2, "Analyzing images with AI...")
|
| 201 |
model_stack = get_models()
|
| 202 |
-
|
| 203 |
-
# Lazy load vision model (for real models only - mock models are already loaded)
|
| 204 |
-
if hasattr(model_stack, 'load_vision') and not model_stack.is_vision_loaded():
|
| 205 |
-
logger.info("Lazy loading vision model...")
|
| 206 |
-
model_stack.load_vision()
|
| 207 |
vision_results = {}
|
| 208 |
annotated_images = []
|
| 209 |
room_mapping = {}
|
|
@@ -260,20 +255,11 @@ class FDAMPipeline:
|
|
| 260 |
logger.info(f"Stage 2 completed in {time.time() - stage_start:.2f}s: "
|
| 261 |
f"{len(vision_results)} images analyzed")
|
| 262 |
|
| 263 |
-
# Unload vision model to free memory for RAG (for real models only)
|
| 264 |
-
if hasattr(model_stack, 'unload_vision') and model_stack.is_vision_loaded():
|
| 265 |
-
logger.info("Unloading vision model to free memory for RAG...")
|
| 266 |
-
model_stack.unload_vision()
|
| 267 |
-
|
| 268 |
# Stage 3: RAG Retrieval
|
| 269 |
stage_start = time.time()
|
| 270 |
logger.info("Stage 3/6: RAG Retrieval")
|
| 271 |
report_progress(3, "Retrieving FDAM methodology context...")
|
| 272 |
|
| 273 |
-
# Lazy load RAG models (for real models only - mock models are already loaded)
|
| 274 |
-
if hasattr(model_stack, 'load_rag') and not model_stack.is_rag_loaded():
|
| 275 |
-
logger.info("Lazy loading RAG models (embedding + reranker)...")
|
| 276 |
-
model_stack.load_rag()
|
| 277 |
# RAG is integrated into disposition engine, just verify connection
|
| 278 |
try:
|
| 279 |
test_results = self.retriever.retrieve("test connection", top_k=1)
|
|
|
|
| 199 |
logger.info(f"Stage 2/6: Vision Analysis ({len(session.images)} images)")
|
| 200 |
report_progress(2, "Analyzing images with AI...")
|
| 201 |
model_stack = get_models()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
vision_results = {}
|
| 203 |
annotated_images = []
|
| 204 |
room_mapping = {}
|
|
|
|
| 255 |
logger.info(f"Stage 2 completed in {time.time() - stage_start:.2f}s: "
|
| 256 |
f"{len(vision_results)} images analyzed")
|
| 257 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
# Stage 3: RAG Retrieval
|
| 259 |
stage_start = time.time()
|
| 260 |
logger.info("Stage 3/6: RAG Retrieval")
|
| 261 |
report_progress(3, "Retrieving FDAM methodology context...")
|
| 262 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
# RAG is integrated into disposition engine, just verify connection
|
| 264 |
try:
|
| 265 |
test_results = self.retriever.retrieve("test connection", top_k=1)
|
rag/retriever.py
CHANGED
|
@@ -88,7 +88,7 @@ class SharedReranker:
|
|
| 88 |
"""Reranker that uses the shared model from RealModelStack.
|
| 89 |
|
| 90 |
This avoids loading a duplicate reranker model - instead uses the
|
| 91 |
-
model already loaded by the pipeline
|
| 92 |
"""
|
| 93 |
|
| 94 |
def rerank(
|
|
@@ -109,13 +109,7 @@ class SharedReranker:
|
|
| 109 |
|
| 110 |
model_stack = get_models()
|
| 111 |
|
| 112 |
-
#
|
| 113 |
-
if not model_stack.is_rag_loaded():
|
| 114 |
-
logger.warning("RAG models not loaded yet - reranking may fail")
|
| 115 |
-
# Return neutral scores as fallback
|
| 116 |
-
return [0.5] * len(documents)
|
| 117 |
-
|
| 118 |
-
# Use the shared reranker model
|
| 119 |
return model_stack.reranker.rerank(query, documents)
|
| 120 |
|
| 121 |
|
|
|
|
| 88 |
"""Reranker that uses the shared model from RealModelStack.
|
| 89 |
|
| 90 |
This avoids loading a duplicate reranker model - instead uses the
|
| 91 |
+
model already loaded by the pipeline at startup.
|
| 92 |
"""
|
| 93 |
|
| 94 |
def rerank(
|
|
|
|
| 109 |
|
| 110 |
model_stack = get_models()
|
| 111 |
|
| 112 |
+
# Use the shared reranker model (always loaded at startup)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
return model_stack.reranker.rerank(query, documents)
|
| 114 |
|
| 115 |
|
rag/vectorstore.py
CHANGED
|
@@ -62,7 +62,7 @@ class SharedEmbeddingFunction:
|
|
| 62 |
"""Embedding function that uses the shared model from RealModelStack.
|
| 63 |
|
| 64 |
This avoids loading a duplicate embedding model - instead uses the
|
| 65 |
-
model already loaded by the pipeline
|
| 66 |
|
| 67 |
For ChromaDB compatibility, this wraps the model stack's embedding model.
|
| 68 |
"""
|
|
@@ -75,13 +75,7 @@ class SharedEmbeddingFunction:
|
|
| 75 |
|
| 76 |
model_stack = get_models()
|
| 77 |
|
| 78 |
-
#
|
| 79 |
-
if not model_stack.is_rag_loaded():
|
| 80 |
-
logger.warning("RAG models not loaded yet - embeddings may fail")
|
| 81 |
-
# Return zero vectors as fallback
|
| 82 |
-
return [[0.0] * self.EMBEDDING_DIM for _ in input]
|
| 83 |
-
|
| 84 |
-
# Use the shared embedding model
|
| 85 |
return model_stack.embedding.embed_batch(input)
|
| 86 |
|
| 87 |
|
|
|
|
| 62 |
"""Embedding function that uses the shared model from RealModelStack.
|
| 63 |
|
| 64 |
This avoids loading a duplicate embedding model - instead uses the
|
| 65 |
+
model already loaded by the pipeline at startup.
|
| 66 |
|
| 67 |
For ChromaDB compatibility, this wraps the model stack's embedding model.
|
| 68 |
"""
|
|
|
|
| 75 |
|
| 76 |
model_stack = get_models()
|
| 77 |
|
| 78 |
+
# Use the shared embedding model (always loaded at startup)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
return model_stack.embedding.embed_batch(input)
|
| 80 |
|
| 81 |
|