Spaces:

KinetoLabs
/

SmokeScan

Paused

KinetoLabs Claude Opus 4.5 commited on 4 days ago

Commit

333c083

1 Parent(s): 3b08f11

Replace 30B MoE with dual 8B models (Thinking + Instruct)

Architecture change:
- Vision: Qwen3-VL-30B-A3B-Instruct → dual Qwen3-VL-8B-Thinking + 8B-Instruct
- Two-stage pipeline: Thinking (deep analysis) → Instruct (JSON formatting)
- VRAM: 90GB → 68GB (~22GB savings, 20GB headroom on 4xL4)

Key changes:
- models/real.py: New DualVisionModel with token-based </think> parsing
- config/settings.py: Dual model paths (vision_model_thinking, vision_model_instruct)
- config/inference.py: ThinkingInferenceConfig (temp=0.6, max_tokens=32768)
- Removed all lazy loading code (load_vision/unload_vision/load_rag)
- All 4 models now load simultaneously at startup

Per Qwen3-VL GitHub recommended hyperparameters for thinking models.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (12) hide show

.env.example +3 -2
CLAUDE.md +19 -12
FDAM_AI_Pipeline_Technical_Spec.md +36 -41
README.md +3 -2
config/inference.py +21 -4
config/settings.py +3 -4
models/loader.py +15 -16
models/mock.py +49 -23
models/real.py +280 -250
pipeline/main.py +0 -14
rag/retriever.py +2 -8
rag/vectorstore.py +2 -8

.env.example CHANGED Viewed

@@ -8,7 +8,8 @@ MOCK_MODELS=true
 SERVER_HOST=0.0.0.0
 SERVER_PORT=7860
-# Optional: Override model paths
-# VISION_MODEL=Qwen/Qwen3-VL-30B-A3B-Instruct
 # EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
 # RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B

 SERVER_HOST=0.0.0.0
 SERVER_PORT=7860
+# Optional: Override model paths (Dual 8B architecture)
+# VISION_MODEL_THINKING=Qwen/Qwen3-VL-8B-Thinking
+# VISION_MODEL_INSTRUCT=Qwen/Qwen3-VL-8B-Instruct
 # EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
 # RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B

CLAUDE.md CHANGED Viewed

@@ -13,7 +13,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Critical Constraints
 1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
-2. **Memory Budget** - 4xL4 96GB total: ~58GB vision (30B BF16) + ~16GB embedding + ~16GB reranker (~90GB used, ~6GB headroom)
 3. **Processing Time** - 60-90 seconds per assessment is acceptable
 4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
 5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
@@ -23,7 +23,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 | Component | Technology |
 |-----------|------------|
 | UI Framework | Gradio 6.x |
-| Vision/Generation | Qwen3-VL-30B-A3B-Instruct |
 | Embeddings | Qwen3-VL-Embedding-8B |
 | Reranker | Qwen3-VL-Reranker-8B |
 | Vector Store | ChromaDB 0.4.x |
@@ -148,28 +149,34 @@ Source documents in `/RAG-KB/`:
 ## Multi-GPU Model Loading
-The 4xL4 setup requires models to be distributed across GPUs. Use `device_map="auto"` in transformers:
 ```python
-model = AutoModel.from_pretrained(
-    "Qwen/Qwen3-VL-30B-A3B-Instruct",
     torch_dtype=torch.bfloat16,
-    device_map="auto",  # Automatically distributes across available GPUs
     trust_remote_code=True
 )
 ```
-Expected distribution (BF16, ~90GB total):
-- Vision model (30B): ~58GB spread across GPUs via device_map="auto"
 - Embedding model (8B): ~16GB
 - Reranker model (8B): ~16GB
-- Headroom: ~6GB for KV cache
-**Fallback**: If VRAM issues arise, use `Qwen/Qwen3-VL-8B-Instruct` (~16GB) instead of 30B
 ## Local Development Strategy
-The RTX 4090 (24GB VRAM) cannot run the full model stack (~90GB required). Use this workflow:
 1. Set `MOCK_MODELS=true` environment variable
 2. Mock responses return realistic JSON matching vision output schema

 ## Critical Constraints
 1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
+2. **Memory Budget** - 4xL4 88GB usable: ~36GB vision (dual 8B) + ~16GB embedding + ~16GB reranker (~68GB used, ~20GB headroom)
 3. **Processing Time** - 60-90 seconds per assessment is acceptable
 4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
 5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
 | Component | Technology |
 |-----------|------------|
 | UI Framework | Gradio 6.x |
+| Vision (Thinking) | Qwen3-VL-8B-Thinking |
+| Vision (Instruct) | Qwen3-VL-8B-Instruct |
 | Embeddings | Qwen3-VL-Embedding-8B |
 | Reranker | Qwen3-VL-Reranker-8B |
 | Vector Store | ChromaDB 0.4.x |
 ## Multi-GPU Model Loading
+All 4 models are loaded simultaneously at startup (~68GB total on 4xL4 GPUs):
 ```python
+# Vision models (dual 8B architecture)
+thinking_model = Qwen3VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen3-VL-8B-Thinking",
     torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+instruct_model = Qwen3VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen3-VL-8B-Instruct",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
     trust_remote_code=True
 )
 ```
+Expected distribution (BF16, ~68GB total):
+- Vision Thinking model (8B): ~18GB
+- Vision Instruct model (8B): ~18GB
 - Embedding model (8B): ~16GB
 - Reranker model (8B): ~16GB
+- Headroom: ~20GB for KV cache and overhead
 ## Local Development Strategy
+The RTX 4090 (24GB VRAM) cannot run the full model stack (~68GB required). Use this workflow:
 1. Set `MOCK_MODELS=true` environment variable
 2. Mock responses return realistic JSON matching vision output schema

FDAM_AI_Pipeline_Technical_Spec.md CHANGED Viewed

@@ -34,7 +34,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
 ### Key Constraints
 - 100% locally-owned models (no Claude/OpenAI API calls)
-- HuggingFace Spaces deployment with Nvidia A100 80GB
 - 60-90 second processing time acceptable
 - Static RAG knowledge base (no user-uploaded documents)
@@ -75,7 +75,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
                                 ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                         VISION ANALYSIS MODULE                               │
-│                      (Qwen3-VL-30B-A3B-Instruct)                            │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Per Image:                                                                 │
 │  ├── Zone Classification (Burn/Near-Field/Far-Field) + confidence          │
@@ -113,7 +113,7 @@ Build an AI-powered fire damage assessment system that generates professional Cl
                                 ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                      DOCUMENT GENERATION MODULE                              │
-│                      (Qwen3-VL-30B-A3B-Instruct)                            │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Outputs:                                                                   │
 │  ├── Cleaning Specification / SOW (primary)                                 │
@@ -144,12 +144,13 @@ Build an AI-powered fire damage assessment system that generates professional Cl
 | Component | Technology | Version |
 |-----------|------------|---------|
 | Platform | HuggingFace Spaces | - |
-| GPU | Nvidia A100 | 80GB |
-| Vision/Generation Model | Qwen3-VL-30B-A3B-Instruct | Latest |
 | Embedding Model | Qwen3-VL-Embedding-8B | Latest |
 | Reranker Model | Qwen3-VL-Reranker-8B | Latest |
 | Vector Store | ChromaDB | 0.4.x |
-| UI Framework | Gradio | 4.x |
 | PDF Generation | Pandoc | 3.x |
 | Image Processing | Pillow, OpenCV | Latest |
@@ -157,16 +158,16 @@ Build an AI-powered fire damage assessment system that generates professional Cl
 ## 3. Model Stack Configuration
-### Memory Budget (A100 80GB)
 | Component | VRAM | Status |
 |-----------|------|--------|
-| Qwen3-VL-30B-A3B-Instruct | ~24GB | Always loaded |
 | Qwen3-VL-Embedding-8B | ~16GB | Always loaded |
 | Qwen3-VL-Reranker-8B | ~16GB | Always loaded |
-| ChromaDB + KV Cache | ~5GB | Always loaded |
-| **Available Headroom** | ~19GB | Context expansion |
-| **Total** | ~61GB | ✅ Fits |
 ### Model Loading Configuration
@@ -175,58 +176,52 @@ Build an AI-powered fire damage assessment system that generates professional Cl
 import torch
 from transformers import (
-    Qwen3VLMoeForConditionalGeneration,  # Note: Qwen3-VL uses MoE architecture
     AutoProcessor,
-    AutoModel,
-    AutoTokenizer
 )
 class ModelStack:
-    """Manages all models with concurrent loading on A100 80GB."""
     def __init__(self, device="cuda"):
         self.device = device
         self.models = {}
         self.processors = {}
     def load_all(self):
-        """Load all models into VRAM."""
-        print("Loading Qwen3-VL-30B-A3B-Instruct (Vision + Generation)...")
-        self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
-            "Qwen/Qwen3-VL-30B-A3B-Instruct",
             torch_dtype=torch.bfloat16,
             device_map="auto",
             trust_remote_code=True
         )
-        self.processors["vision"] = AutoProcessor.from_pretrained(
-            "Qwen/Qwen3-VL-30B-A3B-Instruct",
             trust_remote_code=True
         )
-        print("Loading Qwen3-VL-Embedding-8B (Multimodal RAG)...")
-        self.models["embedding"] = AutoModel.from_pretrained(
-            "Qwen/Qwen3-VL-Embedding-8B",
             torch_dtype=torch.bfloat16,
             device_map="auto",
             trust_remote_code=True
         )
-        self.processors["embedding"] = AutoProcessor.from_pretrained(
-            "Qwen/Qwen3-VL-Embedding-8B",
             trust_remote_code=True
         )
         print("Loading Qwen3-VL-Reranker-8B (Retrieval Precision)...")
-        self.models["reranker"] = AutoModel.from_pretrained(
-            "Qwen/Qwen3-VL-Reranker-8B",
-            torch_dtype=torch.bfloat16,
-            device_map="auto",
-            trust_remote_code=True
-        )
-        self.processors["reranker"] = AutoProcessor.from_pretrained(
-            "Qwen/Qwen3-VL-Reranker-8B",
-            trust_remote_code=True
-        )
         print("All models loaded successfully.")
         return self

 ### Key Constraints
 - 100% locally-owned models (no Claude/OpenAI API calls)
+- HuggingFace Spaces deployment with Nvidia 4xL4 (88GB total)
 - 60-90 second processing time acceptable
 - Static RAG knowledge base (no user-uploaded documents)
                                 ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                         VISION ANALYSIS MODULE                               │
+│               (Qwen3-VL-8B-Thinking → Qwen3-VL-8B-Instruct)                 │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Per Image:                                                                 │
 │  ├── Zone Classification (Burn/Near-Field/Far-Field) + confidence          │
                                 ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                      DOCUMENT GENERATION MODULE                              │
+│                  (Deterministic template + calculations)                     │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Outputs:                                                                   │
 │  ├── Cleaning Specification / SOW (primary)                                 │
 | Component | Technology | Version |
 |-----------|------------|---------|
 | Platform | HuggingFace Spaces | - |
+| GPU | Nvidia 4xL4 | 88GB total |
+| Vision (Thinking) | Qwen3-VL-8B-Thinking | Latest |
+| Vision (Instruct) | Qwen3-VL-8B-Instruct | Latest |
 | Embedding Model | Qwen3-VL-Embedding-8B | Latest |
 | Reranker Model | Qwen3-VL-Reranker-8B | Latest |
 | Vector Store | ChromaDB | 0.4.x |
+| UI Framework | Gradio | 6.x |
 | PDF Generation | Pandoc | 3.x |
 | Image Processing | Pillow, OpenCV | Latest |
 ## 3. Model Stack Configuration
+### Memory Budget (4xL4 88GB)
 | Component | VRAM | Status |
 |-----------|------|--------|
+| Qwen3-VL-8B-Thinking | ~18GB | Always loaded |
+| Qwen3-VL-8B-Instruct | ~18GB | Always loaded |
 | Qwen3-VL-Embedding-8B | ~16GB | Always loaded |
 | Qwen3-VL-Reranker-8B | ~16GB | Always loaded |
+| **Total** | ~68GB | ✅ Fits |
+| **Available Headroom** | ~20GB | KV cache + overhead |
 ### Model Loading Configuration
 import torch
 from transformers import (
+    Qwen3VLForConditionalGeneration,
     AutoProcessor,
 )
 class ModelStack:
+    """Manages all models with concurrent loading on 4xL4 (88GB total)."""
     def __init__(self, device="cuda"):
         self.device = device
         self.models = {}
         self.processors = {}
     def load_all(self):
+        """Load all models into VRAM (~68GB total)."""
+        # Dual vision architecture
+        print("Loading Qwen3-VL-8B-Thinking (Vision Analysis)...")
+        self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
+            "Qwen/Qwen3-VL-8B-Thinking",
             torch_dtype=torch.bfloat16,
             device_map="auto",
             trust_remote_code=True
         )
+        self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
+            "Qwen/Qwen3-VL-8B-Thinking",
             trust_remote_code=True
         )
+        print("Loading Qwen3-VL-8B-Instruct (JSON Formatting)...")
+        self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
+            "Qwen/Qwen3-VL-8B-Instruct",
             torch_dtype=torch.bfloat16,
             device_map="auto",
             trust_remote_code=True
         )
+        self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
+            "Qwen/Qwen3-VL-8B-Instruct",
             trust_remote_code=True
         )
+        # RAG models
+        print("Loading Qwen3-VL-Embedding-8B (Multimodal RAG)...")
+        # Uses official Qwen3VLEmbedder from scripts/qwen3_vl/
         print("Loading Qwen3-VL-Reranker-8B (Retrieval Precision)...")
+        # Uses official Qwen3VLReranker from scripts/qwen3_vl/
         print("All models loaded successfully.")
         return self

README.md CHANGED Viewed

@@ -32,8 +32,9 @@ suggested_hardware: l4x4
 ## Technical Details
-### Model Stack (~90GB VRAM)
-- **Vision**: Qwen3-VL-30B-A3B-Instruct (~58GB)
 - **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
 - **Reranker**: Qwen3-VL-Reranker-8B (~16GB)

 ## Technical Details
+### Model Stack (~68GB VRAM)
+- **Vision (Thinking)**: Qwen3-VL-8B-Thinking (~18GB) - Deep analysis with reasoning
+- **Vision (Instruct)**: Qwen3-VL-8B-Instruct (~18GB) - Structured JSON output
 - **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
 - **Reranker**: Qwen3-VL-Reranker-8B (~16GB)

config/inference.py CHANGED Viewed

@@ -7,15 +7,31 @@ and FDAM Technical Spec requirements.
 from dataclasses import dataclass
 @dataclass
 class VisionInferenceConfig:
-    """Configuration for vision model inference.
-    Per FDAM Technical Spec Section 3 and Qwen3-VL-30B-A3B-Instruct model card.
     """
     max_new_tokens: int = 4096
-    temperature: float = 0.1  # Low temperature for deterministic output
     top_p: float = 0.9
     do_sample: bool = True
     repetition_penalty: float = 1.1  # Reduce repetition in generated text
@@ -66,7 +82,8 @@ class RAGConfig:
 # Default configurations
-vision_config = VisionInferenceConfig()
 generation_config = GenerationInferenceConfig()
 embedding_config = EmbeddingConfig()
 reranker_config = RerankerConfig()

 from dataclasses import dataclass
+@dataclass
+class ThinkingInferenceConfig:
+    """Configuration for 8B-Thinking model inference.
+    Per Qwen3-VL GitHub recommended hyperparameters for thinking models.
+    Used for deep analysis with <think> chains.
+    """
+    max_new_tokens: int = 32768  # Extended for reasoning chains (model supports 40960)
+    temperature: float = 0.6  # Per Qwen3-VL GitHub docs
+    top_p: float = 0.95
+    top_k: int = 20
+    do_sample: bool = True
+    repetition_penalty: float = 1.0  # Per Qwen3-VL docs (not presence_penalty)
 @dataclass
 class VisionInferenceConfig:
+    """Configuration for 8B-Instruct model inference.
+    Per FDAM Technical Spec Section 3. Used for structured JSON output.
     """
     max_new_tokens: int = 4096
+    temperature: float = 0.1  # Low temperature for deterministic JSON output
     top_p: float = 0.9
     do_sample: bool = True
     repetition_penalty: float = 1.1  # Reduce repetition in generated text
 # Default configurations
+thinking_config = ThinkingInferenceConfig()
+vision_config = VisionInferenceConfig()  # Now used for Instruct model
 generation_config = GenerationInferenceConfig()
 embedding_config = EmbeddingConfig()
 reranker_config = RerankerConfig()

config/settings.py CHANGED Viewed

@@ -17,13 +17,12 @@ class Settings(BaseSettings):
     mock_models: bool = True
     # Model paths (for production on HuggingFace Spaces)
-    vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Instruct"
     embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
     reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
-    # Fallback vision model if VRAM issues
-    vision_model_fallback: str = "Qwen/Qwen3-VL-8B-Instruct"
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

     mock_models: bool = True
     # Model paths (for production on HuggingFace Spaces)
+    # Dual 8B architecture: Thinking for analysis, Instruct for structured output
+    vision_model_thinking: str = "Qwen/Qwen3-VL-8B-Thinking"
+    vision_model_instruct: str = "Qwen/Qwen3-VL-8B-Instruct"
     embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
     reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

models/loader.py CHANGED Viewed

@@ -1,13 +1,13 @@
 """Model loading with mock/real switching based on environment.
 Supports two loading modes:
-- MOCK_MODELS=true: Loads all mock models at startup (fast, for local dev)
-- MOCK_MODELS=false: Uses LAZY LOADING (models loaded on-demand by pipeline)
-Lazy Loading Strategy (for 4xL4 GPUs with 88GB total):
-- Vision 30B (~60GB) loaded before Stage 2, unloaded after
-- RAG models (~32GB) loaded before Stage 3
-- Peak usage ~60GB, never both simultaneously
 """
 import logging
@@ -28,8 +28,8 @@ _model_stack: ModelStack | None = None
 def get_model_stack() -> ModelStack:
     """Get model stack based on environment configuration.
-    For mock models: Loads all models immediately (fast, for local dev).
-    For real models: Returns uninitialized stack for lazy loading.
     """
     start_time = time.time()
@@ -42,25 +42,24 @@ def get_model_stack() -> ModelStack:
         logger.info(f"Mock model stack loaded in {elapsed:.2f}s")
         return stack
     else:
-        logger.info("Creating REAL model stack (production mode - lazy loading)")
-        logger.info(f"Vision model: {settings.vision_model}")
         logger.info(f"Embedding model: {settings.embedding_model}")
         logger.info(f"Reranker model: {settings.reranker_model}")
-        logger.info("NOTE: Models will be loaded on-demand by pipeline stages")
         from models.real import RealModelStack
-        # Don't load models yet - pipeline will call load_vision() and load_rag()
-        stack = RealModelStack()
         elapsed = time.time() - start_time
-        logger.info(f"Real model stack initialized in {elapsed:.2f}s (no models loaded yet)")
         return stack
 def get_models() -> ModelStack:
     """Get or create the singleton model stack.
-    For real models, this returns an uninitialized stack.
-    Call stack.load_vision() or stack.load_rag() as needed.
     """
     global _model_stack
     if _model_stack is None:

 """Model loading with mock/real switching based on environment.
 Supports two loading modes:
+- MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
+- MOCK_MODELS=false: Loads all real models at startup (~68GB total)
+Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
+- Vision Thinking 8B (~18GB) + Vision Instruct 8B (~18GB) = ~36GB
+- Embedding 8B (~16GB) + Reranker 8B (~16GB) = ~32GB
+- Total: ~68GB, leaving ~20GB headroom
 """
 import logging
 def get_model_stack() -> ModelStack:
     """Get model stack based on environment configuration.
+    For mock models: Loads mock models immediately (fast, for local dev).
+    For real models: Loads all 4 models at startup (~68GB total).
     """
     start_time = time.time()
         logger.info(f"Mock model stack loaded in {elapsed:.2f}s")
         return stack
     else:
+        logger.info("Loading REAL model stack (production mode)")
+        logger.info(f"Vision thinking model: {settings.vision_model_thinking}")
+        logger.info(f"Vision instruct model: {settings.vision_model_instruct}")
         logger.info(f"Embedding model: {settings.embedding_model}")
         logger.info(f"Reranker model: {settings.reranker_model}")
         from models.real import RealModelStack
+        # Load all models at startup (simultaneous loading)
+        stack = RealModelStack().load_all()
         elapsed = time.time() - start_time
+        logger.info(f"Real model stack loaded in {elapsed:.2f}s")
         return stack
 def get_models() -> ModelStack:
     """Get or create the singleton model stack.
+    Returns fully loaded model stack (all models ready for inference).
     """
     global _model_stack
     if _model_stack is None:

models/mock.py CHANGED Viewed

@@ -1,4 +1,9 @@
-"""Mock model implementations for local development on RTX 4090."""
 import logging
 import random
@@ -9,7 +14,12 @@ logger = logging.getLogger(__name__)
 class MockVisionModel:
-    """Mock vision model that returns realistic JSON responses."""
     ZONES = ["burn", "near-field", "far-field"]
     CONDITIONS = ["background", "light", "moderate", "heavy", "structural-damage"]
@@ -28,11 +38,31 @@ class MockVisionModel:
         {"type": "ductwork-flexible", "category": "hvac"},
     ]
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
-        """Return mock vision analysis matching the spec schema."""
-        logger.debug(f"Mock vision analysis (context: {len(context)} chars)")
         selected_zone = random.choice(self.ZONES)
         selected_condition = random.choice(self.CONDITIONS)
         logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
         # Generate 2-4 random materials
@@ -62,12 +92,18 @@ class MockVisionModel:
             "zone": {
                 "classification": selected_zone,
                 "confidence": round(random.uniform(0.7, 0.95), 2),
-                "reasoning": f"Mock analysis detected {selected_zone} zone characteristics based on visible damage patterns",
             },
             "condition": {
                 "level": selected_condition,
                 "confidence": round(random.uniform(0.65, 0.90), 2),
-                "reasoning": f"Surface shows {selected_condition} contamination levels",
             },
             "materials": materials,
             "combustion_indicators": {
@@ -188,35 +224,25 @@ class MockRerankerModel:
 class MockModelStack:
     """Mock model stack for local development.
-    Unlike RealModelStack, mock models are always loaded together.
-    The is_vision_loaded() and is_rag_loaded() methods are provided
-    for API compatibility with the lazy loading pipeline.
     """
     def __init__(self):
         self.vision = MockVisionModel()
         self.embedding = MockEmbeddingModel()
         self.reranker = MockRerankerModel()
-        self.loaded = False
     def load_all(self) -> "MockModelStack":
-        """Simulate model loading."""
         logger.info("Loading mock models for local development")
-        logger.debug("  Vision model: MockVisionModel")
-        logger.debug("  Embedding model: MockEmbeddingModel")
         logger.debug("  Reranker model: MockRerankerModel")
-        self.loaded = True
         logger.info("All mock models loaded successfully")
         return self
     def is_loaded(self) -> bool:
         """Check if models are loaded."""
-        return self.loaded
-    def is_vision_loaded(self) -> bool:
-        """Check if vision model is loaded (always True when loaded)."""
-        return self.loaded
-    def is_rag_loaded(self) -> bool:
-        """Check if RAG models are loaded (always True when loaded)."""
-        return self.loaded

+"""Mock model implementations for local development on RTX 4090.
+Simulates the dual 8B vision model architecture:
+- MockVisionModel simulates two-stage pipeline (Thinking -> Instruct)
+- All models loaded together at startup (no lazy loading)
+"""
 import logging
 import random
 class MockVisionModel:
+    """Mock vision model that simulates dual-model pipeline output.
+    Simulates:
+    - Stage 1: Thinking model generates reasoning
+    - Stage 2: Instruct model formats to JSON
+    """
     ZONES = ["burn", "near-field", "far-field"]
     CONDITIONS = ["background", "light", "moderate", "heavy", "structural-damage"]
         {"type": "ductwork-flexible", "category": "hvac"},
     ]
+    # Mock reasoning patterns to simulate Thinking model output
+    REASONING_PATTERNS = {
+        "burn": "Direct fire involvement evident from structural char and complete combustion patterns.",
+        "near-field": "Adjacent to burn zone with heavy smoke deposits and heat-induced discoloration.",
+        "far-field": "Light smoke migration only, no direct heat exposure or structural damage visible.",
+    }
+    CONDITION_REASONING = {
+        "background": "Surfaces appear clean with no visible contamination.",
+        "light": "Faint discoloration visible, minimal deposits present.",
+        "moderate": "Clear contamination with visible film on surfaces.",
+        "heavy": "Thick deposits obscuring surface texture.",
+        "structural-damage": "Physical damage requiring repair before cleaning.",
+    }
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
+        """Return mock vision analysis simulating dual-model pipeline output."""
+        logger.debug(f"Mock dual-model vision analysis (context: {len(context)} chars)")
+        # Simulate Stage 1: Thinking model selects classifications
         selected_zone = random.choice(self.ZONES)
         selected_condition = random.choice(self.CONDITIONS)
+        logger.debug("Mock Stage 1 (Thinking): Generated reasoning")
+        logger.debug("Mock Stage 2 (Instruct): Formatted to JSON")
         logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
         # Generate 2-4 random materials
             "zone": {
                 "classification": selected_zone,
                 "confidence": round(random.uniform(0.7, 0.95), 2),
+                "reasoning": self.REASONING_PATTERNS.get(
+                    selected_zone,
+                    f"Mock analysis detected {selected_zone} zone characteristics",
+                ),
             },
             "condition": {
                 "level": selected_condition,
                 "confidence": round(random.uniform(0.65, 0.90), 2),
+                "reasoning": self.CONDITION_REASONING.get(
+                    selected_condition,
+                    f"Surface shows {selected_condition} contamination levels",
+                ),
             },
             "materials": materials,
             "combustion_indicators": {
 class MockModelStack:
     """Mock model stack for local development.
+    All models loaded together at startup (matches production behavior).
     """
     def __init__(self):
         self.vision = MockVisionModel()
         self.embedding = MockEmbeddingModel()
         self.reranker = MockRerankerModel()
+        self._loaded = False
     def load_all(self) -> "MockModelStack":
+        """Load all mock models."""
         logger.info("Loading mock models for local development")
+        logger.debug("  Vision model: MockVisionModel (simulates dual 8B pipeline)")
+        logger.debug("  Embedding model: MockEmbeddingModel (4096-dim)")
         logger.debug("  Reranker model: MockRerankerModel")
+        self._loaded = True
         logger.info("All mock models loaded successfully")
         return self
     def is_loaded(self) -> bool:
         """Check if models are loaded."""
+        return self._loaded

models/real.py CHANGED Viewed

@@ -1,21 +1,21 @@
 """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
 This module loads the actual Qwen3-VL models for production use.
-Uses LAZY LOADING to fit within 88GB VRAM (4xL4 with ~22GB each).
-Memory Strategy:
-- Vision 30B (~60GB): Loaded ONLY during Stage 2 (Vision Analysis)
-- Embedding 8B (~16GB): Loaded ONLY during Stages 3+ (RAG)
-- Reranker 8B (~16GB): Loaded ONLY during Stages 3+ (RAG)
-- Peak usage: ~60GB (never all three simultaneously)
 Model Loading:
-- Vision: Qwen3VLMoeForConditionalGeneration (standard transformers)
 - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
 - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
 """
-import gc
 import json
 import logging
 import re
@@ -24,7 +24,7 @@ import torch
 from typing import Any
 from PIL import Image
-from config.inference import vision_config
 from config.settings import settings
 logger = logging.getLogger(__name__)
@@ -33,17 +33,15 @@ logger = logging.getLogger(__name__)
 class RealModelStack:
     """Real model stack for production on HuggingFace Spaces.
-    Uses LAZY LOADING to prevent OOM errors on 4xL4 (88GB total):
-    - Vision 30B (~60GB) and RAG models (~32GB) are never loaded simultaneously
-    - Pipeline calls load_vision() before Stage 2, unload_vision() after
-    - Pipeline calls load_rag() before Stage 3
     """
     def __init__(self):
         self.models: dict[str, Any] = {}
         self.processors: dict[str, Any] = {}
-        self._vision_loaded = False
-        self._rag_loaded = False
     def _log_gpu_status(self):
         """Log current GPU memory status."""
@@ -57,114 +55,53 @@ class RealModelStack:
                 free = total - allocated
                 logger.info(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
-    def load_vision(self) -> "RealModelStack":
-        """Load only the vision model (~60GB in BF16).
-        Call this before Stage 2 (Vision Analysis).
-        Must call unload_vision() before load_rag() to free memory.
         """
-        if self._vision_loaded:
-            logger.debug("Vision model already loaded, skipping")
             return self
-        from transformers import AutoProcessor
         device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
-        logger.info(f"Loading vision model on {device_type}")
         self._log_gpu_status()
-        logger.info(f"Loading vision model: {settings.vision_model}")
-        vision_start = time.time()
-        try:
-            from transformers import Qwen3VLMoeForConditionalGeneration
-            self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
-                settings.vision_model,
-                torch_dtype=torch.bfloat16,
-                device_map="auto",
-                trust_remote_code=True,
-            )
-            self.processors["vision"] = AutoProcessor.from_pretrained(
-                settings.vision_model,
-                trust_remote_code=True,
-            )
-            logger.info(f"Vision model loaded in {time.time() - vision_start:.2f}s")
-        except Exception as e:
-            logger.warning(f"Failed to load 30B vision model: {e}")
-            logger.info(f"Falling back to {settings.vision_model_fallback}")
-            from transformers import Qwen3VLMoeForConditionalGeneration
-            self.models["vision"] = Qwen3VLMoeForConditionalGeneration.from_pretrained(
-                settings.vision_model_fallback,
-                torch_dtype=torch.bfloat16,
-                device_map="auto",
-                trust_remote_code=True,
-            )
-            self.processors["vision"] = AutoProcessor.from_pretrained(
-                settings.vision_model_fallback,
-                trust_remote_code=True,
-            )
-            logger.info(f"Fallback vision model loaded in {time.time() - vision_start:.2f}s")
-        self._vision_loaded = True
-        self._log_gpu_status()
-        return self
-    def unload_vision(self):
-        """Unload vision model and free CUDA memory.
-        Uses accelerate's remove_hook_from_module per HuggingFace docs.
-        Call this after Stage 2 (Vision Analysis) to free memory for RAG.
-        """
-        if not self._vision_loaded or "vision" not in self.models:
-            logger.debug("Vision model not loaded, skipping unload")
-            return
-        logger.info("Unloading vision model to free memory for RAG...")
-        self._log_gpu_status()
-        try:
-            from accelerate.hooks import remove_hook_from_module
-            # CRITICAL: Remove hooks before deleting (required for device_map="auto")
-            model = self.models["vision"]
-            if hasattr(model, 'model'):
-                # Some wrappers have nested model
-                remove_hook_from_module(model.model, recurse=True)
-            remove_hook_from_module(model, recurse=True)
-            logger.debug("Accelerate hooks removed from vision model")
-        except ImportError:
-            logger.warning("accelerate.hooks not available, proceeding with basic cleanup")
-        except Exception as e:
-            logger.warning(f"Hook removal failed (continuing anyway): {e}")
-        # Delete model and processor
-        del self.models["vision"]
-        del self.processors["vision"]
-        self._vision_loaded = False
-        # Clear CUDA cache (may not free 100% but sufficient for sequential loading)
-        gc.collect()
-        torch.cuda.empty_cache()
-        logger.info("Vision model unloaded, CUDA cache cleared")
-        self._log_gpu_status()
-    def load_rag(self) -> "RealModelStack":
-        """Load embedding and reranker models (~32GB total in BF16).
-        Call this before Stage 3 (RAG Retrieval).
-        Must call unload_vision() first to have enough memory.
-        """
-        if self._rag_loaded:
-            logger.debug("RAG models already loaded, skipping")
-            return self
-        if self._vision_loaded:
-            logger.warning("Vision model still loaded! Call unload_vision() first to avoid OOM.")
-        logger.info("Loading RAG models (embedding + reranker)...")
-        self._log_gpu_status()
         # Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
         logger.info(f"Loading embedding model: {settings.embedding_model}")
@@ -190,59 +127,51 @@ class RealModelStack:
         self.processors["reranker"] = self.models["reranker"].processor
         logger.info(f"Reranker model loaded in {time.time() - reranker_start:.2f}s")
-        self._rag_loaded = True
-        logger.info("RAG models loaded successfully")
         self._log_gpu_status()
         return self
-    def load_all(self) -> "RealModelStack":
-        """Load all models (DEPRECATED - use lazy loading instead).
-        This method is kept for backward compatibility but will cause OOM
-        on 4xL4 GPUs. Use load_vision() and load_rag() sequentially instead.
-        """
-        logger.warning("load_all() is deprecated - use load_vision() and load_rag() for lazy loading")
-        self.load_vision()
-        # Note: This WILL cause OOM on 4xL4 as vision (60GB) + RAG (32GB) > 88GB
-        self.load_rag()
-        return self
     def is_loaded(self) -> bool:
-        """Check if any models are loaded."""
-        return self._vision_loaded or self._rag_loaded
-    def is_vision_loaded(self) -> bool:
-        """Check if vision model is loaded."""
-        return self._vision_loaded
-    def is_rag_loaded(self) -> bool:
-        """Check if RAG models are loaded."""
-        return self._rag_loaded
     @property
-    def vision(self) -> "RealVisionModel":
-        """Return vision model wrapped for pipeline consumption."""
-        if not self._vision_loaded:
-            raise RuntimeError("Vision model not loaded. Call load_vision() first.")
-        return RealVisionModel(self.models["vision"], self.processors["vision"])
     @property
     def embedding(self) -> "RealEmbeddingModel":
         """Return embedding model wrapped for pipeline consumption."""
-        if not self._rag_loaded:
-            raise RuntimeError("Embedding model not loaded. Call load_rag() first.")
         return RealEmbeddingModel(self.models["embedding"], self.processors["embedding"])
     @property
     def reranker(self) -> "RealRerankerModel":
         """Return reranker model wrapped for pipeline consumption."""
-        if not self._rag_loaded:
-            raise RuntimeError("Reranker model not loaded. Call load_rag() first.")
         return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
-class RealVisionModel:
-    """Wrapper for real vision model inference."""
     # System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
     VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
@@ -280,60 +209,123 @@ Identify visible materials and categorize as:
 - Flag any areas that require professional on-site verification
 - Note any potential access issues visible in the image"""
-    # Analysis prompt template with JSON schema
-    ANALYSIS_PROMPT = """Analyze this fire damage image and return a JSON response with the following structure:
-{
-    "zone": {
         "classification": "burn" | "near-field" | "far-field",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
-    },
-    "condition": {
         "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
-    },
     "materials": [
-        {
             "type": "material type (e.g., drywall, concrete, steel, wood)",
             "category": "non-porous" | "semi-porous" | "porous" | "hvac",
             "confidence": 0.0-1.0,
             "location_description": "where in image",
-            "bounding_box": {"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}
-        }
     ],
-    "combustion_indicators": {
         "soot_visible": true/false,
         "soot_pattern": "description or null",
         "char_visible": true/false,
         "char_description": "description or null",
         "ash_visible": true/false,
         "ash_description": "description or null"
-    },
     "structural_concerns": ["list of structural issues if any"],
     "access_issues": ["list of access problems if any"],
     "recommended_sampling_locations": [
-        {
             "description": "where to sample",
             "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
             "priority": "high" | "medium" | "low"
-        }
     ],
     "flags_for_review": ["any items requiring human review"]
-}
 IMPORTANT: Return ONLY valid JSON, no additional text."""
-    def __init__(self, model, processor):
-        self.model = model
-        self.processor = processor
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
-        """Analyze an image and return structured results."""
         start_time = time.time()
-        logger.debug(f"Starting vision analysis (context: {len(context)} chars)")
         try:
             from qwen_vl_utils import process_vision_info
         except ImportError:
@@ -341,7 +333,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
             process_vision_info = None
         # Build the analysis prompt with context
-        prompt = self.ANALYSIS_PROMPT
         if context:
             prompt = f"Context: {context}\n\n{prompt}"
@@ -360,104 +352,142 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
             }
         ]
-        try:
-            # Apply chat template
-            text = self.processor.apply_chat_template(
-                messages, tokenize=False, add_generation_prompt=True
             )
-            # Process vision info if available
-            if process_vision_info:
-                image_inputs, video_inputs = process_vision_info(messages)
-                inputs = self.processor(
-                    text=[text],
-                    images=image_inputs,
-                    videos=video_inputs,
-                    return_tensors="pt",
-                    padding=True,
-                )
-            else:
-                # Fallback: basic image processing
-                inputs = self.processor(
-                    text=[text],
-                    images=[image],
-                    return_tensors="pt",
-                    padding=True,
-                )
-            # Note: With device_map="auto", transformers handles device routing internally
-            # Do NOT call .to(device) - it breaks distributed models
-            # Log inference config being used
-            logger.debug(f"Vision inference config: max_new_tokens={vision_config.max_new_tokens}, "
-                        f"do_sample={vision_config.do_sample}, temp={vision_config.temperature}")
-            # Generate response using config values
-            inference_start = time.time()
-            with torch.no_grad():
-                if vision_config.do_sample:
-                    outputs = self.model.generate(
-                        **inputs,
-                        max_new_tokens=vision_config.max_new_tokens,
-                        do_sample=True,
-                        temperature=vision_config.temperature,
-                        top_p=vision_config.top_p,
-                        repetition_penalty=vision_config.repetition_penalty,
-                    )
-                else:
-                    # Deterministic mode (no sampling)
-                    outputs = self.model.generate(
-                        **inputs,
-                        max_new_tokens=vision_config.max_new_tokens,
-                        do_sample=False,
-                        temperature=None,
-                        top_p=None,
-                        repetition_penalty=vision_config.repetition_penalty,
-                    )
-            inference_time = time.time() - inference_start
-            logger.debug(f"Vision inference completed in {inference_time:.2f}s")
-            # Decode response
-            response_text = self.processor.decode(
-                outputs[0], skip_special_tokens=True
             )
-            logger.debug(f"Response length: {len(response_text)} chars")
-            # Parse JSON from response
-            result = self._parse_vision_response(response_text)
-            # Log result summary
-            total_time = time.time() - start_time
-            zone = result.get("zone", {}).get("classification", "unknown")
-            zone_conf = result.get("zone", {}).get("confidence", 0)
-            condition = result.get("condition", {}).get("level", "unknown")
-            condition_conf = result.get("condition", {}).get("confidence", 0)
-            num_materials = len(result.get("materials", []))
-            logger.info(f"Vision analysis complete in {total_time:.2f}s: "
-                       f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
-                       f"materials={num_materials}")
-            return result
-        except Exception as e:
-            logger.error(f"Vision analysis failed: {e}")
-            return self._get_fallback_response(str(e))
-    def _parse_vision_response(self, response: str) -> dict[str, Any]:
-        """Parse JSON response from vision model."""
         try:
             # Try to extract JSON from response
-            # Look for JSON block in various formats
             json_match = re.search(r'\{[\s\S]*\}', response)
             if json_match:
                 json_str = json_match.group()
                 return json.loads(json_str)
             else:
-                logger.warning("No JSON found in vision response")
                 return self._get_fallback_response("No JSON in response")
         except json.JSONDecodeError as e:
-            logger.warning(f"Failed to parse vision JSON: {e}")
             return self._get_fallback_response(f"JSON parse error: {e}")
     def _get_fallback_response(self, reason: str) -> dict[str, Any]:

 """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
 This module loads the actual Qwen3-VL models for production use.
+All models are loaded simultaneously at startup (~68GB total).
+Memory Strategy (Simultaneous Loading):
+- Vision Thinking 8B (~18GB): Deep analysis with reasoning chains
+- Vision Instruct 8B (~18GB): Structured JSON output formatting
+- Embedding 8B (~16GB): RAG document embedding
+- Reranker 8B (~16GB): RAG retrieval reranking
+- Total: ~68GB on 88GB available (20GB headroom)
 Model Loading:
+- Vision: Qwen3VLForConditionalGeneration (standard transformers)
 - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
 - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
 """
 import json
 import logging
 import re
 from typing import Any
 from PIL import Image
+from config.inference import thinking_config, vision_config
 from config.settings import settings
 logger = logging.getLogger(__name__)
 class RealModelStack:
     """Real model stack for production on HuggingFace Spaces.
+    Loads all 4 models simultaneously at initialization (~68GB total):
+    - Dual vision (Thinking + Instruct): ~36GB
+    - Embedding + Reranker: ~32GB
     """
     def __init__(self):
         self.models: dict[str, Any] = {}
         self.processors: dict[str, Any] = {}
+        self._loaded = False
     def _log_gpu_status(self):
         """Log current GPU memory status."""
                 free = total - allocated
                 logger.info(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
+    def load_all(self) -> "RealModelStack":
+        """Load all models simultaneously.
+        Loads dual vision models (Thinking + Instruct) and RAG models
+        (Embedding + Reranker) for ~68GB total VRAM usage.
         """
+        if self._loaded:
+            logger.debug("Models already loaded, skipping")
             return self
+        from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
         device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
+        logger.info(f"Loading all models on {device_type}")
         self._log_gpu_status()
+        total_start = time.time()
+        # Vision Thinking model (~18GB in BF16)
+        logger.info(f"Loading vision thinking model: {settings.vision_model_thinking}")
+        thinking_start = time.time()
+        self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
+            settings.vision_model_thinking,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            trust_remote_code=True,
+        )
+        self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
+            settings.vision_model_thinking,
+            trust_remote_code=True,
+        )
+        logger.info(f"Vision thinking model loaded in {time.time() - thinking_start:.2f}s")
+        # Vision Instruct model (~18GB in BF16)
+        logger.info(f"Loading vision instruct model: {settings.vision_model_instruct}")
+        instruct_start = time.time()
+        self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
+            settings.vision_model_instruct,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            trust_remote_code=True,
+        )
+        self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
+            settings.vision_model_instruct,
+            trust_remote_code=True,
+        )
+        logger.info(f"Vision instruct model loaded in {time.time() - instruct_start:.2f}s")
         # Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
         logger.info(f"Loading embedding model: {settings.embedding_model}")
         self.processors["reranker"] = self.models["reranker"].processor
         logger.info(f"Reranker model loaded in {time.time() - reranker_start:.2f}s")
+        self._loaded = True
+        total_time = time.time() - total_start
+        logger.info(f"All models loaded in {total_time:.2f}s")
         self._log_gpu_status()
         return self
     def is_loaded(self) -> bool:
+        """Check if models are loaded."""
+        return self._loaded
     @property
+    def vision(self) -> "DualVisionModel":
+        """Return dual vision model wrapped for pipeline consumption."""
+        if not self._loaded:
+            raise RuntimeError("Models not loaded. Call load_all() first.")
+        return DualVisionModel(
+            thinking_model=self.models["vision_thinking"],
+            thinking_processor=self.processors["vision_thinking"],
+            instruct_model=self.models["vision_instruct"],
+            instruct_processor=self.processors["vision_instruct"],
+        )
     @property
     def embedding(self) -> "RealEmbeddingModel":
         """Return embedding model wrapped for pipeline consumption."""
+        if not self._loaded:
+            raise RuntimeError("Models not loaded. Call load_all() first.")
         return RealEmbeddingModel(self.models["embedding"], self.processors["embedding"])
     @property
     def reranker(self) -> "RealRerankerModel":
         """Return reranker model wrapped for pipeline consumption."""
+        if not self._loaded:
+            raise RuntimeError("Models not loaded. Call load_all() first.")
         return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
+class DualVisionModel:
+    """Dual vision model for two-stage fire damage analysis.
+    Uses Qwen3-VL-8B-Thinking for deep analysis with reasoning chains,
+    then Qwen3-VL-8B-Instruct to format results into structured JSON.
+    Pipeline: Image -> Thinking (analysis) -> Instruct (JSON formatting) -> Output
+    """
     # System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
     VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
 - Flag any areas that require professional on-site verification
 - Note any potential access issues visible in the image"""
+    # Analysis prompt for Thinking model (open-ended reasoning)
+    THINKING_ANALYSIS_PROMPT = """Analyze this fire damage image thoroughly. Consider:
+1. What zone classification applies (burn, near-field, or far-field) and why?
+2. What is the contamination condition level (background, light, moderate, heavy, or structural-damage)?
+3. What materials are visible and what is their porosity category?
+4. What combustion indicators (soot, char, ash) are present and where?
+5. Are there any structural concerns or access issues?
+6. Where would you recommend sampling and what type of samples?
+Provide detailed reasoning for each assessment, explaining the visual evidence that supports your conclusions."""
+    # Formatter prompt for Instruct model (structured JSON output)
+    INSTRUCT_FORMATTER_SYSTEM = """You are a technical document formatter. Your task is to convert fire damage analysis into a precise JSON structure.
+Preserve all findings from the analysis accurately. Assign confidence scores (0.0-1.0) based on the certainty expressed in the analysis:
+- Very certain statements: 0.85-0.95
+- Reasonably confident: 0.70-0.84
+- Somewhat uncertain: 0.50-0.69
+- Uncertain/fallback: 0.30-0.49"""
+    INSTRUCT_FORMATTER_PROMPT = """Based on the following fire damage analysis, generate a JSON response with this exact structure:
+<analysis>
+{analysis}
+</analysis>
+Generate JSON with this structure:
+{{
+    "zone": {{
         "classification": "burn" | "near-field" | "far-field",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
+    }},
+    "condition": {{
         "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
+    }},
     "materials": [
+        {{
             "type": "material type (e.g., drywall, concrete, steel, wood)",
             "category": "non-porous" | "semi-porous" | "porous" | "hvac",
             "confidence": 0.0-1.0,
             "location_description": "where in image",
+            "bounding_box": {{"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}}
+        }}
     ],
+    "combustion_indicators": {{
         "soot_visible": true/false,
         "soot_pattern": "description or null",
         "char_visible": true/false,
         "char_description": "description or null",
         "ash_visible": true/false,
         "ash_description": "description or null"
+    }},
     "structural_concerns": ["list of structural issues if any"],
     "access_issues": ["list of access problems if any"],
     "recommended_sampling_locations": [
+        {{
             "description": "where to sample",
             "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
             "priority": "high" | "medium" | "low"
+        }}
     ],
     "flags_for_review": ["any items requiring human review"]
+}}
 IMPORTANT: Return ONLY valid JSON, no additional text."""
+    def __init__(self, thinking_model, thinking_processor, instruct_model, instruct_processor):
+        self.thinking_model = thinking_model
+        self.thinking_processor = thinking_processor
+        self.instruct_model = instruct_model
+        self.instruct_processor = instruct_processor
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
+        """Analyze an image using two-stage pipeline.
+        Stage 1: Thinking model generates detailed analysis with reasoning
+        Stage 2: Instruct model formats the analysis into structured JSON
+        """
         start_time = time.time()
+        logger.debug(f"Starting dual-model vision analysis (context: {len(context)} chars)")
+        try:
+            # Stage 1: Deep analysis with Thinking model
+            thinking_start = time.time()
+            analysis_text = self._run_thinking_stage(image, context)
+            thinking_time = time.time() - thinking_start
+            logger.debug(f"Thinking stage completed in {thinking_time:.2f}s, output: {len(analysis_text)} chars")
+            # Stage 2: Format to JSON with Instruct model
+            instruct_start = time.time()
+            result = self._run_instruct_stage(analysis_text)
+            instruct_time = time.time() - instruct_start
+            logger.debug(f"Instruct stage completed in {instruct_time:.2f}s")
+            # Log result summary
+            total_time = time.time() - start_time
+            zone = result.get("zone", {}).get("classification", "unknown")
+            zone_conf = result.get("zone", {}).get("confidence", 0)
+            condition = result.get("condition", {}).get("level", "unknown")
+            condition_conf = result.get("condition", {}).get("confidence", 0)
+            num_materials = len(result.get("materials", []))
+            logger.info(f"Vision analysis complete in {total_time:.2f}s (thinking: {thinking_time:.2f}s, instruct: {instruct_time:.2f}s): "
+                       f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
+                       f"materials={num_materials}")
+            return result
+        except Exception as e:
+            logger.error(f"Vision analysis failed: {e}")
+            return self._get_fallback_response(str(e))
+    def _run_thinking_stage(self, image: Image.Image, context: str) -> str:
+        """Run the Thinking model to generate detailed analysis."""
         try:
             from qwen_vl_utils import process_vision_info
         except ImportError:
             process_vision_info = None
         # Build the analysis prompt with context
+        prompt = self.THINKING_ANALYSIS_PROMPT
         if context:
             prompt = f"Context: {context}\n\n{prompt}"
             }
         ]
+        # Apply chat template with thinking enabled (default for Thinking model)
+        text = self.thinking_processor.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        # Process vision info if available
+        if process_vision_info:
+            image_inputs, video_inputs = process_vision_info(messages)
+            inputs = self.thinking_processor(
+                text=[text],
+                images=image_inputs,
+                videos=video_inputs,
+                return_tensors="pt",
+                padding=True,
+            )
+        else:
+            # Fallback: basic image processing
+            inputs = self.thinking_processor(
+                text=[text],
+                images=[image],
+                return_tensors="pt",
+                padding=True,
             )
+        # Generate response using thinking config (per Qwen3-VL GitHub recommendations)
+        logger.debug(f"Thinking inference config: max_new_tokens={thinking_config.max_new_tokens}, "
+                    f"temp={thinking_config.temperature}, top_p={thinking_config.top_p}, top_k={thinking_config.top_k}")
+        with torch.no_grad():
+            outputs = self.thinking_model.generate(
+                **inputs,
+                max_new_tokens=thinking_config.max_new_tokens,
+                do_sample=thinking_config.do_sample,
+                temperature=thinking_config.temperature,
+                top_p=thinking_config.top_p,
+                top_k=thinking_config.top_k,
+                repetition_penalty=thinking_config.repetition_penalty,
             )
+        # Decode response - get raw token IDs first for proper parsing
+        output_ids = outputs[0].tolist()
+        # The Thinking model's chat template includes opening <think> tag
+        # Output format: reasoning_content</think>final_answer
+        # Get </think> token ID dynamically from tokenizer (more robust than hardcoding)
+        think_end_token = self.thinking_processor.tokenizer.encode(
+            "</think>", add_special_tokens=False
+        )[0]
+        try:
+            # Find the </think> token position
+            think_end_idx = len(output_ids) - output_ids[::-1].index(think_end_token)
+            # Extract reasoning (before </think>) and answer (after </think>)
+            reasoning_ids = output_ids[:think_end_idx]
+            answer_ids = output_ids[think_end_idx:]
+            reasoning = self.thinking_processor.decode(
+                reasoning_ids, skip_special_tokens=True
+            ).strip()
+            final_answer = self.thinking_processor.decode(
+                answer_ids, skip_special_tokens=True
+            ).strip()
+            logger.debug(f"Extracted thinking: {len(reasoning)} chars reasoning, {len(final_answer)} chars answer")
+            return f"Reasoning:\n{reasoning}\n\nConclusions:\n{final_answer}"
+        except ValueError:
+            # No </think> token found - use full response as-is
+            response_text = self.thinking_processor.decode(
+                output_ids, skip_special_tokens=True
+            ).strip()
+            logger.debug(f"No </think> token found, using full response: {len(response_text)} chars")
+            return response_text
+    def _run_instruct_stage(self, analysis_text: str) -> dict[str, Any]:
+        """Run the Instruct model to format analysis into JSON."""
+        # Prepare messages for Instruct model (text-only, no image)
+        prompt = self.INSTRUCT_FORMATTER_PROMPT.format(analysis=analysis_text)
+        messages = [
+            {
+                "role": "system",
+                "content": self.INSTRUCT_FORMATTER_SYSTEM,
+            },
+            {
+                "role": "user",
+                "content": prompt,
+            }
+        ]
+        # Apply chat template
+        text = self.instruct_processor.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        inputs = self.instruct_processor(
+            text=[text],
+            return_tensors="pt",
+            padding=True,
+        )
+        # Generate response using vision config (low temp for consistent JSON)
+        logger.debug(f"Instruct inference config: max_new_tokens={vision_config.max_new_tokens}, "
+                    f"temp={vision_config.temperature}")
+        with torch.no_grad():
+            outputs = self.instruct_model.generate(
+                **inputs,
+                max_new_tokens=vision_config.max_new_tokens,
+                do_sample=vision_config.do_sample,
+                temperature=vision_config.temperature,
+                top_p=vision_config.top_p,
+                repetition_penalty=vision_config.repetition_penalty,
+            )
+        # Decode response
+        response_text = self.instruct_processor.decode(
+            outputs[0], skip_special_tokens=True
+        )
+        # Parse JSON from response
+        return self._parse_json_response(response_text)
+    def _parse_json_response(self, response: str) -> dict[str, Any]:
+        """Parse JSON response from instruct model."""
         try:
             # Try to extract JSON from response
             json_match = re.search(r'\{[\s\S]*\}', response)
             if json_match:
                 json_str = json_match.group()
                 return json.loads(json_str)
             else:
+                logger.warning("No JSON found in instruct response")
                 return self._get_fallback_response("No JSON in response")
         except json.JSONDecodeError as e:
+            logger.warning(f"Failed to parse JSON: {e}")
             return self._get_fallback_response(f"JSON parse error: {e}")
     def _get_fallback_response(self, reason: str) -> dict[str, Any]:

pipeline/main.py CHANGED Viewed

@@ -199,11 +199,6 @@ class FDAMPipeline:
         logger.info(f"Stage 2/6: Vision Analysis ({len(session.images)} images)")
         report_progress(2, "Analyzing images with AI...")
         model_stack = get_models()
-        # Lazy load vision model (for real models only - mock models are already loaded)
-        if hasattr(model_stack, 'load_vision') and not model_stack.is_vision_loaded():
-            logger.info("Lazy loading vision model...")
-            model_stack.load_vision()
         vision_results = {}
         annotated_images = []
         room_mapping = {}
@@ -260,20 +255,11 @@ class FDAMPipeline:
         logger.info(f"Stage 2 completed in {time.time() - stage_start:.2f}s: "
                    f"{len(vision_results)} images analyzed")
-        # Unload vision model to free memory for RAG (for real models only)
-        if hasattr(model_stack, 'unload_vision') and model_stack.is_vision_loaded():
-            logger.info("Unloading vision model to free memory for RAG...")
-            model_stack.unload_vision()
         # Stage 3: RAG Retrieval
         stage_start = time.time()
         logger.info("Stage 3/6: RAG Retrieval")
         report_progress(3, "Retrieving FDAM methodology context...")
-        # Lazy load RAG models (for real models only - mock models are already loaded)
-        if hasattr(model_stack, 'load_rag') and not model_stack.is_rag_loaded():
-            logger.info("Lazy loading RAG models (embedding + reranker)...")
-            model_stack.load_rag()
         # RAG is integrated into disposition engine, just verify connection
         try:
             test_results = self.retriever.retrieve("test connection", top_k=1)

         logger.info(f"Stage 2/6: Vision Analysis ({len(session.images)} images)")
         report_progress(2, "Analyzing images with AI...")
         model_stack = get_models()
         vision_results = {}
         annotated_images = []
         room_mapping = {}
         logger.info(f"Stage 2 completed in {time.time() - stage_start:.2f}s: "
                    f"{len(vision_results)} images analyzed")
         # Stage 3: RAG Retrieval
         stage_start = time.time()
         logger.info("Stage 3/6: RAG Retrieval")
         report_progress(3, "Retrieving FDAM methodology context...")
         # RAG is integrated into disposition engine, just verify connection
         try:
             test_results = self.retriever.retrieve("test connection", top_k=1)

rag/retriever.py CHANGED Viewed

@@ -88,7 +88,7 @@ class SharedReranker:
     """Reranker that uses the shared model from RealModelStack.
     This avoids loading a duplicate reranker model - instead uses the
-    model already loaded by the pipeline via model_stack.load_rag().
     """
     def rerank(
@@ -109,13 +109,7 @@ class SharedReranker:
         model_stack = get_models()
-        # Check if RAG models are loaded
-        if not model_stack.is_rag_loaded():
-            logger.warning("RAG models not loaded yet - reranking may fail")
-            # Return neutral scores as fallback
-            return [0.5] * len(documents)
-        # Use the shared reranker model
         return model_stack.reranker.rerank(query, documents)

     """Reranker that uses the shared model from RealModelStack.
     This avoids loading a duplicate reranker model - instead uses the
+    model already loaded by the pipeline at startup.
     """
     def rerank(
         model_stack = get_models()
+        # Use the shared reranker model (always loaded at startup)
         return model_stack.reranker.rerank(query, documents)

rag/vectorstore.py CHANGED Viewed

@@ -62,7 +62,7 @@ class SharedEmbeddingFunction:
     """Embedding function that uses the shared model from RealModelStack.
     This avoids loading a duplicate embedding model - instead uses the
-    model already loaded by the pipeline via model_stack.load_rag().
     For ChromaDB compatibility, this wraps the model stack's embedding model.
     """
@@ -75,13 +75,7 @@ class SharedEmbeddingFunction:
         model_stack = get_models()
-        # Check if RAG models are loaded
-        if not model_stack.is_rag_loaded():
-            logger.warning("RAG models not loaded yet - embeddings may fail")
-            # Return zero vectors as fallback
-            return [[0.0] * self.EMBEDDING_DIM for _ in input]
-        # Use the shared embedding model
         return model_stack.embedding.embed_batch(input)

     """Embedding function that uses the shared model from RealModelStack.
     This avoids loading a duplicate embedding model - instead uses the
+    model already loaded by the pipeline at startup.
     For ChromaDB compatibility, this wraps the model stack's embedding model.
     """
         model_stack = get_models()
+        # Use the shared embedding model (always loaded at startup)
         return model_stack.embedding.embed_batch(input)