Spaces:

KinetoLabs
/

SmokeScan

Paused

KinetoLabs Claude Opus 4.5 commited on 9 days ago

Commit

706520f

1 Parent(s): 0699c5f

Replace dual 8B with single 30B-A3B FP8 vision model

Simplify pipeline architecture:
- Vision: Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB) replaces dual 8B
- Embedding: 2B model (2048-dim) replaces 8B
- Reranker: 2B model replaces 8B
- Total VRAM: ~38-43GB (was ~68GB), 45GB+ headroom on 4xL4

Key changes:
- vLLM with FP8 quantization (built-in, no autoawq needed)
- Proper Qwen3-VL chat template formatting via processor
- Removed dual-model Thinking→Instruct pipeline
- Single model handles analysis + structured JSON output

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (12) hide show

.env.example +4 -5
CLAUDE.md +28 -26
FDAM_AI_Pipeline_Technical_Spec.md +0 -0
README.md +4 -5
config/inference.py +14 -25
config/settings.py +8 -5
models/loader.py +7 -7
models/mock.py +16 -19
models/real.py +136 -281
rag/vectorstore.py +3 -3
requirements.txt +3 -0
scripts/qwen3_vl/__init__.py +2 -2

.env.example CHANGED Viewed

@@ -8,8 +8,7 @@ MOCK_MODELS=true
 SERVER_HOST=0.0.0.0
 SERVER_PORT=7860
-# Optional: Override model paths (Dual 8B architecture)
-# VISION_MODEL_THINKING=Qwen/Qwen3-VL-8B-Thinking
-# VISION_MODEL_INSTRUCT=Qwen/Qwen3-VL-8B-Instruct
-# EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-8B
-# RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B

 SERVER_HOST=0.0.0.0
 SERVER_PORT=7860
+# Optional: Override model paths (FP8 + 2B architecture)
+# VISION_MODEL=Qwen/Qwen3-VL-30B-A3B-Thinking-FP8
+# EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-2B
+# RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-2B

CLAUDE.md CHANGED Viewed

@@ -13,7 +13,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Critical Constraints
 1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
-2. **Memory Budget** - 4xL4 88GB usable: ~36GB vision (dual 8B) + ~16GB embedding + ~16GB reranker (~68GB used, ~20GB headroom)
 3. **Processing Time** - 60-90 seconds per assessment is acceptable
 4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
 5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
@@ -23,10 +23,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 | Component | Technology |
 |-----------|------------|
 | UI Framework | Gradio 6.x |
-| Vision (Thinking) | Qwen3-VL-8B-Thinking |
-| Vision (Instruct) | Qwen3-VL-8B-Instruct |
-| Embeddings | Qwen3-VL-Embedding-8B |
-| Reranker | Qwen3-VL-Reranker-8B |
 | Vector Store | ChromaDB 0.4.x |
 | Validation | Pydantic 2.x |
 | PDF Generation | Pandoc 3.x |
@@ -149,40 +149,42 @@ Source documents in `/RAG-KB/`:
 ## Multi-GPU Model Loading
-All 4 models are loaded simultaneously at startup (~68GB total on 4xL4 GPUs):
 ```python
-# Vision models (dual 8B architecture)
-thinking_model = Qwen3VLForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen3-VL-8B-Thinking",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
-)
-instruct_model = Qwen3VLForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen3-VL-8B-Instruct",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
 )
 ```
-Expected distribution (BF16, ~68GB total):
-- Vision Thinking model (8B): ~18GB
-- Vision Instruct model (8B): ~18GB
-- Embedding model (8B): ~16GB
-- Reranker model (8B): ~16GB
-- Headroom: ~20GB for KV cache and overhead
 ## Local Development Strategy
-The RTX 4090 (24GB VRAM) cannot run the full model stack (~68GB required). Use this workflow:
 1. Set `MOCK_MODELS=true` environment variable
-2. Mock responses return realistic JSON matching vision output schema
 3. Test pipeline logic, UI, calculations without real inference
 4. Deploy to HuggingFace Spaces for real model testing
 5. Request build logs after deployment to confirm success
 ## Code Style

 ## Critical Constraints
 1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
+2. **Memory Budget** - 4xL4 88GB usable: ~30-35GB vision (30B FP8) + ~4GB embedding + ~4GB reranker (~38-43GB used, ~45GB+ headroom)
 3. **Processing Time** - 60-90 seconds per assessment is acceptable
 4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
 5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
 | Component | Technology |
 |-----------|------------|
 | UI Framework | Gradio 6.x |
+| Vision | Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (via vLLM) |
+| Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
+| Reranker | Qwen/Qwen3-VL-Reranker-2B |
+| Inference | vLLM with FP8 quantization |
 | Vector Store | ChromaDB 0.4.x |
 | Validation | Pydantic 2.x |
 | PDF Generation | Pandoc 3.x |
 ## Multi-GPU Model Loading
+All 3 models are loaded at startup (~38-43GB total on 4xL4 GPUs):
 ```python
+from vllm import LLM, SamplingParams
+# Vision model via vLLM with FP8 quantization (built-in)
+vision_model = LLM(
+    model="Qwen/Qwen3-VL-30B-A3B-Thinking-FP8",
+    tensor_parallel_size=4,  # Distribute across all 4 GPUs
+    trust_remote_code=True,
+    gpu_memory_utilization=0.70,
+    max_model_len=32768,
 )
+# Embedding and Reranker use official Qwen3VL loaders
+from scripts.qwen3_vl import Qwen3VLEmbedder, Qwen3VLReranker
+embedding_model = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B", torch_dtype=torch.bfloat16)
+reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
 ```
+Expected distribution (FP8 + BF16, ~38-43GB total):
+- Vision model (30B FP8): ~30-35GB
+- Embedding model (2B): ~4GB
+- Reranker model (2B): ~4GB
+- Headroom: ~45GB+ for KV cache and overhead
 ## Local Development Strategy
+The RTX 4090 (24GB VRAM) cannot run the production model stack. Use this workflow:
 1. Set `MOCK_MODELS=true` environment variable
+2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
 3. Test pipeline logic, UI, calculations without real inference
 4. Deploy to HuggingFace Spaces for real model testing
 5. Request build logs after deployment to confirm success
+6. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
 ## Code Style

FDAM_AI_Pipeline_Technical_Spec.md DELETED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -32,11 +32,10 @@ suggested_hardware: l4x4
 ## Technical Details
-### Model Stack (~68GB VRAM)
-- **Vision (Thinking)**: Qwen3-VL-8B-Thinking (~18GB) - Deep analysis with reasoning
-- **Vision (Instruct)**: Qwen3-VL-8B-Instruct (~18GB) - Structured JSON output
-- **Embeddings**: Qwen3-VL-Embedding-8B (~16GB)
-- **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
 ### Zone Classifications
 - **Burn Zone**: Direct fire involvement, structural damage

 ## Technical Details
+### Model Stack (~38-43GB VRAM)
+- **Vision**: Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB) - Reasoning-enhanced analysis with structured JSON output
+- **Embeddings**: Qwen3-VL-Embedding-2B (~4GB)
+- **Reranker**: Qwen3-VL-Reranker-2B (~4GB)
 ### Zone Classifications
 - **Burn Zone**: Direct fire involvement, structural damage

config/inference.py CHANGED Viewed

@@ -2,39 +2,29 @@
 Configuration values aligned with official Qwen3-VL model recommendations
 and FDAM Technical Spec requirements.
 """
 from dataclasses import dataclass
 @dataclass
-class ThinkingInferenceConfig:
-    """Configuration for 8B-Thinking model inference.
-    Per Qwen3-VL GitHub recommended hyperparameters for thinking models.
-    Used for deep analysis with <think> chains.
     """
-    max_new_tokens: int = 8192  # Balanced for reasoning + reasonable time (~7 min)
     temperature: float = 0.6  # Per Qwen3-VL GitHub docs
     top_p: float = 0.95
     top_k: int = 20
-    do_sample: bool = True
-    repetition_penalty: float = 1.0  # Per Qwen3-VL docs (not presence_penalty)
-@dataclass
-class VisionInferenceConfig:
-    """Configuration for 8B-Instruct model inference.
-    Per FDAM Technical Spec Section 3. Used for structured JSON output.
-    """
-    max_new_tokens: int = 4096
-    temperature: float = 0.1  # Low temperature for deterministic JSON output
-    top_p: float = 0.9
-    do_sample: bool = True
-    repetition_penalty: float = 1.1  # Reduce repetition in generated text
 @dataclass
@@ -55,10 +45,10 @@ class GenerationInferenceConfig:
 class EmbeddingConfig:
     """Configuration for embedding model.
-    Per Qwen3-VL-Embedding-8B config.json: text_config.hidden_size = 4096
     """
-    embedding_dimension: int = 4096  # Per Qwen3-VL-Embedding-8B hidden_size
     normalize: bool = True  # L2 normalization (per official implementation)
@@ -82,8 +72,7 @@ class RAGConfig:
 # Default configurations
-thinking_config = ThinkingInferenceConfig()
-vision_config = VisionInferenceConfig()  # Now used for Instruct model
 generation_config = GenerationInferenceConfig()
 embedding_config = EmbeddingConfig()
 reranker_config = RerankerConfig()

 Configuration values aligned with official Qwen3-VL model recommendations
 and FDAM Technical Spec requirements.
+Pipeline uses:
+- Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (single model, FP8 via vLLM)
+- Embedding: Qwen/Qwen3-VL-Embedding-2B (2048-dim)
+- Reranker: Qwen/Qwen3-VL-Reranker-2B
 """
 from dataclasses import dataclass
 @dataclass
+class VisionInferenceConfig:
+    """Configuration for 30B-A3B FP8 vision model inference.
+    Single model handles both analysis and structured JSON output.
+    Uses vLLM with tensor parallelism across 4 GPUs.
     """
+    max_tokens: int = 8192  # vLLM uses max_tokens not max_new_tokens
     temperature: float = 0.6  # Per Qwen3-VL GitHub docs
     top_p: float = 0.95
     top_k: int = 20
+    repetition_penalty: float = 1.0  # Per Qwen3-VL docs
 @dataclass
 class EmbeddingConfig:
     """Configuration for embedding model.
+    Per Qwen3-VL-Embedding-2B config.json: text_config.hidden_size = 2048
     """
+    embedding_dimension: int = 2048  # Per Qwen3-VL-Embedding-2B hidden_size
     normalize: bool = True  # L2 normalization (per official implementation)
 # Default configurations
+vision_config = VisionInferenceConfig()  # Single 30B-A3B FP8 model
 generation_config = GenerationInferenceConfig()
 embedding_config = EmbeddingConfig()
 reranker_config = RerankerConfig()

config/settings.py CHANGED Viewed

@@ -17,11 +17,14 @@ class Settings(BaseSettings):
     mock_models: bool = True
     # Model paths (for production on HuggingFace Spaces)
-    # Dual 8B architecture: Thinking for analysis, Instruct for structured output
-    vision_model_thinking: str = "Qwen/Qwen3-VL-8B-Thinking"
-    vision_model_instruct: str = "Qwen/Qwen3-VL-8B-Instruct"
-    embedding_model: str = "Qwen/Qwen3-VL-Embedding-8B"
-    reranker_model: str = "Qwen/Qwen3-VL-Reranker-8B"
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

     mock_models: bool = True
     # Model paths (for production on HuggingFace Spaces)
+    # Single 30B-A3B MoE model with FP8 quantization via vLLM (official, reasoning-enhanced)
+    vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
+    embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
+    reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
+    # vLLM configuration
+    vllm_tensor_parallel_size: int = 4  # Use all 4 L4 GPUs
+    vllm_max_model_len: int = 32768  # Context window
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

models/loader.py CHANGED Viewed

@@ -2,12 +2,13 @@
 Supports two loading modes:
 - MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
-- MOCK_MODELS=false: Loads all real models at startup (~68GB total)
 Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
-- Vision Thinking 8B (~18GB) + Vision Instruct 8B (~18GB) = ~36GB
-- Embedding 8B (~16GB) + Reranker 8B (~16GB) = ~32GB
-- Total: ~68GB, leaving ~20GB headroom
 """
 import logging
@@ -29,7 +30,7 @@ def get_model_stack() -> ModelStack:
     """Get model stack based on environment configuration.
     For mock models: Loads mock models immediately (fast, for local dev).
-    For real models: Loads all 4 models at startup (~68GB total).
     """
     start_time = time.time()
@@ -43,8 +44,7 @@ def get_model_stack() -> ModelStack:
         return stack
     else:
         logger.info("Loading REAL model stack (production mode)")
-        logger.info(f"Vision thinking model: {settings.vision_model_thinking}")
-        logger.info(f"Vision instruct model: {settings.vision_model_instruct}")
         logger.info(f"Embedding model: {settings.embedding_model}")
         logger.info(f"Reranker model: {settings.reranker_model}")
         from models.real import RealModelStack

 Supports two loading modes:
 - MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
+- MOCK_MODELS=false: Loads all real models at startup (~38-43GB total)
 Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
+- Vision 30B-A3B FP8 via vLLM: ~30-35GB
+- Embedding 2B: ~4GB
+- Reranker 2B: ~4GB
+- Total: ~38-43GB, leaving ~45GB+ headroom
 """
 import logging
     """Get model stack based on environment configuration.
     For mock models: Loads mock models immediately (fast, for local dev).
+    For real models: Loads all 3 models at startup (~38-43GB total).
     """
     start_time = time.time()
         return stack
     else:
         logger.info("Loading REAL model stack (production mode)")
+        logger.info(f"Vision model: {settings.vision_model} (FP8 via vLLM)")
         logger.info(f"Embedding model: {settings.embedding_model}")
         logger.info(f"Reranker model: {settings.reranker_model}")
         from models.real import RealModelStack

models/mock.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """Mock model implementations for local development on RTX 4090.
-Simulates the dual 8B vision model architecture:
-- MockVisionModel simulates two-stage pipeline (Thinking -> Instruct)
 - All models loaded together at startup (no lazy loading)
 """
@@ -14,11 +14,10 @@ logger = logging.getLogger(__name__)
 class MockVisionModel:
-    """Mock vision model that simulates dual-model pipeline output.
-    Simulates:
-    - Stage 1: Thinking model generates reasoning
-    - Stage 2: Instruct model formats to JSON
     """
     ZONES = ["burn", "near-field", "far-field"]
@@ -54,15 +53,13 @@ class MockVisionModel:
     }
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
-        """Return mock vision analysis simulating dual-model pipeline output."""
-        logger.debug(f"Mock dual-model vision analysis (context: {len(context)} chars)")
-        # Simulate Stage 1: Thinking model selects classifications
         selected_zone = random.choice(self.ZONES)
         selected_condition = random.choice(self.CONDITIONS)
-        logger.debug("Mock Stage 1 (Thinking): Generated reasoning")
-        logger.debug("Mock Stage 2 (Instruct): Formatted to JSON")
         logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
         # Generate 2-4 random materials
@@ -141,16 +138,16 @@ class MockVisionModel:
 class MockEmbeddingModel:
     """Mock embedding model that returns deterministic vectors.
-    Dimension matches Qwen3-VL-Embedding-8B (4096-dim).
     Uses last-token pooling concept with L2 normalization.
     """
-    def __init__(self, dimension: int = 4096):
-        """Initialize with dimension matching real Qwen3-VL-Embedding-8B model."""
         self.dimension = dimension
     def embed(self, text: str) -> list[float]:
-        """Return mock embedding vector (4096-dim, L2 normalized).
         Uses hash of text for reproducibility, simulating last-token pooling.
         """
@@ -176,7 +173,7 @@ class MockEmbeddingModel:
 class MockRerankerModel:
     """Mock reranker that returns realistic relevance scores.
-    Simulates Qwen3-VL-Reranker behavior with 0-1 sigmoid-like scores.
     """
     def rerank(self, query: str, documents: list[str]) -> list[float]:
@@ -236,9 +233,9 @@ class MockModelStack:
     def load_all(self) -> "MockModelStack":
         """Load all mock models."""
         logger.info("Loading mock models for local development")
-        logger.debug("  Vision model: MockVisionModel (simulates dual 8B pipeline)")
-        logger.debug("  Embedding model: MockEmbeddingModel (4096-dim)")
-        logger.debug("  Reranker model: MockRerankerModel")
         self._loaded = True
         logger.info("All mock models loaded successfully")
         return self

 """Mock model implementations for local development on RTX 4090.
+Simulates the 30B-A3B FP8 vision model architecture:
+- MockVisionModel simulates single-model analysis + JSON output
 - All models loaded together at startup (no lazy loading)
 """
 class MockVisionModel:
+    """Mock vision model that simulates 30B-A3B FP8 model output.
+    Simulates single-model analysis with structured JSON output.
+    The real model uses vLLM with FP8 quantization.
     """
     ZONES = ["burn", "near-field", "far-field"]
     }
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
+        """Return mock vision analysis simulating 30B-A3B FP8 model output."""
+        logger.debug(f"Mock 30B-A3B FP8 vision analysis (context: {len(context)} chars)")
+        # Simulate model generating analysis + JSON
         selected_zone = random.choice(self.ZONES)
         selected_condition = random.choice(self.CONDITIONS)
         logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
         # Generate 2-4 random materials
 class MockEmbeddingModel:
     """Mock embedding model that returns deterministic vectors.
+    Dimension matches Qwen3-VL-Embedding-2B (2048-dim).
     Uses last-token pooling concept with L2 normalization.
     """
+    def __init__(self, dimension: int = 2048):
+        """Initialize with dimension matching real Qwen3-VL-Embedding-2B model."""
         self.dimension = dimension
     def embed(self, text: str) -> list[float]:
+        """Return mock embedding vector (2048-dim, L2 normalized).
         Uses hash of text for reproducibility, simulating last-token pooling.
         """
 class MockRerankerModel:
     """Mock reranker that returns realistic relevance scores.
+    Simulates Qwen3-VL-Reranker-2B behavior with 0-1 sigmoid-like scores.
     """
     def rerank(self, query: str, documents: list[str]) -> list[float]:
     def load_all(self) -> "MockModelStack":
         """Load all mock models."""
         logger.info("Loading mock models for local development")
+        logger.debug("  Vision model: MockVisionModel (simulates 30B-A3B FP8)")
+        logger.debug("  Embedding model: MockEmbeddingModel (2048-dim)")
+        logger.debug("  Reranker model: MockRerankerModel (simulates 2B)")
         self._loaded = True
         logger.info("All mock models loaded successfully")
         return self

models/real.py CHANGED Viewed

@@ -1,17 +1,13 @@
 """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
-This module loads the actual Qwen3-VL models for production use.
-All models are loaded simultaneously at startup (~68GB total).
-Memory Strategy (Simultaneous Loading):
-- Vision Thinking 8B (~18GB): Deep analysis with reasoning chains
-- Vision Instruct 8B (~18GB): Structured JSON output formatting
-- Embedding 8B (~16GB): RAG document embedding
-- Reranker 8B (~16GB): RAG retrieval reranking
-- Total: ~68GB on 88GB available (20GB headroom)
 Model Loading:
-- Vision: Qwen3VLForConditionalGeneration (standard transformers)
 - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
 - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
 """
@@ -24,7 +20,7 @@ import torch
 from typing import Any
 from PIL import Image
-from config.inference import thinking_config, vision_config
 from config.settings import settings
 logger = logging.getLogger(__name__)
@@ -33,9 +29,10 @@ logger = logging.getLogger(__name__)
 class RealModelStack:
     """Real model stack for production on HuggingFace Spaces.
-    Loads all 4 models simultaneously at initialization (~68GB total):
-    - Dual vision (Thinking + Instruct): ~36GB
-    - Embedding + Reranker: ~32GB
     """
     def __init__(self):
@@ -56,54 +53,53 @@ class RealModelStack:
                 logger.info(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
     def load_all(self) -> "RealModelStack":
-        """Load all models simultaneously.
-        Loads dual vision models (Thinking + Instruct) and RAG models
-        (Embedding + Reranker) for ~68GB total VRAM usage.
         """
         if self._loaded:
             logger.debug("Models already loaded, skipping")
             return self
-        from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
-        device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
-        logger.info(f"Loading all models on {device_type}")
         self._log_gpu_status()
         total_start = time.time()
-        # Vision Thinking model (~18GB in BF16)
-        logger.info(f"Loading vision thinking model: {settings.vision_model_thinking}")
-        thinking_start = time.time()
-        self.models["vision_thinking"] = Qwen3VLForConditionalGeneration.from_pretrained(
-            settings.vision_model_thinking,
-            torch_dtype=torch.bfloat16,
-            device_map="auto",
-            trust_remote_code=True,
-        )
-        self.processors["vision_thinking"] = AutoProcessor.from_pretrained(
-            settings.vision_model_thinking,
             trust_remote_code=True,
         )
-        logger.info(f"Vision thinking model loaded in {time.time() - thinking_start:.2f}s")
-        # Vision Instruct model (~18GB in BF16)
-        logger.info(f"Loading vision instruct model: {settings.vision_model_instruct}")
-        instruct_start = time.time()
-        self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
-            settings.vision_model_instruct,
-            torch_dtype=torch.bfloat16,
-            device_map="auto",
             trust_remote_code=True,
         )
-        self.processors["vision_instruct"] = AutoProcessor.from_pretrained(
-            settings.vision_model_instruct,
-            trust_remote_code=True,
         )
-        logger.info(f"Vision instruct model loaded in {time.time() - instruct_start:.2f}s")
-        # Embedding model (~16GB in BF16) - Using official Qwen3VLEmbedder
         logger.info(f"Loading embedding model: {settings.embedding_model}")
         embed_start = time.time()
         from scripts.qwen3_vl import Qwen3VLEmbedder
@@ -115,7 +111,7 @@ class RealModelStack:
         self.processors["embedding"] = self.models["embedding"].processor
         logger.info(f"Embedding model loaded in {time.time() - embed_start:.2f}s")
-        # Reranker model (~16GB in BF16) - Using official Qwen3VLReranker
         logger.info(f"Loading reranker model: {settings.reranker_model}")
         reranker_start = time.time()
         from scripts.qwen3_vl import Qwen3VLReranker
@@ -138,15 +134,14 @@ class RealModelStack:
         return self._loaded
     @property
-    def vision(self) -> "DualVisionModel":
-        """Return dual vision model wrapped for pipeline consumption."""
         if not self._loaded:
             raise RuntimeError("Models not loaded. Call load_all() first.")
-        return DualVisionModel(
-            thinking_model=self.models["vision_thinking"],
-            thinking_processor=self.processors["vision_thinking"],
-            instruct_model=self.models["vision_instruct"],
-            instruct_processor=self.processors["vision_instruct"],
         )
     @property
@@ -164,20 +159,21 @@ class RealModelStack:
         return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
-class DualVisionModel:
-    """Dual vision model for two-stage fire damage analysis.
-    Uses Qwen3-VL-8B-Thinking for deep analysis with reasoning chains,
-    then Qwen3-VL-8B-Instruct to format results into structured JSON.
-    Pipeline: Image -> Thinking (analysis) -> Instruct (JSON formatting) -> Output
     """
-    # System prompt for FDAM fire damage assessment (per Technical Spec Section 7)
     VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
 ## Your Task
-Analyze the provided image and extract structured information about fire damage, materials, and conditions.
 ## Zone Classification Criteria
 - **Burn Zone**: Direct fire involvement. Look for structural char, complete combustion, exposed/damaged structural elements.
@@ -191,121 +187,105 @@ Analyze the provided image and extract structured information about fire damage,
 - **Heavy**: Thick deposits; surface texture obscured; heavy coating visible.
 - **Structural Damage**: Physical damage requiring repair before cleaning (charring, warping, holes, collapse).
-## Material Identification
-Identify visible materials and categorize as:
 - **Non-porous**: steel, concrete, glass, metal, CMU (concrete masonry unit)
 - **Semi-porous**: painted drywall, sealed wood
 - **Porous**: unpainted drywall, carpet, insulation, acoustic tile, upholstery
 - **HVAC**: rigid ductwork, flexible ductwork
 ## Combustion Particle Visual Indicators
-- **Soot**: Black/dark gray coating with oily/sticky appearance; fine uniform texture; often creates "shadow" patterns
-- **Char**: Black angular fragments; visible wood grain or fibrous structure; larger particles
-- **Ash**: Gray/white powdery residue; crystalline appearance; often found with char
-## Important Notes
-- This is VISUAL assessment only - definitive particle identification requires laboratory analysis
-- When uncertain between two classifications, note both with relative confidence
-- Flag any areas that require professional on-site verification
-- Note any potential access issues visible in the image"""
-    # Analysis prompt for Thinking model (open-ended reasoning)
-    THINKING_ANALYSIS_PROMPT = """Analyze this fire damage image thoroughly. Consider:
-1. What zone classification applies (burn, near-field, or far-field) and why?
-2. What is the contamination condition level (background, light, moderate, heavy, or structural-damage)?
-3. What materials are visible and what is their porosity category?
-4. What combustion indicators (soot, char, ash) are present and where?
-5. Are there any structural concerns or access issues?
-6. Where would you recommend sampling and what type of samples?
-Provide detailed reasoning for each assessment, explaining the visual evidence that supports your conclusions."""
-    # Formatter prompt for Instruct model (structured JSON output)
-    INSTRUCT_FORMATTER_SYSTEM = """You are a technical document formatter. Your task is to convert fire damage analysis into a precise JSON structure.
-Preserve all findings from the analysis accurately. Assign confidence scores (0.0-1.0) based on the certainty expressed in the analysis:
-- Very certain statements: 0.85-0.95
-- Reasonably confident: 0.70-0.84
-- Somewhat uncertain: 0.50-0.69
-- Uncertain/fallback: 0.30-0.49"""
-    INSTRUCT_FORMATTER_PROMPT = """Based on the following fire damage analysis, generate a JSON response with this exact structure:
-<analysis>
-{analysis}
-</analysis>
-Generate JSON with this structure:
-{{
-    "zone": {{
         "classification": "burn" | "near-field" | "far-field",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
-    }},
-    "condition": {{
         "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
-    }},
     "materials": [
-        {{
-            "type": "material type (e.g., drywall, concrete, steel, wood)",
             "category": "non-porous" | "semi-porous" | "porous" | "hvac",
             "confidence": 0.0-1.0,
             "location_description": "where in image",
-            "bounding_box": {{"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}}
-        }}
     ],
-    "combustion_indicators": {{
         "soot_visible": true/false,
         "soot_pattern": "description or null",
         "char_visible": true/false,
         "char_description": "description or null",
         "ash_visible": true/false,
         "ash_description": "description or null"
-    }},
     "structural_concerns": ["list of structural issues if any"],
     "access_issues": ["list of access problems if any"],
     "recommended_sampling_locations": [
-        {{
             "description": "where to sample",
             "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
             "priority": "high" | "medium" | "low"
-        }}
     ],
     "flags_for_review": ["any items requiring human review"]
-}}
 IMPORTANT: Return ONLY valid JSON, no additional text."""
-    def __init__(self, thinking_model, thinking_processor, instruct_model, instruct_processor):
-        self.thinking_model = thinking_model
-        self.thinking_processor = thinking_processor
-        self.instruct_model = instruct_model
-        self.instruct_processor = instruct_processor
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
-        """Analyze an image using two-stage pipeline.
-        Stage 1: Thinking model generates detailed analysis with reasoning
-        Stage 2: Instruct model formats the analysis into structured JSON
         """
         start_time = time.time()
-        logger.debug(f"Starting dual-model vision analysis (context: {len(context)} chars)")
         try:
-            # Stage 1: Deep analysis with Thinking model
-            thinking_start = time.time()
-            analysis_text = self._run_thinking_stage(image, context)
-            thinking_time = time.time() - thinking_start
-            logger.debug(f"Thinking stage completed in {thinking_time:.2f}s, output: {len(analysis_text)} chars")
-            # Stage 2: Format to JSON with Instruct model
-            instruct_start = time.time()
-            result = self._run_instruct_stage(analysis_text)
-            instruct_time = time.time() - instruct_start
-            logger.debug(f"Instruct stage completed in {instruct_time:.2f}s")
             # Log result summary
             total_time = time.time() - start_time
@@ -314,7 +294,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
             condition = result.get("condition", {}).get("level", "unknown")
             condition_conf = result.get("condition", {}).get("confidence", 0)
             num_materials = len(result.get("materials", []))
-            logger.info(f"Vision analysis complete in {total_time:.2f}s (thinking: {thinking_time:.2f}s, instruct: {instruct_time:.2f}s): "
                        f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
                        f"materials={num_materials}")
@@ -324,159 +304,32 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
             logger.error(f"Vision analysis failed: {e}")
             return self._get_fallback_response(str(e))
-    def _run_thinking_stage(self, image: Image.Image, context: str) -> str:
-        """Run the Thinking model to generate detailed analysis."""
-        try:
-            from qwen_vl_utils import process_vision_info
-        except ImportError:
-            logger.warning("qwen_vl_utils not available, using basic processing")
-            process_vision_info = None
-        # Build the analysis prompt with context
-        prompt = self.THINKING_ANALYSIS_PROMPT
         if context:
-            prompt = f"Context: {context}\n\n{prompt}"
-        # Prepare messages in Qwen-VL format with system prompt
         messages = [
-            {
-                "role": "system",
-                "content": self.VISION_SYSTEM_PROMPT,
-            },
             {
                 "role": "user",
                 "content": [
                     {"type": "image", "image": image},
-                    {"type": "text", "text": prompt},
                 ],
-            }
-        ]
-        # Apply chat template with thinking enabled (default for Thinking model)
-        text = self.thinking_processor.apply_chat_template(
-            messages, tokenize=False, add_generation_prompt=True
-        )
-        # Process vision info if available
-        if process_vision_info:
-            image_inputs, video_inputs = process_vision_info(messages)
-            inputs = self.thinking_processor(
-                text=[text],
-                images=image_inputs,
-                videos=video_inputs,
-                return_tensors="pt",
-                padding=True,
-            )
-        else:
-            # Fallback: basic image processing
-            inputs = self.thinking_processor(
-                text=[text],
-                images=[image],
-                return_tensors="pt",
-                padding=True,
-            )
-        # Generate response using thinking config (per Qwen3-VL GitHub recommendations)
-        logger.debug(f"Thinking inference config: max_new_tokens={thinking_config.max_new_tokens}, "
-                    f"temp={thinking_config.temperature}, top_p={thinking_config.top_p}, top_k={thinking_config.top_k}")
-        with torch.no_grad():
-            outputs = self.thinking_model.generate(
-                **inputs,
-                max_new_tokens=thinking_config.max_new_tokens,
-                do_sample=thinking_config.do_sample,
-                temperature=thinking_config.temperature,
-                top_p=thinking_config.top_p,
-                top_k=thinking_config.top_k,
-                repetition_penalty=thinking_config.repetition_penalty,
-            )
-        # Decode response - get raw token IDs first for proper parsing
-        output_ids = outputs[0].tolist()
-        # The Thinking model's chat template includes opening <think> tag
-        # Output format: reasoning_content</think>final_answer
-        # Get </think> token ID dynamically from tokenizer (more robust than hardcoding)
-        think_end_token = self.thinking_processor.tokenizer.encode(
-            "</think>", add_special_tokens=False
-        )[0]
-        try:
-            # Find the </think> token position
-            think_end_idx = len(output_ids) - output_ids[::-1].index(think_end_token)
-            # Extract reasoning (before </think>) and answer (after </think>)
-            reasoning_ids = output_ids[:think_end_idx]
-            answer_ids = output_ids[think_end_idx:]
-            reasoning = self.thinking_processor.decode(
-                reasoning_ids, skip_special_tokens=True
-            ).strip()
-            final_answer = self.thinking_processor.decode(
-                answer_ids, skip_special_tokens=True
-            ).strip()
-            logger.debug(f"Extracted thinking: {len(reasoning)} chars reasoning, {len(final_answer)} chars answer")
-            return f"Reasoning:\n{reasoning}\n\nConclusions:\n{final_answer}"
-        except ValueError:
-            # No </think> token found - use full response as-is
-            response_text = self.thinking_processor.decode(
-                output_ids, skip_special_tokens=True
-            ).strip()
-            logger.debug(f"No </think> token found, using full response: {len(response_text)} chars")
-            return response_text
-    def _run_instruct_stage(self, analysis_text: str) -> dict[str, Any]:
-        """Run the Instruct model to format analysis into JSON."""
-        # Prepare messages for Instruct model (text-only, no image)
-        prompt = self.INSTRUCT_FORMATTER_PROMPT.format(analysis=analysis_text)
-        messages = [
-            {
-                "role": "system",
-                "content": self.INSTRUCT_FORMATTER_SYSTEM,
             },
-            {
-                "role": "user",
-                "content": prompt,
-            }
         ]
-        # Apply chat template
-        text = self.instruct_processor.apply_chat_template(
-            messages, tokenize=False, add_generation_prompt=True
-        )
-        inputs = self.instruct_processor(
-            text=[text],
-            return_tensors="pt",
-            padding=True,
-        )
-        # Generate response using vision config (low temp for consistent JSON)
-        logger.debug(f"Instruct inference config: max_new_tokens={vision_config.max_new_tokens}, "
-                    f"temp={vision_config.temperature}")
-        with torch.no_grad():
-            outputs = self.instruct_model.generate(
-                **inputs,
-                max_new_tokens=vision_config.max_new_tokens,
-                do_sample=vision_config.do_sample,
-                temperature=vision_config.temperature,
-                top_p=vision_config.top_p,
-                repetition_penalty=vision_config.repetition_penalty,
-            )
-        # Decode response
-        response_text = self.instruct_processor.decode(
-            outputs[0], skip_special_tokens=True
-        )
-        # Parse JSON from response
-        return self._parse_json_response(response_text)
     def _parse_json_response(self, response: str) -> dict[str, Any]:
-        """Parse JSON response from instruct model."""
         try:
             # Try to extract JSON from response
             json_match = re.search(r'\{[\s\S]*\}', response)
@@ -484,7 +337,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
                 json_str = json_match.group()
                 return json.loads(json_str)
             else:
-                logger.warning("No JSON found in instruct response")
                 return self._get_fallback_response("No JSON in response")
         except json.JSONDecodeError as e:
             logger.warning(f"Failed to parse JSON: {e}")
@@ -533,6 +386,8 @@ class RealEmbeddingModel:
     Uses the official Qwen3VLEmbedder from QwenLM/Qwen3-VL-Embedding.
     The model handles last-token pooling and L2 normalization internally.
     """
     def __init__(self, model, processor):
@@ -557,7 +412,7 @@ class RealEmbeddingModel:
             text: Input text to embed
         Returns:
-            List of floats representing the embedding (4096-dim for 8B model)
         """
         try:
             # Use official process() API - expects list of dicts
@@ -569,8 +424,8 @@ class RealEmbeddingModel:
         except Exception as e:
             logger.error(f"Embedding generation failed: {e}")
-            # Return zero vector as fallback (4096-dim per Qwen3-VL-Embedding-8B)
-            hidden_size = getattr(self.model.model.config, "hidden_size", 4096)
             return [0.0] * hidden_size
     def embed_batch(self, texts: list[str]) -> list[list[float]]:
@@ -584,7 +439,7 @@ class RealEmbeddingModel:
             return [emb.cpu().tolist() for emb in embeddings]
         except Exception as e:
             logger.error(f"Batch embedding generation failed: {e}")
-            hidden_size = getattr(self.model.model.config, "hidden_size", 4096)
             return [[0.0] * hidden_size for _ in texts]
@@ -597,7 +452,7 @@ class RealRerankerModel:
     - Creates a binary linear layer: weight = yes_weight - no_weight
     - Scores = sigmoid(linear(last_token_hidden_state))
-    Reference: https://github.com/QwenLM/Qwen3-VL-Embedding
     """
     def __init__(self, model, processor):

 """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
+This module loads the production models:
+- Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB via vLLM)
+- Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
+- Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
+- Total: ~38-43GB on 88GB available (45GB+ headroom)
 Model Loading:
+- Vision: vLLM with FP8 quantization (built-in) and tensor parallelism
 - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
 - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
 """
 from typing import Any
 from PIL import Image
+from config.inference import vision_config
 from config.settings import settings
 logger = logging.getLogger(__name__)
 class RealModelStack:
     """Real model stack for production on HuggingFace Spaces.
+    Loads all 3 models at initialization (~38-43GB total):
+    - FP8 Vision via vLLM: ~30-35GB
+    - Embedding 2B: ~4GB
+    - Reranker 2B: ~4GB
     """
     def __init__(self):
                 logger.info(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
     def load_all(self) -> "RealModelStack":
+        """Load all models.
+        Loads FP8 vision model via vLLM and RAG models (Embedding + Reranker).
         """
         if self._loaded:
             logger.debug("Models already loaded, skipping")
             return self
+        logger.info("Loading production models...")
         self._log_gpu_status()
         total_start = time.time()
+        # Vision model via vLLM (~30-35GB in FP8)
+        logger.info(f"Loading vision model: {settings.vision_model}")
+        vision_start = time.time()
+        from vllm import LLM, SamplingParams
+        from transformers import AutoProcessor
+        self.models["vision"] = LLM(
+            model=settings.vision_model,
+            # FP8 quantization is built into model weights, no quantization param needed
+            tensor_parallel_size=settings.vllm_tensor_parallel_size,
             trust_remote_code=True,
+            gpu_memory_utilization=0.70,  # Per Qwen FP8 model recommendations
+            max_model_len=settings.vllm_max_model_len,
         )
+        # Load processor for chat template formatting
+        self.processors["vision"] = AutoProcessor.from_pretrained(
+            settings.vision_model,
             trust_remote_code=True,
         )
+        # Store sampling params for inference
+        self.models["vision_sampling_params"] = SamplingParams(
+            max_tokens=vision_config.max_tokens,
+            temperature=vision_config.temperature,
+            top_p=vision_config.top_p,
+            top_k=vision_config.top_k,
+            repetition_penalty=vision_config.repetition_penalty,
         )
+        logger.info(f"Vision model loaded in {time.time() - vision_start:.2f}s")
+        # Embedding model (~4GB in BF16) - Using official Qwen3VLEmbedder
         logger.info(f"Loading embedding model: {settings.embedding_model}")
         embed_start = time.time()
         from scripts.qwen3_vl import Qwen3VLEmbedder
         self.processors["embedding"] = self.models["embedding"].processor
         logger.info(f"Embedding model loaded in {time.time() - embed_start:.2f}s")
+        # Reranker model (~4GB in BF16) - Using official Qwen3VLReranker
         logger.info(f"Loading reranker model: {settings.reranker_model}")
         reranker_start = time.time()
         from scripts.qwen3_vl import Qwen3VLReranker
         return self._loaded
     @property
+    def vision(self) -> "VisionModel":
+        """Return FP8 vision model wrapped for pipeline consumption."""
         if not self._loaded:
             raise RuntimeError("Models not loaded. Call load_all() first.")
+        return VisionModel(
+            model=self.models["vision"],
+            processor=self.processors["vision"],
+            sampling_params=self.models["vision_sampling_params"],
         )
     @property
         return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
+class VisionModel:
+    """Vision model for fire damage analysis.
+    Uses Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 via vLLM for inference.
+    Reasoning-enhanced model handles analysis with extended thinking
+    and outputs structured JSON.
+    Pipeline: Image -> Thinking Model (reasoning + JSON) -> Output
     """
+    # System prompt for FDAM fire damage assessment
     VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
 ## Your Task
+Analyze the provided image and return a structured JSON response with fire damage assessment.
 ## Zone Classification Criteria
 - **Burn Zone**: Direct fire involvement. Look for structural char, complete combustion, exposed/damaged structural elements.
 - **Heavy**: Thick deposits; surface texture obscured; heavy coating visible.
 - **Structural Damage**: Physical damage requiring repair before cleaning (charring, warping, holes, collapse).
+## Material Categories
 - **Non-porous**: steel, concrete, glass, metal, CMU (concrete masonry unit)
 - **Semi-porous**: painted drywall, sealed wood
 - **Porous**: unpainted drywall, carpet, insulation, acoustic tile, upholstery
 - **HVAC**: rigid ductwork, flexible ductwork
 ## Combustion Particle Visual Indicators
+- **Soot**: Black/dark gray coating with oily/sticky appearance; fine uniform texture
+- **Char**: Black angular fragments; visible wood grain or fibrous structure
+- **Ash**: Gray/white powdery residue; crystalline appearance"""
+    # JSON output format prompt
+    JSON_FORMAT_PROMPT = """Analyze this fire damage image and return a JSON response with this exact structure:
+{
+    "zone": {
         "classification": "burn" | "near-field" | "far-field",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
+    },
+    "condition": {
         "level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
         "confidence": 0.0-1.0,
         "reasoning": "explanation"
+    },
     "materials": [
+        {
+            "type": "material type",
             "category": "non-porous" | "semi-porous" | "porous" | "hvac",
             "confidence": 0.0-1.0,
             "location_description": "where in image",
+            "bounding_box": {"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}
+        }
     ],
+    "combustion_indicators": {
         "soot_visible": true/false,
         "soot_pattern": "description or null",
         "char_visible": true/false,
         "char_description": "description or null",
         "ash_visible": true/false,
         "ash_description": "description or null"
+    },
     "structural_concerns": ["list of structural issues if any"],
     "access_issues": ["list of access problems if any"],
     "recommended_sampling_locations": [
+        {
             "description": "where to sample",
             "sample_type": "tape_lift" | "surface_wipe" | "air_sample",
             "priority": "high" | "medium" | "low"
+        }
     ],
     "flags_for_review": ["any items requiring human review"]
+}
 IMPORTANT: Return ONLY valid JSON, no additional text."""
+    def __init__(self, model, processor, sampling_params):
+        self.model = model
+        self.processor = processor
+        self.sampling_params = sampling_params
     def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
+        """Analyze an image using the FP8 vision model via vLLM.
+        Args:
+            image: PIL Image to analyze
+            context: Optional context string (room info, etc.)
+        Returns:
+            Structured dict with zone, condition, materials, etc.
         """
         start_time = time.time()
+        logger.debug(f"Starting FP8 vision analysis (context: {len(context)} chars)")
         try:
+            # Build messages in Qwen3-VL format
+            messages = self._build_messages(image, context)
+            # Apply chat template to format prompt correctly
+            prompt = self.processor.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+            )
+            # Generate response using vLLM multimodal API
+            # Per vLLM docs: pass PIL image directly in multi_modal_data dict
+            outputs = self.model.generate(
+                prompts=[{
+                    "prompt": prompt,
+                    "multi_modal_data": {"image": image},  # Single PIL image
+                }],
+                sampling_params=self.sampling_params,
+            )
+            response_text = outputs[0].outputs[0].text
+            # Parse JSON from response
+            result = self._parse_json_response(response_text)
             # Log result summary
             total_time = time.time() - start_time
             condition = result.get("condition", {}).get("level", "unknown")
             condition_conf = result.get("condition", {}).get("confidence", 0)
             num_materials = len(result.get("materials", []))
+            logger.info(f"Vision analysis complete in {total_time:.2f}s: "
                        f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
                        f"materials={num_materials}")
             logger.error(f"Vision analysis failed: {e}")
             return self._get_fallback_response(str(e))
+    def _build_messages(self, image: Image.Image, context: str) -> list[dict]:
+        """Build messages in Qwen3-VL format for chat template.
+        Qwen3-VL expects:
+        - System message with role="system"
+        - User message with mixed content [{"type": "image", ...}, {"type": "text", ...}]
+        """
+        # Build user text content
+        user_text = self.JSON_FORMAT_PROMPT
         if context:
+            user_text = f"Context: {context}\n\n{user_text}"
         messages = [
+            {"role": "system", "content": self.VISION_SYSTEM_PROMPT},
             {
                 "role": "user",
                 "content": [
                     {"type": "image", "image": image},
+                    {"type": "text", "text": user_text},
                 ],
             },
         ]
+        return messages
     def _parse_json_response(self, response: str) -> dict[str, Any]:
+        """Parse JSON response from model."""
         try:
             # Try to extract JSON from response
             json_match = re.search(r'\{[\s\S]*\}', response)
                 json_str = json_match.group()
                 return json.loads(json_str)
             else:
+                logger.warning("No JSON found in response")
                 return self._get_fallback_response("No JSON in response")
         except json.JSONDecodeError as e:
             logger.warning(f"Failed to parse JSON: {e}")
     Uses the official Qwen3VLEmbedder from QwenLM/Qwen3-VL-Embedding.
     The model handles last-token pooling and L2 normalization internally.
+    Model: Qwen/Qwen3-VL-Embedding-2B (2048-dim output)
     """
     def __init__(self, model, processor):
             text: Input text to embed
         Returns:
+            List of floats representing the embedding (2048-dim for 2B model)
         """
         try:
             # Use official process() API - expects list of dicts
         except Exception as e:
             logger.error(f"Embedding generation failed: {e}")
+            # Return zero vector as fallback (2048-dim per Qwen3-VL-Embedding-2B)
+            hidden_size = getattr(self.model.model.config, "hidden_size", 2048)
             return [0.0] * hidden_size
     def embed_batch(self, texts: list[str]) -> list[list[float]]:
             return [emb.cpu().tolist() for emb in embeddings]
         except Exception as e:
             logger.error(f"Batch embedding generation failed: {e}")
+            hidden_size = getattr(self.model.model.config, "hidden_size", 2048)
             return [[0.0] * hidden_size for _ in texts]
     - Creates a binary linear layer: weight = yes_weight - no_weight
     - Scores = sigmoid(linear(last_token_hidden_state))
+    Model: Qwen/Qwen3-VL-Reranker-2B
     """
     def __init__(self, model, processor):

rag/vectorstore.py CHANGED Viewed

@@ -22,10 +22,10 @@ class MockEmbeddingFunction:
     """Mock embedding function for local development.
     Generates deterministic pseudo-embeddings based on text hash.
-    Produces 4096-dimensional vectors (matches Qwen3-VL-Embedding-8B).
     """
-    EMBEDDING_DIM = 4096  # Per Qwen3-VL-Embedding-8B hidden_size
     def __call__(self, input: list[str]) -> list[list[float]]:
         """Generate mock embeddings for a list of texts."""
@@ -67,7 +67,7 @@ class SharedEmbeddingFunction:
     For ChromaDB compatibility, this wraps the model stack's embedding model.
     """
-    EMBEDDING_DIM = 4096  # Per Qwen3-VL-Embedding-8B hidden_size
     def __call__(self, input: list[str]) -> list[list[float]]:
         """Generate embeddings using the shared model from model stack."""

     """Mock embedding function for local development.
     Generates deterministic pseudo-embeddings based on text hash.
+    Produces 2048-dimensional vectors (matches Qwen3-VL-Embedding-2B).
     """
+    EMBEDDING_DIM = 2048  # Per Qwen3-VL-Embedding-2B hidden_size
     def __call__(self, input: list[str]) -> list[list[float]]:
         """Generate mock embeddings for a list of texts."""
     For ChromaDB compatibility, this wraps the model stack's embedding model.
     """
+    EMBEDDING_DIM = 2048  # Per Qwen3-VL-Embedding-2B hidden_size
     def __call__(self, input: list[str]) -> list[list[float]]:
         """Generate embeddings using the shared model from model stack."""

requirements.txt CHANGED Viewed

@@ -5,6 +5,9 @@ accelerate
 qwen-vl-utils>=0.0.14
 torchvision
 # UI
 gradio>=6.0.0,<7.0.0

 qwen-vl-utils>=0.0.14
 torchvision
+# vLLM for FP8 quantized model inference (>=0.11.0 required for Qwen3-VL support)
+vllm>=0.11.0
 # UI
 gradio>=6.0.0,<7.0.0

scripts/qwen3_vl/__init__.py CHANGED Viewed

@@ -4,8 +4,8 @@ Source: https://github.com/QwenLM/Qwen3-VL-Embedding
 License: Apache 2.0
 These are the official loading classes for:
-- Qwen/Qwen3-VL-Embedding-8B
-- Qwen/Qwen3-VL-Reranker-8B
 """
 from scripts.qwen3_vl.qwen3_vl_embedding import Qwen3VLEmbedder, Qwen3VLForEmbedding

 License: Apache 2.0
 These are the official loading classes for:
+- Qwen/Qwen3-VL-Embedding-2B (or 8B)
+- Qwen/Qwen3-VL-Reranker-2B (or 8B)
 """
 from scripts.qwen3_vl.qwen3_vl_embedding import Qwen3VLEmbedder, Qwen3VLForEmbedding