Spaces:

KinetoLabs
/

SmokeScan

Paused

KinetoLabs Claude Opus 4.5 commited on Jan 11

Commit

14c59e5

1 Parent(s): 7d5c713

Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity

Architecture change:
- Vision: 30B MoE (4 GPU TP) → 4B Dense (single GPU)
- tensor_parallel_size: 4 → 1
- gpu_memory_utilization: 0.50 → 0.80
- Removed NCCL workarounds (not needed for single GPU)

Why: 30B MoE model failed on 4xL4 due to vLLM V1 + NCCL issues
(L4s lack NVLINK). 4B dense model fits on single L4 (22GB),
eliminates all multi-GPU coordination problems.

Memory: ~18GB total (10GB vision + 4GB embed + 4GB rerank)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (3) hide show

CLAUDE.md +27 -19
config/settings.py +4 -4
models/real.py +11 -22

CLAUDE.md CHANGED Viewed

@@ -6,14 +6,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 **FDAM AI Pipeline** - Fire Damage Assessment Methodology v4.0.1 implementation. An AI-powered system that generates professional Cleaning Specifications / Scope of Work documents for fire damage restoration.
-- **Deployment**: HuggingFace Spaces with Nvidia 4xL4 (96GB VRAM total, 24GB per GPU)
-- **Local Dev**: RTX 4090 (24GB) - insufficient for full model stack; use mock models locally
 - **Spec Document**: `FDAM_AI_Pipeline_Technical_Spec.md` is the authoritative technical reference
 ## Critical Constraints
 1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
-2. **Memory Budget** - 4xL4 88GB usable: ~30-35GB vision (30B FP8) + ~4GB embedding + ~4GB reranker (~38-43GB used, ~45GB+ headroom)
 3. **Processing Time** - 60-90 seconds per assessment is acceptable
 4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
 5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
@@ -23,10 +23,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 | Component | Technology |
 |-----------|------------|
 | UI Framework | Gradio 6.x |
-| Vision | Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (via vLLM) |
 | Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
 | Reranker | Qwen/Qwen3-VL-Reranker-2B |
-| Inference | vLLM with FP8 quantization |
 | Vector Store | ChromaDB 0.4.x |
 | Validation | Pydantic 2.x |
 | PDF Generation | Pandoc 3.x |
@@ -165,20 +165,20 @@ Source documents in `/RAG-KB/`:
 | 50-69% | Moderate | Flag for human review |
 | <50% | Low | Require human verification |
-## Multi-GPU Model Loading
-All 3 models are loaded at startup (~38-43GB total on 4xL4 GPUs):
 ```python
 from vllm import LLM, SamplingParams
-# Vision model via vLLM with FP8 quantization (built-in)
 vision_model = LLM(
-    model="Qwen/Qwen3-VL-30B-A3B-Thinking-FP8",
-    tensor_parallel_size=4,  # Distribute across all 4 GPUs
     trust_remote_code=True,
-    gpu_memory_utilization=0.70,
-    max_model_len=32768,
 )
 # Embedding and Reranker use official Qwen3VL loaders
@@ -187,22 +187,30 @@ embedding_model = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B", torch_dtype=torc
 reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
 ```
-Expected distribution (FP8 + BF16, ~38-43GB total):
-- Vision model (30B FP8): ~30-35GB
 - Embedding model (2B): ~4GB
 - Reranker model (2B): ~4GB
-- Headroom: ~45GB+ for KV cache and overhead
 ## Local Development Strategy
-The RTX 4090 (24GB VRAM) cannot run the production model stack. Use this workflow:
 1. Set `MOCK_MODELS=true` environment variable
 2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
 3. Test pipeline logic, UI, calculations without real inference
-4. Deploy to HuggingFace Spaces for real model testing
-5. Request build logs after deployment to confirm success
-6. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
 ## Code Style

 **FDAM AI Pipeline** - Fire Damage Assessment Methodology v4.0.1 implementation. An AI-powered system that generates professional Cleaning Specifications / Scope of Work documents for fire damage restoration.
+- **Deployment**: HuggingFace Spaces with Nvidia L4 (22GB VRAM per GPU, single GPU used)
+- **Local Dev**: RTX 4090 (24GB) - can run 4B model; use mock models for faster iteration
 - **Spec Document**: `FDAM_AI_Pipeline_Technical_Spec.md` is the authoritative technical reference
 ## Critical Constraints
 1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
+2. **Memory Budget** - Single L4 (22GB): ~10GB vision (4B) + ~4GB embedding + ~4GB reranker (~18GB used, ~4GB headroom)
 3. **Processing Time** - 60-90 seconds per assessment is acceptable
 4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
 5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
 | Component | Technology |
 |-----------|------------|
 | UI Framework | Gradio 6.x |
+| Vision | Qwen/Qwen3-VL-4B-Thinking (via vLLM, single GPU) |
 | Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
 | Reranker | Qwen/Qwen3-VL-Reranker-2B |
+| Inference | vLLM (single GPU, no tensor parallelism) |
 | Vector Store | ChromaDB 0.4.x |
 | Validation | Pydantic 2.x |
 | PDF Generation | Pandoc 3.x |
 | 50-69% | Moderate | Flag for human review |
 | <50% | Low | Require human verification |
+## Model Loading
+All 3 models are loaded at startup (~18GB total on single L4 GPU):
 ```python
 from vllm import LLM, SamplingParams
+# Vision model via vLLM (single GPU, no tensor parallelism)
 vision_model = LLM(
+    model="Qwen/Qwen3-VL-4B-Thinking",
+    tensor_parallel_size=1,  # Single GPU
     trust_remote_code=True,
+    gpu_memory_utilization=0.80,
+    max_model_len=16384,
 )
 # Embedding and Reranker use official Qwen3VL loaders
 reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
 ```
+Expected memory usage (~18GB total on single L4):
+- Vision model (4B BF16): ~10GB
 - Embedding model (2B): ~4GB
 - Reranker model (2B): ~4GB
+- Headroom: ~4GB for KV cache and overhead
 ## Local Development Strategy
+The RTX 4090 (24GB VRAM) can run the 4B model stack (~18GB). Two options:
+**Option A: Real Models Locally**
+1. Set `MOCK_MODELS=false` (or omit - defaults to false)
+2. Models will download and load (~18GB VRAM)
+3. Full inference testing locally
+**Option B: Mock Models (faster iteration)**
 1. Set `MOCK_MODELS=true` environment variable
 2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
 3. Test pipeline logic, UI, calculations without real inference
+**Deployment:**
+1. Deploy to HuggingFace Spaces for production testing
+2. Request build logs after deployment to confirm success
+3. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
 ## Code Style

config/settings.py CHANGED Viewed

@@ -18,14 +18,14 @@ class Settings(BaseSettings):
     mock_models: bool = False
     # Model paths (for production on HuggingFace Spaces)
-    # Single 30B-A3B MoE model with FP8 quantization via vLLM (official, reasoning-enhanced)
-    vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
     embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
     reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
     # vLLM configuration
-    vllm_tensor_parallel_size: int = 4  # Use all 4 L4 GPUs
-    vllm_max_model_len: int = 8192  # Reduced to minimize NCCL overhead on L4s
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

     mock_models: bool = False
     # Model paths (for production on HuggingFace Spaces)
+    # 4B dense model - fits single GPU, no tensor parallelism needed
+    vision_model: str = "Qwen/Qwen3-VL-4B-Thinking"
     embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
     reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
     # vLLM configuration
+    vllm_tensor_parallel_size: int = 1  # Single GPU - 4B model fits on one L4
+    vllm_max_model_len: int = 16384  # 4B supports up to 256K, 16K is sufficient
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

models/real.py CHANGED Viewed

@@ -1,13 +1,13 @@
-"""Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
 This module loads the production models:
-- Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB via vLLM)
 - Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
 - Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
-- Total: ~38-43GB on 88GB available (45GB+ headroom)
 Model Loading:
-- Vision: vLLM with FP8 quantization (built-in) and tensor parallelism
 - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
 - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
 """
@@ -15,16 +15,7 @@ Model Loading:
 import os
 # vLLM environment variables - MUST be set before importing vLLM
-# Note: V0 engine is removed in vLLM 0.11+, so we must use V1
-# Force spawn method for tensor parallelism workers
-# See: https://github.com/vllm-project/vllm/issues/17618
-os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
-# NCCL settings for L4 GPU communication
-# See: https://github.com/vllm-project/vllm/issues/19002
-os.environ["NCCL_P2P_DISABLE"] = "1"
-os.environ["NCCL_IB_DISABLE"] = "1"
 import json
 import logging
@@ -43,8 +34,8 @@ logger = logging.getLogger(__name__)
 class RealModelStack:
     """Real model stack for production on HuggingFace Spaces.
-    Loads all 3 models at initialization (~38-43GB total):
-    - FP8 Vision via vLLM: ~30-35GB
     - Embedding 2B: ~4GB
     - Reranker 2B: ~4GB
     """
@@ -80,7 +71,7 @@ class RealModelStack:
         total_start = time.time()
-        # Vision model via vLLM (~30-35GB in FP8)
         logger.info(f"Loading vision model: {settings.vision_model}")
         vision_start = time.time()
@@ -89,12 +80,10 @@ class RealModelStack:
         self.models["vision"] = LLM(
             model=settings.vision_model,
-            tensor_parallel_size=settings.vllm_tensor_parallel_size,
             trust_remote_code=True,
-            # dtype removed - FP8 model auto-detects native quantization
-            gpu_memory_utilization=0.50,  # Reduced to minimize NCCL overhead on L4s
             max_model_len=settings.vllm_max_model_len,
-            # enforce_eager removed - let vLLM default (False) per official
         )
         # Load processor for chat template formatting
@@ -177,7 +166,7 @@ class RealModelStack:
 class VisionModel:
     """Vision model for fire damage analysis.
-    Uses Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 via vLLM for inference.
     Reasoning-enhanced model handles analysis with extended thinking
     and outputs structured JSON.

+"""Real model loading for production (HuggingFace Spaces).
 This module loads the production models:
+- Vision: Qwen/Qwen3-VL-4B-Thinking (~10GB via vLLM, single GPU)
 - Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
 - Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
+- Total: ~18GB on single L4 GPU (22GB)
 Model Loading:
+- Vision: vLLM with single GPU (no tensor parallelism needed)
 - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
 - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
 """
 import os
 # vLLM environment variables - MUST be set before importing vLLM
+# Note: Using single GPU (TP=1) so NCCL workarounds are not needed
 import json
 import logging
 class RealModelStack:
     """Real model stack for production on HuggingFace Spaces.
+    Loads all 3 models at initialization (~18GB total on single GPU):
+    - Vision 4B via vLLM: ~10GB
     - Embedding 2B: ~4GB
     - Reranker 2B: ~4GB
     """
         total_start = time.time()
+        # Vision model via vLLM (~10GB for 4B model)
         logger.info(f"Loading vision model: {settings.vision_model}")
         vision_start = time.time()
         self.models["vision"] = LLM(
             model=settings.vision_model,
+            tensor_parallel_size=settings.vllm_tensor_parallel_size,  # 1 for single GPU
             trust_remote_code=True,
+            gpu_memory_utilization=0.80,  # Can use more on single GPU
             max_model_len=settings.vllm_max_model_len,
         )
         # Load processor for chat template formatting
 class VisionModel:
     """Vision model for fire damage analysis.
+    Uses Qwen/Qwen3-VL-4B-Thinking via vLLM for inference.
     Reasoning-enhanced model handles analysis with extended thinking
     and outputs structured JSON.