Spaces:
Paused
Paused
Commit
·
14c59e5
1
Parent(s):
7d5c713
Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity
Browse filesArchitecture change:
- Vision: 30B MoE (4 GPU TP) → 4B Dense (single GPU)
- tensor_parallel_size: 4 → 1
- gpu_memory_utilization: 0.50 → 0.80
- Removed NCCL workarounds (not needed for single GPU)
Why: 30B MoE model failed on 4xL4 due to vLLM V1 + NCCL issues
(L4s lack NVLINK). 4B dense model fits on single L4 (22GB),
eliminates all multi-GPU coordination problems.
Memory: ~18GB total (10GB vision + 4GB embed + 4GB rerank)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLAUDE.md +27 -19
- config/settings.py +4 -4
- models/real.py +11 -22
CLAUDE.md
CHANGED
|
@@ -6,14 +6,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 6 |
|
| 7 |
**FDAM AI Pipeline** - Fire Damage Assessment Methodology v4.0.1 implementation. An AI-powered system that generates professional Cleaning Specifications / Scope of Work documents for fire damage restoration.
|
| 8 |
|
| 9 |
-
- **Deployment**: HuggingFace Spaces with Nvidia
|
| 10 |
-
- **Local Dev**: RTX 4090 (24GB) -
|
| 11 |
- **Spec Document**: `FDAM_AI_Pipeline_Technical_Spec.md` is the authoritative technical reference
|
| 12 |
|
| 13 |
## Critical Constraints
|
| 14 |
|
| 15 |
1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
|
| 16 |
-
2. **Memory Budget** -
|
| 17 |
3. **Processing Time** - 60-90 seconds per assessment is acceptable
|
| 18 |
4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
|
| 19 |
5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
|
|
@@ -23,10 +23,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 23 |
| Component | Technology |
|
| 24 |
|-----------|------------|
|
| 25 |
| UI Framework | Gradio 6.x |
|
| 26 |
-
| Vision | Qwen/Qwen3-VL-
|
| 27 |
| Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
|
| 28 |
| Reranker | Qwen/Qwen3-VL-Reranker-2B |
|
| 29 |
-
| Inference | vLLM
|
| 30 |
| Vector Store | ChromaDB 0.4.x |
|
| 31 |
| Validation | Pydantic 2.x |
|
| 32 |
| PDF Generation | Pandoc 3.x |
|
|
@@ -165,20 +165,20 @@ Source documents in `/RAG-KB/`:
|
|
| 165 |
| 50-69% | Moderate | Flag for human review |
|
| 166 |
| <50% | Low | Require human verification |
|
| 167 |
|
| 168 |
-
##
|
| 169 |
|
| 170 |
-
All 3 models are loaded at startup (~
|
| 171 |
|
| 172 |
```python
|
| 173 |
from vllm import LLM, SamplingParams
|
| 174 |
|
| 175 |
-
# Vision model via vLLM
|
| 176 |
vision_model = LLM(
|
| 177 |
-
model="Qwen/Qwen3-VL-
|
| 178 |
-
tensor_parallel_size=
|
| 179 |
trust_remote_code=True,
|
| 180 |
-
gpu_memory_utilization=0.
|
| 181 |
-
max_model_len=
|
| 182 |
)
|
| 183 |
|
| 184 |
# Embedding and Reranker use official Qwen3VL loaders
|
|
@@ -187,22 +187,30 @@ embedding_model = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B", torch_dtype=torc
|
|
| 187 |
reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
|
| 188 |
```
|
| 189 |
|
| 190 |
-
Expected
|
| 191 |
-
- Vision model (
|
| 192 |
- Embedding model (2B): ~4GB
|
| 193 |
- Reranker model (2B): ~4GB
|
| 194 |
-
- Headroom: ~
|
| 195 |
|
| 196 |
## Local Development Strategy
|
| 197 |
|
| 198 |
-
The RTX 4090 (24GB VRAM)
|
| 199 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
1. Set `MOCK_MODELS=true` environment variable
|
| 201 |
2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
|
| 202 |
3. Test pipeline logic, UI, calculations without real inference
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
|
|
|
|
|
|
| 206 |
|
| 207 |
## Code Style
|
| 208 |
|
|
|
|
| 6 |
|
| 7 |
**FDAM AI Pipeline** - Fire Damage Assessment Methodology v4.0.1 implementation. An AI-powered system that generates professional Cleaning Specifications / Scope of Work documents for fire damage restoration.
|
| 8 |
|
| 9 |
+
- **Deployment**: HuggingFace Spaces with Nvidia L4 (22GB VRAM per GPU, single GPU used)
|
| 10 |
+
- **Local Dev**: RTX 4090 (24GB) - can run 4B model; use mock models for faster iteration
|
| 11 |
- **Spec Document**: `FDAM_AI_Pipeline_Technical_Spec.md` is the authoritative technical reference
|
| 12 |
|
| 13 |
## Critical Constraints
|
| 14 |
|
| 15 |
1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
|
| 16 |
+
2. **Memory Budget** - Single L4 (22GB): ~10GB vision (4B) + ~4GB embedding + ~4GB reranker (~18GB used, ~4GB headroom)
|
| 17 |
3. **Processing Time** - 60-90 seconds per assessment is acceptable
|
| 18 |
4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
|
| 19 |
5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
|
|
|
|
| 23 |
| Component | Technology |
|
| 24 |
|-----------|------------|
|
| 25 |
| UI Framework | Gradio 6.x |
|
| 26 |
+
| Vision | Qwen/Qwen3-VL-4B-Thinking (via vLLM, single GPU) |
|
| 27 |
| Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
|
| 28 |
| Reranker | Qwen/Qwen3-VL-Reranker-2B |
|
| 29 |
+
| Inference | vLLM (single GPU, no tensor parallelism) |
|
| 30 |
| Vector Store | ChromaDB 0.4.x |
|
| 31 |
| Validation | Pydantic 2.x |
|
| 32 |
| PDF Generation | Pandoc 3.x |
|
|
|
|
| 165 |
| 50-69% | Moderate | Flag for human review |
|
| 166 |
| <50% | Low | Require human verification |
|
| 167 |
|
| 168 |
+
## Model Loading
|
| 169 |
|
| 170 |
+
All 3 models are loaded at startup (~18GB total on single L4 GPU):
|
| 171 |
|
| 172 |
```python
|
| 173 |
from vllm import LLM, SamplingParams
|
| 174 |
|
| 175 |
+
# Vision model via vLLM (single GPU, no tensor parallelism)
|
| 176 |
vision_model = LLM(
|
| 177 |
+
model="Qwen/Qwen3-VL-4B-Thinking",
|
| 178 |
+
tensor_parallel_size=1, # Single GPU
|
| 179 |
trust_remote_code=True,
|
| 180 |
+
gpu_memory_utilization=0.80,
|
| 181 |
+
max_model_len=16384,
|
| 182 |
)
|
| 183 |
|
| 184 |
# Embedding and Reranker use official Qwen3VL loaders
|
|
|
|
| 187 |
reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
|
| 188 |
```
|
| 189 |
|
| 190 |
+
Expected memory usage (~18GB total on single L4):
|
| 191 |
+
- Vision model (4B BF16): ~10GB
|
| 192 |
- Embedding model (2B): ~4GB
|
| 193 |
- Reranker model (2B): ~4GB
|
| 194 |
+
- Headroom: ~4GB for KV cache and overhead
|
| 195 |
|
| 196 |
## Local Development Strategy
|
| 197 |
|
| 198 |
+
The RTX 4090 (24GB VRAM) can run the 4B model stack (~18GB). Two options:
|
| 199 |
|
| 200 |
+
**Option A: Real Models Locally**
|
| 201 |
+
1. Set `MOCK_MODELS=false` (or omit - defaults to false)
|
| 202 |
+
2. Models will download and load (~18GB VRAM)
|
| 203 |
+
3. Full inference testing locally
|
| 204 |
+
|
| 205 |
+
**Option B: Mock Models (faster iteration)**
|
| 206 |
1. Set `MOCK_MODELS=true` environment variable
|
| 207 |
2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
|
| 208 |
3. Test pipeline logic, UI, calculations without real inference
|
| 209 |
+
|
| 210 |
+
**Deployment:**
|
| 211 |
+
1. Deploy to HuggingFace Spaces for production testing
|
| 212 |
+
2. Request build logs after deployment to confirm success
|
| 213 |
+
3. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
|
| 214 |
|
| 215 |
## Code Style
|
| 216 |
|
config/settings.py
CHANGED
|
@@ -18,14 +18,14 @@ class Settings(BaseSettings):
|
|
| 18 |
mock_models: bool = False
|
| 19 |
|
| 20 |
# Model paths (for production on HuggingFace Spaces)
|
| 21 |
-
#
|
| 22 |
-
vision_model: str = "Qwen/Qwen3-VL-
|
| 23 |
embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
|
| 24 |
reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
|
| 25 |
|
| 26 |
# vLLM configuration
|
| 27 |
-
vllm_tensor_parallel_size: int =
|
| 28 |
-
vllm_max_model_len: int =
|
| 29 |
|
| 30 |
# ChromaDB
|
| 31 |
chroma_persist_dir: str = "./chroma_db"
|
|
|
|
| 18 |
mock_models: bool = False
|
| 19 |
|
| 20 |
# Model paths (for production on HuggingFace Spaces)
|
| 21 |
+
# 4B dense model - fits single GPU, no tensor parallelism needed
|
| 22 |
+
vision_model: str = "Qwen/Qwen3-VL-4B-Thinking"
|
| 23 |
embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
|
| 24 |
reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
|
| 25 |
|
| 26 |
# vLLM configuration
|
| 27 |
+
vllm_tensor_parallel_size: int = 1 # Single GPU - 4B model fits on one L4
|
| 28 |
+
vllm_max_model_len: int = 16384 # 4B supports up to 256K, 16K is sufficient
|
| 29 |
|
| 30 |
# ChromaDB
|
| 31 |
chroma_persist_dir: str = "./chroma_db"
|
models/real.py
CHANGED
|
@@ -1,13 +1,13 @@
|
|
| 1 |
-
"""Real model loading for production (HuggingFace Spaces
|
| 2 |
|
| 3 |
This module loads the production models:
|
| 4 |
-
- Vision: Qwen/Qwen3-VL-
|
| 5 |
- Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
|
| 6 |
- Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
|
| 7 |
-
- Total: ~
|
| 8 |
|
| 9 |
Model Loading:
|
| 10 |
-
- Vision: vLLM with
|
| 11 |
- Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 12 |
- Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 13 |
"""
|
|
@@ -15,16 +15,7 @@ Model Loading:
|
|
| 15 |
import os
|
| 16 |
|
| 17 |
# vLLM environment variables - MUST be set before importing vLLM
|
| 18 |
-
# Note:
|
| 19 |
-
|
| 20 |
-
# Force spawn method for tensor parallelism workers
|
| 21 |
-
# See: https://github.com/vllm-project/vllm/issues/17618
|
| 22 |
-
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
|
| 23 |
-
|
| 24 |
-
# NCCL settings for L4 GPU communication
|
| 25 |
-
# See: https://github.com/vllm-project/vllm/issues/19002
|
| 26 |
-
os.environ["NCCL_P2P_DISABLE"] = "1"
|
| 27 |
-
os.environ["NCCL_IB_DISABLE"] = "1"
|
| 28 |
|
| 29 |
import json
|
| 30 |
import logging
|
|
@@ -43,8 +34,8 @@ logger = logging.getLogger(__name__)
|
|
| 43 |
class RealModelStack:
|
| 44 |
"""Real model stack for production on HuggingFace Spaces.
|
| 45 |
|
| 46 |
-
Loads all 3 models at initialization (~
|
| 47 |
-
-
|
| 48 |
- Embedding 2B: ~4GB
|
| 49 |
- Reranker 2B: ~4GB
|
| 50 |
"""
|
|
@@ -80,7 +71,7 @@ class RealModelStack:
|
|
| 80 |
|
| 81 |
total_start = time.time()
|
| 82 |
|
| 83 |
-
# Vision model via vLLM (~
|
| 84 |
logger.info(f"Loading vision model: {settings.vision_model}")
|
| 85 |
vision_start = time.time()
|
| 86 |
|
|
@@ -89,12 +80,10 @@ class RealModelStack:
|
|
| 89 |
|
| 90 |
self.models["vision"] = LLM(
|
| 91 |
model=settings.vision_model,
|
| 92 |
-
tensor_parallel_size=settings.vllm_tensor_parallel_size,
|
| 93 |
trust_remote_code=True,
|
| 94 |
-
#
|
| 95 |
-
gpu_memory_utilization=0.50, # Reduced to minimize NCCL overhead on L4s
|
| 96 |
max_model_len=settings.vllm_max_model_len,
|
| 97 |
-
# enforce_eager removed - let vLLM default (False) per official
|
| 98 |
)
|
| 99 |
|
| 100 |
# Load processor for chat template formatting
|
|
@@ -177,7 +166,7 @@ class RealModelStack:
|
|
| 177 |
class VisionModel:
|
| 178 |
"""Vision model for fire damage analysis.
|
| 179 |
|
| 180 |
-
Uses Qwen/Qwen3-VL-
|
| 181 |
Reasoning-enhanced model handles analysis with extended thinking
|
| 182 |
and outputs structured JSON.
|
| 183 |
|
|
|
|
| 1 |
+
"""Real model loading for production (HuggingFace Spaces).
|
| 2 |
|
| 3 |
This module loads the production models:
|
| 4 |
+
- Vision: Qwen/Qwen3-VL-4B-Thinking (~10GB via vLLM, single GPU)
|
| 5 |
- Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
|
| 6 |
- Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
|
| 7 |
+
- Total: ~18GB on single L4 GPU (22GB)
|
| 8 |
|
| 9 |
Model Loading:
|
| 10 |
+
- Vision: vLLM with single GPU (no tensor parallelism needed)
|
| 11 |
- Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 12 |
- Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 13 |
"""
|
|
|
|
| 15 |
import os
|
| 16 |
|
| 17 |
# vLLM environment variables - MUST be set before importing vLLM
|
| 18 |
+
# Note: Using single GPU (TP=1) so NCCL workarounds are not needed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
import json
|
| 21 |
import logging
|
|
|
|
| 34 |
class RealModelStack:
|
| 35 |
"""Real model stack for production on HuggingFace Spaces.
|
| 36 |
|
| 37 |
+
Loads all 3 models at initialization (~18GB total on single GPU):
|
| 38 |
+
- Vision 4B via vLLM: ~10GB
|
| 39 |
- Embedding 2B: ~4GB
|
| 40 |
- Reranker 2B: ~4GB
|
| 41 |
"""
|
|
|
|
| 71 |
|
| 72 |
total_start = time.time()
|
| 73 |
|
| 74 |
+
# Vision model via vLLM (~10GB for 4B model)
|
| 75 |
logger.info(f"Loading vision model: {settings.vision_model}")
|
| 76 |
vision_start = time.time()
|
| 77 |
|
|
|
|
| 80 |
|
| 81 |
self.models["vision"] = LLM(
|
| 82 |
model=settings.vision_model,
|
| 83 |
+
tensor_parallel_size=settings.vllm_tensor_parallel_size, # 1 for single GPU
|
| 84 |
trust_remote_code=True,
|
| 85 |
+
gpu_memory_utilization=0.80, # Can use more on single GPU
|
|
|
|
| 86 |
max_model_len=settings.vllm_max_model_len,
|
|
|
|
| 87 |
)
|
| 88 |
|
| 89 |
# Load processor for chat template formatting
|
|
|
|
| 166 |
class VisionModel:
|
| 167 |
"""Vision model for fire damage analysis.
|
| 168 |
|
| 169 |
+
Uses Qwen/Qwen3-VL-4B-Thinking via vLLM for inference.
|
| 170 |
Reasoning-enhanced model handles analysis with extended thinking
|
| 171 |
and outputs structured JSON.
|
| 172 |
|