KinetoLabs Claude Opus 4.5 commited on
Commit
14c59e5
·
1 Parent(s): 7d5c713

Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity

Browse files

Architecture change:
- Vision: 30B MoE (4 GPU TP) → 4B Dense (single GPU)
- tensor_parallel_size: 4 → 1
- gpu_memory_utilization: 0.50 → 0.80
- Removed NCCL workarounds (not needed for single GPU)

Why: 30B MoE model failed on 4xL4 due to vLLM V1 + NCCL issues
(L4s lack NVLINK). 4B dense model fits on single L4 (22GB),
eliminates all multi-GPU coordination problems.

Memory: ~18GB total (10GB vision + 4GB embed + 4GB rerank)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (3) hide show
  1. CLAUDE.md +27 -19
  2. config/settings.py +4 -4
  3. models/real.py +11 -22
CLAUDE.md CHANGED
@@ -6,14 +6,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
 
7
  **FDAM AI Pipeline** - Fire Damage Assessment Methodology v4.0.1 implementation. An AI-powered system that generates professional Cleaning Specifications / Scope of Work documents for fire damage restoration.
8
 
9
- - **Deployment**: HuggingFace Spaces with Nvidia 4xL4 (96GB VRAM total, 24GB per GPU)
10
- - **Local Dev**: RTX 4090 (24GB) - insufficient for full model stack; use mock models locally
11
  - **Spec Document**: `FDAM_AI_Pipeline_Technical_Spec.md` is the authoritative technical reference
12
 
13
  ## Critical Constraints
14
 
15
  1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
16
- 2. **Memory Budget** - 4xL4 88GB usable: ~30-35GB vision (30B FP8) + ~4GB embedding + ~4GB reranker (~38-43GB used, ~45GB+ headroom)
17
  3. **Processing Time** - 60-90 seconds per assessment is acceptable
18
  4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
19
  5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
@@ -23,10 +23,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
23
  | Component | Technology |
24
  |-----------|------------|
25
  | UI Framework | Gradio 6.x |
26
- | Vision | Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (via vLLM) |
27
  | Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
28
  | Reranker | Qwen/Qwen3-VL-Reranker-2B |
29
- | Inference | vLLM with FP8 quantization |
30
  | Vector Store | ChromaDB 0.4.x |
31
  | Validation | Pydantic 2.x |
32
  | PDF Generation | Pandoc 3.x |
@@ -165,20 +165,20 @@ Source documents in `/RAG-KB/`:
165
  | 50-69% | Moderate | Flag for human review |
166
  | <50% | Low | Require human verification |
167
 
168
- ## Multi-GPU Model Loading
169
 
170
- All 3 models are loaded at startup (~38-43GB total on 4xL4 GPUs):
171
 
172
  ```python
173
  from vllm import LLM, SamplingParams
174
 
175
- # Vision model via vLLM with FP8 quantization (built-in)
176
  vision_model = LLM(
177
- model="Qwen/Qwen3-VL-30B-A3B-Thinking-FP8",
178
- tensor_parallel_size=4, # Distribute across all 4 GPUs
179
  trust_remote_code=True,
180
- gpu_memory_utilization=0.70,
181
- max_model_len=32768,
182
  )
183
 
184
  # Embedding and Reranker use official Qwen3VL loaders
@@ -187,22 +187,30 @@ embedding_model = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B", torch_dtype=torc
187
  reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
188
  ```
189
 
190
- Expected distribution (FP8 + BF16, ~38-43GB total):
191
- - Vision model (30B FP8): ~30-35GB
192
  - Embedding model (2B): ~4GB
193
  - Reranker model (2B): ~4GB
194
- - Headroom: ~45GB+ for KV cache and overhead
195
 
196
  ## Local Development Strategy
197
 
198
- The RTX 4090 (24GB VRAM) cannot run the production model stack. Use this workflow:
199
 
 
 
 
 
 
 
200
  1. Set `MOCK_MODELS=true` environment variable
201
  2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
202
  3. Test pipeline logic, UI, calculations without real inference
203
- 4. Deploy to HuggingFace Spaces for real model testing
204
- 5. Request build logs after deployment to confirm success
205
- 6. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
 
 
206
 
207
  ## Code Style
208
 
 
6
 
7
  **FDAM AI Pipeline** - Fire Damage Assessment Methodology v4.0.1 implementation. An AI-powered system that generates professional Cleaning Specifications / Scope of Work documents for fire damage restoration.
8
 
9
+ - **Deployment**: HuggingFace Spaces with Nvidia L4 (22GB VRAM per GPU, single GPU used)
10
+ - **Local Dev**: RTX 4090 (24GB) - can run 4B model; use mock models for faster iteration
11
  - **Spec Document**: `FDAM_AI_Pipeline_Technical_Spec.md` is the authoritative technical reference
12
 
13
  ## Critical Constraints
14
 
15
  1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
16
+ 2. **Memory Budget** - Single L4 (22GB): ~10GB vision (4B) + ~4GB embedding + ~4GB reranker (~18GB used, ~4GB headroom)
17
  3. **Processing Time** - 60-90 seconds per assessment is acceptable
18
  4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
19
  5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
 
23
  | Component | Technology |
24
  |-----------|------------|
25
  | UI Framework | Gradio 6.x |
26
+ | Vision | Qwen/Qwen3-VL-4B-Thinking (via vLLM, single GPU) |
27
  | Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
28
  | Reranker | Qwen/Qwen3-VL-Reranker-2B |
29
+ | Inference | vLLM (single GPU, no tensor parallelism) |
30
  | Vector Store | ChromaDB 0.4.x |
31
  | Validation | Pydantic 2.x |
32
  | PDF Generation | Pandoc 3.x |
 
165
  | 50-69% | Moderate | Flag for human review |
166
  | <50% | Low | Require human verification |
167
 
168
+ ## Model Loading
169
 
170
+ All 3 models are loaded at startup (~18GB total on single L4 GPU):
171
 
172
  ```python
173
  from vllm import LLM, SamplingParams
174
 
175
+ # Vision model via vLLM (single GPU, no tensor parallelism)
176
  vision_model = LLM(
177
+ model="Qwen/Qwen3-VL-4B-Thinking",
178
+ tensor_parallel_size=1, # Single GPU
179
  trust_remote_code=True,
180
+ gpu_memory_utilization=0.80,
181
+ max_model_len=16384,
182
  )
183
 
184
  # Embedding and Reranker use official Qwen3VL loaders
 
187
  reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
188
  ```
189
 
190
+ Expected memory usage (~18GB total on single L4):
191
+ - Vision model (4B BF16): ~10GB
192
  - Embedding model (2B): ~4GB
193
  - Reranker model (2B): ~4GB
194
+ - Headroom: ~4GB for KV cache and overhead
195
 
196
  ## Local Development Strategy
197
 
198
+ The RTX 4090 (24GB VRAM) can run the 4B model stack (~18GB). Two options:
199
 
200
+ **Option A: Real Models Locally**
201
+ 1. Set `MOCK_MODELS=false` (or omit - defaults to false)
202
+ 2. Models will download and load (~18GB VRAM)
203
+ 3. Full inference testing locally
204
+
205
+ **Option B: Mock Models (faster iteration)**
206
  1. Set `MOCK_MODELS=true` environment variable
207
  2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
208
  3. Test pipeline logic, UI, calculations without real inference
209
+
210
+ **Deployment:**
211
+ 1. Deploy to HuggingFace Spaces for production testing
212
+ 2. Request build logs after deployment to confirm success
213
+ 3. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
214
 
215
  ## Code Style
216
 
config/settings.py CHANGED
@@ -18,14 +18,14 @@ class Settings(BaseSettings):
18
  mock_models: bool = False
19
 
20
  # Model paths (for production on HuggingFace Spaces)
21
- # Single 30B-A3B MoE model with FP8 quantization via vLLM (official, reasoning-enhanced)
22
- vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
23
  embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
24
  reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
25
 
26
  # vLLM configuration
27
- vllm_tensor_parallel_size: int = 4 # Use all 4 L4 GPUs
28
- vllm_max_model_len: int = 8192 # Reduced to minimize NCCL overhead on L4s
29
 
30
  # ChromaDB
31
  chroma_persist_dir: str = "./chroma_db"
 
18
  mock_models: bool = False
19
 
20
  # Model paths (for production on HuggingFace Spaces)
21
+ # 4B dense model - fits single GPU, no tensor parallelism needed
22
+ vision_model: str = "Qwen/Qwen3-VL-4B-Thinking"
23
  embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
24
  reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
25
 
26
  # vLLM configuration
27
+ vllm_tensor_parallel_size: int = 1 # Single GPU - 4B model fits on one L4
28
+ vllm_max_model_len: int = 16384 # 4B supports up to 256K, 16K is sufficient
29
 
30
  # ChromaDB
31
  chroma_persist_dir: str = "./chroma_db"
models/real.py CHANGED
@@ -1,13 +1,13 @@
1
- """Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
2
 
3
  This module loads the production models:
4
- - Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB via vLLM)
5
  - Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
6
  - Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
7
- - Total: ~38-43GB on 88GB available (45GB+ headroom)
8
 
9
  Model Loading:
10
- - Vision: vLLM with FP8 quantization (built-in) and tensor parallelism
11
  - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
12
  - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
13
  """
@@ -15,16 +15,7 @@ Model Loading:
15
  import os
16
 
17
  # vLLM environment variables - MUST be set before importing vLLM
18
- # Note: V0 engine is removed in vLLM 0.11+, so we must use V1
19
-
20
- # Force spawn method for tensor parallelism workers
21
- # See: https://github.com/vllm-project/vllm/issues/17618
22
- os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
23
-
24
- # NCCL settings for L4 GPU communication
25
- # See: https://github.com/vllm-project/vllm/issues/19002
26
- os.environ["NCCL_P2P_DISABLE"] = "1"
27
- os.environ["NCCL_IB_DISABLE"] = "1"
28
 
29
  import json
30
  import logging
@@ -43,8 +34,8 @@ logger = logging.getLogger(__name__)
43
  class RealModelStack:
44
  """Real model stack for production on HuggingFace Spaces.
45
 
46
- Loads all 3 models at initialization (~38-43GB total):
47
- - FP8 Vision via vLLM: ~30-35GB
48
  - Embedding 2B: ~4GB
49
  - Reranker 2B: ~4GB
50
  """
@@ -80,7 +71,7 @@ class RealModelStack:
80
 
81
  total_start = time.time()
82
 
83
- # Vision model via vLLM (~30-35GB in FP8)
84
  logger.info(f"Loading vision model: {settings.vision_model}")
85
  vision_start = time.time()
86
 
@@ -89,12 +80,10 @@ class RealModelStack:
89
 
90
  self.models["vision"] = LLM(
91
  model=settings.vision_model,
92
- tensor_parallel_size=settings.vllm_tensor_parallel_size,
93
  trust_remote_code=True,
94
- # dtype removed - FP8 model auto-detects native quantization
95
- gpu_memory_utilization=0.50, # Reduced to minimize NCCL overhead on L4s
96
  max_model_len=settings.vllm_max_model_len,
97
- # enforce_eager removed - let vLLM default (False) per official
98
  )
99
 
100
  # Load processor for chat template formatting
@@ -177,7 +166,7 @@ class RealModelStack:
177
  class VisionModel:
178
  """Vision model for fire damage analysis.
179
 
180
- Uses Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 via vLLM for inference.
181
  Reasoning-enhanced model handles analysis with extended thinking
182
  and outputs structured JSON.
183
 
 
1
+ """Real model loading for production (HuggingFace Spaces).
2
 
3
  This module loads the production models:
4
+ - Vision: Qwen/Qwen3-VL-4B-Thinking (~10GB via vLLM, single GPU)
5
  - Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
6
  - Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
7
+ - Total: ~18GB on single L4 GPU (22GB)
8
 
9
  Model Loading:
10
+ - Vision: vLLM with single GPU (no tensor parallelism needed)
11
  - Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
12
  - Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
13
  """
 
15
  import os
16
 
17
  # vLLM environment variables - MUST be set before importing vLLM
18
+ # Note: Using single GPU (TP=1) so NCCL workarounds are not needed
 
 
 
 
 
 
 
 
 
19
 
20
  import json
21
  import logging
 
34
  class RealModelStack:
35
  """Real model stack for production on HuggingFace Spaces.
36
 
37
+ Loads all 3 models at initialization (~18GB total on single GPU):
38
+ - Vision 4B via vLLM: ~10GB
39
  - Embedding 2B: ~4GB
40
  - Reranker 2B: ~4GB
41
  """
 
71
 
72
  total_start = time.time()
73
 
74
+ # Vision model via vLLM (~10GB for 4B model)
75
  logger.info(f"Loading vision model: {settings.vision_model}")
76
  vision_start = time.time()
77
 
 
80
 
81
  self.models["vision"] = LLM(
82
  model=settings.vision_model,
83
+ tensor_parallel_size=settings.vllm_tensor_parallel_size, # 1 for single GPU
84
  trust_remote_code=True,
85
+ gpu_memory_utilization=0.80, # Can use more on single GPU
 
86
  max_model_len=settings.vllm_max_model_len,
 
87
  )
88
 
89
  # Load processor for chat template formatting
 
166
  class VisionModel:
167
  """Vision model for fire damage analysis.
168
 
169
+ Uses Qwen/Qwen3-VL-4B-Thinking via vLLM for inference.
170
  Reasoning-enhanced model handles analysis with extended thinking
171
  and outputs structured JSON.
172