Spaces:
Paused
Paused
Commit
·
706520f
1
Parent(s):
0699c5f
Replace dual 8B with single 30B-A3B FP8 vision model
Browse filesSimplify pipeline architecture:
- Vision: Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB) replaces dual 8B
- Embedding: 2B model (2048-dim) replaces 8B
- Reranker: 2B model replaces 8B
- Total VRAM: ~38-43GB (was ~68GB), 45GB+ headroom on 4xL4
Key changes:
- vLLM with FP8 quantization (built-in, no autoawq needed)
- Proper Qwen3-VL chat template formatting via processor
- Removed dual-model Thinking→Instruct pipeline
- Single model handles analysis + structured JSON output
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- .env.example +4 -5
- CLAUDE.md +28 -26
- FDAM_AI_Pipeline_Technical_Spec.md +0 -0
- README.md +4 -5
- config/inference.py +14 -25
- config/settings.py +8 -5
- models/loader.py +7 -7
- models/mock.py +16 -19
- models/real.py +136 -281
- rag/vectorstore.py +3 -3
- requirements.txt +3 -0
- scripts/qwen3_vl/__init__.py +2 -2
.env.example
CHANGED
|
@@ -8,8 +8,7 @@ MOCK_MODELS=true
|
|
| 8 |
SERVER_HOST=0.0.0.0
|
| 9 |
SERVER_PORT=7860
|
| 10 |
|
| 11 |
-
# Optional: Override model paths (
|
| 12 |
-
#
|
| 13 |
-
#
|
| 14 |
-
#
|
| 15 |
-
# RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-8B
|
|
|
|
| 8 |
SERVER_HOST=0.0.0.0
|
| 9 |
SERVER_PORT=7860
|
| 10 |
|
| 11 |
+
# Optional: Override model paths (FP8 + 2B architecture)
|
| 12 |
+
# VISION_MODEL=Qwen/Qwen3-VL-30B-A3B-Thinking-FP8
|
| 13 |
+
# EMBEDDING_MODEL=Qwen/Qwen3-VL-Embedding-2B
|
| 14 |
+
# RERANKER_MODEL=Qwen/Qwen3-VL-Reranker-2B
|
|
|
CLAUDE.md
CHANGED
|
@@ -13,7 +13,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 13 |
## Critical Constraints
|
| 14 |
|
| 15 |
1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
|
| 16 |
-
2. **Memory Budget** - 4xL4 88GB usable: ~
|
| 17 |
3. **Processing Time** - 60-90 seconds per assessment is acceptable
|
| 18 |
4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
|
| 19 |
5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
|
|
@@ -23,10 +23,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 23 |
| Component | Technology |
|
| 24 |
|-----------|------------|
|
| 25 |
| UI Framework | Gradio 6.x |
|
| 26 |
-
| Vision
|
| 27 |
-
|
|
| 28 |
-
|
|
| 29 |
-
|
|
| 30 |
| Vector Store | ChromaDB 0.4.x |
|
| 31 |
| Validation | Pydantic 2.x |
|
| 32 |
| PDF Generation | Pandoc 3.x |
|
|
@@ -149,40 +149,42 @@ Source documents in `/RAG-KB/`:
|
|
| 149 |
|
| 150 |
## Multi-GPU Model Loading
|
| 151 |
|
| 152 |
-
All
|
| 153 |
|
| 154 |
```python
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
torch_dtype=torch.bfloat16,
|
| 165 |
-
device_map="auto",
|
| 166 |
-
trust_remote_code=True
|
| 167 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
```
|
| 169 |
|
| 170 |
-
Expected distribution (BF16, ~
|
| 171 |
-
- Vision
|
| 172 |
-
-
|
| 173 |
-
-
|
| 174 |
-
-
|
| 175 |
-
- Headroom: ~20GB for KV cache and overhead
|
| 176 |
|
| 177 |
## Local Development Strategy
|
| 178 |
|
| 179 |
-
The RTX 4090 (24GB VRAM) cannot run the
|
| 180 |
|
| 181 |
1. Set `MOCK_MODELS=true` environment variable
|
| 182 |
-
2. Mock responses return realistic JSON matching vision output schema
|
| 183 |
3. Test pipeline logic, UI, calculations without real inference
|
| 184 |
4. Deploy to HuggingFace Spaces for real model testing
|
| 185 |
5. Request build logs after deployment to confirm success
|
|
|
|
| 186 |
|
| 187 |
## Code Style
|
| 188 |
|
|
|
|
| 13 |
## Critical Constraints
|
| 14 |
|
| 15 |
1. **No External API Calls** - 100% locally-owned models only (no Claude/OpenAI APIs)
|
| 16 |
+
2. **Memory Budget** - 4xL4 88GB usable: ~30-35GB vision (30B FP8) + ~4GB embedding + ~4GB reranker (~38-43GB used, ~45GB+ headroom)
|
| 17 |
3. **Processing Time** - 60-90 seconds per assessment is acceptable
|
| 18 |
4. **MVP Scope** - Phase 1 (PRE) and Phase 2 (PRA) only; no lab results processing yet
|
| 19 |
5. **Static RAG** - Knowledge base is pre-indexed; no user document uploads
|
|
|
|
| 23 |
| Component | Technology |
|
| 24 |
|-----------|------------|
|
| 25 |
| UI Framework | Gradio 6.x |
|
| 26 |
+
| Vision | Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (via vLLM) |
|
| 27 |
+
| Embeddings | Qwen/Qwen3-VL-Embedding-2B (2048-dim) |
|
| 28 |
+
| Reranker | Qwen/Qwen3-VL-Reranker-2B |
|
| 29 |
+
| Inference | vLLM with FP8 quantization |
|
| 30 |
| Vector Store | ChromaDB 0.4.x |
|
| 31 |
| Validation | Pydantic 2.x |
|
| 32 |
| PDF Generation | Pandoc 3.x |
|
|
|
|
| 149 |
|
| 150 |
## Multi-GPU Model Loading
|
| 151 |
|
| 152 |
+
All 3 models are loaded at startup (~38-43GB total on 4xL4 GPUs):
|
| 153 |
|
| 154 |
```python
|
| 155 |
+
from vllm import LLM, SamplingParams
|
| 156 |
+
|
| 157 |
+
# Vision model via vLLM with FP8 quantization (built-in)
|
| 158 |
+
vision_model = LLM(
|
| 159 |
+
model="Qwen/Qwen3-VL-30B-A3B-Thinking-FP8",
|
| 160 |
+
tensor_parallel_size=4, # Distribute across all 4 GPUs
|
| 161 |
+
trust_remote_code=True,
|
| 162 |
+
gpu_memory_utilization=0.70,
|
| 163 |
+
max_model_len=32768,
|
|
|
|
|
|
|
|
|
|
| 164 |
)
|
| 165 |
+
|
| 166 |
+
# Embedding and Reranker use official Qwen3VL loaders
|
| 167 |
+
from scripts.qwen3_vl import Qwen3VLEmbedder, Qwen3VLReranker
|
| 168 |
+
embedding_model = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B", torch_dtype=torch.bfloat16)
|
| 169 |
+
reranker_model = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B", torch_dtype=torch.bfloat16)
|
| 170 |
```
|
| 171 |
|
| 172 |
+
Expected distribution (FP8 + BF16, ~38-43GB total):
|
| 173 |
+
- Vision model (30B FP8): ~30-35GB
|
| 174 |
+
- Embedding model (2B): ~4GB
|
| 175 |
+
- Reranker model (2B): ~4GB
|
| 176 |
+
- Headroom: ~45GB+ for KV cache and overhead
|
|
|
|
| 177 |
|
| 178 |
## Local Development Strategy
|
| 179 |
|
| 180 |
+
The RTX 4090 (24GB VRAM) cannot run the production model stack. Use this workflow:
|
| 181 |
|
| 182 |
1. Set `MOCK_MODELS=true` environment variable
|
| 183 |
+
2. Mock responses return realistic JSON matching vision output schema (2048-dim embeddings)
|
| 184 |
3. Test pipeline logic, UI, calculations without real inference
|
| 185 |
4. Deploy to HuggingFace Spaces for real model testing
|
| 186 |
5. Request build logs after deployment to confirm success
|
| 187 |
+
6. After changing embedding dimensions, rebuild ChromaDB: `python -m rag.index_builder --rebuild`
|
| 188 |
|
| 189 |
## Code Style
|
| 190 |
|
FDAM_AI_Pipeline_Technical_Spec.md
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
README.md
CHANGED
|
@@ -32,11 +32,10 @@ suggested_hardware: l4x4
|
|
| 32 |
|
| 33 |
## Technical Details
|
| 34 |
|
| 35 |
-
### Model Stack (~
|
| 36 |
-
- **Vision
|
| 37 |
-
- **
|
| 38 |
-
- **
|
| 39 |
-
- **Reranker**: Qwen3-VL-Reranker-8B (~16GB)
|
| 40 |
|
| 41 |
### Zone Classifications
|
| 42 |
- **Burn Zone**: Direct fire involvement, structural damage
|
|
|
|
| 32 |
|
| 33 |
## Technical Details
|
| 34 |
|
| 35 |
+
### Model Stack (~38-43GB VRAM)
|
| 36 |
+
- **Vision**: Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB) - Reasoning-enhanced analysis with structured JSON output
|
| 37 |
+
- **Embeddings**: Qwen3-VL-Embedding-2B (~4GB)
|
| 38 |
+
- **Reranker**: Qwen3-VL-Reranker-2B (~4GB)
|
|
|
|
| 39 |
|
| 40 |
### Zone Classifications
|
| 41 |
- **Burn Zone**: Direct fire involvement, structural damage
|
config/inference.py
CHANGED
|
@@ -2,39 +2,29 @@
|
|
| 2 |
|
| 3 |
Configuration values aligned with official Qwen3-VL model recommendations
|
| 4 |
and FDAM Technical Spec requirements.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"""
|
| 6 |
|
| 7 |
from dataclasses import dataclass
|
| 8 |
|
| 9 |
|
| 10 |
@dataclass
|
| 11 |
-
class
|
| 12 |
-
"""Configuration for
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
"""
|
| 17 |
|
| 18 |
-
|
| 19 |
temperature: float = 0.6 # Per Qwen3-VL GitHub docs
|
| 20 |
top_p: float = 0.95
|
| 21 |
top_k: int = 20
|
| 22 |
-
|
| 23 |
-
repetition_penalty: float = 1.0 # Per Qwen3-VL docs (not presence_penalty)
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
@dataclass
|
| 27 |
-
class VisionInferenceConfig:
|
| 28 |
-
"""Configuration for 8B-Instruct model inference.
|
| 29 |
-
|
| 30 |
-
Per FDAM Technical Spec Section 3. Used for structured JSON output.
|
| 31 |
-
"""
|
| 32 |
-
|
| 33 |
-
max_new_tokens: int = 4096
|
| 34 |
-
temperature: float = 0.1 # Low temperature for deterministic JSON output
|
| 35 |
-
top_p: float = 0.9
|
| 36 |
-
do_sample: bool = True
|
| 37 |
-
repetition_penalty: float = 1.1 # Reduce repetition in generated text
|
| 38 |
|
| 39 |
|
| 40 |
@dataclass
|
|
@@ -55,10 +45,10 @@ class GenerationInferenceConfig:
|
|
| 55 |
class EmbeddingConfig:
|
| 56 |
"""Configuration for embedding model.
|
| 57 |
|
| 58 |
-
Per Qwen3-VL-Embedding-
|
| 59 |
"""
|
| 60 |
|
| 61 |
-
embedding_dimension: int =
|
| 62 |
normalize: bool = True # L2 normalization (per official implementation)
|
| 63 |
|
| 64 |
|
|
@@ -82,8 +72,7 @@ class RAGConfig:
|
|
| 82 |
|
| 83 |
|
| 84 |
# Default configurations
|
| 85 |
-
|
| 86 |
-
vision_config = VisionInferenceConfig() # Now used for Instruct model
|
| 87 |
generation_config = GenerationInferenceConfig()
|
| 88 |
embedding_config = EmbeddingConfig()
|
| 89 |
reranker_config = RerankerConfig()
|
|
|
|
| 2 |
|
| 3 |
Configuration values aligned with official Qwen3-VL model recommendations
|
| 4 |
and FDAM Technical Spec requirements.
|
| 5 |
+
|
| 6 |
+
Pipeline uses:
|
| 7 |
+
- Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (single model, FP8 via vLLM)
|
| 8 |
+
- Embedding: Qwen/Qwen3-VL-Embedding-2B (2048-dim)
|
| 9 |
+
- Reranker: Qwen/Qwen3-VL-Reranker-2B
|
| 10 |
"""
|
| 11 |
|
| 12 |
from dataclasses import dataclass
|
| 13 |
|
| 14 |
|
| 15 |
@dataclass
|
| 16 |
+
class VisionInferenceConfig:
|
| 17 |
+
"""Configuration for 30B-A3B FP8 vision model inference.
|
| 18 |
|
| 19 |
+
Single model handles both analysis and structured JSON output.
|
| 20 |
+
Uses vLLM with tensor parallelism across 4 GPUs.
|
| 21 |
"""
|
| 22 |
|
| 23 |
+
max_tokens: int = 8192 # vLLM uses max_tokens not max_new_tokens
|
| 24 |
temperature: float = 0.6 # Per Qwen3-VL GitHub docs
|
| 25 |
top_p: float = 0.95
|
| 26 |
top_k: int = 20
|
| 27 |
+
repetition_penalty: float = 1.0 # Per Qwen3-VL docs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
|
| 30 |
@dataclass
|
|
|
|
| 45 |
class EmbeddingConfig:
|
| 46 |
"""Configuration for embedding model.
|
| 47 |
|
| 48 |
+
Per Qwen3-VL-Embedding-2B config.json: text_config.hidden_size = 2048
|
| 49 |
"""
|
| 50 |
|
| 51 |
+
embedding_dimension: int = 2048 # Per Qwen3-VL-Embedding-2B hidden_size
|
| 52 |
normalize: bool = True # L2 normalization (per official implementation)
|
| 53 |
|
| 54 |
|
|
|
|
| 72 |
|
| 73 |
|
| 74 |
# Default configurations
|
| 75 |
+
vision_config = VisionInferenceConfig() # Single 30B-A3B FP8 model
|
|
|
|
| 76 |
generation_config = GenerationInferenceConfig()
|
| 77 |
embedding_config = EmbeddingConfig()
|
| 78 |
reranker_config = RerankerConfig()
|
config/settings.py
CHANGED
|
@@ -17,11 +17,14 @@ class Settings(BaseSettings):
|
|
| 17 |
mock_models: bool = True
|
| 18 |
|
| 19 |
# Model paths (for production on HuggingFace Spaces)
|
| 20 |
-
#
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
# ChromaDB
|
| 27 |
chroma_persist_dir: str = "./chroma_db"
|
|
|
|
| 17 |
mock_models: bool = True
|
| 18 |
|
| 19 |
# Model paths (for production on HuggingFace Spaces)
|
| 20 |
+
# Single 30B-A3B MoE model with FP8 quantization via vLLM (official, reasoning-enhanced)
|
| 21 |
+
vision_model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
|
| 22 |
+
embedding_model: str = "Qwen/Qwen3-VL-Embedding-2B"
|
| 23 |
+
reranker_model: str = "Qwen/Qwen3-VL-Reranker-2B"
|
| 24 |
+
|
| 25 |
+
# vLLM configuration
|
| 26 |
+
vllm_tensor_parallel_size: int = 4 # Use all 4 L4 GPUs
|
| 27 |
+
vllm_max_model_len: int = 32768 # Context window
|
| 28 |
|
| 29 |
# ChromaDB
|
| 30 |
chroma_persist_dir: str = "./chroma_db"
|
models/loader.py
CHANGED
|
@@ -2,12 +2,13 @@
|
|
| 2 |
|
| 3 |
Supports two loading modes:
|
| 4 |
- MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
|
| 5 |
-
- MOCK_MODELS=false: Loads all real models at startup (~
|
| 6 |
|
| 7 |
Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
|
| 8 |
-
- Vision
|
| 9 |
-
- Embedding
|
| 10 |
-
-
|
|
|
|
| 11 |
"""
|
| 12 |
|
| 13 |
import logging
|
|
@@ -29,7 +30,7 @@ def get_model_stack() -> ModelStack:
|
|
| 29 |
"""Get model stack based on environment configuration.
|
| 30 |
|
| 31 |
For mock models: Loads mock models immediately (fast, for local dev).
|
| 32 |
-
For real models: Loads all
|
| 33 |
"""
|
| 34 |
start_time = time.time()
|
| 35 |
|
|
@@ -43,8 +44,7 @@ def get_model_stack() -> ModelStack:
|
|
| 43 |
return stack
|
| 44 |
else:
|
| 45 |
logger.info("Loading REAL model stack (production mode)")
|
| 46 |
-
logger.info(f"Vision
|
| 47 |
-
logger.info(f"Vision instruct model: {settings.vision_model_instruct}")
|
| 48 |
logger.info(f"Embedding model: {settings.embedding_model}")
|
| 49 |
logger.info(f"Reranker model: {settings.reranker_model}")
|
| 50 |
from models.real import RealModelStack
|
|
|
|
| 2 |
|
| 3 |
Supports two loading modes:
|
| 4 |
- MOCK_MODELS=true: Loads mock models (fast, for local dev on RTX 4090)
|
| 5 |
+
- MOCK_MODELS=false: Loads all real models at startup (~38-43GB total)
|
| 6 |
|
| 7 |
Memory Strategy (Simultaneous Loading for 4xL4 GPUs with 88GB total):
|
| 8 |
+
- Vision 30B-A3B FP8 via vLLM: ~30-35GB
|
| 9 |
+
- Embedding 2B: ~4GB
|
| 10 |
+
- Reranker 2B: ~4GB
|
| 11 |
+
- Total: ~38-43GB, leaving ~45GB+ headroom
|
| 12 |
"""
|
| 13 |
|
| 14 |
import logging
|
|
|
|
| 30 |
"""Get model stack based on environment configuration.
|
| 31 |
|
| 32 |
For mock models: Loads mock models immediately (fast, for local dev).
|
| 33 |
+
For real models: Loads all 3 models at startup (~38-43GB total).
|
| 34 |
"""
|
| 35 |
start_time = time.time()
|
| 36 |
|
|
|
|
| 44 |
return stack
|
| 45 |
else:
|
| 46 |
logger.info("Loading REAL model stack (production mode)")
|
| 47 |
+
logger.info(f"Vision model: {settings.vision_model} (FP8 via vLLM)")
|
|
|
|
| 48 |
logger.info(f"Embedding model: {settings.embedding_model}")
|
| 49 |
logger.info(f"Reranker model: {settings.reranker_model}")
|
| 50 |
from models.real import RealModelStack
|
models/mock.py
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
"""Mock model implementations for local development on RTX 4090.
|
| 2 |
|
| 3 |
-
Simulates the
|
| 4 |
-
- MockVisionModel simulates
|
| 5 |
- All models loaded together at startup (no lazy loading)
|
| 6 |
"""
|
| 7 |
|
|
@@ -14,11 +14,10 @@ logger = logging.getLogger(__name__)
|
|
| 14 |
|
| 15 |
|
| 16 |
class MockVisionModel:
|
| 17 |
-
"""Mock vision model that simulates
|
| 18 |
|
| 19 |
-
Simulates
|
| 20 |
-
|
| 21 |
-
- Stage 2: Instruct model formats to JSON
|
| 22 |
"""
|
| 23 |
|
| 24 |
ZONES = ["burn", "near-field", "far-field"]
|
|
@@ -54,15 +53,13 @@ class MockVisionModel:
|
|
| 54 |
}
|
| 55 |
|
| 56 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 57 |
-
"""Return mock vision analysis simulating
|
| 58 |
-
logger.debug(f"Mock
|
| 59 |
|
| 60 |
-
# Simulate
|
| 61 |
selected_zone = random.choice(self.ZONES)
|
| 62 |
selected_condition = random.choice(self.CONDITIONS)
|
| 63 |
|
| 64 |
-
logger.debug("Mock Stage 1 (Thinking): Generated reasoning")
|
| 65 |
-
logger.debug("Mock Stage 2 (Instruct): Formatted to JSON")
|
| 66 |
logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
|
| 67 |
|
| 68 |
# Generate 2-4 random materials
|
|
@@ -141,16 +138,16 @@ class MockVisionModel:
|
|
| 141 |
class MockEmbeddingModel:
|
| 142 |
"""Mock embedding model that returns deterministic vectors.
|
| 143 |
|
| 144 |
-
Dimension matches Qwen3-VL-Embedding-
|
| 145 |
Uses last-token pooling concept with L2 normalization.
|
| 146 |
"""
|
| 147 |
|
| 148 |
-
def __init__(self, dimension: int =
|
| 149 |
-
"""Initialize with dimension matching real Qwen3-VL-Embedding-
|
| 150 |
self.dimension = dimension
|
| 151 |
|
| 152 |
def embed(self, text: str) -> list[float]:
|
| 153 |
-
"""Return mock embedding vector (
|
| 154 |
|
| 155 |
Uses hash of text for reproducibility, simulating last-token pooling.
|
| 156 |
"""
|
|
@@ -176,7 +173,7 @@ class MockEmbeddingModel:
|
|
| 176 |
class MockRerankerModel:
|
| 177 |
"""Mock reranker that returns realistic relevance scores.
|
| 178 |
|
| 179 |
-
Simulates Qwen3-VL-Reranker behavior with 0-1 sigmoid-like scores.
|
| 180 |
"""
|
| 181 |
|
| 182 |
def rerank(self, query: str, documents: list[str]) -> list[float]:
|
|
@@ -236,9 +233,9 @@ class MockModelStack:
|
|
| 236 |
def load_all(self) -> "MockModelStack":
|
| 237 |
"""Load all mock models."""
|
| 238 |
logger.info("Loading mock models for local development")
|
| 239 |
-
logger.debug(" Vision model: MockVisionModel (simulates
|
| 240 |
-
logger.debug(" Embedding model: MockEmbeddingModel (
|
| 241 |
-
logger.debug(" Reranker model: MockRerankerModel")
|
| 242 |
self._loaded = True
|
| 243 |
logger.info("All mock models loaded successfully")
|
| 244 |
return self
|
|
|
|
| 1 |
"""Mock model implementations for local development on RTX 4090.
|
| 2 |
|
| 3 |
+
Simulates the 30B-A3B FP8 vision model architecture:
|
| 4 |
+
- MockVisionModel simulates single-model analysis + JSON output
|
| 5 |
- All models loaded together at startup (no lazy loading)
|
| 6 |
"""
|
| 7 |
|
|
|
|
| 14 |
|
| 15 |
|
| 16 |
class MockVisionModel:
|
| 17 |
+
"""Mock vision model that simulates 30B-A3B FP8 model output.
|
| 18 |
|
| 19 |
+
Simulates single-model analysis with structured JSON output.
|
| 20 |
+
The real model uses vLLM with FP8 quantization.
|
|
|
|
| 21 |
"""
|
| 22 |
|
| 23 |
ZONES = ["burn", "near-field", "far-field"]
|
|
|
|
| 53 |
}
|
| 54 |
|
| 55 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 56 |
+
"""Return mock vision analysis simulating 30B-A3B FP8 model output."""
|
| 57 |
+
logger.debug(f"Mock 30B-A3B FP8 vision analysis (context: {len(context)} chars)")
|
| 58 |
|
| 59 |
+
# Simulate model generating analysis + JSON
|
| 60 |
selected_zone = random.choice(self.ZONES)
|
| 61 |
selected_condition = random.choice(self.CONDITIONS)
|
| 62 |
|
|
|
|
|
|
|
| 63 |
logger.info(f"Mock vision result: zone={selected_zone}, condition={selected_condition}")
|
| 64 |
|
| 65 |
# Generate 2-4 random materials
|
|
|
|
| 138 |
class MockEmbeddingModel:
|
| 139 |
"""Mock embedding model that returns deterministic vectors.
|
| 140 |
|
| 141 |
+
Dimension matches Qwen3-VL-Embedding-2B (2048-dim).
|
| 142 |
Uses last-token pooling concept with L2 normalization.
|
| 143 |
"""
|
| 144 |
|
| 145 |
+
def __init__(self, dimension: int = 2048):
|
| 146 |
+
"""Initialize with dimension matching real Qwen3-VL-Embedding-2B model."""
|
| 147 |
self.dimension = dimension
|
| 148 |
|
| 149 |
def embed(self, text: str) -> list[float]:
|
| 150 |
+
"""Return mock embedding vector (2048-dim, L2 normalized).
|
| 151 |
|
| 152 |
Uses hash of text for reproducibility, simulating last-token pooling.
|
| 153 |
"""
|
|
|
|
| 173 |
class MockRerankerModel:
|
| 174 |
"""Mock reranker that returns realistic relevance scores.
|
| 175 |
|
| 176 |
+
Simulates Qwen3-VL-Reranker-2B behavior with 0-1 sigmoid-like scores.
|
| 177 |
"""
|
| 178 |
|
| 179 |
def rerank(self, query: str, documents: list[str]) -> list[float]:
|
|
|
|
| 233 |
def load_all(self) -> "MockModelStack":
|
| 234 |
"""Load all mock models."""
|
| 235 |
logger.info("Loading mock models for local development")
|
| 236 |
+
logger.debug(" Vision model: MockVisionModel (simulates 30B-A3B FP8)")
|
| 237 |
+
logger.debug(" Embedding model: MockEmbeddingModel (2048-dim)")
|
| 238 |
+
logger.debug(" Reranker model: MockRerankerModel (simulates 2B)")
|
| 239 |
self._loaded = True
|
| 240 |
logger.info("All mock models loaded successfully")
|
| 241 |
return self
|
models/real.py
CHANGED
|
@@ -1,17 +1,13 @@
|
|
| 1 |
"""Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
|
| 2 |
|
| 3 |
-
This module loads the
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
-
|
| 8 |
-
- Vision Instruct 8B (~18GB): Structured JSON output formatting
|
| 9 |
-
- Embedding 8B (~16GB): RAG document embedding
|
| 10 |
-
- Reranker 8B (~16GB): RAG retrieval reranking
|
| 11 |
-
- Total: ~68GB on 88GB available (20GB headroom)
|
| 12 |
|
| 13 |
Model Loading:
|
| 14 |
-
- Vision:
|
| 15 |
- Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 16 |
- Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 17 |
"""
|
|
@@ -24,7 +20,7 @@ import torch
|
|
| 24 |
from typing import Any
|
| 25 |
from PIL import Image
|
| 26 |
|
| 27 |
-
from config.inference import
|
| 28 |
from config.settings import settings
|
| 29 |
|
| 30 |
logger = logging.getLogger(__name__)
|
|
@@ -33,9 +29,10 @@ logger = logging.getLogger(__name__)
|
|
| 33 |
class RealModelStack:
|
| 34 |
"""Real model stack for production on HuggingFace Spaces.
|
| 35 |
|
| 36 |
-
Loads all
|
| 37 |
-
-
|
| 38 |
-
- Embedding
|
|
|
|
| 39 |
"""
|
| 40 |
|
| 41 |
def __init__(self):
|
|
@@ -56,54 +53,53 @@ class RealModelStack:
|
|
| 56 |
logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
|
| 57 |
|
| 58 |
def load_all(self) -> "RealModelStack":
|
| 59 |
-
"""Load all models
|
| 60 |
|
| 61 |
-
Loads
|
| 62 |
-
(Embedding + Reranker) for ~68GB total VRAM usage.
|
| 63 |
"""
|
| 64 |
if self._loaded:
|
| 65 |
logger.debug("Models already loaded, skipping")
|
| 66 |
return self
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 71 |
-
logger.info(f"Loading all models on {device_type}")
|
| 72 |
self._log_gpu_status()
|
| 73 |
|
| 74 |
total_start = time.time()
|
| 75 |
|
| 76 |
-
# Vision
|
| 77 |
-
logger.info(f"Loading vision
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
settings.
|
| 87 |
trust_remote_code=True,
|
|
|
|
|
|
|
| 88 |
)
|
| 89 |
-
logger.info(f"Vision thinking model loaded in {time.time() - thinking_start:.2f}s")
|
| 90 |
|
| 91 |
-
#
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
self.models["vision_instruct"] = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 95 |
-
settings.vision_model_instruct,
|
| 96 |
-
torch_dtype=torch.bfloat16,
|
| 97 |
-
device_map="auto",
|
| 98 |
trust_remote_code=True,
|
| 99 |
)
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
)
|
| 104 |
-
logger.info(f"Vision instruct model loaded in {time.time() - instruct_start:.2f}s")
|
| 105 |
|
| 106 |
-
|
|
|
|
|
|
|
| 107 |
logger.info(f"Loading embedding model: {settings.embedding_model}")
|
| 108 |
embed_start = time.time()
|
| 109 |
from scripts.qwen3_vl import Qwen3VLEmbedder
|
|
@@ -115,7 +111,7 @@ class RealModelStack:
|
|
| 115 |
self.processors["embedding"] = self.models["embedding"].processor
|
| 116 |
logger.info(f"Embedding model loaded in {time.time() - embed_start:.2f}s")
|
| 117 |
|
| 118 |
-
# Reranker model (~
|
| 119 |
logger.info(f"Loading reranker model: {settings.reranker_model}")
|
| 120 |
reranker_start = time.time()
|
| 121 |
from scripts.qwen3_vl import Qwen3VLReranker
|
|
@@ -138,15 +134,14 @@ class RealModelStack:
|
|
| 138 |
return self._loaded
|
| 139 |
|
| 140 |
@property
|
| 141 |
-
def vision(self) -> "
|
| 142 |
-
"""Return
|
| 143 |
if not self._loaded:
|
| 144 |
raise RuntimeError("Models not loaded. Call load_all() first.")
|
| 145 |
-
return
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
instruct_processor=self.processors["vision_instruct"],
|
| 150 |
)
|
| 151 |
|
| 152 |
@property
|
|
@@ -164,20 +159,21 @@ class RealModelStack:
|
|
| 164 |
return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
|
| 165 |
|
| 166 |
|
| 167 |
-
class
|
| 168 |
-
"""
|
| 169 |
|
| 170 |
-
Uses Qwen3-VL-
|
| 171 |
-
|
|
|
|
| 172 |
|
| 173 |
-
Pipeline: Image -> Thinking (
|
| 174 |
"""
|
| 175 |
|
| 176 |
-
# System prompt for FDAM fire damage assessment
|
| 177 |
VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
|
| 178 |
|
| 179 |
## Your Task
|
| 180 |
-
Analyze the provided image and
|
| 181 |
|
| 182 |
## Zone Classification Criteria
|
| 183 |
- **Burn Zone**: Direct fire involvement. Look for structural char, complete combustion, exposed/damaged structural elements.
|
|
@@ -191,121 +187,105 @@ Analyze the provided image and extract structured information about fire damage,
|
|
| 191 |
- **Heavy**: Thick deposits; surface texture obscured; heavy coating visible.
|
| 192 |
- **Structural Damage**: Physical damage requiring repair before cleaning (charring, warping, holes, collapse).
|
| 193 |
|
| 194 |
-
## Material
|
| 195 |
-
Identify visible materials and categorize as:
|
| 196 |
- **Non-porous**: steel, concrete, glass, metal, CMU (concrete masonry unit)
|
| 197 |
- **Semi-porous**: painted drywall, sealed wood
|
| 198 |
- **Porous**: unpainted drywall, carpet, insulation, acoustic tile, upholstery
|
| 199 |
- **HVAC**: rigid ductwork, flexible ductwork
|
| 200 |
|
| 201 |
## Combustion Particle Visual Indicators
|
| 202 |
-
- **Soot**: Black/dark gray coating with oily/sticky appearance; fine uniform texture
|
| 203 |
-
- **Char**: Black angular fragments; visible wood grain or fibrous structure
|
| 204 |
-
- **Ash**: Gray/white powdery residue; crystalline appearance
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
# Analysis prompt for Thinking model (open-ended reasoning)
|
| 213 |
-
THINKING_ANALYSIS_PROMPT = """Analyze this fire damage image thoroughly. Consider:
|
| 214 |
-
|
| 215 |
-
1. What zone classification applies (burn, near-field, or far-field) and why?
|
| 216 |
-
2. What is the contamination condition level (background, light, moderate, heavy, or structural-damage)?
|
| 217 |
-
3. What materials are visible and what is their porosity category?
|
| 218 |
-
4. What combustion indicators (soot, char, ash) are present and where?
|
| 219 |
-
5. Are there any structural concerns or access issues?
|
| 220 |
-
6. Where would you recommend sampling and what type of samples?
|
| 221 |
-
|
| 222 |
-
Provide detailed reasoning for each assessment, explaining the visual evidence that supports your conclusions."""
|
| 223 |
-
|
| 224 |
-
# Formatter prompt for Instruct model (structured JSON output)
|
| 225 |
-
INSTRUCT_FORMATTER_SYSTEM = """You are a technical document formatter. Your task is to convert fire damage analysis into a precise JSON structure.
|
| 226 |
-
|
| 227 |
-
Preserve all findings from the analysis accurately. Assign confidence scores (0.0-1.0) based on the certainty expressed in the analysis:
|
| 228 |
-
- Very certain statements: 0.85-0.95
|
| 229 |
-
- Reasonably confident: 0.70-0.84
|
| 230 |
-
- Somewhat uncertain: 0.50-0.69
|
| 231 |
-
- Uncertain/fallback: 0.30-0.49"""
|
| 232 |
-
|
| 233 |
-
INSTRUCT_FORMATTER_PROMPT = """Based on the following fire damage analysis, generate a JSON response with this exact structure:
|
| 234 |
-
|
| 235 |
-
<analysis>
|
| 236 |
-
{analysis}
|
| 237 |
-
</analysis>
|
| 238 |
-
|
| 239 |
-
Generate JSON with this structure:
|
| 240 |
-
{{
|
| 241 |
-
"zone": {{
|
| 242 |
"classification": "burn" | "near-field" | "far-field",
|
| 243 |
"confidence": 0.0-1.0,
|
| 244 |
"reasoning": "explanation"
|
| 245 |
-
}
|
| 246 |
-
"condition": {
|
| 247 |
"level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
|
| 248 |
"confidence": 0.0-1.0,
|
| 249 |
"reasoning": "explanation"
|
| 250 |
-
}
|
| 251 |
"materials": [
|
| 252 |
-
{
|
| 253 |
-
"type": "material type
|
| 254 |
"category": "non-porous" | "semi-porous" | "porous" | "hvac",
|
| 255 |
"confidence": 0.0-1.0,
|
| 256 |
"location_description": "where in image",
|
| 257 |
-
"bounding_box": {
|
| 258 |
-
}
|
| 259 |
],
|
| 260 |
-
"combustion_indicators": {
|
| 261 |
"soot_visible": true/false,
|
| 262 |
"soot_pattern": "description or null",
|
| 263 |
"char_visible": true/false,
|
| 264 |
"char_description": "description or null",
|
| 265 |
"ash_visible": true/false,
|
| 266 |
"ash_description": "description or null"
|
| 267 |
-
}
|
| 268 |
"structural_concerns": ["list of structural issues if any"],
|
| 269 |
"access_issues": ["list of access problems if any"],
|
| 270 |
"recommended_sampling_locations": [
|
| 271 |
-
{
|
| 272 |
"description": "where to sample",
|
| 273 |
"sample_type": "tape_lift" | "surface_wipe" | "air_sample",
|
| 274 |
"priority": "high" | "medium" | "low"
|
| 275 |
-
}
|
| 276 |
],
|
| 277 |
"flags_for_review": ["any items requiring human review"]
|
| 278 |
-
}
|
| 279 |
|
| 280 |
IMPORTANT: Return ONLY valid JSON, no additional text."""
|
| 281 |
|
| 282 |
-
def __init__(self,
|
| 283 |
-
self.
|
| 284 |
-
self.
|
| 285 |
-
self.
|
| 286 |
-
self.instruct_processor = instruct_processor
|
| 287 |
|
| 288 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 289 |
-
"""Analyze an image using
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
|
| 291 |
-
|
| 292 |
-
|
| 293 |
"""
|
| 294 |
start_time = time.time()
|
| 295 |
-
logger.debug(f"Starting
|
| 296 |
|
| 297 |
try:
|
| 298 |
-
#
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 309 |
|
| 310 |
# Log result summary
|
| 311 |
total_time = time.time() - start_time
|
|
@@ -314,7 +294,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
|
|
| 314 |
condition = result.get("condition", {}).get("level", "unknown")
|
| 315 |
condition_conf = result.get("condition", {}).get("confidence", 0)
|
| 316 |
num_materials = len(result.get("materials", []))
|
| 317 |
-
logger.info(f"Vision analysis complete in {total_time:.2f}s
|
| 318 |
f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
|
| 319 |
f"materials={num_materials}")
|
| 320 |
|
|
@@ -324,159 +304,32 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
|
|
| 324 |
logger.error(f"Vision analysis failed: {e}")
|
| 325 |
return self._get_fallback_response(str(e))
|
| 326 |
|
| 327 |
-
def
|
| 328 |
-
"""
|
| 329 |
-
try:
|
| 330 |
-
from qwen_vl_utils import process_vision_info
|
| 331 |
-
except ImportError:
|
| 332 |
-
logger.warning("qwen_vl_utils not available, using basic processing")
|
| 333 |
-
process_vision_info = None
|
| 334 |
|
| 335 |
-
|
| 336 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 337 |
if context:
|
| 338 |
-
|
| 339 |
|
| 340 |
-
# Prepare messages in Qwen-VL format with system prompt
|
| 341 |
messages = [
|
| 342 |
-
{
|
| 343 |
-
"role": "system",
|
| 344 |
-
"content": self.VISION_SYSTEM_PROMPT,
|
| 345 |
-
},
|
| 346 |
{
|
| 347 |
"role": "user",
|
| 348 |
"content": [
|
| 349 |
{"type": "image", "image": image},
|
| 350 |
-
{"type": "text", "text":
|
| 351 |
],
|
| 352 |
-
}
|
| 353 |
-
]
|
| 354 |
-
|
| 355 |
-
# Apply chat template with thinking enabled (default for Thinking model)
|
| 356 |
-
text = self.thinking_processor.apply_chat_template(
|
| 357 |
-
messages, tokenize=False, add_generation_prompt=True
|
| 358 |
-
)
|
| 359 |
-
|
| 360 |
-
# Process vision info if available
|
| 361 |
-
if process_vision_info:
|
| 362 |
-
image_inputs, video_inputs = process_vision_info(messages)
|
| 363 |
-
inputs = self.thinking_processor(
|
| 364 |
-
text=[text],
|
| 365 |
-
images=image_inputs,
|
| 366 |
-
videos=video_inputs,
|
| 367 |
-
return_tensors="pt",
|
| 368 |
-
padding=True,
|
| 369 |
-
)
|
| 370 |
-
else:
|
| 371 |
-
# Fallback: basic image processing
|
| 372 |
-
inputs = self.thinking_processor(
|
| 373 |
-
text=[text],
|
| 374 |
-
images=[image],
|
| 375 |
-
return_tensors="pt",
|
| 376 |
-
padding=True,
|
| 377 |
-
)
|
| 378 |
-
|
| 379 |
-
# Generate response using thinking config (per Qwen3-VL GitHub recommendations)
|
| 380 |
-
logger.debug(f"Thinking inference config: max_new_tokens={thinking_config.max_new_tokens}, "
|
| 381 |
-
f"temp={thinking_config.temperature}, top_p={thinking_config.top_p}, top_k={thinking_config.top_k}")
|
| 382 |
-
|
| 383 |
-
with torch.no_grad():
|
| 384 |
-
outputs = self.thinking_model.generate(
|
| 385 |
-
**inputs,
|
| 386 |
-
max_new_tokens=thinking_config.max_new_tokens,
|
| 387 |
-
do_sample=thinking_config.do_sample,
|
| 388 |
-
temperature=thinking_config.temperature,
|
| 389 |
-
top_p=thinking_config.top_p,
|
| 390 |
-
top_k=thinking_config.top_k,
|
| 391 |
-
repetition_penalty=thinking_config.repetition_penalty,
|
| 392 |
-
)
|
| 393 |
-
|
| 394 |
-
# Decode response - get raw token IDs first for proper parsing
|
| 395 |
-
output_ids = outputs[0].tolist()
|
| 396 |
-
|
| 397 |
-
# The Thinking model's chat template includes opening <think> tag
|
| 398 |
-
# Output format: reasoning_content</think>final_answer
|
| 399 |
-
# Get </think> token ID dynamically from tokenizer (more robust than hardcoding)
|
| 400 |
-
think_end_token = self.thinking_processor.tokenizer.encode(
|
| 401 |
-
"</think>", add_special_tokens=False
|
| 402 |
-
)[0]
|
| 403 |
-
|
| 404 |
-
try:
|
| 405 |
-
# Find the </think> token position
|
| 406 |
-
think_end_idx = len(output_ids) - output_ids[::-1].index(think_end_token)
|
| 407 |
-
# Extract reasoning (before </think>) and answer (after </think>)
|
| 408 |
-
reasoning_ids = output_ids[:think_end_idx]
|
| 409 |
-
answer_ids = output_ids[think_end_idx:]
|
| 410 |
-
|
| 411 |
-
reasoning = self.thinking_processor.decode(
|
| 412 |
-
reasoning_ids, skip_special_tokens=True
|
| 413 |
-
).strip()
|
| 414 |
-
final_answer = self.thinking_processor.decode(
|
| 415 |
-
answer_ids, skip_special_tokens=True
|
| 416 |
-
).strip()
|
| 417 |
-
|
| 418 |
-
logger.debug(f"Extracted thinking: {len(reasoning)} chars reasoning, {len(final_answer)} chars answer")
|
| 419 |
-
return f"Reasoning:\n{reasoning}\n\nConclusions:\n{final_answer}"
|
| 420 |
-
|
| 421 |
-
except ValueError:
|
| 422 |
-
# No </think> token found - use full response as-is
|
| 423 |
-
response_text = self.thinking_processor.decode(
|
| 424 |
-
output_ids, skip_special_tokens=True
|
| 425 |
-
).strip()
|
| 426 |
-
logger.debug(f"No </think> token found, using full response: {len(response_text)} chars")
|
| 427 |
-
return response_text
|
| 428 |
-
|
| 429 |
-
def _run_instruct_stage(self, analysis_text: str) -> dict[str, Any]:
|
| 430 |
-
"""Run the Instruct model to format analysis into JSON."""
|
| 431 |
-
# Prepare messages for Instruct model (text-only, no image)
|
| 432 |
-
prompt = self.INSTRUCT_FORMATTER_PROMPT.format(analysis=analysis_text)
|
| 433 |
-
|
| 434 |
-
messages = [
|
| 435 |
-
{
|
| 436 |
-
"role": "system",
|
| 437 |
-
"content": self.INSTRUCT_FORMATTER_SYSTEM,
|
| 438 |
},
|
| 439 |
-
{
|
| 440 |
-
"role": "user",
|
| 441 |
-
"content": prompt,
|
| 442 |
-
}
|
| 443 |
]
|
| 444 |
-
|
| 445 |
-
# Apply chat template
|
| 446 |
-
text = self.instruct_processor.apply_chat_template(
|
| 447 |
-
messages, tokenize=False, add_generation_prompt=True
|
| 448 |
-
)
|
| 449 |
-
|
| 450 |
-
inputs = self.instruct_processor(
|
| 451 |
-
text=[text],
|
| 452 |
-
return_tensors="pt",
|
| 453 |
-
padding=True,
|
| 454 |
-
)
|
| 455 |
-
|
| 456 |
-
# Generate response using vision config (low temp for consistent JSON)
|
| 457 |
-
logger.debug(f"Instruct inference config: max_new_tokens={vision_config.max_new_tokens}, "
|
| 458 |
-
f"temp={vision_config.temperature}")
|
| 459 |
-
|
| 460 |
-
with torch.no_grad():
|
| 461 |
-
outputs = self.instruct_model.generate(
|
| 462 |
-
**inputs,
|
| 463 |
-
max_new_tokens=vision_config.max_new_tokens,
|
| 464 |
-
do_sample=vision_config.do_sample,
|
| 465 |
-
temperature=vision_config.temperature,
|
| 466 |
-
top_p=vision_config.top_p,
|
| 467 |
-
repetition_penalty=vision_config.repetition_penalty,
|
| 468 |
-
)
|
| 469 |
-
|
| 470 |
-
# Decode response
|
| 471 |
-
response_text = self.instruct_processor.decode(
|
| 472 |
-
outputs[0], skip_special_tokens=True
|
| 473 |
-
)
|
| 474 |
-
|
| 475 |
-
# Parse JSON from response
|
| 476 |
-
return self._parse_json_response(response_text)
|
| 477 |
|
| 478 |
def _parse_json_response(self, response: str) -> dict[str, Any]:
|
| 479 |
-
"""Parse JSON response from
|
| 480 |
try:
|
| 481 |
# Try to extract JSON from response
|
| 482 |
json_match = re.search(r'\{[\s\S]*\}', response)
|
|
@@ -484,7 +337,7 @@ IMPORTANT: Return ONLY valid JSON, no additional text."""
|
|
| 484 |
json_str = json_match.group()
|
| 485 |
return json.loads(json_str)
|
| 486 |
else:
|
| 487 |
-
logger.warning("No JSON found in
|
| 488 |
return self._get_fallback_response("No JSON in response")
|
| 489 |
except json.JSONDecodeError as e:
|
| 490 |
logger.warning(f"Failed to parse JSON: {e}")
|
|
@@ -533,6 +386,8 @@ class RealEmbeddingModel:
|
|
| 533 |
|
| 534 |
Uses the official Qwen3VLEmbedder from QwenLM/Qwen3-VL-Embedding.
|
| 535 |
The model handles last-token pooling and L2 normalization internally.
|
|
|
|
|
|
|
| 536 |
"""
|
| 537 |
|
| 538 |
def __init__(self, model, processor):
|
|
@@ -557,7 +412,7 @@ class RealEmbeddingModel:
|
|
| 557 |
text: Input text to embed
|
| 558 |
|
| 559 |
Returns:
|
| 560 |
-
List of floats representing the embedding (
|
| 561 |
"""
|
| 562 |
try:
|
| 563 |
# Use official process() API - expects list of dicts
|
|
@@ -569,8 +424,8 @@ class RealEmbeddingModel:
|
|
| 569 |
|
| 570 |
except Exception as e:
|
| 571 |
logger.error(f"Embedding generation failed: {e}")
|
| 572 |
-
# Return zero vector as fallback (
|
| 573 |
-
hidden_size = getattr(self.model.model.config, "hidden_size",
|
| 574 |
return [0.0] * hidden_size
|
| 575 |
|
| 576 |
def embed_batch(self, texts: list[str]) -> list[list[float]]:
|
|
@@ -584,7 +439,7 @@ class RealEmbeddingModel:
|
|
| 584 |
return [emb.cpu().tolist() for emb in embeddings]
|
| 585 |
except Exception as e:
|
| 586 |
logger.error(f"Batch embedding generation failed: {e}")
|
| 587 |
-
hidden_size = getattr(self.model.model.config, "hidden_size",
|
| 588 |
return [[0.0] * hidden_size for _ in texts]
|
| 589 |
|
| 590 |
|
|
@@ -597,7 +452,7 @@ class RealRerankerModel:
|
|
| 597 |
- Creates a binary linear layer: weight = yes_weight - no_weight
|
| 598 |
- Scores = sigmoid(linear(last_token_hidden_state))
|
| 599 |
|
| 600 |
-
|
| 601 |
"""
|
| 602 |
|
| 603 |
def __init__(self, model, processor):
|
|
|
|
| 1 |
"""Real model loading for production (HuggingFace Spaces with 4xL4 GPUs).
|
| 2 |
|
| 3 |
+
This module loads the production models:
|
| 4 |
+
- Vision: Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 (~30-35GB via vLLM)
|
| 5 |
+
- Embedding: Qwen/Qwen3-VL-Embedding-2B (~4GB)
|
| 6 |
+
- Reranker: Qwen/Qwen3-VL-Reranker-2B (~4GB)
|
| 7 |
+
- Total: ~38-43GB on 88GB available (45GB+ headroom)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
Model Loading:
|
| 10 |
+
- Vision: vLLM with FP8 quantization (built-in) and tensor parallelism
|
| 11 |
- Embedding: Qwen3VLEmbedder (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 12 |
- Reranker: Qwen3VLReranker (official scripts from QwenLM/Qwen3-VL-Embedding)
|
| 13 |
"""
|
|
|
|
| 20 |
from typing import Any
|
| 21 |
from PIL import Image
|
| 22 |
|
| 23 |
+
from config.inference import vision_config
|
| 24 |
from config.settings import settings
|
| 25 |
|
| 26 |
logger = logging.getLogger(__name__)
|
|
|
|
| 29 |
class RealModelStack:
|
| 30 |
"""Real model stack for production on HuggingFace Spaces.
|
| 31 |
|
| 32 |
+
Loads all 3 models at initialization (~38-43GB total):
|
| 33 |
+
- FP8 Vision via vLLM: ~30-35GB
|
| 34 |
+
- Embedding 2B: ~4GB
|
| 35 |
+
- Reranker 2B: ~4GB
|
| 36 |
"""
|
| 37 |
|
| 38 |
def __init__(self):
|
|
|
|
| 53 |
logger.info(f" GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {free:.1f}GB free / {total:.1f}GB total")
|
| 54 |
|
| 55 |
def load_all(self) -> "RealModelStack":
|
| 56 |
+
"""Load all models.
|
| 57 |
|
| 58 |
+
Loads FP8 vision model via vLLM and RAG models (Embedding + Reranker).
|
|
|
|
| 59 |
"""
|
| 60 |
if self._loaded:
|
| 61 |
logger.debug("Models already loaded, skipping")
|
| 62 |
return self
|
| 63 |
|
| 64 |
+
logger.info("Loading production models...")
|
|
|
|
|
|
|
|
|
|
| 65 |
self._log_gpu_status()
|
| 66 |
|
| 67 |
total_start = time.time()
|
| 68 |
|
| 69 |
+
# Vision model via vLLM (~30-35GB in FP8)
|
| 70 |
+
logger.info(f"Loading vision model: {settings.vision_model}")
|
| 71 |
+
vision_start = time.time()
|
| 72 |
+
|
| 73 |
+
from vllm import LLM, SamplingParams
|
| 74 |
+
from transformers import AutoProcessor
|
| 75 |
+
|
| 76 |
+
self.models["vision"] = LLM(
|
| 77 |
+
model=settings.vision_model,
|
| 78 |
+
# FP8 quantization is built into model weights, no quantization param needed
|
| 79 |
+
tensor_parallel_size=settings.vllm_tensor_parallel_size,
|
| 80 |
trust_remote_code=True,
|
| 81 |
+
gpu_memory_utilization=0.70, # Per Qwen FP8 model recommendations
|
| 82 |
+
max_model_len=settings.vllm_max_model_len,
|
| 83 |
)
|
|
|
|
| 84 |
|
| 85 |
+
# Load processor for chat template formatting
|
| 86 |
+
self.processors["vision"] = AutoProcessor.from_pretrained(
|
| 87 |
+
settings.vision_model,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
trust_remote_code=True,
|
| 89 |
)
|
| 90 |
+
|
| 91 |
+
# Store sampling params for inference
|
| 92 |
+
self.models["vision_sampling_params"] = SamplingParams(
|
| 93 |
+
max_tokens=vision_config.max_tokens,
|
| 94 |
+
temperature=vision_config.temperature,
|
| 95 |
+
top_p=vision_config.top_p,
|
| 96 |
+
top_k=vision_config.top_k,
|
| 97 |
+
repetition_penalty=vision_config.repetition_penalty,
|
| 98 |
)
|
|
|
|
| 99 |
|
| 100 |
+
logger.info(f"Vision model loaded in {time.time() - vision_start:.2f}s")
|
| 101 |
+
|
| 102 |
+
# Embedding model (~4GB in BF16) - Using official Qwen3VLEmbedder
|
| 103 |
logger.info(f"Loading embedding model: {settings.embedding_model}")
|
| 104 |
embed_start = time.time()
|
| 105 |
from scripts.qwen3_vl import Qwen3VLEmbedder
|
|
|
|
| 111 |
self.processors["embedding"] = self.models["embedding"].processor
|
| 112 |
logger.info(f"Embedding model loaded in {time.time() - embed_start:.2f}s")
|
| 113 |
|
| 114 |
+
# Reranker model (~4GB in BF16) - Using official Qwen3VLReranker
|
| 115 |
logger.info(f"Loading reranker model: {settings.reranker_model}")
|
| 116 |
reranker_start = time.time()
|
| 117 |
from scripts.qwen3_vl import Qwen3VLReranker
|
|
|
|
| 134 |
return self._loaded
|
| 135 |
|
| 136 |
@property
|
| 137 |
+
def vision(self) -> "VisionModel":
|
| 138 |
+
"""Return FP8 vision model wrapped for pipeline consumption."""
|
| 139 |
if not self._loaded:
|
| 140 |
raise RuntimeError("Models not loaded. Call load_all() first.")
|
| 141 |
+
return VisionModel(
|
| 142 |
+
model=self.models["vision"],
|
| 143 |
+
processor=self.processors["vision"],
|
| 144 |
+
sampling_params=self.models["vision_sampling_params"],
|
|
|
|
| 145 |
)
|
| 146 |
|
| 147 |
@property
|
|
|
|
| 159 |
return RealRerankerModel(self.models["reranker"], self.processors["reranker"])
|
| 160 |
|
| 161 |
|
| 162 |
+
class VisionModel:
|
| 163 |
+
"""Vision model for fire damage analysis.
|
| 164 |
|
| 165 |
+
Uses Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 via vLLM for inference.
|
| 166 |
+
Reasoning-enhanced model handles analysis with extended thinking
|
| 167 |
+
and outputs structured JSON.
|
| 168 |
|
| 169 |
+
Pipeline: Image -> Thinking Model (reasoning + JSON) -> Output
|
| 170 |
"""
|
| 171 |
|
| 172 |
+
# System prompt for FDAM fire damage assessment
|
| 173 |
VISION_SYSTEM_PROMPT = """You are an expert industrial hygienist analyzing fire damage images for the FDAM (Fire Damage Assessment Methodology) framework.
|
| 174 |
|
| 175 |
## Your Task
|
| 176 |
+
Analyze the provided image and return a structured JSON response with fire damage assessment.
|
| 177 |
|
| 178 |
## Zone Classification Criteria
|
| 179 |
- **Burn Zone**: Direct fire involvement. Look for structural char, complete combustion, exposed/damaged structural elements.
|
|
|
|
| 187 |
- **Heavy**: Thick deposits; surface texture obscured; heavy coating visible.
|
| 188 |
- **Structural Damage**: Physical damage requiring repair before cleaning (charring, warping, holes, collapse).
|
| 189 |
|
| 190 |
+
## Material Categories
|
|
|
|
| 191 |
- **Non-porous**: steel, concrete, glass, metal, CMU (concrete masonry unit)
|
| 192 |
- **Semi-porous**: painted drywall, sealed wood
|
| 193 |
- **Porous**: unpainted drywall, carpet, insulation, acoustic tile, upholstery
|
| 194 |
- **HVAC**: rigid ductwork, flexible ductwork
|
| 195 |
|
| 196 |
## Combustion Particle Visual Indicators
|
| 197 |
+
- **Soot**: Black/dark gray coating with oily/sticky appearance; fine uniform texture
|
| 198 |
+
- **Char**: Black angular fragments; visible wood grain or fibrous structure
|
| 199 |
+
- **Ash**: Gray/white powdery residue; crystalline appearance"""
|
| 200 |
+
|
| 201 |
+
# JSON output format prompt
|
| 202 |
+
JSON_FORMAT_PROMPT = """Analyze this fire damage image and return a JSON response with this exact structure:
|
| 203 |
+
|
| 204 |
+
{
|
| 205 |
+
"zone": {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
"classification": "burn" | "near-field" | "far-field",
|
| 207 |
"confidence": 0.0-1.0,
|
| 208 |
"reasoning": "explanation"
|
| 209 |
+
},
|
| 210 |
+
"condition": {
|
| 211 |
"level": "background" | "light" | "moderate" | "heavy" | "structural-damage",
|
| 212 |
"confidence": 0.0-1.0,
|
| 213 |
"reasoning": "explanation"
|
| 214 |
+
},
|
| 215 |
"materials": [
|
| 216 |
+
{
|
| 217 |
+
"type": "material type",
|
| 218 |
"category": "non-porous" | "semi-porous" | "porous" | "hvac",
|
| 219 |
"confidence": 0.0-1.0,
|
| 220 |
"location_description": "where in image",
|
| 221 |
+
"bounding_box": {"x": 0.0-1.0, "y": 0.0-1.0, "width": 0.0-1.0, "height": 0.0-1.0}
|
| 222 |
+
}
|
| 223 |
],
|
| 224 |
+
"combustion_indicators": {
|
| 225 |
"soot_visible": true/false,
|
| 226 |
"soot_pattern": "description or null",
|
| 227 |
"char_visible": true/false,
|
| 228 |
"char_description": "description or null",
|
| 229 |
"ash_visible": true/false,
|
| 230 |
"ash_description": "description or null"
|
| 231 |
+
},
|
| 232 |
"structural_concerns": ["list of structural issues if any"],
|
| 233 |
"access_issues": ["list of access problems if any"],
|
| 234 |
"recommended_sampling_locations": [
|
| 235 |
+
{
|
| 236 |
"description": "where to sample",
|
| 237 |
"sample_type": "tape_lift" | "surface_wipe" | "air_sample",
|
| 238 |
"priority": "high" | "medium" | "low"
|
| 239 |
+
}
|
| 240 |
],
|
| 241 |
"flags_for_review": ["any items requiring human review"]
|
| 242 |
+
}
|
| 243 |
|
| 244 |
IMPORTANT: Return ONLY valid JSON, no additional text."""
|
| 245 |
|
| 246 |
+
def __init__(self, model, processor, sampling_params):
|
| 247 |
+
self.model = model
|
| 248 |
+
self.processor = processor
|
| 249 |
+
self.sampling_params = sampling_params
|
|
|
|
| 250 |
|
| 251 |
def analyze_image(self, image: Image.Image, context: str = "") -> dict[str, Any]:
|
| 252 |
+
"""Analyze an image using the FP8 vision model via vLLM.
|
| 253 |
+
|
| 254 |
+
Args:
|
| 255 |
+
image: PIL Image to analyze
|
| 256 |
+
context: Optional context string (room info, etc.)
|
| 257 |
|
| 258 |
+
Returns:
|
| 259 |
+
Structured dict with zone, condition, materials, etc.
|
| 260 |
"""
|
| 261 |
start_time = time.time()
|
| 262 |
+
logger.debug(f"Starting FP8 vision analysis (context: {len(context)} chars)")
|
| 263 |
|
| 264 |
try:
|
| 265 |
+
# Build messages in Qwen3-VL format
|
| 266 |
+
messages = self._build_messages(image, context)
|
| 267 |
+
|
| 268 |
+
# Apply chat template to format prompt correctly
|
| 269 |
+
prompt = self.processor.apply_chat_template(
|
| 270 |
+
messages,
|
| 271 |
+
tokenize=False,
|
| 272 |
+
add_generation_prompt=True,
|
| 273 |
+
)
|
| 274 |
+
|
| 275 |
+
# Generate response using vLLM multimodal API
|
| 276 |
+
# Per vLLM docs: pass PIL image directly in multi_modal_data dict
|
| 277 |
+
outputs = self.model.generate(
|
| 278 |
+
prompts=[{
|
| 279 |
+
"prompt": prompt,
|
| 280 |
+
"multi_modal_data": {"image": image}, # Single PIL image
|
| 281 |
+
}],
|
| 282 |
+
sampling_params=self.sampling_params,
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
+
response_text = outputs[0].outputs[0].text
|
| 286 |
+
|
| 287 |
+
# Parse JSON from response
|
| 288 |
+
result = self._parse_json_response(response_text)
|
| 289 |
|
| 290 |
# Log result summary
|
| 291 |
total_time = time.time() - start_time
|
|
|
|
| 294 |
condition = result.get("condition", {}).get("level", "unknown")
|
| 295 |
condition_conf = result.get("condition", {}).get("confidence", 0)
|
| 296 |
num_materials = len(result.get("materials", []))
|
| 297 |
+
logger.info(f"Vision analysis complete in {total_time:.2f}s: "
|
| 298 |
f"zone={zone} ({zone_conf:.2f}), condition={condition} ({condition_conf:.2f}), "
|
| 299 |
f"materials={num_materials}")
|
| 300 |
|
|
|
|
| 304 |
logger.error(f"Vision analysis failed: {e}")
|
| 305 |
return self._get_fallback_response(str(e))
|
| 306 |
|
| 307 |
+
def _build_messages(self, image: Image.Image, context: str) -> list[dict]:
|
| 308 |
+
"""Build messages in Qwen3-VL format for chat template.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 309 |
|
| 310 |
+
Qwen3-VL expects:
|
| 311 |
+
- System message with role="system"
|
| 312 |
+
- User message with mixed content [{"type": "image", ...}, {"type": "text", ...}]
|
| 313 |
+
"""
|
| 314 |
+
# Build user text content
|
| 315 |
+
user_text = self.JSON_FORMAT_PROMPT
|
| 316 |
if context:
|
| 317 |
+
user_text = f"Context: {context}\n\n{user_text}"
|
| 318 |
|
|
|
|
| 319 |
messages = [
|
| 320 |
+
{"role": "system", "content": self.VISION_SYSTEM_PROMPT},
|
|
|
|
|
|
|
|
|
|
| 321 |
{
|
| 322 |
"role": "user",
|
| 323 |
"content": [
|
| 324 |
{"type": "image", "image": image},
|
| 325 |
+
{"type": "text", "text": user_text},
|
| 326 |
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
]
|
| 329 |
+
return messages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 330 |
|
| 331 |
def _parse_json_response(self, response: str) -> dict[str, Any]:
|
| 332 |
+
"""Parse JSON response from model."""
|
| 333 |
try:
|
| 334 |
# Try to extract JSON from response
|
| 335 |
json_match = re.search(r'\{[\s\S]*\}', response)
|
|
|
|
| 337 |
json_str = json_match.group()
|
| 338 |
return json.loads(json_str)
|
| 339 |
else:
|
| 340 |
+
logger.warning("No JSON found in response")
|
| 341 |
return self._get_fallback_response("No JSON in response")
|
| 342 |
except json.JSONDecodeError as e:
|
| 343 |
logger.warning(f"Failed to parse JSON: {e}")
|
|
|
|
| 386 |
|
| 387 |
Uses the official Qwen3VLEmbedder from QwenLM/Qwen3-VL-Embedding.
|
| 388 |
The model handles last-token pooling and L2 normalization internally.
|
| 389 |
+
|
| 390 |
+
Model: Qwen/Qwen3-VL-Embedding-2B (2048-dim output)
|
| 391 |
"""
|
| 392 |
|
| 393 |
def __init__(self, model, processor):
|
|
|
|
| 412 |
text: Input text to embed
|
| 413 |
|
| 414 |
Returns:
|
| 415 |
+
List of floats representing the embedding (2048-dim for 2B model)
|
| 416 |
"""
|
| 417 |
try:
|
| 418 |
# Use official process() API - expects list of dicts
|
|
|
|
| 424 |
|
| 425 |
except Exception as e:
|
| 426 |
logger.error(f"Embedding generation failed: {e}")
|
| 427 |
+
# Return zero vector as fallback (2048-dim per Qwen3-VL-Embedding-2B)
|
| 428 |
+
hidden_size = getattr(self.model.model.config, "hidden_size", 2048)
|
| 429 |
return [0.0] * hidden_size
|
| 430 |
|
| 431 |
def embed_batch(self, texts: list[str]) -> list[list[float]]:
|
|
|
|
| 439 |
return [emb.cpu().tolist() for emb in embeddings]
|
| 440 |
except Exception as e:
|
| 441 |
logger.error(f"Batch embedding generation failed: {e}")
|
| 442 |
+
hidden_size = getattr(self.model.model.config, "hidden_size", 2048)
|
| 443 |
return [[0.0] * hidden_size for _ in texts]
|
| 444 |
|
| 445 |
|
|
|
|
| 452 |
- Creates a binary linear layer: weight = yes_weight - no_weight
|
| 453 |
- Scores = sigmoid(linear(last_token_hidden_state))
|
| 454 |
|
| 455 |
+
Model: Qwen/Qwen3-VL-Reranker-2B
|
| 456 |
"""
|
| 457 |
|
| 458 |
def __init__(self, model, processor):
|
rag/vectorstore.py
CHANGED
|
@@ -22,10 +22,10 @@ class MockEmbeddingFunction:
|
|
| 22 |
"""Mock embedding function for local development.
|
| 23 |
|
| 24 |
Generates deterministic pseudo-embeddings based on text hash.
|
| 25 |
-
Produces
|
| 26 |
"""
|
| 27 |
|
| 28 |
-
EMBEDDING_DIM =
|
| 29 |
|
| 30 |
def __call__(self, input: list[str]) -> list[list[float]]:
|
| 31 |
"""Generate mock embeddings for a list of texts."""
|
|
@@ -67,7 +67,7 @@ class SharedEmbeddingFunction:
|
|
| 67 |
For ChromaDB compatibility, this wraps the model stack's embedding model.
|
| 68 |
"""
|
| 69 |
|
| 70 |
-
EMBEDDING_DIM =
|
| 71 |
|
| 72 |
def __call__(self, input: list[str]) -> list[list[float]]:
|
| 73 |
"""Generate embeddings using the shared model from model stack."""
|
|
|
|
| 22 |
"""Mock embedding function for local development.
|
| 23 |
|
| 24 |
Generates deterministic pseudo-embeddings based on text hash.
|
| 25 |
+
Produces 2048-dimensional vectors (matches Qwen3-VL-Embedding-2B).
|
| 26 |
"""
|
| 27 |
|
| 28 |
+
EMBEDDING_DIM = 2048 # Per Qwen3-VL-Embedding-2B hidden_size
|
| 29 |
|
| 30 |
def __call__(self, input: list[str]) -> list[list[float]]:
|
| 31 |
"""Generate mock embeddings for a list of texts."""
|
|
|
|
| 67 |
For ChromaDB compatibility, this wraps the model stack's embedding model.
|
| 68 |
"""
|
| 69 |
|
| 70 |
+
EMBEDDING_DIM = 2048 # Per Qwen3-VL-Embedding-2B hidden_size
|
| 71 |
|
| 72 |
def __call__(self, input: list[str]) -> list[list[float]]:
|
| 73 |
"""Generate embeddings using the shared model from model stack."""
|
requirements.txt
CHANGED
|
@@ -5,6 +5,9 @@ accelerate
|
|
| 5 |
qwen-vl-utils>=0.0.14
|
| 6 |
torchvision
|
| 7 |
|
|
|
|
|
|
|
|
|
|
| 8 |
# UI
|
| 9 |
gradio>=6.0.0,<7.0.0
|
| 10 |
|
|
|
|
| 5 |
qwen-vl-utils>=0.0.14
|
| 6 |
torchvision
|
| 7 |
|
| 8 |
+
# vLLM for FP8 quantized model inference (>=0.11.0 required for Qwen3-VL support)
|
| 9 |
+
vllm>=0.11.0
|
| 10 |
+
|
| 11 |
# UI
|
| 12 |
gradio>=6.0.0,<7.0.0
|
| 13 |
|
scripts/qwen3_vl/__init__.py
CHANGED
|
@@ -4,8 +4,8 @@ Source: https://github.com/QwenLM/Qwen3-VL-Embedding
|
|
| 4 |
License: Apache 2.0
|
| 5 |
|
| 6 |
These are the official loading classes for:
|
| 7 |
-
- Qwen/Qwen3-VL-Embedding-8B
|
| 8 |
-
- Qwen/Qwen3-VL-Reranker-8B
|
| 9 |
"""
|
| 10 |
|
| 11 |
from scripts.qwen3_vl.qwen3_vl_embedding import Qwen3VLEmbedder, Qwen3VLForEmbedding
|
|
|
|
| 4 |
License: Apache 2.0
|
| 5 |
|
| 6 |
These are the official loading classes for:
|
| 7 |
+
- Qwen/Qwen3-VL-Embedding-2B (or 8B)
|
| 8 |
+
- Qwen/Qwen3-VL-Reranker-2B (or 8B)
|
| 9 |
"""
|
| 10 |
|
| 11 |
from scripts.qwen3_vl.qwen3_vl_embedding import Qwen3VLEmbedder, Qwen3VLForEmbedding
|