Spaces:

KinetoLabs
/

SmokeScan

Paused

KinetoLabs Claude Opus 4.5 commited on Jan 11

Commit

b2fe3f4

1 Parent(s): a65b765

Align vLLM config with official Qwen3-VL model card

Changes:
- Remove VLLM_USE_V1=0 (V0 is removed in vLLM 0.11+, must use V1)
- Remove dtype="float16" (FP8 model should auto-detect)
- Change gpu_memory_utilization 0.90→0.70 (official recommendation)
- Remove enforce_eager=True (let vLLM default per official)

Root cause: V0 engine doesn't exist in vLLM >=0.11.0. Qwen3-VL
requires vLLM >=0.11.0. Must make V1 work with official config.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show

models/real.py +4 -6

models/real.py CHANGED Viewed

@@ -15,9 +15,7 @@ Model Loading:
 import os
 # vLLM environment variables - MUST be set before importing vLLM
-# Force values (not setdefault) to override any pre-existing
-# Force V0 engine (V1 has multi-GPU initialization issues)
-os.environ["VLLM_USE_V1"] = "0"
 # Force spawn method for tensor parallelism workers
 # See: https://github.com/vllm-project/vllm/issues/17618
@@ -93,10 +91,10 @@ class RealModelStack:
             model=settings.vision_model,
             tensor_parallel_size=settings.vllm_tensor_parallel_size,
             trust_remote_code=True,
-            dtype="float16",              # Explicit dtype (RTX 4090D success config)
-            gpu_memory_utilization=0.90,  # Higher utilization for FP8 model
             max_model_len=settings.vllm_max_model_len,
-            enforce_eager=True,           # Disable CUDA graphs for multi-GPU stability
         )
         # Load processor for chat template formatting

 import os
 # vLLM environment variables - MUST be set before importing vLLM
+# Note: V0 engine is removed in vLLM 0.11+, so we must use V1
 # Force spawn method for tensor parallelism workers
 # See: https://github.com/vllm-project/vllm/issues/17618
             model=settings.vision_model,
             tensor_parallel_size=settings.vllm_tensor_parallel_size,
             trust_remote_code=True,
+            # dtype removed - FP8 model auto-detects native quantization
+            gpu_memory_utilization=0.70,  # Official model card recommendation
             max_model_len=settings.vllm_max_model_len,
+            # enforce_eager removed - let vLLM default (False) per official
         )
         # Load processor for chat template formatting