KinetoLabs Claude Opus 4.5 commited on
Commit
b2fe3f4
·
1 Parent(s): a65b765

Align vLLM config with official Qwen3-VL model card

Browse files

Changes:
- Remove VLLM_USE_V1=0 (V0 is removed in vLLM 0.11+, must use V1)
- Remove dtype="float16" (FP8 model should auto-detect)
- Change gpu_memory_utilization 0.90→0.70 (official recommendation)
- Remove enforce_eager=True (let vLLM default per official)

Root cause: V0 engine doesn't exist in vLLM >=0.11.0. Qwen3-VL
requires vLLM >=0.11.0. Must make V1 work with official config.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. models/real.py +4 -6
models/real.py CHANGED
@@ -15,9 +15,7 @@ Model Loading:
15
  import os
16
 
17
  # vLLM environment variables - MUST be set before importing vLLM
18
- # Force values (not setdefault) to override any pre-existing
19
- # Force V0 engine (V1 has multi-GPU initialization issues)
20
- os.environ["VLLM_USE_V1"] = "0"
21
 
22
  # Force spawn method for tensor parallelism workers
23
  # See: https://github.com/vllm-project/vllm/issues/17618
@@ -93,10 +91,10 @@ class RealModelStack:
93
  model=settings.vision_model,
94
  tensor_parallel_size=settings.vllm_tensor_parallel_size,
95
  trust_remote_code=True,
96
- dtype="float16", # Explicit dtype (RTX 4090D success config)
97
- gpu_memory_utilization=0.90, # Higher utilization for FP8 model
98
  max_model_len=settings.vllm_max_model_len,
99
- enforce_eager=True, # Disable CUDA graphs for multi-GPU stability
100
  )
101
 
102
  # Load processor for chat template formatting
 
15
  import os
16
 
17
  # vLLM environment variables - MUST be set before importing vLLM
18
+ # Note: V0 engine is removed in vLLM 0.11+, so we must use V1
 
 
19
 
20
  # Force spawn method for tensor parallelism workers
21
  # See: https://github.com/vllm-project/vllm/issues/17618
 
91
  model=settings.vision_model,
92
  tensor_parallel_size=settings.vllm_tensor_parallel_size,
93
  trust_remote_code=True,
94
+ # dtype removed - FP8 model auto-detects native quantization
95
+ gpu_memory_utilization=0.70, # Official model card recommendation
96
  max_model_len=settings.vllm_max_model_len,
97
+ # enforce_eager removed - let vLLM default (False) per official
98
  )
99
 
100
  # Load processor for chat template formatting