Spaces:
Paused
Paused
Commit
·
b2fe3f4
1
Parent(s):
a65b765
Align vLLM config with official Qwen3-VL model card
Browse filesChanges:
- Remove VLLM_USE_V1=0 (V0 is removed in vLLM 0.11+, must use V1)
- Remove dtype="float16" (FP8 model should auto-detect)
- Change gpu_memory_utilization 0.90→0.70 (official recommendation)
- Remove enforce_eager=True (let vLLM default per official)
Root cause: V0 engine doesn't exist in vLLM >=0.11.0. Qwen3-VL
requires vLLM >=0.11.0. Must make V1 work with official config.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- models/real.py +4 -6
models/real.py
CHANGED
|
@@ -15,9 +15,7 @@ Model Loading:
|
|
| 15 |
import os
|
| 16 |
|
| 17 |
# vLLM environment variables - MUST be set before importing vLLM
|
| 18 |
-
#
|
| 19 |
-
# Force V0 engine (V1 has multi-GPU initialization issues)
|
| 20 |
-
os.environ["VLLM_USE_V1"] = "0"
|
| 21 |
|
| 22 |
# Force spawn method for tensor parallelism workers
|
| 23 |
# See: https://github.com/vllm-project/vllm/issues/17618
|
|
@@ -93,10 +91,10 @@ class RealModelStack:
|
|
| 93 |
model=settings.vision_model,
|
| 94 |
tensor_parallel_size=settings.vllm_tensor_parallel_size,
|
| 95 |
trust_remote_code=True,
|
| 96 |
-
|
| 97 |
-
gpu_memory_utilization=0.
|
| 98 |
max_model_len=settings.vllm_max_model_len,
|
| 99 |
-
|
| 100 |
)
|
| 101 |
|
| 102 |
# Load processor for chat template formatting
|
|
|
|
| 15 |
import os
|
| 16 |
|
| 17 |
# vLLM environment variables - MUST be set before importing vLLM
|
| 18 |
+
# Note: V0 engine is removed in vLLM 0.11+, so we must use V1
|
|
|
|
|
|
|
| 19 |
|
| 20 |
# Force spawn method for tensor parallelism workers
|
| 21 |
# See: https://github.com/vllm-project/vllm/issues/17618
|
|
|
|
| 91 |
model=settings.vision_model,
|
| 92 |
tensor_parallel_size=settings.vllm_tensor_parallel_size,
|
| 93 |
trust_remote_code=True,
|
| 94 |
+
# dtype removed - FP8 model auto-detects native quantization
|
| 95 |
+
gpu_memory_utilization=0.70, # Official model card recommendation
|
| 96 |
max_model_len=settings.vllm_max_model_len,
|
| 97 |
+
# enforce_eager removed - let vLLM default (False) per official
|
| 98 |
)
|
| 99 |
|
| 100 |
# Load processor for chat template formatting
|