jeanbaptdzd commited on
Commit
3db41e6
·
1 Parent(s): 5550dcb

Final fix: vLLM 0.6.5 + VLLM_USE_V1=0

Browse files

Correct configuration based on user feedback:
- vLLM v1 engine WAS working (Qwen3 support)
- Issue was multi-process architecture causing OOM
- Solution: vLLM 0.6.5 + VLLM_USE_V1=0 (single-process)

Configuration:
- PyTorch: 2.4.0+cu124 (working version)
- vLLM: 0.6.5 (Qwen3 support)
- VLLM_USE_V1=0 (single-process, stable)
- CUDA: 12.4.0-devel (unchanged)

This should work: Qwen3 support + stable single-process engine.

Files changed (2) hide show
  1. Dockerfile +1 -1
  2. app/providers/vllm.py +1 -1
Dockerfile CHANGED
@@ -30,7 +30,7 @@ RUN pip install --no-cache-dir \
30
  --index-url https://download.pytorch.org/whl/cu124
31
 
32
  # Install vLLM (will use the PyTorch we just installed)
33
- RUN pip install --no-cache-dir vllm==0.6.4.post1
34
 
35
  # Install application dependencies
36
  RUN pip install --no-cache-dir \
 
30
  --index-url https://download.pytorch.org/whl/cu124
31
 
32
  # Install vLLM (will use the PyTorch we just installed)
33
+ RUN pip install --no-cache-dir vllm==0.6.5
34
 
35
  # Install application dependencies
36
  RUN pip install --no-cache-dir \
app/providers/vllm.py CHANGED
@@ -44,7 +44,7 @@ def initialize_vllm():
44
  print(f"L4 GPU: 24GB VRAM available")
45
  print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
46
  print(f"GPU memory utilization: 0.85")
47
- print(f"vLLM: v0.6.4.post1 (working combination)")
48
  print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
49
 
50
  llm_engine = LLM(
 
44
  print(f"L4 GPU: 24GB VRAM available")
45
  print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
46
  print(f"GPU memory utilization: 0.85")
47
+ print(f"vLLM: v0.6.5 (Qwen3 support + VLLM_USE_V1=0 for stability)")
48
  print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
49
 
50
  llm_engine = LLM(