Commit
·
3db41e6
1
Parent(s):
5550dcb
Final fix: vLLM 0.6.5 + VLLM_USE_V1=0
Browse filesCorrect configuration based on user feedback:
- vLLM v1 engine WAS working (Qwen3 support)
- Issue was multi-process architecture causing OOM
- Solution: vLLM 0.6.5 + VLLM_USE_V1=0 (single-process)
Configuration:
- PyTorch: 2.4.0+cu124 (working version)
- vLLM: 0.6.5 (Qwen3 support)
- VLLM_USE_V1=0 (single-process, stable)
- CUDA: 12.4.0-devel (unchanged)
This should work: Qwen3 support + stable single-process engine.
- Dockerfile +1 -1
- app/providers/vllm.py +1 -1
Dockerfile
CHANGED
|
@@ -30,7 +30,7 @@ RUN pip install --no-cache-dir \
|
|
| 30 |
--index-url https://download.pytorch.org/whl/cu124
|
| 31 |
|
| 32 |
# Install vLLM (will use the PyTorch we just installed)
|
| 33 |
-
RUN pip install --no-cache-dir vllm==0.6.
|
| 34 |
|
| 35 |
# Install application dependencies
|
| 36 |
RUN pip install --no-cache-dir \
|
|
|
|
| 30 |
--index-url https://download.pytorch.org/whl/cu124
|
| 31 |
|
| 32 |
# Install vLLM (will use the PyTorch we just installed)
|
| 33 |
+
RUN pip install --no-cache-dir vllm==0.6.5
|
| 34 |
|
| 35 |
# Install application dependencies
|
| 36 |
RUN pip install --no-cache-dir \
|
app/providers/vllm.py
CHANGED
|
@@ -44,7 +44,7 @@ def initialize_vllm():
|
|
| 44 |
print(f"L4 GPU: 24GB VRAM available")
|
| 45 |
print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
|
| 46 |
print(f"GPU memory utilization: 0.85")
|
| 47 |
-
print(f"vLLM: v0.6.
|
| 48 |
print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
|
| 49 |
|
| 50 |
llm_engine = LLM(
|
|
|
|
| 44 |
print(f"L4 GPU: 24GB VRAM available")
|
| 45 |
print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
|
| 46 |
print(f"GPU memory utilization: 0.85")
|
| 47 |
+
print(f"vLLM: v0.6.5 (Qwen3 support + VLLM_USE_V1=0 for stability)")
|
| 48 |
print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
|
| 49 |
|
| 50 |
llm_engine = LLM(
|