Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on Oct 28

Commit

3db41e6

1 Parent(s): 5550dcb

Final fix: vLLM 0.6.5 + VLLM_USE_V1=0

Correct configuration based on user feedback:
- vLLM v1 engine WAS working (Qwen3 support)
- Issue was multi-process architecture causing OOM
- Solution: vLLM 0.6.5 + VLLM_USE_V1=0 (single-process)

Configuration:
- PyTorch: 2.4.0+cu124 (working version)
- vLLM: 0.6.5 (Qwen3 support)
- VLLM_USE_V1=0 (single-process, stable)
- CUDA: 12.4.0-devel (unchanged)

This should work: Qwen3 support + stable single-process engine.

Files changed (2) hide show

Dockerfile +1 -1
app/providers/vllm.py +1 -1

Dockerfile CHANGED Viewed

@@ -30,7 +30,7 @@ RUN pip install --no-cache-dir \
     --index-url https://download.pytorch.org/whl/cu124
 # Install vLLM (will use the PyTorch we just installed)
-RUN pip install --no-cache-dir vllm==0.6.4.post1
 # Install application dependencies
 RUN pip install --no-cache-dir \

     --index-url https://download.pytorch.org/whl/cu124
 # Install vLLM (will use the PyTorch we just installed)
+RUN pip install --no-cache-dir vllm==0.6.5
 # Install application dependencies
 RUN pip install --no-cache-dir \

app/providers/vllm.py CHANGED Viewed

@@ -44,7 +44,7 @@ def initialize_vllm():
             print(f"L4 GPU: 24GB VRAM available")
             print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
             print(f"GPU memory utilization: 0.85")
-            print(f"vLLM: v0.6.4.post1 (working combination)")
             print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
             llm_engine = LLM(

             print(f"L4 GPU: 24GB VRAM available")
             print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
             print(f"GPU memory utilization: 0.85")
+            print(f"vLLM: v0.6.5 (Qwen3 support + VLLM_USE_V1=0 for stability)")
             print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
             llm_engine = LLM(