Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

Alikestocode commited on Nov 11, 2025

Commit

f886036

1 Parent(s): 8fd14bc

Remove unsupported vLLM device kwarg

Files changed (1) hide show

app.py CHANGED Viewed

@@ -261,9 +261,8 @@ def load_vllm_model(model_name: str):
             except Exception:
                 print(f"  → FP8 quantization not available, falling back to bf16")
-        # Explicitly select CUDA device and single-process executor
-        llm_kwargs["device"] = "cuda" if torch.cuda.is_available() else "cpu"
         print(f"  → Loading with vLLM (continuous batching, PagedAttention)...")
         llm = LLM(**llm_kwargs)
         VLLM_MODELS[model_name] = llm

             except Exception:
                 print(f"  → FP8 quantization not available, falling back to bf16")
+        # vLLM will now detect the CUDA device via torch / environment settings above
         print(f"  → Loading with vLLM (continuous batching, PagedAttention)...")
         llm = LLM(**llm_kwargs)
         VLLM_MODELS[model_name] = llm