Spaces:

plarnholt
/

excom-ai-demo

Paused

Peter Larnholt commited on Oct 9

Commit

3356350

1 Parent(s): 4142581

Disable guided decoding to resolve chat completion errors

The vLLM server was returning 500 errors on chat requests due to
guided decoding (outlines) import issues. Since basic chat doesn't
require structured generation, disabled guided decoding entirely
and removed airportsdata dependency.

Files changed (2) hide show

app.py +1 -0
requirements.txt +0 -3

app.py CHANGED Viewed

@@ -27,6 +27,7 @@ VLLM_ARGS = [
     "--gpu-memory-utilization", "0.90",
     "--trust-remote-code",
     "--disable-log-requests",                # reduce log noise
 ]
 if "AWQ" in MODEL_ID.upper():
     VLLM_ARGS += ["--quantization", "awq_marlin"]  # faster AWQ kernel if available

     "--gpu-memory-utilization", "0.90",
     "--trust-remote-code",
     "--disable-log-requests",                # reduce log noise
+    "--disable-guided-decoding",             # skip guided decoding (outlines) to avoid import issues
 ]
 if "AWQ" in MODEL_ID.upper():
     VLLM_ARGS += ["--quantization", "awq_marlin"]  # faster AWQ kernel if available

requirements.txt CHANGED Viewed

@@ -9,6 +9,3 @@ vllm==0.6.3.post1
 torch==2.4.0
 transformers>=4.44
 accelerate>=0.30
-# Required for vLLM guided decoding (even if not actively used)
-airportsdata>=20240400

 torch==2.4.0
 transformers>=4.44
 accelerate>=0.30