does this even run on intel gpus?
Has anyone actually gotten Intel/Qwen3.6-27B-int4-AutoRound to produce coherent output? On 3× Arc Pro B60 with vLLM 0.20.0 (XPU build from the upstream Dockerfile.xpu), the server starts cleanly, weights load, KV cache allocates, and /v1/chat/completions returns multilingual gibberish at both TP=1 and TP=2.
Are these GPUs a joke to whoever ships them, or what? Entire single-vendor stack quant, hardware, container and not one coherent token comes out the other end.
Is there any working version?
Sources:
Thanks for reporting the issue. Could you share the prompts you’re using? That would help us reproduce the behavior on other devices and determine whether it’s a model issue or device-specific.
Hi, I use the following CMD and get the correct answers.
python3 examples/basic/offline_inference/generate.py --model /models/Qwen3.6-27B-int4-AutoRound --enforce-eager --gpu-memory-utilization 0.6 --max-model-len 8192 --tensor-parallel-size 2
# pip list | grep vllm
vllm 0.19.1rc1.dev223+g620e8924d.xpu
vllm-xpu-kernels 0.1.5
Can you try generate.py this script and my bash CMD? Is this https://github.com/vllm-project/vllm/blob/v0.20.0/docker/Dockerfile.xpu your Dockerfile?
Got vLLM from git to load this AutoRound model:
building vLLM-intel docker image
➜ git clone https://github.com/vllm-project/vllm
➜ cd vllm
➜ git checkout -b verified-on-b70 c6235ed1803e31bafedd90d53766e5705c570780
➜ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=16g
running it
➜ docker run -it --rm --network=host --ipc=host --device /dev/dri --privileged \
-v /home/models/vllm:/root/.cache/vllm \
-v /home/models:/models \
-e HF_HUB_CACHE=/models \
-e HF_TOKEN \
-e VLLM_XPU_ENABLE_XPU_GRAPH=0 \
vllm-xpu-env Intel/Qwen3.6-27B-int4-AutoRound \
--kv-cache-dtype turboquant_k8v4 \
--max-model-len 256K \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3
It generates working useful output for one or two responses and then devolves into gibberish or infinite loops.
Hi, it works for me. I used your branch and commands.
# git branch
# main
# * verified-on-b70
docker run -it --rm --network=host --ipc=host --device /dev/dri --privileged -v /models:/models -e HF_HUB_CACHE=/models -e HF_TOKEN -e VLLM_XPU_ENABLE_XPU_GRAPH=0 vllm-xpu-env /models/Qwen3.6-27B-int4-AutoRound --kv-cache-dtype turboquant_k8v4 --max-model-len 32768 --enable-auto-tool-choice --tool-call-parser qwen3_xml --reasoning-parser qwen3
I was using opencode with this model. It seemed to have trouble with tools calls while reasoning. I had it make a fishing game. It did the first draft. When I had it update the code to make a fix, it would loop. I would recommend trying this model in open code or pi dev and have it make something.
Sorry I couldn't be more specific.





