does this even run on intel gpus?

#2
by Thomas98519864 - opened

Has anyone actually gotten Intel/Qwen3.6-27B-int4-AutoRound to produce coherent output? On 3× Arc Pro B60 with vLLM 0.20.0 (XPU build from the upstream Dockerfile.xpu), the server starts cleanly, weights load, KV cache allocates, and /v1/chat/completions returns multilingual gibberish at both TP=1 and TP=2.

Are these GPUs a joke to whoever ships them, or what? Entire single-vendor stack quant, hardware, container and not one coherent token comes out the other end.

Is there any working version?

Sources:

Intel org

Thanks for reporting the issue. Could you share the prompts you’re using? That would help us reproduce the behavior on other devices and determine whether it’s a model issue or device-specific.

Hi I also reproduced this on my Arc Pro B60 with vLLM chat (both English and other).
And I also found by using code listed here (chat.py), partially (i.e. English) I can generate better texts (not for other langs). Some tokenizing issue with XPU vLLM...?

vLLM chat

vllm chat

chat.py

chat.py

Hi, I use the following CMD and get the correct answers.

python3 examples/basic/offline_inference/generate.py --model /models/Qwen3.6-27B-int4-AutoRound --enforce-eager --gpu-memory-utilization 0.6 --max-model-len 8192 --tensor-parallel-size 2

# pip list | grep vllm
vllm                                     0.19.1rc1.dev223+g620e8924d.xpu
vllm-xpu-kernels                         0.1.5

image

Can you try generate.py this script and my bash CMD? Is this https://github.com/vllm-project/vllm/blob/v0.20.0/docker/Dockerfile.xpu your Dockerfile?

Got vLLM from git to load this AutoRound model:

building vLLM-intel docker image

➜ git clone https://github.com/vllm-project/vllm
➜ cd vllm
➜ git checkout -b verified-on-b70 c6235ed1803e31bafedd90d53766e5705c570780
➜ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=16g

running it

➜ docker run -it --rm --network=host --ipc=host --device /dev/dri --privileged \
     -v /home/models/vllm:/root/.cache/vllm \
     -v /home/models:/models \
     -e HF_HUB_CACHE=/models \
     -e HF_TOKEN \
     -e VLLM_XPU_ENABLE_XPU_GRAPH=0 \
     vllm-xpu-env Intel/Qwen3.6-27B-int4-AutoRound \
     --kv-cache-dtype turboquant_k8v4 \
     --max-model-len 256K \
     --enable-auto-tool-choice \
     --tool-call-parser qwen3_xml \
     --reasoning-parser qwen3

It generates working useful output for one or two responses and then devolves into gibberish or infinite loops.

Hi, it works for me. I used your branch and commands.


# git branch
# main
# * verified-on-b70

docker run -it --rm --network=host --ipc=host --device /dev/dri --privileged   -v /models:/models      -e HF_HUB_CACHE=/models      -e HF_TOKEN      -e VLLM_XPU_ENABLE_XPU_GRAPH=0      vllm-xpu-env /models/Qwen3.6-27B-int4-AutoRound      --kv-cache-dtype turboquant_k8v4      --max-model-len 32768      --enable-auto-tool-choice      --tool-call-parser qwen3_xml      --reasoning-parser qwen3

image

image

image

I was using opencode with this model. It seemed to have trouble with tools calls while reasoning. I had it make a fishing game. It did the first draft. When I had it update the code to make a fix, it would loop. I would recommend trying this model in open code or pi dev and have it make something.

Sorry I couldn't be more specific.

Sign up or log in to comment