does this even run on intel gpus?

by Thomas98519864 - opened Apr 29

Thomas98519864

Apr 29

Has anyone actually gotten Intel/Qwen3.6-27B-int4-AutoRound to produce coherent output? On 3× Arc Pro B60 with vLLM 0.20.0 (XPU build from the upstream Dockerfile.xpu), the server starts cleanly, weights load, KV cache allocates, and /v1/chat/completions returns multilingual gibberish at both TP=1 and TP=2.

Are these GPUs a joke to whoever ships them, or what? Entire single-vendor stack quant, hardware, container and not one coherent token comes out the other end.

Is there any working version?

Sources:

wenhuach

Intel org Apr 29

Thanks for reporting the issue. Could you share the prompts you’re using? That would help us reproduce the behavior on other devices and determine whether it’s a model issue or device-specific.

mizuhoid

Apr 29

Hi I also reproduced this on my Arc Pro B60 with vLLM chat (both English and other).
And I also found by using code listed here (chat.py), partially (i.e. English) I can generate better texts (not for other langs). Some tokenizing issue with XPU vLLM...?

vLLM chat

chat.py

HuggingZhen

Intel org May 8

Hi, I use the following CMD and get the correct answers.

python3 examples/basic/offline_inference/generate.py --model /models/Qwen3.6-27B-int4-AutoRound --enforce-eager --gpu-memory-utilization 0.6 --max-model-len 8192 --tensor-parallel-size 2

# pip list | grep vllm
vllm                                     0.19.1rc1.dev223+g620e8924d.xpu
vllm-xpu-kernels                         0.1.5

Can you try generate.py this script and my bash CMD? Is this https://github.com/vllm-project/vllm/blob/v0.20.0/docker/Dockerfile.xpu your Dockerfile?

slashclee

May 8

Got vLLM from git to load this AutoRound model:

building vLLM-intel docker image

➜ git clone https://github.com/vllm-project/vllm
➜ cd vllm
➜ git checkout -b verified-on-b70 c6235ed1803e31bafedd90d53766e5705c570780
➜ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=16g

running it

➜ docker run -it --rm --network=host --ipc=host --device /dev/dri --privileged \
     -v /home/models/vllm:/root/.cache/vllm \
     -v /home/models:/models \
     -e HF_HUB_CACHE=/models \
     -e HF_TOKEN \
     -e VLLM_XPU_ENABLE_XPU_GRAPH=0 \
     vllm-xpu-env Intel/Qwen3.6-27B-int4-AutoRound \
     --kv-cache-dtype turboquant_k8v4 \
     --max-model-len 256K \
     --enable-auto-tool-choice \
     --tool-call-parser qwen3_xml \
     --reasoning-parser qwen3

It generates working useful output for one or two responses and then devolves into gibberish or infinite loops.

HuggingZhen

Intel org May 13

Hi, it works for me. I used your branch and commands.


# git branch
# main
# * verified-on-b70

docker run -it --rm --network=host --ipc=host --device /dev/dri --privileged   -v /models:/models      -e HF_HUB_CACHE=/models      -e HF_TOKEN      -e VLLM_XPU_ENABLE_XPU_GRAPH=0      vllm-xpu-env /models/Qwen3.6-27B-int4-AutoRound      --kv-cache-dtype turboquant_k8v4      --max-model-len 32768      --enable-auto-tool-choice      --tool-call-parser qwen3_xml      --reasoning-parser qwen3

NoNamesLeft1

May 15

I was using opencode with this model. It seemed to have trouble with tools calls while reasoning. I had it make a fishing game. It did the first draft. When I had it update the code to make a fix, it would loop. I would recommend trying this model in open code or pi dev and have it make something.

Sorry I couldn't be more specific.

keeratita

14 days ago

opencode has an issue with --tool-call-parser qwen3_xml

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment