Awesome. Please let me know if there's anything else coming out.
Yintong Lu
AI & ML interests
Recent Activity
Organizations
@Azhuvath , Thanks for checking. In practice, the raw model weights alone are already close to the full 24GB budget, and vLLM still needs extra memory for KV cache, activations, and runtime overhead. So OOM on a single B60 is expected.
So if it works with 2 cards but OOMs on 1 card, that is consistent with the memory budget and I would not currently treat 1x B60 as a validated setup for Gemma 4 12B in vLLM.
Thanks for the log. This error is different from the earlier disable_sliding_window issue.
The current failure happens earlier during model config loading: the container's transformers does not recognize the new gemma4_unified architecture yet. So this is a dependency/version mismatch。
The stack we verified for Gemma-4-12B on XPU was:
- vLLM: ef3af56
- vllm-xpu-kernels: 06e909e
- torch: 2.12.0+xpu
- transformers: 5.10.0.dev0
Please also verify the transformers version and install the newer Transformers if needed.
Hi,
At the moment, I do not have a separate prebuilt public Docker image with this exact stack to point you to. The recommended route for now is to pin the vllm version(git checkout ef3af56) and then proceed the Docker flow in the blog (docker/Dockerfile.xpu).
For gemma-4-12B which was newly released, we have verified on XPU with the following dependencies:
vllm:ef3af56
vllm-xpu-kernels: 06e909e
torch: 2.12.0+xpu
transformers:5.10.0.dev0.
Please give a try.
Hi, thanks for trying out.
Based on the traceback, this does not look like an Intel XPU kernel/runtime failure. It is specifically a config compatibility bug in the disable_sliding_window path with newer Hf strict config behavior.
As a quick workaround, please retry without --disable-sliding-window and without forcing --max-model-len=8192. In parallel, we should submit a fix to upstream vLLM following the issue you created so that disabling sliding window does not mutate the HF config field to None.
hi @stefan-it , awesome to see that your B70 works. Hope you have a wonderful experience running Gemma on your device and please let me know if any issue comes out.
Please try ZE_AFFINITY_MASK instead of ONEAPI_DEVICE_SELECTOR
Hi,
Thanks for trying out.
Could you please provide more details of your SW stack and error log?