Spaces:

webml-community
/

gemma-4-webgpu-kernels

Running

App Files Files Community

Fix Windows WebGPU subgroup generation corruption

by igorls - opened 3 days ago

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

+7258

-7196

igorls

3 days ago

This fixes a Windows/Chrome WebGPU failure mode where Gemma 4 E2B loads but generates repetitive corrupted text such as "Aula..." / "key to the database...".

Changes:

Disable unstable subgroup/subgroup-matrix WebGPU features for this Space runtime.
Add retrying/annotated fetch handling for transient Hugging Face asset fetch failures.
Include EOS id 50 in the fallback EOS list.
Force non-exact variants for variable-length prefill ops whose manifests reject exact mode.

Verified locally:

node --check gemma-4-e2b.js
git diff --check
Fresh browser load reached Ready.
Multi-turn prompts produced normal responses without the repetition loop.

Fix Windows WebGPU subgroup generation corruptiona632b11c

Xenova

WebML Community org 2 days ago

Hi @igorls -- thanks so much for testing and fixing this!

Would you mind explaining your changes a bit?

Disable unstable subgroup/subgroup-matrix WebGPU features for this Space runtime.

Are you saying that there are upstream errors in Chrome on Windows with subgroup/subgroup-matrix features? Or are our kernels themselves unstable, but with a few modifications, we could get it working on Windows?

Add retrying/annotated fetch handling for transient Hugging Face asset fetch failures.

I didn't notice these kinds of issues, but I suppose this could be considered.

Include EOS id 50 in the fallback EOS list.

Indeed, we can add this token. Right now, the demo doesn't do anything tool-related, but this should be supported in other applications:

{
  "id": 50,
  "content": "<|tool_response>",
  "single_word": false,
  "lstrip": false,
  "rstrip": false,
  "normalized": false,
  "special": true
},

Force non-exact variants for variable-length prefill ops whose manifests reject exact mode.

Similarly, what issues did you find here? Did you see any error? In which case, we could probably relax the strictness for some of these ops. If you can share information that the WebGPU API provides you, that will be very useful.

Thanks!

igorls

2 days ago

Thanks for looking at this!

A bit more detail/context from my side:

I would not confidently call the subgroup issue an upstream Chrome bug yet. What I can say is narrower:

On Windows Chrome/WebGPU with an NVIDIA RTX PRO 6000 Blackwell card, the model loaded successfully but generation was corrupted/repetitive.
The visible symptom was repeated text like Aula... / key to the database..., even for simple prompts like hello.
Disabling subgroups and chromium-experimental-subgroup-matrix made the same local app generate normal responses again.
After that change, a fresh load plus multi-turn smoke test worked:
- Write a haiku about on-device AI produced a normal haiku.
- Follow-up hello produced a normal greeting.

So my current hypothesis is: either Chrome/Dawn/D3D12/NVIDIA is miscompiling or mishandling one of the subgroup paths, or one of the subgroup-specialized kernels has an assumption that holds on the M4/Metal backend but not on this Windows/NVIDIA backend. I do not have enough evidence yet to say which.

The fetch retry change is unrelated to the generation corruption. I added it after seeing intermittent Failed to fetch while testing. Happy to drop it from this PR if you prefer keeping the patch focused.

The EOS id change came from generation_config.json, which lists [1, 106, 50], while the fallback only had [1, 106].

For the variable-length prefill / exact-mode change, the concrete error I saw was:

Error: No supported WebGPU variant for com.xenova.gemma4.DenseGemv;
rejected sgmat: when guard resolved to false;
gemm: when guard resolved to false;
scalar: when guard resolved to false

The stack went through selectVariant -> prepare -> denseGemv -> Ot -> streamTokenIdsFromCache -> warmup/loadModel.

In the dynamic/variable-length prefill path (Ot), some ops were being called with exact: true, but their available variants appear to be guarded for non-exact mode only. I first saw it with DenseGemv, then hit similar variant-selection failures for attention / decode gate-up norm while narrowing the change. Forcing exact: false only on the ops whose manifests reject exact mode allowed that path to prepare.

That said, I agree the better long-term fix may be inside the runtime's variant selection / strictness handling rather than patching the bundled file directly. This PR was meant as a repro-derived mitigation and starting point.

Environment:

OS: Windows
Browser: Chrome
GPU: NVIDIA RTX PRO 6000 Blackwell
WebGPU backend: Chrome on Windows, so presumably Dawn/D3D12

Happy to split this into smaller PRs if useful:

EOS fallback id only
Windows subgroup workaround
Variable-prefill exact-mode guard investigation
Optional fetch retry

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment