Fix Windows WebGPU subgroup generation corruption
This fixes a Windows/Chrome WebGPU failure mode where Gemma 4 E2B loads but generates repetitive corrupted text such as "Aula..." / "key to the database...".
Changes:
- Disable unstable subgroup/subgroup-matrix WebGPU features for this Space runtime.
- Add retrying/annotated fetch handling for transient Hugging Face asset fetch failures.
- Include EOS id 50 in the fallback EOS list.
- Force non-exact variants for variable-length prefill ops whose manifests reject exact mode.
Verified locally:
- node --check gemma-4-e2b.js
- git diff --check
- Fresh browser load reached Ready.
- Multi-turn prompts produced normal responses without the repetition loop.
Hi @igorls -- thanks so much for testing and fixing this!
Would you mind explaining your changes a bit?
Disable unstable subgroup/subgroup-matrix WebGPU features for this Space runtime.
Are you saying that there are upstream errors in Chrome on Windows with subgroup/subgroup-matrix features? Or are our kernels themselves unstable, but with a few modifications, we could get it working on Windows?
Add retrying/annotated fetch handling for transient Hugging Face asset fetch failures.
I didn't notice these kinds of issues, but I suppose this could be considered.
Include EOS id 50 in the fallback EOS list.
Indeed, we can add this token. Right now, the demo doesn't do anything tool-related, but this should be supported in other applications:
{
"id": 50,
"content": "<|tool_response>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
Force non-exact variants for variable-length prefill ops whose manifests reject exact mode.
Similarly, what issues did you find here? Did you see any error? In which case, we could probably relax the strictness for some of these ops. If you can share information that the WebGPU API provides you, that will be very useful.
Thanks!
Thanks for looking at this!
A bit more detail/context from my side:
I would not confidently call the subgroup issue an upstream Chrome bug yet. What I can say is narrower:
- On Windows Chrome/WebGPU with an NVIDIA RTX PRO 6000 Blackwell card, the model loaded successfully but generation was corrupted/repetitive.
- The visible symptom was repeated text like
Aula.../key to the database..., even for simple prompts likehello. - Disabling
subgroupsandchromium-experimental-subgroup-matrixmade the same local app generate normal responses again. - After that change, a fresh load plus multi-turn smoke test worked:
Write a haiku about on-device AIproduced a normal haiku.- Follow-up
helloproduced a normal greeting.
So my current hypothesis is: either Chrome/Dawn/D3D12/NVIDIA is miscompiling or mishandling one of the subgroup paths, or one of the subgroup-specialized kernels has an assumption that holds on the M4/Metal backend but not on this Windows/NVIDIA backend. I do not have enough evidence yet to say which.
The fetch retry change is unrelated to the generation corruption. I added it after seeing intermittent Failed to fetch while testing. Happy to drop it from this PR if you prefer keeping the patch focused.
The EOS id change came from generation_config.json, which lists [1, 106, 50], while the fallback only had [1, 106].
For the variable-length prefill / exact-mode change, the concrete error I saw was:
Error: No supported WebGPU variant for com.xenova.gemma4.DenseGemv;
rejected sgmat: when guard resolved to false;
gemm: when guard resolved to false;
scalar: when guard resolved to false
The stack went through selectVariant -> prepare -> denseGemv -> Ot -> streamTokenIdsFromCache -> warmup/loadModel.
In the dynamic/variable-length prefill path (Ot), some ops were being called with exact: true, but their available variants appear to be guarded for non-exact mode only. I first saw it with DenseGemv, then hit similar variant-selection failures for attention / decode gate-up norm while narrowing the change. Forcing exact: false only on the ops whose manifests reject exact mode allowed that path to prepare.
That said, I agree the better long-term fix may be inside the runtime's variant selection / strictness handling rather than patching the bundled file directly. This PR was meant as a repro-derived mitigation and starting point.
Environment:
- OS: Windows
- Browser: Chrome
- GPU: NVIDIA RTX PRO 6000 Blackwell
- WebGPU backend: Chrome on Windows, so presumably Dawn/D3D12
Happy to split this into smaller PRs if useful:
- EOS fallback id only
- Windows subgroup workaround
- Variable-prefill exact-mode guard investigation
- Optional fetch retry