DecodeOprojNorm variant rejected on NVIDIA Ampere: subgroupMaxSize is 128
Bug report: Gemma 4 WebGPU Space fails on NVIDIA Ampere due to overly strict subgroup guard
Space: https://huggingface.co/spaces/webml-community/gemma-4-webgpu-kernels
Error
Failed to load: No supported WebGPU variant for com.xenova.gemma4.DecodeOprojNorm;
rejected fused_rows: when guard resolved to false;
fused: when guard resolved to false
Stack trace points to warmup/decode graph construction:
at Mn.selectVariant (gemma-4-e2b.js:18:3407)
at Mn.selectVariantAndScope (gemma-4-e2b.js:18:3171)
at Mn.prepare (gemma-4-e2b.js:17:22996)
at Xe.oprojNorm (gemma-4-e2b.js:5161:89523)
at Gt.build (gemma-4-e2b.js:5161:103613)
at async e.warmup (gemma-4-e2b.js:5161:121223)
Environment
Chrome: 149.0.7827.114
OS: Windows 10 / 11 kernel 10.0.26100.8521
WebGPU: Hardware accelerated
Adapter selected with powerPreference: "high-performance":
{
vendor: "nvidia",
architecture: "ampere",
subgroupMinSize: 32,
subgroupMaxSize: 128,
features: [
"shader-f16",
"subgroups",
"timestamp-query",
...
]
}
GPU: NVIDIA GeForce RTX 2050
Likely cause
The embedded DecodeOprojNorm variant guard appears to require an exact fixed subgroup range:
device.features.has("subgroups") &&
device.adapterInfo.subgroupMinSize == 32 &&
device.adapterInfo.subgroupMaxSize == 32
On this NVIDIA Ampere/D3D12 adapter, Chrome reports:
subgroupMinSize: 32
subgroupMaxSize: 128
So both fused_rows and fused variants are rejected before compilation, even though the adapter supports subgroups and shader-f16.
Suggested fix
Please add a compatible fallback or relax/add a variant for adapters where subgroup size includes 32 but subgroupMaxSize > 32, e.g. NVIDIA/D3D12. If the WGSL is safe for 32-lane subgroup assumptions, the guard might be closer to:
subgroupMinSize <= 32 && subgroupMaxSize >= 32
Otherwise, a separate NVIDIA/D3D12 variant or non-fixed-subgroup fallback would allow the demo to run on hardware-backed WebGPU adapters that expose a subgroup range rather than fixed 32.
Thanks for testing and letting me know about this! I have updated the demo (https://huggingface.co/spaces/webml-community/gemma-4-webgpu-kernels/discussions/2) to fix this. Could you try again and see if it works now for you? Thanks!
The model loads now, but when I asked a question, I just get:
The key:
سیستم (System) - is the key to your success!
over and over again.
Hmm, strange. Can you check your console for any warnings/errors?
and can you give as much information about your OS/browser/hardware as possible? Thanks!
gemma-4-webgpu-kernels:104 Unrecognized feature: 'ambient-light-sensor'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'battery'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'document-domain'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'layout-animations'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'legacy-image-formats'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'oversized-images'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'vr'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'wake-lock'.
gemma-4-e2b.js:5603 The powerPreference option is currently ignored when calling requestAdapter() on Windows. See https://crbug.com/369219127
dt @ gemma-4-e2b.js:5603
Thanks! Claude Opus 4.8 wrote a diagnostic script to help with debugging. If you could run this, that would be very helpful!
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Gemma4 WebGPU subgroup diagnostic</title>
<style>
body { font: 14px/1.5 ui-monospace, SFMono-Regular, Menlo, monospace; max-width: 900px; margin: 24px auto; padding: 0 16px; }
h1 { font-size: 18px; } h2 { font-size: 15px; margin-top: 24px; }
pre { background: #0b0b0c; color: #d6d6d6; padding: 12px; border-radius: 8px; overflow-x: auto; white-space: pre-wrap; }
.pass { color: #34c759; font-weight: 600; } .fail { color: #ff3b30; font-weight: 600; } .warn { color: #ff9f0a; font-weight: 600; }
button { font: inherit; padding: 8px 16px; border-radius: 8px; border: 1px solid #888; cursor: pointer; }
</style>
</head>
<body>
<h1>Gemma4 WebGPU subgroup-portability diagnostic</h1>
<p>Runs the exact reduction idioms the Gemma4 decode kernels use, on <em>your</em> GPU, and self-checks each
against a known answer. No model download. This is a hybrid-GPU laptop, so please run <b>both</b> buttons
(they request different GPUs) and paste the full output of each back.</p>
<button id="run-hp">Run on high-performance GPU (NVIDIA)</button>
<button id="run-lp">Run on low-power GPU (AMD integrated)</button>
<pre id="out">Click a button above.</pre>
<script type="module">
const out = document.getElementById('out');
const log = (s, cls) => { out.innerHTML += (cls ? `<span class="${cls}">${s}</span>` : s) + '\n'; };
const PASS = (s) => log(' PASS ' + s, 'pass');
const FAIL = (s) => log(' FAIL ' + s, 'fail');
const WARN = (s) => log(' WARN ' + s, 'warn');
document.getElementById('run-hp').onclick = () => go('high-performance');
document.getElementById('run-lp').onclick = () => go('low-power');
async function go(pref) {
out.innerHTML = '';
log(`requested powerPreference: ${pref}\n`);
try { await main(pref); } catch (e) { log('FATAL: ' + (e && e.message || e), 'fail'); console.error(e); }
}
async function main(powerPreference) {
if (!navigator.gpu) { FAIL('navigator.gpu missing — WebGPU unavailable'); return; }
const adapter = await navigator.gpu.requestAdapter({ powerPreference });
if (!adapter) { FAIL('no adapter'); return; }
const info = adapter.info || {};
log(`adapter: vendor=${info.vendor} arch=${info.architecture} device=${info.device || ''}`);
log(`subgroup range (reported): min=${info.subgroupMinSize} max=${info.subgroupMaxSize}`);
log(`features available: subgroups=${adapter.features.has('subgroups')} shader-f16=${adapter.features.has('shader-f16')}`);
const req = [];
if (adapter.features.has('subgroups')) req.push('subgroups');
if (adapter.features.has('shader-f16')) req.push('shader-f16');
const device = await adapter.requestDevice({ requiredFeatures: req });
const hasSub = adapter.features.has('subgroups');
if (!hasSub) WARN('subgroups feature absent — the model would use the workgroup fallback (the butterfly tests are skipped)');
// ---- helpers ----
async function runCompute(code, dispatch, outElems, { workgroups = dispatch } = {}) {
const module = device.createShaderModule({ code });
const ci = await module.getCompilationInfo();
for (const m of ci.messages) if (m.type === 'error') throw new Error(`shader compile: ${m.message}`);
const pipeline = await device.createComputePipelineAsync({ layout: 'auto', compute: { module, entryPoint: 'main' } });
const buf = device.createBuffer({ size: outElems * 4, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST });
device.queue.writeBuffer(buf, 0, new Uint32Array(outElems)); // zero-init
const bind = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [{ binding: 0, resource: { buffer: buf } }] });
const enc = device.createCommandEncoder();
const pass = enc.beginComputePass();
pass.setPipeline(pipeline); pass.setBindGroup(0, bind); pass.dispatchWorkgroups(workgroups); pass.end();
const read = device.createBuffer({ size: outElems * 4, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });
enc.copyBufferToBuffer(buf, 0, read, 0, outElems * 4);
device.queue.submit([enc.finish()]);
await read.mapAsync(GPUMapMode.READ);
const u32 = new Uint32Array(read.getMappedRange().slice(0));
read.unmap();
return u32;
}
const f32view = (u32) => new Float32Array(u32.buffer);
// =========================================================================
// TEST A — subgroup size + lane mapping (the load-bearing assumption)
// The decode kernels assume subgroup_invocation_id == local_invocation_index % subgroup_size
// and that subgroups tile the 1-D workgroup linearly. This dumps both and checks it.
// =========================================================================
log('\n== TEST A: subgroup size + lane mapping (WG=256) ==', '');
if (hasSub) {
const A = await runCompute(`
enable subgroups;
@group(0) @binding(0) var<storage, read_write> o: array<u32>;
@compute @workgroup_size(256)
fn main(@builtin(local_invocation_index) lid: u32,
@builtin(subgroup_invocation_id) sid: u32,
@builtin(subgroup_size) ssz: u32) {
o[lid*3u] = sid; o[lid*3u+1u] = ssz; o[lid*3u+2u] = lid;
}`, 1, 256 * 3);
const ssz = A[1];
log(` runtime subgroup_size = ${ssz} (NVIDIA warps are physically 32 regardless of the reported max)`);
let linear = true, allSame = true, badLid = -1;
for (let lid = 0; lid < 256; lid++) {
const sid = A[lid*3], s = A[lid*3+1];
if (s !== ssz) allSame = false;
if (sid !== (lid % s)) { linear = false; if (badLid < 0) badLid = lid; }
}
if (!allSame) WARN('subgroup_size is not uniform across the workgroup (unusual)');
if (linear) PASS(`lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % ${ssz} for all 256 lanes`);
else FAIL(`lane mapping is NON-LINEAR at lid=${badLid} (sid=${A[badLid*3]}, expected ${badLid % A[badLid*3+1]}). ` +
`This breaks the (tid&31)/(tid>>5) cross-block combine -> THIS is the bug.`);
if (ssz !== 32) WARN(`physical subgroup_size is ${ssz} (not 32). The butterfly is designed for this, but it is the case Apple cannot test.`);
} else { log(' (skipped — no subgroups)'); }
// =========================================================================
// TEST B — the butterfly + cross-32-block combine, vs known answer.
// Replicates the EXACT idiom in oproj/norm/down/attention reduce_sum.
// Each lane contributes (tid+1); a WG=256 workgroup must total 256*257/2 = 32896.
// =========================================================================
log('\n== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==', '');
if (hasSub) {
const code = `
enable subgroups;
@group(0) @binding(0) var<storage, read_write> result: array<f32>;
var<workgroup> sgp: array<f32, 8>; // WG/32
fn sg_sum(v: f32) -> f32 { // sgExact32==false butterfly (deltas 1,2,4,8,16)
var x = v;
x = x + subgroupShuffleXor(x, 1u);
x = x + subgroupShuffleXor(x, 2u);
x = x + subgroupShuffleXor(x, 4u);
x = x + subgroupShuffleXor(x, 8u);
x = x + subgroupShuffleXor(x, 16u);
return x;
}
@compute @workgroup_size(256)
fn main(@builtin(local_invocation_index) tid: u32, @builtin(workgroup_id) wg: vec3<u32>) {
let v = f32(tid + 1u);
let s = sg_sum(v);
if ((tid & 31u) == 0u) { sgp[tid >> 5u] = s; }
workgroupBarrier();
var total: f32 = 0.0;
for (var i: u32 = 0u; i < 8u; i = i + 1u) { total = total + sgp[i]; }
if (tid == 0u) { result[wg.x] = total; }
}`;
const B = f32view(await runCompute(code, 4, 4)); // 4 workgroups, all should agree
const expected = 256 * 257 / 2;
let ok = true; for (let i = 0; i < 4; i++) if (B[i] !== expected) ok = false;
if (ok) PASS(`butterfly+combine = ${B[0]} (correct). The cross-32-block reduction is right on this GPU.`);
else FAIL(`butterfly+combine = [${[...B].join(', ')}], expected ${expected}. The reduction is WRONG on this GPU -> THIS is the bug.`);
} else { log(' (skipped — no subgroups)'); }
// =========================================================================
// TEST C — cross-workgroup "last-arriver" merge (the prime D3D12 suspect).
// N workgroups each atomicStore a known partial; the last (by ticket) sums them.
// Expected = N*(N+1)/2. Run many iterations to catch a nondeterministic race.
// This is the flash-attention / oproj / down merge pattern. If it fails or is
// flaky here, the repetition is the cross-workgroup memory model, NOT the subgroups.
// =========================================================================
log('\n== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==', '');
{
const N = 64, ITERS = 200;
const code = `
const N: u32 = ${N}u;
@group(0) @binding(0) var<storage, read_write> buf: array<atomic<u32>>; // [0..N) partials, [N] ticket, [N+1] result
@compute @workgroup_size(64)
fn main(@builtin(workgroup_id) wg: vec3<u32>, @builtin(local_invocation_index) tid: u32) {
if (tid == 0u) { atomicStore(&buf[wg.x], wg.x + 1u); }
storageBarrier(); // workgroup-scoped (as in the kernels)
if (tid == 0u) {
let ticket = atomicAdd(&buf[N], 1u);
if (ticket == N - 1u) {
var total: u32 = 0u;
for (var i: u32 = 0u; i < N; i = i + 1u) { total = total + atomicLoad(&buf[i]); }
atomicStore(&buf[N + 1u], total);
}
}
}`;
const module = device.createShaderModule({ code });
const pipeline = await device.createComputePipelineAsync({ layout: 'auto', compute: { module, entryPoint: 'main' } });
const sz = (N + 2) * 4;
const buf = device.createBuffer({ size: sz, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST });
const read = device.createBuffer({ size: 4, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });
const bind = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [{ binding: 0, resource: { buffer: buf } }] });
const expected = N * (N + 1) / 2;
let fails = 0, firstBad = -1, sample = [];
for (let it = 0; it < ITERS; it++) {
device.queue.writeBuffer(buf, 0, new Uint32Array(N + 2)); // zero ticket + result + partials each run
const enc = device.createCommandEncoder();
const pass = enc.beginComputePass();
pass.setPipeline(pipeline); pass.setBindGroup(0, bind); pass.dispatchWorkgroups(N); pass.end();
enc.copyBufferToBuffer(buf, (N + 1) * 4, read, 0, 4);
device.queue.submit([enc.finish()]);
await read.mapAsync(GPUMapMode.READ);
const got = new Uint32Array(read.getMappedRange())[0];
read.unmap();
if (got !== expected) { fails++; if (firstBad < 0) firstBad = it; if (sample.length < 6) sample.push(got); }
}
if (fails === 0) PASS(`last-arriver merge = ${expected} on all ${ITERS} runs. Cross-workgroup visibility is fine here.`);
else FAIL(`last-arriver merge WRONG on ${fails}/${ITERS} runs (first at iter ${firstBad}; sample totals: ${sample.join(', ')}; expected ${expected}). ` +
`The last workgroup is NOT reliably seeing other workgroups' partials -> THIS is the bug (cross-workgroup memory model). ` +
`The flash-attention / oproj / down merges use exactly this pattern.`);
}
// =========================================================================
// TEST D — butterfly vs workgroup-memory tree on RANDOM data (exact parity).
// If the butterfly disagrees with a barrier-tree reference for the same data,
// the subgroup path is miscomputing on this GPU.
// =========================================================================
log('\n== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==', '');
if (hasSub) {
const code = `
enable subgroups;
@group(0) @binding(0) var<storage, read_write> o: array<u32>; // [0]=mismatch count
var<workgroup> sgp: array<f32, 8>;
var<workgroup> tree: array<f32, 256>;
fn sg_sum(v: f32) -> f32 {
var x = v;
x = x + subgroupShuffleXor(x, 1u); x = x + subgroupShuffleXor(x, 2u);
x = x + subgroupShuffleXor(x, 4u); x = x + subgroupShuffleXor(x, 8u);
x = x + subgroupShuffleXor(x, 16u); return x;
}
@compute @workgroup_size(256)
fn main(@builtin(local_invocation_index) tid: u32) {
// deterministic pseudo-random per lane
let v = f32((tid * 2654435761u) % 997u) - 498.0;
// butterfly + combine
let s = sg_sum(v);
if ((tid & 31u) == 0u) { sgp[tid >> 5u] = s; }
workgroupBarrier();
var bf: f32 = 0.0; for (var i: u32 = 0u; i < 8u; i = i + 1u) { bf = bf + sgp[i]; }
// workgroup-memory tree reference (no subgroup assumption)
tree[tid] = v; workgroupBarrier();
for (var st: u32 = 128u; st > 0u; st = st >> 1u) {
if (tid < st) { tree[tid] = tree[tid] + tree[tid + st]; }
workgroupBarrier();
}
if (tid == 0u) {
// allow tiny FP-order slack; flag only structural disagreement
let diff = abs(bf - tree[0]);
if (diff > 0.5) { o[0] = 1u; o[1] = bitcast<u32>(bf); o[2] = bitcast<u32>(tree[0]); }
}
}`;
const D = await runCompute(code, 1, 4);
if (D[0] === 0) PASS('butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.');
else FAIL(`butterfly (${f32view(D)[1]}) != workgroup-tree (${f32view(D)[2]}). The subgroup reduction is structurally wrong here -> THIS is the bug.`);
} else { log(' (skipped — no subgroups)'); }
log('\n== Summary ==', '');
log('If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.');
log('If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).');
log('If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.');
log('\nPlease copy everything above and send it back. Thank you!');
}
</script>
</body>
</html>
if you could run on both high-power and low-power, that would be very helpful!
(PS: you can run the script by creating an index.html file and then just opening it in chrome or whatever browser you are using).
requested powerPreference: high-performance
adapter: vendor=nvidia arch=ampere device=
subgroup range (reported): min=32 max=128
features available: subgroups=true shader-f16=true
== TEST A: subgroup size + lane mapping (WG=256) ==
runtime subgroup_size = 32 (NVIDIA warps are physically 32 regardless of the reported max)
PASS lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % 32 for all 256 lanes
== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==
PASS butterfly+combine = 32896 (correct). The cross-32-block reduction is right on this GPU.
== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==
PASS last-arriver merge = 2080 on all 200 runs. Cross-workgroup visibility is fine here.
== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==
PASS butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.
== Summary ==
If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.
If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).
If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.
Please copy everything above and send it back. Thank you!
requested powerPreference: low-power
adapter: vendor=nvidia arch=ampere device=
subgroup range (reported): min=32 max=128
features available: subgroups=true shader-f16=true
== TEST A: subgroup size + lane mapping (WG=256) ==
runtime subgroup_size = 32 (NVIDIA warps are physically 32 regardless of the reported max)
PASS lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % 32 for all 256 lanes
== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==
PASS butterfly+combine = 32896 (correct). The cross-32-block reduction is right on this GPU.
== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==
PASS last-arriver merge = 2080 on all 200 runs. Cross-workgroup visibility is fine here.
== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==
PASS butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.
== Summary ==
If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.
If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).
If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.
Please copy everything above and send it back. Thank you!
I loaded the weights and tried again. Any prompts gives repeating gibberish. But weights do load.
oh wow, interesting. Very useful info. Thanks! I'll see if I can get this fixed.
Happy to help. We support this direction. It's what we are building towards.
My IntiBot offers the following analysis:
Current status: the weights load, but any prompt produces repeating gibberish. The subgroup/cross-workgroup diagnostic passed on both NVIDIA high-performance and AMD low-power paths, so this probably is not the specific subgroup lane mapping or cross-workgroup merge issue those tests were meant to catch.
The next places I’d look are:
Model conversion / weight layout mismatch
- A transposed matrix, wrong packed layout, wrong shard order, or stale metadata can still load cleanly but produce nonsense logits.
- Debug path: compare a few layer outputs against a known-good CPU/Python reference using the same prompt tokens. Start with embedding output, then attention q/k/v, attention output, MLP output, final logits.
- Possible fix: add per-tensor shape/layout validation during load, especially for q/k/v/o projection, MLP gate/up/down, norms, and final lm_head.
Tokenizer or chat template mismatch
- If tokens are wrong, the model may look broken even if kernels are fine.
- Debug path: log the exact token IDs for a tiny prompt and compare against the official tokenizer. Also test raw completion without chat formatting if possible.
- Possible fix: make tokenizer/template selection explicit per model variant, and verify BOS/EOS/control tokens.
KV cache indexing / position handling
Repeating gibberish during decode often points at bad cache writes/reads, off-by-one positions, wrong stride, or RoPE positions not advancing correctly.
- Debug path: test prefill-only logits for the next token, then compare decode step 1, step 2, step 3 against reference. If prefill is sane but decode diverges immediately, KV/positioning is the prime suspect.
- Possible fix: audit cache layout as [layer, kv, batch, head, seq, dim] or whatever the implementation expects, and verify the write/read strides match exactly.
RoPE / scaling config mismatch
- Gemma-family models are sensitive to the exact RoPE base/scaling and position conventions.
- Debug path: compare q/k after RoPE for one layer/head/position against reference. Also test very short prompts where long-context scaling should not matter much.
- Possible fix: read RoPE params from model config instead of hardcoding, and verify half-dim pairing/order matches the reference implementation.
Precision or quantization path
- f16, packed weights, dequant scales, or accumulation differences can make logits collapse into repetition.
- Debug path: run the smallest possible model/config in f32 or least-quantized mode if available. Compare final logits before sampling, not just generated text.
- Possible fix: temporarily force higher-precision accumulation in attention/MLP reductions, then narrow which kernel needs it.
Sampling bug
- If greedy/temperature 0 works but normal sampling fails, then the model path may be fine and the issue is logits processing.
- Debug path: run deterministic generation: temperature 0, top_k disabled, top_p disabled, fixed seed if sampling remains enabled. Log top-10 logits/probs each step.
Possible fix: check softmax stability, repetition penalty, top-k/top-p filtering, NaN handling, and whether logits are being read from the correct buffer.
The most useful repro bundle for the dev would be:
- browser + version
- OS
- exact GPU names and driver versions
- model/checkpoint/conversion command or source
- exact prompt and settings
- first ~50 generated tokens/text
- whether NVIDIA and AMD produce the same gibberish or different gibberish
- whether greedy / temperature 0 still repeats
- whether prefill next-token logits match a reference, if there is a reference path
My suggested shortest debug route: first verify tokenizer IDs, then compare prefill logits against reference, then compare the first few decode steps. If prefill matches but decode diverges, focus on KV cache / RoPE / position indexing. If prefill already diverges, focus on conversion, weight layout, attention/MLP math, and quantization.
I am seeing an issue as well - model responses are all blank.
Here is my debug:
requested powerPreference: high-performance
adapter: vendor=qualcomm arch=adreno-7xx device=
subgroup range (reported): min=64 max=128
features available: subgroups=true shader-f16=true
== TEST A: subgroup size + lane mapping (WG=256) ==
runtime subgroup_size = 128 (NVIDIA warps are physically 32 regardless of the reported max)
PASS lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % 128 for all 256 lanes
WARN physical subgroup_size is 128 (not 32). The butterfly is designed for this, but it is the case Apple cannot test.
== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==
PASS butterfly+combine = 32896 (correct). The cross-32-block reduction is right on this GPU.
== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==
PASS last-arriver merge = 2080 on all 200 runs. Cross-workgroup visibility is fine here.
== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==
PASS butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.
== Summary ==
If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.
If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).
If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.
Please copy everything above and send it back. Thank you!
Debug console is spitting out:
[Invalid CommandBuffer from CommandEncoder "compute-dispatch"] is invalid due to a previous error.
- While calling [Queue].Submit([[Invalid CommandBuffer from CommandEncoder "compute-dispatch"]])
very useful info! thanks so much.
