DecodeOprojNorm variant rejected on NVIDIA Ampere: subgroupMaxSize is 128

#1
by Intellipedia - opened

Bug report: Gemma 4 WebGPU Space fails on NVIDIA Ampere due to overly strict subgroup guard

Space: https://huggingface.co/spaces/webml-community/gemma-4-webgpu-kernels

Error

Failed to load: No supported WebGPU variant for com.xenova.gemma4.DecodeOprojNorm;
rejected fused_rows: when guard resolved to false;
fused: when guard resolved to false

Stack trace points to warmup/decode graph construction:

at Mn.selectVariant (gemma-4-e2b.js:18:3407)
at Mn.selectVariantAndScope (gemma-4-e2b.js:18:3171)
at Mn.prepare (gemma-4-e2b.js:17:22996)
at Xe.oprojNorm (gemma-4-e2b.js:5161:89523)
at Gt.build (gemma-4-e2b.js:5161:103613)
at async e.warmup (gemma-4-e2b.js:5161:121223)

Environment

Chrome: 149.0.7827.114
OS: Windows 10 / 11 kernel 10.0.26100.8521
WebGPU: Hardware accelerated
Adapter selected with powerPreference: "high-performance":

{
  vendor: "nvidia",
  architecture: "ampere",
  subgroupMinSize: 32,
  subgroupMaxSize: 128,
  features: [
    "shader-f16",
    "subgroups",
    "timestamp-query",
    ...
  ]
}

GPU: NVIDIA GeForce RTX 2050
Likely cause

The embedded DecodeOprojNorm variant guard appears to require an exact fixed subgroup range:

device.features.has("subgroups") &&
device.adapterInfo.subgroupMinSize == 32 &&
device.adapterInfo.subgroupMaxSize == 32

On this NVIDIA Ampere/D3D12 adapter, Chrome reports:

subgroupMinSize: 32
subgroupMaxSize: 128

So both fused_rows and fused variants are rejected before compilation, even though the adapter supports subgroups and shader-f16.

Suggested fix

Please add a compatible fallback or relax/add a variant for adapters where subgroup size includes 32 but subgroupMaxSize > 32, e.g. NVIDIA/D3D12. If the WGSL is safe for 32-lane subgroup assumptions, the guard might be closer to:

subgroupMinSize <= 32 && subgroupMaxSize >= 32

Otherwise, a separate NVIDIA/D3D12 variant or non-fixed-subgroup fallback would allow the demo to run on hardware-backed WebGPU adapters that expose a subgroup range rather than fixed 32.


WebML Community org

Thanks for testing and letting me know about this! I have updated the demo (https://huggingface.co/spaces/webml-community/gemma-4-webgpu-kernels/discussions/2) to fix this. Could you try again and see if it works now for you? Thanks!

The model loads now, but when I asked a question, I just get:

The key:

سیستم (System) - is the key to your success!

over and over again.

Intellipedia changed discussion status to closed
Intellipedia changed discussion status to open
WebML Community org

Hmm, strange. Can you check your console for any warnings/errors?

WebML Community org

and can you give as much information about your OS/browser/hardware as possible? Thanks!

gemma-4-webgpu-kernels:104 Unrecognized feature: 'ambient-light-sensor'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'battery'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'document-domain'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'layout-animations'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'legacy-image-formats'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'oversized-images'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'vr'.
gemma-4-webgpu-kernels:104 Unrecognized feature: 'wake-lock'.
gemma-4-e2b.js:5603 The powerPreference option is currently ignored when calling requestAdapter() on Windows. See https://crbug.com/369219127
dt @ gemma-4-e2b.js:5603

image

WebML Community org

Thanks! Claude Opus 4.8 wrote a diagnostic script to help with debugging. If you could run this, that would be very helpful!

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Gemma4 WebGPU subgroup diagnostic</title>
<style>
  body { font: 14px/1.5 ui-monospace, SFMono-Regular, Menlo, monospace; max-width: 900px; margin: 24px auto; padding: 0 16px; }
  h1 { font-size: 18px; } h2 { font-size: 15px; margin-top: 24px; }
  pre { background: #0b0b0c; color: #d6d6d6; padding: 12px; border-radius: 8px; overflow-x: auto; white-space: pre-wrap; }
  .pass { color: #34c759; font-weight: 600; } .fail { color: #ff3b30; font-weight: 600; } .warn { color: #ff9f0a; font-weight: 600; }
  button { font: inherit; padding: 8px 16px; border-radius: 8px; border: 1px solid #888; cursor: pointer; }
</style>
</head>
<body>
<h1>Gemma4 WebGPU subgroup-portability diagnostic</h1>
<p>Runs the exact reduction idioms the Gemma4 decode kernels use, on <em>your</em> GPU, and self-checks each
against a known answer. No model download. This is a hybrid-GPU laptop, so please run <b>both</b> buttons
(they request different GPUs) and paste the full output of each back.</p>
<button id="run-hp">Run on high-performance GPU (NVIDIA)</button>
<button id="run-lp">Run on low-power GPU (AMD integrated)</button>
<pre id="out">Click a button above.</pre>
<script type="module">
const out = document.getElementById('out');
const log = (s, cls) => { out.innerHTML += (cls ? `<span class="${cls}">${s}</span>` : s) + '\n'; };
const PASS = (s) => log('  PASS  ' + s, 'pass');
const FAIL = (s) => log('  FAIL  ' + s, 'fail');
const WARN = (s) => log('  WARN  ' + s, 'warn');

document.getElementById('run-hp').onclick = () => go('high-performance');
document.getElementById('run-lp').onclick = () => go('low-power');
async function go(pref) {
  out.innerHTML = '';
  log(`requested powerPreference: ${pref}\n`);
  try { await main(pref); } catch (e) { log('FATAL: ' + (e && e.message || e), 'fail'); console.error(e); }
}

async function main(powerPreference) {
  if (!navigator.gpu) { FAIL('navigator.gpu missing — WebGPU unavailable'); return; }
  const adapter = await navigator.gpu.requestAdapter({ powerPreference });
  if (!adapter) { FAIL('no adapter'); return; }
  const info = adapter.info || {};
  log(`adapter: vendor=${info.vendor} arch=${info.architecture} device=${info.device || ''}`);
  log(`subgroup range (reported): min=${info.subgroupMinSize} max=${info.subgroupMaxSize}`);
  log(`features available: subgroups=${adapter.features.has('subgroups')} shader-f16=${adapter.features.has('shader-f16')}`);
  const req = [];
  if (adapter.features.has('subgroups')) req.push('subgroups');
  if (adapter.features.has('shader-f16')) req.push('shader-f16');
  const device = await adapter.requestDevice({ requiredFeatures: req });
  const hasSub = adapter.features.has('subgroups');
  if (!hasSub) WARN('subgroups feature absent — the model would use the workgroup fallback (the butterfly tests are skipped)');

  // ---- helpers ----
  async function runCompute(code, dispatch, outElems, { workgroups = dispatch } = {}) {
    const module = device.createShaderModule({ code });
    const ci = await module.getCompilationInfo();
    for (const m of ci.messages) if (m.type === 'error') throw new Error(`shader compile: ${m.message}`);
    const pipeline = await device.createComputePipelineAsync({ layout: 'auto', compute: { module, entryPoint: 'main' } });
    const buf = device.createBuffer({ size: outElems * 4, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST });
    device.queue.writeBuffer(buf, 0, new Uint32Array(outElems)); // zero-init
    const bind = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [{ binding: 0, resource: { buffer: buf } }] });
    const enc = device.createCommandEncoder();
    const pass = enc.beginComputePass();
    pass.setPipeline(pipeline); pass.setBindGroup(0, bind); pass.dispatchWorkgroups(workgroups); pass.end();
    const read = device.createBuffer({ size: outElems * 4, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });
    enc.copyBufferToBuffer(buf, 0, read, 0, outElems * 4);
    device.queue.submit([enc.finish()]);
    await read.mapAsync(GPUMapMode.READ);
    const u32 = new Uint32Array(read.getMappedRange().slice(0));
    read.unmap();
    return u32;
  }
  const f32view = (u32) => new Float32Array(u32.buffer);

  // =========================================================================
  // TEST A — subgroup size + lane mapping (the load-bearing assumption)
  // The decode kernels assume subgroup_invocation_id == local_invocation_index % subgroup_size
  // and that subgroups tile the 1-D workgroup linearly. This dumps both and checks it.
  // =========================================================================
  log('\n== TEST A: subgroup size + lane mapping (WG=256) ==', '');
  if (hasSub) {
    const A = await runCompute(`
      enable subgroups;
      @group(0) @binding(0) var<storage, read_write> o: array<u32>;
      @compute @workgroup_size(256)
      fn main(@builtin(local_invocation_index) lid: u32,
              @builtin(subgroup_invocation_id) sid: u32,
              @builtin(subgroup_size) ssz: u32) {
        o[lid*3u] = sid; o[lid*3u+1u] = ssz; o[lid*3u+2u] = lid;
      }`, 1, 256 * 3);
    const ssz = A[1];
    log(`  runtime subgroup_size = ${ssz}  (NVIDIA warps are physically 32 regardless of the reported max)`);
    let linear = true, allSame = true, badLid = -1;
    for (let lid = 0; lid < 256; lid++) {
      const sid = A[lid*3], s = A[lid*3+1];
      if (s !== ssz) allSame = false;
      if (sid !== (lid % s)) { linear = false; if (badLid < 0) badLid = lid; }
    }
    if (!allSame) WARN('subgroup_size is not uniform across the workgroup (unusual)');
    if (linear) PASS(`lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % ${ssz} for all 256 lanes`);
    else FAIL(`lane mapping is NON-LINEAR at lid=${badLid} (sid=${A[badLid*3]}, expected ${badLid % A[badLid*3+1]}). ` +
              `This breaks the (tid&31)/(tid>>5) cross-block combine -> THIS is the bug.`);
    if (ssz !== 32) WARN(`physical subgroup_size is ${ssz} (not 32). The butterfly is designed for this, but it is the case Apple cannot test.`);
  } else { log('  (skipped — no subgroups)'); }

  // =========================================================================
  // TEST B — the butterfly + cross-32-block combine, vs known answer.
  // Replicates the EXACT idiom in oproj/norm/down/attention reduce_sum.
  // Each lane contributes (tid+1); a WG=256 workgroup must total 256*257/2 = 32896.
  // =========================================================================
  log('\n== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==', '');
  if (hasSub) {
    const code = `
      enable subgroups;
      @group(0) @binding(0) var<storage, read_write> result: array<f32>;
      var<workgroup> sgp: array<f32, 8>;            // WG/32
      fn sg_sum(v: f32) -> f32 {                    // sgExact32==false butterfly (deltas 1,2,4,8,16)
        var x = v;
        x = x + subgroupShuffleXor(x, 1u);
        x = x + subgroupShuffleXor(x, 2u);
        x = x + subgroupShuffleXor(x, 4u);
        x = x + subgroupShuffleXor(x, 8u);
        x = x + subgroupShuffleXor(x, 16u);
        return x;
      }
      @compute @workgroup_size(256)
      fn main(@builtin(local_invocation_index) tid: u32, @builtin(workgroup_id) wg: vec3<u32>) {
        let v = f32(tid + 1u);
        let s = sg_sum(v);
        if ((tid & 31u) == 0u) { sgp[tid >> 5u] = s; }
        workgroupBarrier();
        var total: f32 = 0.0;
        for (var i: u32 = 0u; i < 8u; i = i + 1u) { total = total + sgp[i]; }
        if (tid == 0u) { result[wg.x] = total; }
      }`;
    const B = f32view(await runCompute(code, 4, 4)); // 4 workgroups, all should agree
    const expected = 256 * 257 / 2;
    let ok = true; for (let i = 0; i < 4; i++) if (B[i] !== expected) ok = false;
    if (ok) PASS(`butterfly+combine = ${B[0]} (correct). The cross-32-block reduction is right on this GPU.`);
    else FAIL(`butterfly+combine = [${[...B].join(', ')}], expected ${expected}. The reduction is WRONG on this GPU -> THIS is the bug.`);
  } else { log('  (skipped — no subgroups)'); }

  // =========================================================================
  // TEST C — cross-workgroup "last-arriver" merge (the prime D3D12 suspect).
  // N workgroups each atomicStore a known partial; the last (by ticket) sums them.
  // Expected = N*(N+1)/2. Run many iterations to catch a nondeterministic race.
  // This is the flash-attention / oproj / down merge pattern. If it fails or is
  // flaky here, the repetition is the cross-workgroup memory model, NOT the subgroups.
  // =========================================================================
  log('\n== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==', '');
  {
    const N = 64, ITERS = 200;
    const code = `
      const N: u32 = ${N}u;
      @group(0) @binding(0) var<storage, read_write> buf: array<atomic<u32>>;  // [0..N) partials, [N] ticket, [N+1] result
      @compute @workgroup_size(64)
      fn main(@builtin(workgroup_id) wg: vec3<u32>, @builtin(local_invocation_index) tid: u32) {
        if (tid == 0u) { atomicStore(&buf[wg.x], wg.x + 1u); }
        storageBarrier();                                  // workgroup-scoped (as in the kernels)
        if (tid == 0u) {
          let ticket = atomicAdd(&buf[N], 1u);
          if (ticket == N - 1u) {
            var total: u32 = 0u;
            for (var i: u32 = 0u; i < N; i = i + 1u) { total = total + atomicLoad(&buf[i]); }
            atomicStore(&buf[N + 1u], total);
          }
        }
      }`;
    const module = device.createShaderModule({ code });
    const pipeline = await device.createComputePipelineAsync({ layout: 'auto', compute: { module, entryPoint: 'main' } });
    const sz = (N + 2) * 4;
    const buf = device.createBuffer({ size: sz, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST });
    const read = device.createBuffer({ size: 4, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });
    const bind = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [{ binding: 0, resource: { buffer: buf } }] });
    const expected = N * (N + 1) / 2;
    let fails = 0, firstBad = -1, sample = [];
    for (let it = 0; it < ITERS; it++) {
      device.queue.writeBuffer(buf, 0, new Uint32Array(N + 2)); // zero ticket + result + partials each run
      const enc = device.createCommandEncoder();
      const pass = enc.beginComputePass();
      pass.setPipeline(pipeline); pass.setBindGroup(0, bind); pass.dispatchWorkgroups(N); pass.end();
      enc.copyBufferToBuffer(buf, (N + 1) * 4, read, 0, 4);
      device.queue.submit([enc.finish()]);
      await read.mapAsync(GPUMapMode.READ);
      const got = new Uint32Array(read.getMappedRange())[0];
      read.unmap();
      if (got !== expected) { fails++; if (firstBad < 0) firstBad = it; if (sample.length < 6) sample.push(got); }
    }
    if (fails === 0) PASS(`last-arriver merge = ${expected} on all ${ITERS} runs. Cross-workgroup visibility is fine here.`);
    else FAIL(`last-arriver merge WRONG on ${fails}/${ITERS} runs (first at iter ${firstBad}; sample totals: ${sample.join(', ')}; expected ${expected}). ` +
              `The last workgroup is NOT reliably seeing other workgroups' partials -> THIS is the bug (cross-workgroup memory model). ` +
              `The flash-attention / oproj / down merges use exactly this pattern.`);
  }

  // =========================================================================
  // TEST D — butterfly vs workgroup-memory tree on RANDOM data (exact parity).
  // If the butterfly disagrees with a barrier-tree reference for the same data,
  // the subgroup path is miscomputing on this GPU.
  // =========================================================================
  log('\n== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==', '');
  if (hasSub) {
    const code = `
      enable subgroups;
      @group(0) @binding(0) var<storage, read_write> o: array<u32>;  // [0]=mismatch count
      var<workgroup> sgp: array<f32, 8>;
      var<workgroup> tree: array<f32, 256>;
      fn sg_sum(v: f32) -> f32 {
        var x = v;
        x = x + subgroupShuffleXor(x, 1u); x = x + subgroupShuffleXor(x, 2u);
        x = x + subgroupShuffleXor(x, 4u); x = x + subgroupShuffleXor(x, 8u);
        x = x + subgroupShuffleXor(x, 16u); return x;
      }
      @compute @workgroup_size(256)
      fn main(@builtin(local_invocation_index) tid: u32) {
        // deterministic pseudo-random per lane
        let v = f32((tid * 2654435761u) % 997u) - 498.0;
        // butterfly + combine
        let s = sg_sum(v);
        if ((tid & 31u) == 0u) { sgp[tid >> 5u] = s; }
        workgroupBarrier();
        var bf: f32 = 0.0; for (var i: u32 = 0u; i < 8u; i = i + 1u) { bf = bf + sgp[i]; }
        // workgroup-memory tree reference (no subgroup assumption)
        tree[tid] = v; workgroupBarrier();
        for (var st: u32 = 128u; st > 0u; st = st >> 1u) {
          if (tid < st) { tree[tid] = tree[tid] + tree[tid + st]; }
          workgroupBarrier();
        }
        if (tid == 0u) {
          // allow tiny FP-order slack; flag only structural disagreement
          let diff = abs(bf - tree[0]);
          if (diff > 0.5) { o[0] = 1u; o[1] = bitcast<u32>(bf); o[2] = bitcast<u32>(tree[0]); }
        }
      }`;
    const D = await runCompute(code, 1, 4);
    if (D[0] === 0) PASS('butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.');
    else FAIL(`butterfly (${f32view(D)[1]}) != workgroup-tree (${f32view(D)[2]}). The subgroup reduction is structurally wrong here -> THIS is the bug.`);
  } else { log('  (skipped — no subgroups)'); }

  log('\n== Summary ==', '');
  log('If A/B/D all PASS and C PASSES  -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.');
  log('If A or B or D FAILS            -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).');
  log('If C FAILS or is flaky          -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.');
  log('\nPlease copy everything above and send it back. Thank you!');
}
</script>
</body>
</html>
WebML Community org

if you could run on both high-power and low-power, that would be very helpful!
(PS: you can run the script by creating an index.html file and then just opening it in chrome or whatever browser you are using).

requested powerPreference: high-performance

adapter: vendor=nvidia arch=ampere device=
subgroup range (reported): min=32 max=128
features available: subgroups=true shader-f16=true

== TEST A: subgroup size + lane mapping (WG=256) ==
runtime subgroup_size = 32 (NVIDIA warps are physically 32 regardless of the reported max)
PASS lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % 32 for all 256 lanes

== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==
PASS butterfly+combine = 32896 (correct). The cross-32-block reduction is right on this GPU.

== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==
PASS last-arriver merge = 2080 on all 200 runs. Cross-workgroup visibility is fine here.

== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==
PASS butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.

== Summary ==
If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.
If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).
If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.

Please copy everything above and send it back. Thank you!


requested powerPreference: low-power

adapter: vendor=nvidia arch=ampere device=
subgroup range (reported): min=32 max=128
features available: subgroups=true shader-f16=true

== TEST A: subgroup size + lane mapping (WG=256) ==
runtime subgroup_size = 32 (NVIDIA warps are physically 32 regardless of the reported max)
PASS lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % 32 for all 256 lanes

== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==
PASS butterfly+combine = 32896 (correct). The cross-32-block reduction is right on this GPU.

== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==
PASS last-arriver merge = 2080 on all 200 runs. Cross-workgroup visibility is fine here.

== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==
PASS butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.

== Summary ==
If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.
If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).
If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.

Please copy everything above and send it back. Thank you!

I loaded the weights and tried again. Any prompts gives repeating gibberish. But weights do load.

WebML Community org

oh wow, interesting. Very useful info. Thanks! I'll see if I can get this fixed.

Happy to help. We support this direction. It's what we are building towards.

My IntiBot offers the following analysis:

Current status: the weights load, but any prompt produces repeating gibberish. The subgroup/cross-workgroup diagnostic passed on both NVIDIA high-performance and AMD low-power paths, so this probably is not the specific subgroup lane mapping or cross-workgroup merge issue those tests were meant to catch.

The next places I’d look are:

  • Model conversion / weight layout mismatch

    • A transposed matrix, wrong packed layout, wrong shard order, or stale metadata can still load cleanly but produce nonsense logits.
    • Debug path: compare a few layer outputs against a known-good CPU/Python reference using the same prompt tokens. Start with embedding output, then attention q/k/v, attention output, MLP output, final logits.
    • Possible fix: add per-tensor shape/layout validation during load, especially for q/k/v/o projection, MLP gate/up/down, norms, and final lm_head.
  • Tokenizer or chat template mismatch

    • If tokens are wrong, the model may look broken even if kernels are fine.
    • Debug path: log the exact token IDs for a tiny prompt and compare against the official tokenizer. Also test raw completion without chat formatting if possible.
    • Possible fix: make tokenizer/template selection explicit per model variant, and verify BOS/EOS/control tokens.
  • KV cache indexing / position handling

  • Repeating gibberish during decode often points at bad cache writes/reads, off-by-one positions, wrong stride, or RoPE positions not advancing correctly.

    • Debug path: test prefill-only logits for the next token, then compare decode step 1, step 2, step 3 against reference. If prefill is sane but decode diverges immediately, KV/positioning is the prime suspect.
    • Possible fix: audit cache layout as [layer, kv, batch, head, seq, dim] or whatever the implementation expects, and verify the write/read strides match exactly.
  • RoPE / scaling config mismatch

    • Gemma-family models are sensitive to the exact RoPE base/scaling and position conventions.
    • Debug path: compare q/k after RoPE for one layer/head/position against reference. Also test very short prompts where long-context scaling should not matter much.
    • Possible fix: read RoPE params from model config instead of hardcoding, and verify half-dim pairing/order matches the reference implementation.
  • Precision or quantization path

    • f16, packed weights, dequant scales, or accumulation differences can make logits collapse into repetition.
    • Debug path: run the smallest possible model/config in f32 or least-quantized mode if available. Compare final logits before sampling, not just generated text.
    • Possible fix: temporarily force higher-precision accumulation in attention/MLP reductions, then narrow which kernel needs it.
  • Sampling bug

    • If greedy/temperature 0 works but normal sampling fails, then the model path may be fine and the issue is logits processing.
    • Debug path: run deterministic generation: temperature 0, top_k disabled, top_p disabled, fixed seed if sampling remains enabled. Log top-10 logits/probs each step.
  • Possible fix: check softmax stability, repetition penalty, top-k/top-p filtering, NaN handling, and whether logits are being read from the correct buffer.

The most useful repro bundle for the dev would be:

  • browser + version
  • OS
  • exact GPU names and driver versions
  • model/checkpoint/conversion command or source
  • exact prompt and settings
  • first ~50 generated tokens/text
  • whether NVIDIA and AMD produce the same gibberish or different gibberish
  • whether greedy / temperature 0 still repeats
  • whether prefill next-token logits match a reference, if there is a reference path

My suggested shortest debug route: first verify tokenizer IDs, then compare prefill logits against reference, then compare the first few decode steps. If prefill matches but decode diverges, focus on KV cache / RoPE / position indexing. If prefill already diverges, focus on conversion, weight layout, attention/MLP math, and quantization.

I am seeing an issue as well - model responses are all blank.
Here is my debug:
requested powerPreference: high-performance

adapter: vendor=qualcomm arch=adreno-7xx device=
subgroup range (reported): min=64 max=128
features available: subgroups=true shader-f16=true

== TEST A: subgroup size + lane mapping (WG=256) ==
runtime subgroup_size = 128 (NVIDIA warps are physically 32 regardless of the reported max)
PASS lane mapping is LINEAR: subgroup_invocation_id == local_invocation_index % 128 for all 256 lanes
WARN physical subgroup_size is 128 (not 32). The butterfly is designed for this, but it is the case Apple cannot test.

== TEST B: butterfly + combine reduction (WG=256, expected total = 32896) ==
PASS butterfly+combine = 32896 (correct). The cross-32-block reduction is right on this GPU.

== TEST C: cross-workgroup last-arriver merge (N=64, 200 iters, expected = 2080) ==
PASS last-arriver merge = 2080 on all 200 runs. Cross-workgroup visibility is fine here.

== TEST D: butterfly vs workgroup-tree parity on pseudo-random data (WG=256) ==
PASS butterfly == workgroup-tree (within FP slack). Subgroup reduction matches the assumption-free reference on this GPU.

== Summary ==
If A/B/D all PASS and C PASSES -> the kernels are correct on your GPU; the issue is float precision / decoding, not the subgroup fix.
If A or B or D FAILS -> the subgroup reduction is wrong on your GPU (lane mapping / butterfly).
If C FAILS or is flaky -> the cross-workgroup merge (flash attention / oproj / down) is the culprit.

Please copy everything above and send it back. Thank you!

Debug console is spitting out:
[Invalid CommandBuffer from CommandEncoder "compute-dispatch"] is invalid due to a previous error.

  • While calling [Queue].Submit([[Invalid CommandBuffer from CommandEncoder "compute-dispatch"]])
WebML Community org

very useful info! thanks so much.

Sign up or log in to comment