Gemma-4-26B-A4B in the Browser via WebGPU

Gemma-4-26B-A4B-it (MoE, 3.8B active params per token) running in a browser tab via WebGPU at 23 tokens/second. 20GB GGUF loaded into WebGPU memory on AMD Strix Halo iGPU (64GB unified memory).

What This Is

A working setup for running Gemma-4-26B-A4B-it in the browser using wllama (WASM binding for llama.cpp) with a patched WebGPU backend that fixes a buffer aliasing bug in the GLU/GeGLU shader.

This is, to our knowledge as of May 2026, the largest model successfully run in a browser via WebGPU β€” 20GB of Q5_K_XL weights loaded into 31.5GB of available WebGPU memory on a consumer iGPU.

Key Findings

WebGPU Memory on Strix Halo

  • 31.5 GB available to a single Chrome tab (tested empirically)
  • 64GB unified memory, no discrete GPU needed
  • maxBufferSize reports 2GB per buffer, but total allocation far exceeds this

Performance

  • 23 tokens/second decode speed
  • ~2 minutes model loading (20GB via byte-range fetch)
  • Quiet operation β€” same model through llama-server Vulkan thrashes the machine; browser WebGPU (D3D12 path) runs silently

The Bug We Fixed

llama.cpp's WebGPU backend (ggml-webgpu.cpp) has a buffer aliasing bug in the GLU shader that crashes all Gemma-4 MoE models. The GeGLU operation binds overlapping regions of the same GPU buffer as separate writable storage bindings β€” Vulkan allows this, WebGPU forbids it.

The fix: When src0 and src1 tensor views overlap (share the same backing buffer), force the NO_SPLIT shader variant which reads both halves from a single binding with offset computation. This follows the same pattern as PRs #22266 (RMS_NORM_MUL) and #22456 (SSM_SCAN).

Files changed:

  • ggml-webgpu-shader-lib.hpp β€” Added overlap detection to GLU pipeline key
  • ggml-webgpu.cpp β€” Skip separate src1 binding when overlapping
  • glu.wgsl β€” Added INPLACE mode for src0/dst overlap case

Quick Start

# 1. Clone this repo
git clone https://huggingface.co/LJTSG/gemma-webgpu

# 2. Split your Gemma GGUF into <2GB chunks
llama-gguf-split --split-max-size 512M /path/to/gemma-4-26B-A4B.gguf ./model_splits/gemma-26b

# 3. Start the server
node serve_gemma.js

# 4. Open http://localhost:8150
# Click "Load Model" β†’ wait for 20GB download β†’ "Generate"

Requirements:

  • Gemma-4-26B-A4B-it GGUF (Q5_K_XL or Q4_K_M)
  • Node.js
  • Chrome/Edge with WebGPU support
  • GPU with 20+ GB accessible via WebGPU (tested: AMD Strix Halo iGPU, 64GB unified)

Why Browser WebGPU?

On AMD Strix Halo (and likely other unified memory iGPU systems):

  • Vulkan path (llama-server): fights the driver for GPU memory, thrashes, machine runs hot and loud
  • WebGPU path (browser): goes through D3D12, the path AMD optimizes for. Silent, smooth, same speed

The browser's managed WebGPU context is genuinely better for sustained inference on iGPU hardware than native Vulkan.

Files

  • index.html β€” Test page with Load/Generate buttons
  • serve_gemma.js β€” Node.js server with Range requests + CORS/COEP/COOP headers
  • memory_test.html β€” WebGPU memory ceiling allocation test
  • wllama-patch/ β€” The GLU aliasing fix (diff against wllama v3.4.1)

Related Work

  • LJTSG/mamba-webgpu β€” First browser-native Mamba/SSM inference (hand-written WGSL shaders)
  • wllama β€” WASM binding for llama.cpp with WebGPU support
  • Llamas on the Web β€” WebGPU backend for llama.cpp

License

Apache 2.0

Credits

Built by Joshua (@LJTSG) and Claude (Anthropic Opus 4.6). Model: google/gemma-4-26b-a4b-it. Runtime: wllama by ngxson, with patched WebGPU backend.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support