Gemma-4-26B-A4B in the Browser via WebGPU
Gemma-4-26B-A4B-it (MoE, 3.8B active params per token) running in a browser tab via WebGPU at 23 tokens/second. 20GB GGUF loaded into WebGPU memory on AMD Strix Halo iGPU (64GB unified memory).
What This Is
A working setup for running Gemma-4-26B-A4B-it in the browser using wllama (WASM binding for llama.cpp) with a patched WebGPU backend that fixes a buffer aliasing bug in the GLU/GeGLU shader.
This is, to our knowledge as of May 2026, the largest model successfully run in a browser via WebGPU β 20GB of Q5_K_XL weights loaded into 31.5GB of available WebGPU memory on a consumer iGPU.
Key Findings
WebGPU Memory on Strix Halo
- 31.5 GB available to a single Chrome tab (tested empirically)
- 64GB unified memory, no discrete GPU needed
maxBufferSizereports 2GB per buffer, but total allocation far exceeds this
Performance
- 23 tokens/second decode speed
- ~2 minutes model loading (20GB via byte-range fetch)
- Quiet operation β same model through llama-server Vulkan thrashes the machine; browser WebGPU (D3D12 path) runs silently
The Bug We Fixed
llama.cpp's WebGPU backend (ggml-webgpu.cpp) has a buffer aliasing bug in the GLU shader that crashes all Gemma-4 MoE models. The GeGLU operation binds overlapping regions of the same GPU buffer as separate writable storage bindings β Vulkan allows this, WebGPU forbids it.
The fix: When src0 and src1 tensor views overlap (share the same backing buffer), force the NO_SPLIT shader variant which reads both halves from a single binding with offset computation. This follows the same pattern as PRs #22266 (RMS_NORM_MUL) and #22456 (SSM_SCAN).
Files changed:
ggml-webgpu-shader-lib.hppβ Added overlap detection to GLU pipeline keyggml-webgpu.cppβ Skip separate src1 binding when overlappingglu.wgslβ Added INPLACE mode for src0/dst overlap case
Quick Start
# 1. Clone this repo
git clone https://huggingface.co/LJTSG/gemma-webgpu
# 2. Split your Gemma GGUF into <2GB chunks
llama-gguf-split --split-max-size 512M /path/to/gemma-4-26B-A4B.gguf ./model_splits/gemma-26b
# 3. Start the server
node serve_gemma.js
# 4. Open http://localhost:8150
# Click "Load Model" β wait for 20GB download β "Generate"
Requirements:
- Gemma-4-26B-A4B-it GGUF (Q5_K_XL or Q4_K_M)
- Node.js
- Chrome/Edge with WebGPU support
- GPU with 20+ GB accessible via WebGPU (tested: AMD Strix Halo iGPU, 64GB unified)
Why Browser WebGPU?
On AMD Strix Halo (and likely other unified memory iGPU systems):
- Vulkan path (llama-server): fights the driver for GPU memory, thrashes, machine runs hot and loud
- WebGPU path (browser): goes through D3D12, the path AMD optimizes for. Silent, smooth, same speed
The browser's managed WebGPU context is genuinely better for sustained inference on iGPU hardware than native Vulkan.
Files
index.htmlβ Test page with Load/Generate buttonsserve_gemma.jsβ Node.js server with Range requests + CORS/COEP/COOP headersmemory_test.htmlβ WebGPU memory ceiling allocation testwllama-patch/β The GLU aliasing fix (diff against wllama v3.4.1)
Related Work
- LJTSG/mamba-webgpu β First browser-native Mamba/SSM inference (hand-written WGSL shaders)
- wllama β WASM binding for llama.cpp with WebGPU support
- Llamas on the Web β WebGPU backend for llama.cpp
License
Apache 2.0
Credits
Built by Joshua (@LJTSG) and Claude (Anthropic Opus 4.6). Model: google/gemma-4-26b-a4b-it. Runtime: wllama by ngxson, with patched WebGPU backend.