Gemma-4-26B-A4B in the Browser via WebGPU

Gemma-4-26B-A4B-it (MoE, 3.8B active params per token) running in a browser tab via WebGPU at 23 tokens/second. 20GB GGUF loaded into WebGPU memory on AMD Strix Halo iGPU (64GB unified memory).

What This Is

A working setup for running Gemma-4-26B-A4B-it in the browser using wllama (WASM binding for llama.cpp) with a patched WebGPU backend that fixes a buffer aliasing bug in the GLU/GeGLU shader.

This is, to our knowledge as of May 2026, the largest model successfully run in a browser via WebGPU — 20GB of Q5_K_XL weights loaded into 31.5GB of available WebGPU memory on a consumer iGPU.

Key Findings

WebGPU Memory on Strix Halo

31.5 GB available to a single Chrome tab (tested empirically)
64GB unified memory, no discrete GPU needed
maxBufferSize reports 2GB per buffer, but total allocation far exceeds this

Performance

23 tokens/second decode speed
~2 minutes model loading (20GB via byte-range fetch)
Quiet operation — same model through llama-server Vulkan thrashes the machine; browser WebGPU (D3D12 path) runs silently

The Bug We Fixed

llama.cpp's WebGPU backend (ggml-webgpu.cpp) has a buffer aliasing bug in the GLU shader that crashes all Gemma-4 MoE models. The GeGLU operation binds overlapping regions of the same GPU buffer as separate writable storage bindings — Vulkan allows this, WebGPU forbids it.

The fix: When src0 and src1 tensor views overlap (share the same backing buffer), force the NO_SPLIT shader variant which reads both halves from a single binding with offset computation. This follows the same pattern as PRs #22266 (RMS_NORM_MUL) and #22456 (SSM_SCAN).

Files changed:

ggml-webgpu-shader-lib.hpp — Added overlap detection to GLU pipeline key
ggml-webgpu.cpp — Skip separate src1 binding when overlapping
glu.wgsl — Added INPLACE mode for src0/dst overlap case

Quick Start

# 1. Clone this repo
git clone https://huggingface.co/LJTSG/gemma-webgpu

# 2. Split your Gemma GGUF into <2GB chunks
llama-gguf-split --split-max-size 512M /path/to/gemma-4-26B-A4B.gguf ./model_splits/gemma-26b

# 3. Start the server
node serve_gemma.js

# 4. Open http://localhost:8150
# Click "Load Model" → wait for 20GB download → "Generate"

Requirements:

Gemma-4-26B-A4B-it GGUF (Q5_K_XL or Q4_K_M)
Node.js
Chrome/Edge with WebGPU support
GPU with 20+ GB accessible via WebGPU (tested: AMD Strix Halo iGPU, 64GB unified)

Why Browser WebGPU?

On AMD Strix Halo (and likely other unified memory iGPU systems):

Vulkan path (llama-server): fights the driver for GPU memory, thrashes, machine runs hot and loud
WebGPU path (browser): goes through D3D12, the path AMD optimizes for. Silent, smooth, same speed

The browser's managed WebGPU context is genuinely better for sustained inference on iGPU hardware than native Vulkan.

Files

index.html — Test page with Load/Generate buttons
serve_gemma.js — Node.js server with Range requests + CORS/COEP/COOP headers
memory_test.html — WebGPU memory ceiling allocation test
wllama-patch/ — The GLU aliasing fix (diff against wllama v3.4.1)

Related Work

LJTSG/mamba-webgpu — First browser-native Mamba/SSM inference (hand-written WGSL shaders)
wllama — WASM binding for llama.cpp with WebGPU support
Llamas on the Web — WebGPU backend for llama.cpp

License

Apache 2.0

Credits

Built by Joshua (@LJTSG) and Claude (Anthropic Opus 4.6). Model: google/gemma-4-26b-a4b-it. Runtime: wllama by ngxson, with patched WebGPU backend.

Downloads last month: -; Downloads are not tracked for this model. How to track