Upload README.md with huggingface_hub

8a927ce verified about 23 hours ago

4.33 kB

	---
	library_name: wllama
	tags:
	- gemma
	- gemma-4
	- webgpu
	- browser-inference
	- mixture-of-experts
	- moe
	- strix-halo
	- unified-memory
	- wllama
	- first-of-its-kind
	language:
	- en
	license: apache-2.0
	pipeline_tag: text-generation
	base_model: google/gemma-4-26b-a4b-it
	---

	# Gemma-4-26B-A4B in the Browser via WebGPU

	Gemma-4-26B-A4B-it (MoE, 3.8B active params per token) running in a browser tab via WebGPU at 23 tokens/second. 20GB GGUF loaded into WebGPU memory on AMD Strix Halo iGPU (64GB unified memory).

	## What This Is

	A working setup for running Gemma-4-26B-A4B-it in the browser using [wllama](https://github.com/ngxson/wllama) (WASM binding for llama.cpp) with a patched WebGPU backend that fixes a buffer aliasing bug in the GLU/GeGLU shader.

	This is, to our knowledge as of May 2026, the largest model successfully run in a browser via WebGPU — 20GB of Q5_K_XL weights loaded into 31.5GB of available WebGPU memory on a consumer iGPU.

	## Key Findings

	### WebGPU Memory on Strix Halo
	- 31.5 GB available to a single Chrome tab (tested empirically)
	- 64GB unified memory, no discrete GPU needed
	- `maxBufferSize` reports 2GB per buffer, but total allocation far exceeds this

	### Performance
	- 23 tokens/second decode speed
	- ~2 minutes model loading (20GB via byte-range fetch)
	- Quiet operation — same model through llama-server Vulkan thrashes the machine; browser WebGPU (D3D12 path) runs silently

	### The Bug We Fixed
	llama.cpp's WebGPU backend (`ggml-webgpu.cpp`) has a buffer aliasing bug in the GLU shader that crashes all Gemma-4 MoE models. The GeGLU operation binds overlapping regions of the same GPU buffer as separate writable storage bindings — Vulkan allows this, WebGPU forbids it.

	The fix: When `src0` and `src1` tensor views overlap (share the same backing buffer), force the NO_SPLIT shader variant which reads both halves from a single binding with offset computation. This follows the same pattern as PRs [#22266](https://github.com/ggml-org/llama.cpp/pull/22266) (RMS_NORM_MUL) and [#22456](https://github.com/ggml-org/llama.cpp/pull/22456) (SSM_SCAN).

	Files changed:
	- `ggml-webgpu-shader-lib.hpp` — Added overlap detection to GLU pipeline key
	- `ggml-webgpu.cpp` — Skip separate src1 binding when overlapping
	- `glu.wgsl` — Added INPLACE mode for src0/dst overlap case

	## Quick Start

	```bash
	# 1. Clone this repo
	git clone https://huggingface.co/LJTSG/gemma-webgpu

	# 2. Split your Gemma GGUF into <2GB chunks
	llama-gguf-split --split-max-size 512M /path/to/gemma-4-26B-A4B.gguf ./model_splits/gemma-26b

	# 3. Start the server
	node serve_gemma.js

	# 4. Open http://localhost:8150
	# Click "Load Model" → wait for 20GB download → "Generate"
	```

	Requirements:
	- Gemma-4-26B-A4B-it GGUF (Q5_K_XL or Q4_K_M)
	- Node.js
	- Chrome/Edge with WebGPU support
	- GPU with 20+ GB accessible via WebGPU (tested: AMD Strix Halo iGPU, 64GB unified)

	## Why Browser WebGPU?

	On AMD Strix Halo (and likely other unified memory iGPU systems):
	- Vulkan path (llama-server): fights the driver for GPU memory, thrashes, machine runs hot and loud
	- WebGPU path (browser): goes through D3D12, the path AMD optimizes for. Silent, smooth, same speed

	The browser's managed WebGPU context is genuinely better for sustained inference on iGPU hardware than native Vulkan.

	## Files

	- `index.html` — Test page with Load/Generate buttons
	- `serve_gemma.js` — Node.js server with Range requests + CORS/COEP/COOP headers
	- `memory_test.html` — WebGPU memory ceiling allocation test
	- `wllama-patch/` — The GLU aliasing fix (diff against wllama v3.4.1)

	## Related Work

	- [LJTSG/mamba-webgpu](https://huggingface.co/LJTSG/mamba-webgpu) — First browser-native Mamba/SSM inference (hand-written WGSL shaders)
	- [wllama](https://github.com/ngxson/wllama) — WASM binding for llama.cpp with WebGPU support
	- [Llamas on the Web](https://reeselevine.github.io/llamas-on-the-web/) — WebGPU backend for llama.cpp

	## License

	Apache 2.0

	## Credits

	Built by Joshua ([@LJTSG](https://huggingface.co/LJTSG)) and Claude (Anthropic Opus 4.6).
	Model: [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it).
	Runtime: [wllama](https://github.com/ngxson/wllama) by ngxson, with patched WebGPU backend.