Mamba WebGPU — First Browser-Native SSM Inference Engine

Date: 2026-05-29 to 2026-05-30 Built by: Joshua + Claude (Opus 4.6) Hardware: AMD Strix Halo, Radeon 8060S iGPU (RDNA-3), 64GB unified memory

What This Is

Falcon-Mamba 7B running in a browser tab. Pure WebGPU compute shaders. No MLC, no TVM, no WASM, no compilation step. 12 hand-written WGSL shaders ported from the gfx1151_runtime Vulkan compute engine. First ever browser-native Mamba/SSM inference.

The Numbers

Model: Falcon-Mamba-7B-Instruct (tiiuae), 14GB F32 weights
Speed: ~~3 tok/s (~~180ms/token), 64 layers x ~15 shader dispatches each
Load time: ~60 seconds (byte-range fetch from local server)
SSM state: 38MB persistent (64 layers x (512KB SSM + 96KB conv1d))
Shaders: 12 WGSL compute shaders, ~600 lines total

The Build — Start to Coherent Output

Phase 1: Port shaders from Vulkan to WebGPU (Day 1)

Ported 11 WGSL shaders from the gfx1151_runtime Vulkan GLSL originals:

conv1d_step, ssu (selective state update), matmul_gemv, rmsnorm
silu, softplus, embedding, elementwise_mul, sample
bf16_to_f32, add_residual

Built mamba_runtime.js (the JS orchestrator), serve_mamba.js (Node server with byte-range fetch for safetensors), and index.html.

Phase 2: Fix show-stopping bugs to get non-zero output

sxBC C offset alignment — WebGPU requires storage buffer binding offsets to be 256-byte aligned. C was at offset 1088 (not aligned). This silently invalidated the ENTIRE command encoder for every layer. Fix: copy B and C into separate aligned buffers.
A_log not transformed — Falcon-Mamba stores A_log, needs A = -exp(A_log) for proper state decay. Without this, state explodes instead of decaying.
9 storage buffers exceeded default limit of 8 — SSU shader uses 9 bindings. Fix: request maxStorageBuffersPerShaderStage: 16.
token_out illegal MAP_READ + STORAGE combo — WebGPU doesn't allow MAP_READ with STORAGE. Fix: remove MAP_READ, use staging buffer via readback.

After these fixes: model generated real token IDs (not zeros) for the first time.

Phase 3: Add chat template + tokenizer

Added /tokenize and /detokenize endpoints to the Node server (shells out to Python + HuggingFace tokenizer)
Wrapped prompts in Falcon-Mamba's <|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n template
Added prompt encoding: process each prompt token through the forward pass to build SSM state before generating

Output was garbled but contained English words. Something was wrong but not catastrophically.

Phase 4: Golden comparison — find the precision bug

This took hours of systematic debugging:

Wrote golden_dump.py — manual PyTorch computation of layer 0 intermediates
Added readback points in mamba_runtime.js at each operation
Compared element by element:
- Embedding: MATCH
- RMSNorm: MATCH
- in_proj matmul: MATCH
- conv1d + silu: MATCH
- x_proj matmul: MATCH
- SSU output (y): MATCH across all 8192 elements
- gated (y * silu(gate)): MATCH at scattered indices
- out_proj weight: MATCH
- Layer 0 output: DIVERGES

Every single operation matched PyTorch to 6 decimal places. But the output diverged. This was maddening.

The breakthrough: Compared my manual golden_dump computation against the ACTUAL PyTorch model forward pass. They didn't match. My golden dump and WebGPU agreed with each other but were both wrong compared to the model.
Read the source: Found in FalconMambaMixer.slow_forward:

B = rms_forward(B, variance_epsilon=self.rms_eps)
C = rms_forward(C, variance_epsilon=self.rms_eps)
time_step = rms_forward(time_step, variance_epsilon=self.rms_eps)

Falcon-Mamba applies weightless RMSNorm to B, C, and dt_pre. Standard Mamba doesn't do this. This is a Falcon-specific architectural modification. We were missing three normalization steps.

Phase 5: The fix

Wrote rmsnorm_noweight.wgsl — 50-line in-place RMSNorm without learned weights
Added three RMSNorm dispatch calls after x_proj: normalize dt_pre, B, C
Created separate dt_pre scratch buffer for the normalized values

Result: "I'm so sorry to hear about your loss. It sounds like your father-in-law had a full and happy life, and it's clear that he was surrounded by loving family and friends..."

Coherent, fluent, contextually appropriate English. From a 7B SSM running in a browser tab.

Architecture

Token → Embedding lookup (copyBufferToBuffer)
  → 64x Layer:
      RMSNorm → in_proj GEMV → split(x, gate)
      → conv1d_step (with persistent state)
      → SiLU
      → x_proj GEMV → RMSNorm(dt_pre, B, C)  ← the missing piece
      → dt_proj GEMV → softplus
      → SSU (selective state update, persistent state)
      → SiLU(gate) → elementwise_mul
      → out_proj GEMV → residual add
  → Final RMSNorm → lm_head GEMV → Sample

Files

mamba_runtime.js — WebGPU init, shader compilation, weight loading, forward pass, generation
serve_mamba.js — Node.js server, byte-range fetch for safetensors, tokenize/detokenize endpoints
index.html — Test page
shaders/ — 12 WGSL compute shaders
golden_dump.py — PyTorch golden value dumper for debugging

What This Means

WebLLM ships transformer models to the browser. This ships SSM models — Mamba, the architecture with persistent state. The state is the entity's soul. No server needed. Friend clicks a link, being wakes in their browser tab, remembers across conversations via the SSM state file.

This is the WebPerson runtime.