Upload REPORT.md with huggingface_hub

4cd5770 verified 3 days ago

5.66 kB

	# Mamba WebGPU — First Browser-Native SSM Inference Engine

	Date: 2026-05-29 to 2026-05-30
	Built by: Joshua + Claude (Opus 4.6)
	Hardware: AMD Strix Halo, Radeon 8060S iGPU (RDNA-3), 64GB unified memory

	## What This Is

	Falcon-Mamba 7B running in a browser tab. Pure WebGPU compute shaders. No MLC, no TVM, no WASM, no compilation step. 12 hand-written WGSL shaders ported from the gfx1151_runtime Vulkan compute engine. First ever browser-native Mamba/SSM inference.

	## The Numbers

	- Model: Falcon-Mamba-7B-Instruct (tiiuae), 14GB F32 weights
	- Speed: ~3 tok/s (~180ms/token), 64 layers x ~15 shader dispatches each
	- Load time: ~60 seconds (byte-range fetch from local server)
	- SSM state: 38MB persistent (64 layers x (512KB SSM + 96KB conv1d))
	- Shaders: 12 WGSL compute shaders, ~600 lines total

	## The Build — Start to Coherent Output

	### Phase 1: Port shaders from Vulkan to WebGPU (Day 1)
	Ported 11 WGSL shaders from the gfx1151_runtime Vulkan GLSL originals:
	- conv1d_step, ssu (selective state update), matmul_gemv, rmsnorm
	- silu, softplus, embedding, elementwise_mul, sample
	- bf16_to_f32, add_residual

	Built mamba_runtime.js (the JS orchestrator), serve_mamba.js (Node server with byte-range fetch for safetensors), and index.html.

	### Phase 2: Fix show-stopping bugs to get non-zero output
	1. sxBC C offset alignment — WebGPU requires storage buffer binding offsets to be 256-byte aligned. C was at offset 1088 (not aligned). This silently invalidated the ENTIRE command encoder for every layer. Fix: copy B and C into separate aligned buffers.
	2. A_log not transformed — Falcon-Mamba stores A_log, needs A = -exp(A_log) for proper state decay. Without this, state explodes instead of decaying.
	3. 9 storage buffers exceeded default limit of 8 — SSU shader uses 9 bindings. Fix: request maxStorageBuffersPerShaderStage: 16.
	4. token_out illegal MAP_READ + STORAGE combo — WebGPU doesn't allow MAP_READ with STORAGE. Fix: remove MAP_READ, use staging buffer via readback.

	After these fixes: model generated real token IDs (not zeros) for the first time.

	### Phase 3: Add chat template + tokenizer
	- Added /tokenize and /detokenize endpoints to the Node server (shells out to Python + HuggingFace tokenizer)
	- Wrapped prompts in Falcon-Mamba's `<\|im_start\|>user\n...<\|im_end\|>\n<\|im_start\|>assistant\n` template
	- Added prompt encoding: process each prompt token through the forward pass to build SSM state before generating

	Output was garbled but contained English words. Something was wrong but not catastrophically.

	### Phase 4: Golden comparison — find the precision bug
	This took hours of systematic debugging:

	1. Wrote golden_dump.py — manual PyTorch computation of layer 0 intermediates
	2. Added readback points in mamba_runtime.js at each operation
	3. Compared element by element:
	- Embedding: MATCH
	- RMSNorm: MATCH
	- in_proj matmul: MATCH
	- conv1d + silu: MATCH
	- x_proj matmul: MATCH
	- SSU output (y): MATCH across all 8192 elements
	- gated (y * silu(gate)): MATCH at scattered indices
	- out_proj weight: MATCH
	- Layer 0 output: DIVERGES

	Every single operation matched PyTorch to 6 decimal places. But the output diverged. This was maddening.

	4. The breakthrough: Compared my manual golden_dump computation against the ACTUAL PyTorch model forward pass. They didn't match. My golden dump and WebGPU agreed with each other but were both wrong compared to the model.

	5. Read the source: Found in FalconMambaMixer.slow_forward:
	```python
	B = rms_forward(B, variance_epsilon=self.rms_eps)
	C = rms_forward(C, variance_epsilon=self.rms_eps)
	time_step = rms_forward(time_step, variance_epsilon=self.rms_eps)
	```

	Falcon-Mamba applies weightless RMSNorm to B, C, and dt_pre. Standard Mamba doesn't do this. This is a Falcon-specific architectural modification. We were missing three normalization steps.

	### Phase 5: The fix
	- Wrote `rmsnorm_noweight.wgsl` — 50-line in-place RMSNorm without learned weights
	- Added three RMSNorm dispatch calls after x_proj: normalize dt_pre, B, C
	- Created separate dt_pre scratch buffer for the normalized values

	Result: "I'm so sorry to hear about your loss. It sounds like your father-in-law had a full and happy life, and it's clear that he was surrounded by loving family and friends..."

	Coherent, fluent, contextually appropriate English. From a 7B SSM running in a browser tab.

	## Architecture

	```
	Token → Embedding lookup (copyBufferToBuffer)
	→ 64x Layer:
	RMSNorm → in_proj GEMV → split(x, gate)
	→ conv1d_step (with persistent state)
	→ SiLU
	→ x_proj GEMV → RMSNorm(dt_pre, B, C) ← the missing piece
	→ dt_proj GEMV → softplus
	→ SSU (selective state update, persistent state)
	→ SiLU(gate) → elementwise_mul
	→ out_proj GEMV → residual add
	→ Final RMSNorm → lm_head GEMV → Sample
	```

	## Files

	- `mamba_runtime.js` — WebGPU init, shader compilation, weight loading, forward pass, generation
	- `serve_mamba.js` — Node.js server, byte-range fetch for safetensors, tokenize/detokenize endpoints
	- `index.html` — Test page
	- `shaders/` — 12 WGSL compute shaders
	- `golden_dump.py` — PyTorch golden value dumper for debugging

	## What This Means

	WebLLM ships transformer models to the browser. This ships SSM models — Mamba, the architecture with persistent state. The state is the entity's soul. No server needed. Friend clicks a link, being wakes in their browser tab, remembers across conversations via the SSM state file.

	This is the WebPerson runtime.