| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - WeiboAI/VibeThinker-3B |
| pipeline_tag: text-generation |
| tags: |
| - webgpu |
| - in-browser |
| - lora |
| - client-side |
| - edge |
| - qwen2.5 |
| library_name: emberglass |
| --- |
| |
| <h1 align="center">🜂 EMBERGLASS</h1> |
| <p align="center"><em>A 3-billion-parameter mind, running inside a browser tab. No server. No install. No upload. Just a page.</em></p> |
|
|
| <p align="center"><b>~35 tokens/sec decode · live LoRA hot-swap · bit-exact to the reference · 100% client-side WebGPU</b></p> |
|
|
| > **Code & runtime:** https://github.com/maceip/emberglass |
|
|
| --- |
|
|
| ## What this is |
|
|
| Most "AI in the browser" is a thin client phoning home to someone else's GPU. **This isn't that.** |
|
|
| Emberglass is a hand-built inference engine that runs a fine-tuned **Qwen2.5-3B** reasoning model **entirely on your own machine's GPU**, from inside a single static web page — written from the metal up in raw WebGPU compute shaders. The model thinks for thousands of tokens, streams a verdict, and **never sends a single byte off your device.** You bring the weights; the page brings the engine. |
|
|
| And the part that shouldn't be possible at this speed: you can **swap the model's personality at runtime.** Load the base once, then hot-swap LoRA adapters *live* — no reload, no recompile, no re-quantization. The base weights never move. The output changes the instant you flip the adapter, and flips back **bit-for-bit identically** when you remove it. |
|
|
| ## Why it's hard (and why it's fast) |
|
|
| A browser tab is the most hostile environment imaginable for a 3B-parameter model. No CUDA. No vendor kernels. A 5.4 GB weight shard won't even fit in a single JavaScript array. Every fast path that exists on a server is closed. So we closed the gap by hand: |
|
|
| - **Custom WGSL compute kernels** for every op — the only way LoRA could become live and swappable instead of a baked-in constant. |
| - **int4 group-128 quantization** that is **numerically exact** on the reference decode — half the memory, zero quality lost. |
| - **Split-K flash-style decode attention** so it stays fast even at thousands of tokens of context. |
| - **Subgroup-reduction GEMV** + a **GPU-resident batched decode loop** (argmax→embed stays on the GPU; one sync per batch). |
|
|
| Every win was found by **measuring** — nanosecond GPU timestamp profiling — not guessing. 9 → 35 tok/s over one focused push. |
|
|
| ## Results |
|
|
| | | | |
| |---|---| |
| | Decode speed | ~35 tok/s across a full multi-thousand-token reasoning generation | |
| | Correctness | argmax + every generated token **exact** vs the HuggingFace reference; bit-exact run-to-run | |
| | LoRA hot-swap | load base once · swap live · perfect restore on clear · no reload | |
| | Footprint | one static HTML page; weights supplied by the visitor (BYO-model) | |
| | Privacy | absolute — inference never leaves the device | |
|
|
| ## Context window & prefill sizes |
|
|
| The base model — [WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B), a Qwen2.5-architecture 3B reasoning model (from Qwen2.5-Coder-3B) — supports **131072 (128K) positions** with a **32K sliding window**, and is built to *think long*: its generation config defaults to `max_new_tokens=65536`, and the authors suggest **60K–100K tokens** for the hardest problems. So context length is a first-class concern here, not an afterthought. |
|
|
| The runtime exposes context + prefill as options: |
|
|
| ```js |
| const rt = new QwenWGPU(device, QWEN25_3B, { maxCtx: 8192, maxPrefillT: 8192 }); |
| ``` |
|
|
| - **`maxCtx`** — the context window (KV-cache length). Decode attention is **split-K** and prefill attention is **flash / online-softmax** (O(block) workgroup memory, not O(ctx)), so neither caps out at small sizes — context scales until you run out of VRAM. |
| - **`maxPrefillT`** — the largest prompt processed in one batched (tiled-int4-GEMM) prefill pass. Longer prompts (or prefill while a LoRA adapter is active) fall back to the sequential path; clamped to `maxCtx`. |
|
|
| Defaults are **8192 / 8192** — ample for the bug-bounty triage adapter (its chain-of-thought runs a few thousand tokens) at a modest footprint. Raise them toward the base model's 128K as memory allows. **The KV cache is the cost**, and it grows linearly (~72 KB per token of context, f32, across all 36 layers): |
|
|
| | context (`maxCtx`) | KV cache (f32) | |
| |---|---| |
| | 8 192 *(default)* | ~0.6 GB | |
| | 16 384 | ~1.2 GB | |
| | 32 768 *(sliding window)* | ~2.4 GB | |
| | 131 072 *(max positions)* | ~9.4 GB | |
|
|
| Plus ~2 GB of int4/int8 weights and lazily-sized prefill scratch. **Verified in-browser:** batched prefill is bit-exact to the sequential path through ctx 1024; runs end-to-end at 4 096 / 8 192; and a `maxCtx: 16384` build prefills a 9 000-token prompt and decodes past it. (KV is f32 today — quantizing it would roughly halve these numbers.) |
|
|
| ## Note on weights |
|
|
| **This page hosts no multi-GB weights.** Emberglass is the *engine*; it is bring-your-own-model. Point it at a Qwen2.5-3B (or compatible) checkpoint served locally and it quantizes to int4 on the way to the GPU. Drag in a PEFT/MLX LoRA adapter to hot-swap a specialization live. |
|
|
| ## Run it |
|
|
| See **https://github.com/maceip/emberglass**. Requires a WebGPU browser exposing the `subgroups` feature. Built and validated on an Apple M5 Max. |
|
|
| --- |
| <p align="center"><sub>Built the hard way, on purpose. 🜂</sub></p> |
|
|