macmacmacmac commited on
Commit
175cd4d
·
1 Parent(s): c1f3d36

Emberglass model card: in-browser WebGPU Qwen2.5-3B + runtime LoRA hot-swap

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-3B
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - webgpu
10
+ - in-browser
11
+ - lora
12
+ - client-side
13
+ - edge
14
+ - qwen2.5
15
+ library_name: emberglass
16
+ ---
17
+
18
+ <h1 align="center">🜂 EMBERGLASS</h1>
19
+ <p align="center"><em>A 3-billion-parameter mind, running inside a browser tab. No server. No install. No upload. Just a page.</em></p>
20
+
21
+ <p align="center"><b>~35 tokens/sec decode · live LoRA hot-swap · bit-exact to the reference · 100% client-side WebGPU</b></p>
22
+
23
+ > **Code & runtime:** https://github.com/maceip/emberglass
24
+
25
+ ---
26
+
27
+ ## What this is
28
+
29
+ Most "AI in the browser" is a thin client phoning home to someone else's GPU. **This isn't that.**
30
+
31
+ Emberglass is a hand-built inference engine that runs a fine-tuned **Qwen2.5-3B** reasoning model **entirely on your own machine's GPU**, from inside a single static web page — written from the metal up in raw WebGPU compute shaders. The model thinks for thousands of tokens, streams a verdict, and **never sends a single byte off your device.** You bring the weights; the page brings the engine.
32
+
33
+ And the part that shouldn't be possible at this speed: you can **swap the model's personality at runtime.** Load the base once, then hot-swap LoRA adapters *live* — no reload, no recompile, no re-quantization. The base weights never move. The output changes the instant you flip the adapter, and flips back **bit-for-bit identically** when you remove it.
34
+
35
+ ## Why it's hard (and why it's fast)
36
+
37
+ A browser tab is the most hostile environment imaginable for a 3B-parameter model. No CUDA. No vendor kernels. A 5.4 GB weight shard won't even fit in a single JavaScript array. Every fast path that exists on a server is closed. So we closed the gap by hand:
38
+
39
+ - **Custom WGSL compute kernels** for every op — the only way LoRA could become live and swappable instead of a baked-in constant.
40
+ - **int4 group-128 quantization** that is **numerically exact** on the reference decode — half the memory, zero quality lost.
41
+ - **Split-K flash-style decode attention** so it stays fast even at thousands of tokens of context.
42
+ - **Subgroup-reduction GEMV** + a **GPU-resident batched decode loop** (argmax→embed stays on the GPU; one sync per batch).
43
+
44
+ Every win was found by **measuring** — nanosecond GPU timestamp profiling — not guessing. 9 → 35 tok/s over one focused push.
45
+
46
+ ## Results
47
+
48
+ | | |
49
+ |---|---|
50
+ | Decode speed | ~35 tok/s across a full multi-thousand-token reasoning generation |
51
+ | Correctness | argmax + every generated token **exact** vs the HuggingFace reference; bit-exact run-to-run |
52
+ | LoRA hot-swap | load base once · swap live · perfect restore on clear · no reload |
53
+ | Footprint | one static HTML page; weights supplied by the visitor (BYO-model) |
54
+ | Privacy | absolute — inference never leaves the device |
55
+
56
+ ## Note on weights
57
+
58
+ **This page hosts no multi-GB weights.** Emberglass is the *engine*; it is bring-your-own-model. Point it at a Qwen2.5-3B (or compatible) checkpoint served locally and it quantizes to int4 on the way to the GPU. Drag in a PEFT/MLX LoRA adapter to hot-swap a specialization live.
59
+
60
+ ## Run it
61
+
62
+ See **https://github.com/maceip/emberglass**. Requires a WebGPU browser exposing the `subgroups` feature. Built and validated on an Apple M5 Max.
63
+
64
+ ---
65
+ <p align="center"><sub>Built the hard way, on purpose. 🜂</sub></p>