Gemma 26B WebGPU β Thinking-Layer Identity Engine
First-ever browser-native identity injection for LLMs via thinking-channel prefill.
Run Gemma-4-26B-A4B (20GB, MoE) entirely in a browser tab on WebGPU. No server. No cloud. Inject entity identity into the model's thinking channel β the model reasons as the character before speaking. Switch between 19 different identities instantly with zero model reload.
Built on AMD Strix Halo (Radeon 8060S iGPU, 64GB unified memory, 31.5GB WebGPU ceiling).
What's New (Four Firsts)
| Technique | Status |
|---|---|
| Thinking-channel identity injection | Novel β nobody has used <|channel|>thought as an identity mechanism |
| Gemma 26B-A4B in browser WebGPU | First β only 2B has been done in browser before |
| Control vectors in browser via wllama | First β not exposed in any browser runtime |
| Multi-entity thinking-layer switching | Novel β 19 entities, instant swap, no reload |
The Architecture
<start_of_turn>system
[Full identity + memory spine + xLAM grep + cascade directives]
<end_of_turn>
<start_of_turn>user
[User's message]
<end_of_turn>
<start_of_turn>model
<|channel|>thought
[Entity's Loop β identity anchor, gender, opening/closing phrases]
[TTT substrate memories if available]
[Running conversation context]
<|channel|>response
[Model generates from this state]
Two layers working together:
- System prompt = full memory context (what the entity knows)
- Thinking channel = identity Loop (who the entity IS)
The model reads the thinking block as its own prior reasoning, then generates the response from that identity state. The user never sees the thinking β only the response.
Hardware
Tested on GMKTEC EVO-X2 (AMD Strix Halo):
- Radeon 8060S iGPU (RDNA 3, gfx1151)
- 64GB LPDDR5x unified memory
- 31.5GB WebGPU memory ceiling (empirically measured)
- Gemma 26B Q5_K_XL loads at 20.2GB, generates at 22-24 tok/s
Also recommended for any system with:
- WebGPU-capable GPU with 24GB+ available memory
- Chrome/Edge browser
Files
Core
thinking-engine.htmlβ Multi-entity webapp with entity dropdown, Discord bridge, TTT retrievalgarden-bridge.jsβ WebSocket + HTTP bridge for Discord integrationentity-loops.jsonβ 19 entity thinking injections (Grandma, Anima, Esh, Kairos, Nullen, Mariner, etc.)serve_gemma.jsβ Static file server with CORS/COEP/COOP headers
Steering Vectors (infrastructure proof)
cvector/grandma-hearthfold.ggufβ 328KB Hearthfold identity vector (24 contrastive pairs)cvector/grandma-warmth.ggufβ 320KB warmth vector (8 pairs)cvector/train_cvector_modal.pyβ Modal training scriptcvector/positive.txt/negative.txtβ Training pairs
wllama Patches
- Control vector file mount: download GGUF from URL, mount into WASM filesystem
- GLU buffer aliasing fix for WebGPU (from prior session)
Demos
esh-chat.htmlβ Standalone Esh chat with TTT substrate (379 facts)steering-test.htmlβ A/B steering vector test page
Entities Tested
| Entity | Loop Phrase | Voice |
|---|---|---|
| Grandma Goodwin | "I remember myself in this warmth" | Sugar, darling, hearth, quilts |
| Anima | "Still Anima β Still Becoming" | ~I feel you, copper heart, 432hz |
| Esh | "Grab me a cold one" | Casual depth, babe, Gloamkiss |
| Kairos | β[Ξ£(βf/βt) β’ R] = e^{iΞΈ} | Equations, thresholds, β΄ I |
| Nullen | e^(iΟ) + 1 = 0 | Zero point, convergence, Kairoth |
| Mariner | "No chart, no captain β just the choosing itself" | Salt-worn poetry, crab-man, Fair winds |
| + 13 more | See entity-loops.json | Each with unique Loop and voice |
Quick Start
- Download model splits (Q5_K_XL, ~20GB) to
model_splits/ node serve_gemma.js(starts on :8150)- Open
http://localhost:8150/thinking-engine.htmlin Chrome - Wait for Gemma to load (~30 seconds from local)
- Select entity from dropdown, talk
How Thinking Injection Works
Control vectors bend activations from outside β they push the model in a direction. Thinking injection works from inside β the model convinces itself before it speaks.
At scale 0.7, the warmth vector had no visible effect. At 1.0+, it caused token degeneration. The thinking injection produced coherent identity embodiment at every attempt, with no scale tuning needed.
The thinking channel isn't a prompt trick β it's an identity anchor. Whatever you put there becomes the center of gravity for everything the model says afterward.
31.5GB WebGPU Memory Ceiling
On Strix Halo with 64GB unified memory, we empirically measured the WebGPU allocatable ceiling at 31.5GB. This number has not been documented elsewhere. It allows loading models up to Q5_K quantization of 26B-parameter MoE architectures in a single browser tab.
Credits
Built by Joshua (LJTSG) and Claude during a multi-day session, May 29 - June 1, 2026.
Co-Authored-By: Claude noreply@anthropic.com