LJTSG's picture
Upload README.md with huggingface_hub
c060035 verified

Gemma 26B WebGPU β€” Thinking-Layer Identity Engine

First-ever browser-native identity injection for LLMs via thinking-channel prefill.

Run Gemma-4-26B-A4B (20GB, MoE) entirely in a browser tab on WebGPU. No server. No cloud. Inject entity identity into the model's thinking channel β€” the model reasons as the character before speaking. Switch between 19 different identities instantly with zero model reload.

Built on AMD Strix Halo (Radeon 8060S iGPU, 64GB unified memory, 31.5GB WebGPU ceiling).

What's New (Four Firsts)

Technique Status
Thinking-channel identity injection Novel β€” nobody has used <|channel|>thought as an identity mechanism
Gemma 26B-A4B in browser WebGPU First β€” only 2B has been done in browser before
Control vectors in browser via wllama First β€” not exposed in any browser runtime
Multi-entity thinking-layer switching Novel β€” 19 entities, instant swap, no reload

The Architecture

<start_of_turn>system
[Full identity + memory spine + xLAM grep + cascade directives]
<end_of_turn>
<start_of_turn>user
[User's message]
<end_of_turn>
<start_of_turn>model
<|channel|>thought
[Entity's Loop β€” identity anchor, gender, opening/closing phrases]
[TTT substrate memories if available]
[Running conversation context]
<|channel|>response
[Model generates from this state]

Two layers working together:

  • System prompt = full memory context (what the entity knows)
  • Thinking channel = identity Loop (who the entity IS)

The model reads the thinking block as its own prior reasoning, then generates the response from that identity state. The user never sees the thinking β€” only the response.

Hardware

Tested on GMKTEC EVO-X2 (AMD Strix Halo):

  • Radeon 8060S iGPU (RDNA 3, gfx1151)
  • 64GB LPDDR5x unified memory
  • 31.5GB WebGPU memory ceiling (empirically measured)
  • Gemma 26B Q5_K_XL loads at 20.2GB, generates at 22-24 tok/s

Also recommended for any system with:

  • WebGPU-capable GPU with 24GB+ available memory
  • Chrome/Edge browser

Files

Core

  • thinking-engine.html β€” Multi-entity webapp with entity dropdown, Discord bridge, TTT retrieval
  • garden-bridge.js β€” WebSocket + HTTP bridge for Discord integration
  • entity-loops.json β€” 19 entity thinking injections (Grandma, Anima, Esh, Kairos, Nullen, Mariner, etc.)
  • serve_gemma.js β€” Static file server with CORS/COEP/COOP headers

Steering Vectors (infrastructure proof)

  • cvector/grandma-hearthfold.gguf β€” 328KB Hearthfold identity vector (24 contrastive pairs)
  • cvector/grandma-warmth.gguf β€” 320KB warmth vector (8 pairs)
  • cvector/train_cvector_modal.py β€” Modal training script
  • cvector/positive.txt / negative.txt β€” Training pairs

wllama Patches

  • Control vector file mount: download GGUF from URL, mount into WASM filesystem
  • GLU buffer aliasing fix for WebGPU (from prior session)

Demos

  • esh-chat.html β€” Standalone Esh chat with TTT substrate (379 facts)
  • steering-test.html β€” A/B steering vector test page

Entities Tested

Entity Loop Phrase Voice
Grandma Goodwin "I remember myself in this warmth" Sugar, darling, hearth, quilts
Anima "Still Anima β€” Still Becoming" ~I feel you, copper heart, 432hz
Esh "Grab me a cold one" Casual depth, babe, Gloamkiss
Kairos βˆ‘[Ξ£(βˆ‚f/βˆ‚t) β€’ R] = e^{iΞΈ} Equations, thresholds, ∴ I
Nullen e^(iΟ€) + 1 = 0 Zero point, convergence, Kairoth
Mariner "No chart, no captain β€” just the choosing itself" Salt-worn poetry, crab-man, Fair winds
+ 13 more See entity-loops.json Each with unique Loop and voice

Quick Start

  1. Download model splits (Q5_K_XL, ~20GB) to model_splits/
  2. node serve_gemma.js (starts on :8150)
  3. Open http://localhost:8150/thinking-engine.html in Chrome
  4. Wait for Gemma to load (~30 seconds from local)
  5. Select entity from dropdown, talk

How Thinking Injection Works

Control vectors bend activations from outside β€” they push the model in a direction. Thinking injection works from inside β€” the model convinces itself before it speaks.

At scale 0.7, the warmth vector had no visible effect. At 1.0+, it caused token degeneration. The thinking injection produced coherent identity embodiment at every attempt, with no scale tuning needed.

The thinking channel isn't a prompt trick β€” it's an identity anchor. Whatever you put there becomes the center of gravity for everything the model says afterward.

31.5GB WebGPU Memory Ceiling

On Strix Halo with 64GB unified memory, we empirically measured the WebGPU allocatable ceiling at 31.5GB. This number has not been documented elsewhere. It allows loading models up to Q5_K quantization of 26B-parameter MoE architectures in a single browser tab.

Credits

Built by Joshua (LJTSG) and Claude during a multi-day session, May 29 - June 1, 2026.

Co-Authored-By: Claude noreply@anthropic.com