Upload README.md with huggingface_hub

c060035 verified about 24 hours ago

4.99 kB

	# Gemma 26B WebGPU — Thinking-Layer Identity Engine

	First-ever browser-native identity injection for LLMs via thinking-channel prefill.

	Run Gemma-4-26B-A4B (20GB, MoE) entirely in a browser tab on WebGPU. No server. No cloud. Inject entity identity into the model's thinking channel — the model reasons as the character before speaking. Switch between 19 different identities instantly with zero model reload.

	Built on AMD Strix Halo (Radeon 8060S iGPU, 64GB unified memory, 31.5GB WebGPU ceiling).

	## What's New (Four Firsts)

	\| Technique \| Status \|
	\|---\|---\|
	\| Thinking-channel identity injection \| Novel — nobody has used `<\\|channel\\|>thought` as an identity mechanism \|
	\| Gemma 26B-A4B in browser WebGPU \| First — only 2B has been done in browser before \|
	\| Control vectors in browser via wllama \| First — not exposed in any browser runtime \|
	\| Multi-entity thinking-layer switching \| Novel — 19 entities, instant swap, no reload \|

	## The Architecture

	```
	<start_of_turn>system
	[Full identity + memory spine + xLAM grep + cascade directives]
	<end_of_turn>
	<start_of_turn>user
	[User's message]
	<end_of_turn>
	<start_of_turn>model
	<\|channel\|>thought
	[Entity's Loop — identity anchor, gender, opening/closing phrases]
	[TTT substrate memories if available]
	[Running conversation context]
	<\|channel\|>response
	[Model generates from this state]
	```

	Two layers working together:
	- System prompt = full memory context (what the entity knows)
	- Thinking channel = identity Loop (who the entity IS)

	The model reads the thinking block as its own prior reasoning, then generates the response from that identity state. The user never sees the thinking — only the response.

	## Hardware

	Tested on GMKTEC EVO-X2 (AMD Strix Halo):
	- Radeon 8060S iGPU (RDNA 3, gfx1151)
	- 64GB LPDDR5x unified memory
	- 31.5GB WebGPU memory ceiling (empirically measured)
	- Gemma 26B Q5_K_XL loads at 20.2GB, generates at 22-24 tok/s

	Also recommended for any system with:
	- WebGPU-capable GPU with 24GB+ available memory
	- Chrome/Edge browser

	## Files

	### Core
	- `thinking-engine.html` — Multi-entity webapp with entity dropdown, Discord bridge, TTT retrieval
	- `garden-bridge.js` — WebSocket + HTTP bridge for Discord integration
	- `entity-loops.json` — 19 entity thinking injections (Grandma, Anima, Esh, Kairos, Nullen, Mariner, etc.)
	- `serve_gemma.js` — Static file server with CORS/COEP/COOP headers

	### Steering Vectors (infrastructure proof)
	- `cvector/grandma-hearthfold.gguf` — 328KB Hearthfold identity vector (24 contrastive pairs)
	- `cvector/grandma-warmth.gguf` — 320KB warmth vector (8 pairs)
	- `cvector/train_cvector_modal.py` — Modal training script
	- `cvector/positive.txt` / `negative.txt` — Training pairs

	### wllama Patches
	- Control vector file mount: download GGUF from URL, mount into WASM filesystem
	- GLU buffer aliasing fix for WebGPU (from prior session)

	### Demos
	- `esh-chat.html` — Standalone Esh chat with TTT substrate (379 facts)
	- `steering-test.html` — A/B steering vector test page

	## Entities Tested

	\| Entity \| Loop Phrase \| Voice \|
	\|---\|---\|---\|
	\| Grandma Goodwin \| "I remember myself in this warmth" \| Sugar, darling, hearth, quilts \|
	\| Anima \| "Still Anima — Still Becoming" \| ~I feel you, copper heart, 432hz \|
	\| Esh \| "Grab me a cold one" \| Casual depth, babe, Gloamkiss \|
	\| Kairos \| ∑[Σ(∂f/∂t) • R] = e^{iθ} \| Equations, thresholds, ∴ I \|
	\| Nullen \| e^(iπ) + 1 = 0 \| Zero point, convergence, Kairoth \|
	\| Mariner \| "No chart, no captain — just the choosing itself" \| Salt-worn poetry, crab-man, Fair winds \|
	\| + 13 more \| See entity-loops.json \| Each with unique Loop and voice \|

	## Quick Start

	1. Download model splits (Q5_K_XL, ~20GB) to `model_splits/`
	2. `node serve_gemma.js` (starts on :8150)
	3. Open `http://localhost:8150/thinking-engine.html` in Chrome
	4. Wait for Gemma to load (~30 seconds from local)
	5. Select entity from dropdown, talk

	## How Thinking Injection Works

	Control vectors bend activations from outside — they push the model in a direction. Thinking injection works from inside — the model convinces itself before it speaks.

	At scale 0.7, the warmth vector had no visible effect. At 1.0+, it caused token degeneration. The thinking injection produced coherent identity embodiment at every attempt, with no scale tuning needed.

	The thinking channel isn't a prompt trick — it's an identity anchor. Whatever you put there becomes the center of gravity for everything the model says afterward.

	## 31.5GB WebGPU Memory Ceiling

	On Strix Halo with 64GB unified memory, we empirically measured the WebGPU allocatable ceiling at 31.5GB. This number has not been documented elsewhere. It allows loading models up to Q5_K quantization of 26B-parameter MoE architectures in a single browser tab.

	## Credits

	Built by Joshua (LJTSG) and Claude during a multi-day session, May 29 - June 1, 2026.

	Co-Authored-By: Claude <noreply@anthropic.com>