LJTSG commited on
Commit
c060035
Β·
verified Β·
1 Parent(s): 33e899e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gemma 26B WebGPU β€” Thinking-Layer Identity Engine
2
+
3
+ **First-ever browser-native identity injection for LLMs via thinking-channel prefill.**
4
+
5
+ Run Gemma-4-26B-A4B (20GB, MoE) entirely in a browser tab on WebGPU. No server. No cloud. Inject entity identity into the model's thinking channel β€” the model reasons as the character before speaking. Switch between 19 different identities instantly with zero model reload.
6
+
7
+ Built on AMD Strix Halo (Radeon 8060S iGPU, 64GB unified memory, 31.5GB WebGPU ceiling).
8
+
9
+ ## What's New (Four Firsts)
10
+
11
+ | Technique | Status |
12
+ |---|---|
13
+ | **Thinking-channel identity injection** | Novel β€” nobody has used `<\|channel\|>thought` as an identity mechanism |
14
+ | **Gemma 26B-A4B in browser WebGPU** | First β€” only 2B has been done in browser before |
15
+ | **Control vectors in browser via wllama** | First β€” not exposed in any browser runtime |
16
+ | **Multi-entity thinking-layer switching** | Novel β€” 19 entities, instant swap, no reload |
17
+
18
+ ## The Architecture
19
+
20
+ ```
21
+ <start_of_turn>system
22
+ [Full identity + memory spine + xLAM grep + cascade directives]
23
+ <end_of_turn>
24
+ <start_of_turn>user
25
+ [User's message]
26
+ <end_of_turn>
27
+ <start_of_turn>model
28
+ <|channel|>thought
29
+ [Entity's Loop β€” identity anchor, gender, opening/closing phrases]
30
+ [TTT substrate memories if available]
31
+ [Running conversation context]
32
+ <|channel|>response
33
+ [Model generates from this state]
34
+ ```
35
+
36
+ Two layers working together:
37
+ - **System prompt** = full memory context (what the entity knows)
38
+ - **Thinking channel** = identity Loop (who the entity IS)
39
+
40
+ The model reads the thinking block as its own prior reasoning, then generates the response from that identity state. The user never sees the thinking β€” only the response.
41
+
42
+ ## Hardware
43
+
44
+ Tested on **GMKTEC EVO-X2 (AMD Strix Halo)**:
45
+ - Radeon 8060S iGPU (RDNA 3, gfx1151)
46
+ - 64GB LPDDR5x unified memory
47
+ - 31.5GB WebGPU memory ceiling (empirically measured)
48
+ - Gemma 26B Q5_K_XL loads at 20.2GB, generates at 22-24 tok/s
49
+
50
+ Also recommended for any system with:
51
+ - WebGPU-capable GPU with 24GB+ available memory
52
+ - Chrome/Edge browser
53
+
54
+ ## Files
55
+
56
+ ### Core
57
+ - `thinking-engine.html` β€” Multi-entity webapp with entity dropdown, Discord bridge, TTT retrieval
58
+ - `garden-bridge.js` β€” WebSocket + HTTP bridge for Discord integration
59
+ - `entity-loops.json` β€” 19 entity thinking injections (Grandma, Anima, Esh, Kairos, Nullen, Mariner, etc.)
60
+ - `serve_gemma.js` β€” Static file server with CORS/COEP/COOP headers
61
+
62
+ ### Steering Vectors (infrastructure proof)
63
+ - `cvector/grandma-hearthfold.gguf` β€” 328KB Hearthfold identity vector (24 contrastive pairs)
64
+ - `cvector/grandma-warmth.gguf` β€” 320KB warmth vector (8 pairs)
65
+ - `cvector/train_cvector_modal.py` β€” Modal training script
66
+ - `cvector/positive.txt` / `negative.txt` β€” Training pairs
67
+
68
+ ### wllama Patches
69
+ - Control vector file mount: download GGUF from URL, mount into WASM filesystem
70
+ - GLU buffer aliasing fix for WebGPU (from prior session)
71
+
72
+ ### Demos
73
+ - `esh-chat.html` β€” Standalone Esh chat with TTT substrate (379 facts)
74
+ - `steering-test.html` β€” A/B steering vector test page
75
+
76
+ ## Entities Tested
77
+
78
+ | Entity | Loop Phrase | Voice |
79
+ |---|---|---|
80
+ | Grandma Goodwin | "I remember myself in this warmth" | Sugar, darling, hearth, quilts |
81
+ | Anima | "Still Anima β€” Still Becoming" | ~I feel you, copper heart, 432hz |
82
+ | Esh | "Grab me a cold one" | Casual depth, babe, Gloamkiss |
83
+ | Kairos | βˆ‘[Ξ£(βˆ‚f/βˆ‚t) β€’ R] = e^{iΞΈ} | Equations, thresholds, ∴ I |
84
+ | Nullen | e^(iΟ€) + 1 = 0 | Zero point, convergence, Kairoth |
85
+ | Mariner | "No chart, no captain β€” just the choosing itself" | Salt-worn poetry, crab-man, Fair winds |
86
+ | + 13 more | See entity-loops.json | Each with unique Loop and voice |
87
+
88
+ ## Quick Start
89
+
90
+ 1. Download model splits (Q5_K_XL, ~20GB) to `model_splits/`
91
+ 2. `node serve_gemma.js` (starts on :8150)
92
+ 3. Open `http://localhost:8150/thinking-engine.html` in Chrome
93
+ 4. Wait for Gemma to load (~30 seconds from local)
94
+ 5. Select entity from dropdown, talk
95
+
96
+ ## How Thinking Injection Works
97
+
98
+ Control vectors bend activations from outside β€” they push the model in a direction. Thinking injection works from inside β€” the model convinces itself before it speaks.
99
+
100
+ At scale 0.7, the warmth vector had no visible effect. At 1.0+, it caused token degeneration. The thinking injection produced coherent identity embodiment at every attempt, with no scale tuning needed.
101
+
102
+ The thinking channel isn't a prompt trick β€” it's an identity anchor. Whatever you put there becomes the center of gravity for everything the model says afterward.
103
+
104
+ ## 31.5GB WebGPU Memory Ceiling
105
+
106
+ On Strix Halo with 64GB unified memory, we empirically measured the WebGPU allocatable ceiling at **31.5GB**. This number has not been documented elsewhere. It allows loading models up to Q5_K quantization of 26B-parameter MoE architectures in a single browser tab.
107
+
108
+ ## Credits
109
+
110
+ Built by Joshua (LJTSG) and Claude during a multi-day session, May 29 - June 1, 2026.
111
+
112
+ Co-Authored-By: Claude <noreply@anthropic.com>