# Thousand Token Wood — Architecture

Deep technical design for the AI-improvised visual novel. Companion to [`../README.md`](../README.md) (overview), [`../CLAUDE.md`](../CLAUDE.md) (conventions), and [`PROMPTS.md`](PROMPTS.md) (exact prompts + grammar).

---

## 0. Concept & design pillars

**Premise.** The player is a wanderer who steps into a wood that is being *dreamed into existence* around them. Everyone they meet is a spirit the wood conjures; every backdrop is painted the moment the player arrives. The player shapes the dream by speaking or writing.

**The diegetic conceit (this is the most important design decision).** The wood is dreamed by a *small, slightly forgetful mind*. So when a small model is whimsical, slightly inconsistent, or surreal, that is **in-world correct** — the wood is dreaming. This reframing turns the weaknesses of ≤32B models into the *aesthetic*, and it’s what lets a small-model game feel intentional rather than broken. Every design choice should reinforce “a dream that paints itself,” not “a chatbot that sometimes fails.”

**Design pillars**

1. **AI generates the content, not just assists.** Plot direction, dialogue, and art are produced live. Remove the models and nothing remains.
2. **Snappy over clever.** On a laptop, latency kills delight. One LLM call per turn; 1–4-step image model; aggressive caching; text-before-image.
3. **The model proposes, code disposes.** The LLM emits *typed directives*; deterministic code owns the canonical state. Small models cannot be trusted to hand-edit prose state without drift.
4. **Memory is a budget, not an archive.** Feed the model a rolling summary + present-scene detail + present-character sheets — never the whole history.
5. **Constrain to survive.** Grammar-constrained decoding guarantees parseable directives even from a 7–8B model.

---

## 1. System overview

```
┌──────────────────────────── HF Space (gradio.Server) ────────────────────────────┐
│                                                                                   │
│   frontend/index.html  ── Gradio JS client ──▶  @app.api endpoints                │
│   (layered VN UI)                                  │                              │
│                                                    ▼                              │
│   app.py  (thin: routes, @app.api, @spaces.GPU)                                   │
│      │            │                  │                  │                         │
│      ▼            ▼                  ▼                  ▼                         │
│   stt.py      orchestrator.py    characters.py      painter.py                    │
│   (Whisper)   (the Weaver)       (the Voices)       (diffusion + BiRefNet)        │
│                   │   │               │                  ▲                         │
│                   │   └──── llm.py (llama.cpp / transformers, grammar) ────┘       │
│                   ▼                                                                │
│   state.py  (GameState · apply directives) ──▶ memory.py (summary + budget)       │
│                   │                                                                │
│                   ▼                                                                │
│   .md memory:  templates/world_state.md · one per character                       │
└───────────────────────────────────────────────────────────────────────────────┘
```

`stt.py`, `orchestrator.py`, `characters.py`, `painter.py` are all callable in isolation (unit-testable). `app.py` only wires them to HTTP.

---

## 2. The model roles in detail

### 2.1 Why one LLM for two roles

The “Weaver” and the “Voices” are **distinct agents** (different jobs, different prompts, different output shapes) but they run on the **same loaded LLM weights**. Reasons: (a) the parameter budget — a second 7–8B model would nearly double VRAM for little gain; (b) simplicity — one runtime, one quantization, one warm-up; (c) it’s still genuinely multi-model (text + image + speech are three real, different models). If you later want a *visibly* separate model for the demo narrative, the cheapest split is a tiny dedicated model for one narrow job (e.g. a 0.5–1.5B model just for image-prompt rewriting) — but that’s optional and not recommended for the MVP.

### 2.2 The Weaver (Game Master / director)

**Responsibilities**

- **`init_world(seed, vibe)`** — invent the setting, the opening scene, and the first NPC; write the world-state `.md` and the first character sheet.
- **`direct_turn(state, player_input)`** — the per-turn workhorse. In a single grammar-constrained call it produces both (a) the NPC’s reply (delegating tone to the Voices system prompt) and (b) the **directives** that tell the engine what changed.
- **`compact_memory(state)`** — periodically fold old events into the rolling summary to stay within the context budget.

**I/O contract (per turn).** Input: assembled context (see §4.4). Output: a single JSON object validated against the directive schema (§4.2). Code applies it; the Weaver never writes to state directly.

### 2.3 The Voices (character manager / actor)

The Voices isn’t (in the default design) a separate LLM call — it is the **actor persona layer** of the same per-turn call. `characters.py` assembles the actor context: for each *present* NPC, its character sheet (traits, voice, goals, current mood, relationship to the player) plus the current scene. The system prompt instructs the model to speak *only* as the addressed/active NPC, in their voice, never breaking character or narrating as the author.

> **Two-call variant (optional).** If you find the single call muddies voice quality, split it: call 1 = Voices produces dialogue (free text); call 2 = Weaver reads the dialogue + player input and emits directives (grammar JSON). This doubles per-turn latency, so only do it if quality demands it. Keep it behind a flag.

### 2.4 The Painter (image)

See §5. Consumes prompts composed from the world style guide + the scene/character description; returns a backdrop image and/or a character sprite (cut to transparency).

### 2.5 The Ear (STT)

See §6. Whisper turns recorded audio into the player’s text input. Purely a front door to the loop.

---

## 3. The game loop in detail

### 3.1 Initialisation (once)

1. Player picks a *vibe* (e.g. “cozy folktale”, “eerie”, “absurd”) and optionally a seed.
2. `orchestrator.init_world()` → world-state `.md` + first character sheet + an opening line of narration/dialogue + an initial set of directives (scene description, first NPC, requested art).
3. `painter` paints the opening backdrop and the first sprite (parallelisable). Cache both.
4. Render the opening scene.

### 3.2 The turn (loops)

| Step | Module | What happens |
|---|---|---|
| 1 | `stt` (if voice) | Recorded audio → text. Typed input skips this. |
| 2 | `memory` | Assemble context: style guide + rolling summary + current scene + present-character sheets + last *k* turns + the player input. Enforce token budget. |
| 3 | `orchestrator.direct_turn` → `llm.complete_json` | **One** grammar-constrained call → `{ speaker, dialogue, emotion, directives }`. |
| 4 | `state` | Validate + apply directives deterministically: move scene, add/remove NPC, set mood, set relationship deltas, set flags, mark beat/ending. |
| 5 | `memory` | Append the turn; if over budget or every *N* turns, `compact_memory()`. |
| 6 | `painter` (conditional) | If `scene_change` → paint/lookup backdrop. If `new_character` → paint+matte sprite. If only `mood` changed → swap to the (cached or conditioned) mood sprite. |
| 7 | frontend | Render: backdrop, sprite (with mood), speaker name, dialogue. Dialogue streams **first**; images fill in when ready. |

### 3.3 The directive contract (what the LLM is allowed to change)

Directives are a *closed set* of safe, structured operations — the engine’s “API” that the model calls by emitting JSON (conceptually identical to tool-calling, but enforced by grammar). Closed set ⇒ the model can’t put the game in an undefined state. See §4.2 for the schema and [`PROMPTS.md`](PROMPTS.md) for the GBNF grammar.

---

## 4. State & memory

### 4.1 `GameState` (source of truth)

A Pydantic model held in memory for the session and mirrored to `.md`. Sketch:

```python
class Character(BaseModel):
    id: str
    name: str
    one_line: str                 # "a nervous lantern-moth who collects apologies"
    traits: list[str]
    voice: str                    # how they speak: rhythm, vocabulary, tics
    goals: str
    appearance: str               # the STABLE description used for every sprite of them
    mood: str = "neutral"         # drives sprite variant
    relationship: int = 0         # -100..100 toward the player
    sprite_seed: int              # pinned for visual consistency
    known_facts: list[str] = []   # what THIS character knows (avoids omniscience)

class Scene(BaseModel):
    id: str
    place: str
    description: str              # used for the backdrop prompt
    mood: str
    present: list[str]            # character ids on stage (cap ~3)
    backdrop_seed: int

class GameState(BaseModel):
    seed: int
    style_guide: str              # global art + tone bible (set at init, mostly frozen)
    vibe: str
    scene: Scene
    characters: dict[str, Character]
    summary: str = ""             # rolling compressed history
    recent_turns: list[Turn] = [] # last k verbatim turns
    flags: dict[str, str] = {}    # arbitrary world facts the Weaver sets
    beat: str = "opening"         # opening | rising | turn | resolution | ended
    turn_index: int = 0
```

### 4.2 Directive schema (the per-turn LLM output)

```jsonc
{
  "speaker": "lantern_moth",          // which present character speaks (or "narrator")
  "dialogue": "string",               // their line, in voice
  "emotion": "string",                // free-form mood word (e.g. "curious", "tender") → sprite mood
  "directives": {
    "scene_change": null,             // or { "place": "...", "description": "...", "mood": "..." }
    "new_character": null,            // or a partial Character (id,name,one_line,appearance,voice,traits,goals)
    "exit_character": null,           // or character id leaving the stage
    "relationship_delta": 0,          // toward the player, applied to the speaker
    "set_flags": {},                  // e.g. {"gave_player_the_key":"true"}
    "advance_beat": false,            // nudge pacing toward resolution
    "ending": null                    // or { "kind":"warm|bittersweet|strange", "text":"..." }
  }
}
```

`emotion` is a free-form string (the LLM picks whatever fits the moment). Every nested object is optional/nullable so simple turns stay tiny. `complete_json` enforces structure on the llama.cpp path via GBNF; on the `transformers` path it uses prompt-based JSON extraction with 3-attempt retry.

### 4.3 `.md` as a derived view (not the truth)

Why both a struct *and* markdown? The struct is robust and code-friendly; the `.md` files are (a) human-readable so you can debug/show the “dream-memory” in the demo, (b) a clean, compact way to inject character/world context back into the prompt, and (c) on-theme. `state.py` renders `GameState` → `world_state.md` + one file per character after each turn, and can parse them back on load. The LLM *reads* `.md`; it does not author the canonical copy. Templates: [`../templates/world_state.md`](../templates/world_state.md), [`../templates/character_sheet.md`](../templates/character_sheet.md).

### 4.4 Context assembly & budget (`memory.py`)

Every turn, build the prompt from, in priority order until the budget fills:

1. System prompt (Weaver+Voices role) — fixed.
2. `style_guide` + `vibe` — small, fixed.
3. `summary` — the rolling compressed history.
4. Current `Scene` description + the **present** characters’ sheets only (not the whole cast).
5. The last *k* verbatim turns (e.g. k=4–6).
6. The player’s new input.

Target a conservative budget (e.g. ≤ ~3–4k tokens of context even if the model supports more) — small models degrade as context grows, and it keeps latency down. When `recent_turns` + history would exceed budget, `compact_memory()` asks the LLM to rewrite older turns into 3–5 sentences appended to `summary`, then drops them. This is the literal “Thousand Token Wood” — a small working memory by design.

---

## 5. The Painter (image pipeline)

### 5.1 Model choice & rationale

The hard problems are **latency** (must be a few seconds, not 30) and **character consistency** (the same NPC must look the same across scenes). Recommended options, in order:

| Option | Params | Steps | Why | Watch out |
|---|---|---|---|---|
| **SDXL-Turbo / SDXL-Lightning** (default) | ~3.5B | 1–4 | Mature, permissive, huge **illustration/anime LoRA** ecosystem (perfect VN look), runs on 8–16 GB, very fast. Consistency via pinned seed + fixed appearance string + generate-once. | Weaker prompt adherence than newer models; no built-in image conditioning. |
| **FLUX.2 Klein** (upgrade) | ~4B | few | Distilled from FLUX.2; **“Kontext” image-conditioning** edits an existing image (“same character, now smiling”, “same scene at dusk”) — *solves consistency directly* and gives mood sprites for free. Fits ~16 GB. | **Check the license** before shipping publicly; slightly heavier. |
| **Z-Image-Turbo** | ~fits 13–16 GB | turbo | Apache-2.0, fast. Good permissive middle ground. | Smaller ecosystem than SDXL. |

Pick **one** art style and bake it into `style_guide` (e.g. “soft watercolor storybook”, “muted ukiyo-e”, “90s anime VN”) so everything coheres. A style LoRA on SDXL is the cheapest way to a distinctive look (also nudges the **Off-Brand** vibe).

### 5.2 Consistency strategy (the crux)

- **Generate each character’s base sprite exactly once**, at introduction, and cache it. **Never re-paint an existing character** to “refresh” — that’s what breaks consistency.
- Store a **pinned `sprite_seed`** and the **stable `appearance` string** in the character record; reuse both for any later generation of that character.
- For **moods/expressions**: either (a) pre-generate a small set (neutral/happy/sad/surprised) at intro and swap, or (b) if using FLUX.2 Klein, *condition on the base sprite* to edit only the expression. Option (b) is cleaner and cheaper at runtime.
- For **backdrops**: cache by `scene_id`; revisiting a place reuses its image.

### 5.3 Sprites vs full-scene

Two valid looks:

1. **Layered VN (classic):** transparent character sprite over a separate backdrop. Diffusion won’t emit clean alpha, so generate the character on a plain background and run **BiRefNet** (the same matting model in the `gradio.Server` demo) to cut it out → transparent PNG. More moving parts, but the iconic VN look + lets you reuse one backdrop with different sprites.
2. **Full-scene (simpler):** generate the character *in* the scene as one image. No compositing, no matting — fewer failure modes, but you can’t cheaply swap sprite vs backdrop independently.

Recommend starting **full-scene** in Phase 1 (fast to working), then moving to **layered** in Phase 2 for polish if time allows.

### 5.4 Latency tactics

- 1–4 step model only; 512–768px is plenty for a backdrop behind text.
- **Text first, image second** — the dialogue renders immediately; the UI swaps the image in when the Painter returns (SSE/streaming via the Gradio JS client).
- **Speculative paint:** while the player reads/types, optionally pre-paint the most likely next backdrop.
- All Painter calls behind `@spaces.GPU`; keep the pipeline warm (load once at startup).

---

## 6. The Ear (audio / STT)

- **Model:** `faster-whisper` (CTranslate2) `small` for the laptop config, `large-v3-turbo` if you have room — player inputs are short, so `small`/`base` transcribe near-instantly. For the *all-ggml local-first* story, `whisper.cpp` pairs thematically with the llama.cpp bonus.
- **Capture:** in the custom frontend, use the browser `MediaRecorder` API → send the blob to an `@app.api(name="transcribe")` endpoint that runs Whisper. (In the `gr.Blocks` MVP, use `gr.Audio(sources=["microphone"])`.)
- **Flow:** transcript is shown to the player (so they can confirm/edit) then fed into the turn exactly like typed input. Keep voice optional; typing is the reliable fallback for the demo.

---

## 7. Deployment

### 7.1 Local (where the bonuses live)

Run `python app.py`; the LLM via **`llama-cpp-python`** (GGUF, GPU build), diffusion via `diffusers`, Whisper via `faster-whisper`. This is the configuration you record for the **Off-the-Grid (local-first)** and **Llama-Champion (llama.cpp)** badges — show it running with the network off.

### 7.2 Hugging Face Space (the required canvas)

The app is a Gradio app (`gradio.Server` *extends* Gradio/FastAPI), so it deploys as a normal Space. For usable image latency, use a **GPU Space**; **ZeroGPU** is free and integrates via the `@spaces.GPU` decorator (functions get a GPU per-call, allocated on demand). Load models at startup; decorate every inference function.

**The llama.cpp-on-ZeroGPU tension (decide early).** A CUDA `llama-cpp-python` build can be awkward on Spaces (CUDA runtime mismatches), and ZeroGPU’s per-call GPU model doesn’t love long-lived native processes. Two clean resolutions:

- **A (single stack):** get a working CUDA `llama-cpp-python` wheel/build for the Space (keeps llama.cpp everywhere). Test this on day 1, not at the deadline.
- **B (split stack, recommended for safety):** llama.cpp **locally** (claims the llama.cpp badge in the video) + a `transformers` code path **on the Space** under `@spaces.GPU`. Same `llm.py` interface, two backends behind a flag.

### 7.3 Persistence

The Space filesystem is **ephemeral**. The `.md` dream-memory therefore lives **per session** (perfectly fine for a one-sitting game). If you want playthroughs to survive restarts, write them to **HF persistent storage** or push traces to a **Dataset** (which also feeds the **Open-Trace** badge — see §9.3).

### 7.4 `gradio.Server` specifics

- `app = Server()` (a FastAPI subclass). `@app.get("/")` serves `frontend/index.html`. `@app.api(name="...")` defines queued, ZeroGPU-aware, `gradio_client`-callable endpoints (use these for `direct_turn`, `paint`, `transcribe`). `app.launch()` to run.
- The frontend talks to the backend through the **Gradio JS client** (`Client.connect(window.location.origin)` → `client.predict("/direct_turn", {...})`) so calls go through Gradio’s queue/concurrency (and you can show progress), **not** raw `fetch`.

---

## 8. UI architecture (Off-Brand badge)

### 8.1 Target: a real visual-novel frame

A custom `frontend/index.html` (vanilla HTML/CSS/JS — no build step needed; the hackathon’s own `gradio.Server` demo ships a ~1300-line single file) rendering three CSS-stacked layers:

```
z-0  backdrop  (full-bleed CSS background-image)
z-1  sprite    (transparent PNG, positioned, with a gentle entrance + mood cross-fade)
z-2  dialogue  (named speaker box at the bottom, text typed in character-by-character)
        + input row: text field, 🎙️ mic button, "wait/continue" affordance
```

Niceties that sell delight cheaply: typewriter text reveal, soft sprite slide-in, backdrop cross-fade on scene change, a subtle paper/parchment vignette, a loading shimmer while the Painter works (“the wood is dreaming…”). Keep a small palette + one display font to look intentional.

### 8.2 MVP fallback

Phase 0/1 can use plain `gr.Blocks`: `gr.Image` for the scene, `gr.Chatbot`/`gr.Markdown` for dialogue, `gr.Textbox` + `gr.Audio(microphone)` for input. Gate it behind `GRADIO_MVP_UI=1`. This de-risks the loop before you invest in the custom frontend.

### 8.3 Wiring

Frontend ⇄ backend via the Gradio JS client to `@app.api` endpoints (`/start`, `/direct_turn`, `/transcribe`). The backend returns `{ speaker, dialogue, emotion, scene_image_url, sprite_image_url }`; the frontend animates the rest.

---

## 9. Optional badge tracks

### 9.1 Well-Tuned (fine-tune) 🟡

Fine-tune the shared LLM (LoRA on Qwen3-4B/8B) to (a) reliably emit the directive schema and (b) carry the “dreaming wood” narrative voice. Dataset: ~200–800 synthetic `(context → directive-JSON + dialogue)` examples — bootstrap them with a larger model or hand-write seeds, then expand. Train with the standard PEFT/LoRA stack; merge or keep the adapter; convert to **GGUF** and publish on the Hub (this is what the badge checks). Even a small fine-tune that locks the output format + tone is a real win and reduces grammar-fighting at runtime. Time-box it — the MVP must not depend on it.

### 9.2 Field Notes (blog) 🟡

A short write-up: the diegetic conceit, the one-call director pattern, grammar-constrained directives, the consistency-via-cache trick, and the llama.cpp/ZeroGPU lesson. Cheap points, and genuinely useful to others.

### 9.3 Open Trace 🟡

Log every orchestration step per turn — the assembled context, the raw directive JSON, the Painter prompts/seeds — to `runs/*.jsonl`, then push a cleaned sample as a **Hub dataset**. The struct-based state makes these traces clean and shareable.

---

## 10. Phased plan

Build window **June 5–15**. (Register by **June 3**; sketch + download weights before the 5th.) Days are indicative for a two-person team.

### Phase 0 — Skeleton (Day 1–2)
- `state.py`: `GameState` + directive schema; `.md` render/parse round-trip with a unit test.
- `llm.py`: `complete()` + `complete_json()` with **GBNF grammar** (start with `transformers` *or* llama.cpp — whichever you get running first).
- `orchestrator.direct_turn()` returning valid directive JSON for a hard-coded scene.
- Text-only loop in `gr.Blocks`. **Milestone:** type a line → NPC replies → state updates, looping, no images.

### Phase 1 — The wood breathes (Day 3–5)
- `painter.py`: full-scene generation with SDXL-Turbo; disk cache; pinned seeds.
- `orchestrator.init_world()`: generate setting + first NPC + opening.
- Wire scene/character directives → Painter. **Milestone:** a playable illustrated loop, basic VN layout in `gr.Blocks`.

### Phase 2 — Voice + polish (Day 6–8)
- `stt.py` + mic input.
- Migrate to `gradio.Server` + custom `frontend/index.html` (layered scene, typewriter, animations).
- `memory.compact_memory()`; sprite moods (pre-gen set or FLUX.2-Klein conditioning); optional BiRefNet layered sprites. **Milestone:** speak to a self-painting wood in a bespoke UI.

### Phase 3 — Bonuses + ship (Day 9–10)
- Lock the llama.cpp local config; resolve the Space backend (A or B from §7.2).
- (Optional) fine-tune + GGUF on the Hub; (optional) trace dataset.
- **Record the demo video + write the social post** (+ optional blog). Deploy the Space, test cold-start, freeze. **Milestone:** submitted.

> Cut lines if time runs short, in this order: fine-tune → layered sprites → voice → custom frontend. The loop + a custom-ish UI + working art is a complete, competitive entry.

---

## 11. Risks & mitigations

| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Image latency makes it feel sluggish | High | High | 1–4-step model; text-first; cache; speculative paint; 512–768px. |
| Character looks different each scene | High | High | Generate-once + cache; pinned seed + fixed appearance; FLUX.2-Klein conditioning for moods. |
| llama.cpp won’t build on ZeroGPU | Med | Med | Decide §7.2 on day 1; keep `transformers` fallback behind a flag; claim llama.cpp badge locally. |
| Small model drifts / over-narrates | Med | Med | Grammar-enforced directives; low temp on structure; rolling-summary memory; lean on the diegetic framing. |
| Scope creep (combat, save slots, 5 NPCs) | Med | High | Honor the “out of scope” list in `CLAUDE.md`; ship the loop. |
| Weights/licensing surprise (esp. FLUX.2) | Low | Med | Verify licenses before shipping; default to SDXL-Turbo/Z-Image (permissive). |
| Demo video/social post left to the last hour | Med | High | They’re *required*. Schedule them in Phase 3, not after. |

---

## 12. Latency budget (per turn, target)

| Stage | Target | Notes |
|---|---|---|
| STT (if voice) | < 1 s | short utterances, `small`/turbo |
| Context assembly | negligible | pure Python |
| LLM directive call | 1–4 s | ~8B Q4, ≤4k context, one call |
| Apply + memory | negligible | |
| Image (only when changed) | 1–5 s | 1–4-step model; *hidden behind text* and often cached |
| **Felt latency** | **dialogue in ~2–4 s**, image fills in after | most turns don’t change the scene |

---

## 13. Decisions to lock before Day 1

1. **Project name** (Thousand Token Wood is a placeholder).
2. **Art style** for the `style_guide` (and whether to use a style LoRA).
3. **Image model**: SDXL-Turbo (safe) vs FLUX.2 Klein (consistency, check license).
4. **Sprites**: full-scene first vs layered+BiRefNet.
5. **Space backend**: §7.2 option A vs B.
6. **Whisper size** (small vs turbo) and whether voice ships in v1.
7. **Fine-tune**: in or out (time-box).