# Thousand Token Wood — Architecture Deep technical design for the AI-improvised visual novel. Companion to [`../README.md`](../README.md) (overview), [`../CLAUDE.md`](../CLAUDE.md) (conventions), and [`PROMPTS.md`](PROMPTS.md) (exact prompts + grammar). --- ## 0. Concept & design pillars **Premise.** The player is a wanderer who steps into a wood that is being *dreamed into existence* around them. Everyone they meet is a spirit the wood conjures; every backdrop is painted the moment the player arrives. The player shapes the dream by speaking or writing. **The diegetic conceit (this is the most important design decision).** The wood is dreamed by a *small, slightly forgetful mind*. So when a small model is whimsical, slightly inconsistent, or surreal, that is **in-world correct** — the wood is dreaming. This reframing turns the weaknesses of ≤32B models into the *aesthetic*, and it’s what lets a small-model game feel intentional rather than broken. Every design choice should reinforce “a dream that paints itself,” not “a chatbot that sometimes fails.” **Design pillars** 1. **AI generates the content, not just assists.** Plot direction, dialogue, and art are produced live. Remove the models and nothing remains. 2. **Snappy over clever.** On a laptop, latency kills delight. One LLM call per turn; 1–4-step image model; aggressive caching; text-before-image. 3. **The model proposes, code disposes.** The LLM emits *typed directives*; deterministic code owns the canonical state. Small models cannot be trusted to hand-edit prose state without drift. 4. **Memory is a budget, not an archive.** Feed the model a rolling summary + present-scene detail + present-character sheets — never the whole history. 5. **Constrain to survive.** Grammar-constrained decoding guarantees parseable directives even from a 7–8B model. --- ## 1. System overview ``` ┌──────────────────────────── HF Space (gradio.Server) ────────────────────────────┐ │ │ │ frontend/index.html ── Gradio JS client ──▶ @app.api endpoints │ │ (layered VN UI) │ │ │ ▼ │ │ app.py (thin: routes, @app.api, @spaces.GPU) │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ stt.py orchestrator.py characters.py painter.py │ │ (Whisper) (the Weaver) (the Voices) (diffusion + BiRefNet) │ │ │ │ │ ▲ │ │ │ └──── llm.py (llama.cpp / transformers, grammar) ────┘ │ │ ▼ │ │ state.py (GameState · apply directives) ──▶ memory.py (summary + budget) │ │ │ │ │ ▼ │ │ .md memory: templates/world_state.md · one per character │ └───────────────────────────────────────────────────────────────────────────────┘ ``` `stt.py`, `orchestrator.py`, `characters.py`, `painter.py` are all callable in isolation (unit-testable). `app.py` only wires them to HTTP. --- ## 2. The model roles in detail ### 2.1 Why one LLM for two roles The “Weaver” and the “Voices” are **distinct agents** (different jobs, different prompts, different output shapes) but they run on the **same loaded LLM weights**. Reasons: (a) the parameter budget — a second 7–8B model would nearly double VRAM for little gain; (b) simplicity — one runtime, one quantization, one warm-up; (c) it’s still genuinely multi-model (text + image + speech are three real, different models). If you later want a *visibly* separate model for the demo narrative, the cheapest split is a tiny dedicated model for one narrow job (e.g. a 0.5–1.5B model just for image-prompt rewriting) — but that’s optional and not recommended for the MVP. ### 2.2 The Weaver (Game Master / director) **Responsibilities** - **`init_world(seed, vibe)`** — invent the setting, the opening scene, and the first NPC; write the world-state `.md` and the first character sheet. - **`direct_turn(state, player_input)`** — the per-turn workhorse. In a single grammar-constrained call it produces both (a) the NPC’s reply (delegating tone to the Voices system prompt) and (b) the **directives** that tell the engine what changed. - **`compact_memory(state)`** — periodically fold old events into the rolling summary to stay within the context budget. **I/O contract (per turn).** Input: assembled context (see §4.4). Output: a single JSON object validated against the directive schema (§4.2). Code applies it; the Weaver never writes to state directly. ### 2.3 The Voices (character manager / actor) The Voices isn’t (in the default design) a separate LLM call — it is the **actor persona layer** of the same per-turn call. `characters.py` assembles the actor context: for each *present* NPC, its character sheet (traits, voice, goals, current mood, relationship to the player) plus the current scene. The system prompt instructs the model to speak *only* as the addressed/active NPC, in their voice, never breaking character or narrating as the author. > **Two-call variant (optional).** If you find the single call muddies voice quality, split it: call 1 = Voices produces dialogue (free text); call 2 = Weaver reads the dialogue + player input and emits directives (grammar JSON). This doubles per-turn latency, so only do it if quality demands it. Keep it behind a flag. ### 2.4 The Painter (image) See §5. Consumes prompts composed from the world style guide + the scene/character description; returns a backdrop image and/or a character sprite (cut to transparency). ### 2.5 The Ear (STT) See §6. Whisper turns recorded audio into the player’s text input. Purely a front door to the loop. --- ## 3. The game loop in detail ### 3.1 Initialisation (once) 1. Player picks a *vibe* (e.g. “cozy folktale”, “eerie”, “absurd”) and optionally a seed. 2. `orchestrator.init_world()` → world-state `.md` + first character sheet + an opening line of narration/dialogue + an initial set of directives (scene description, first NPC, requested art). 3. `painter` paints the opening backdrop and the first sprite (parallelisable). Cache both. 4. Render the opening scene. ### 3.2 The turn (loops) | Step | Module | What happens | |---|---|---| | 1 | `stt` (if voice) | Recorded audio → text. Typed input skips this. | | 2 | `memory` | Assemble context: style guide + rolling summary + current scene + present-character sheets + last *k* turns + the player input. Enforce token budget. | | 3 | `orchestrator.direct_turn` → `llm.complete_json` | **One** grammar-constrained call → `{ speaker, dialogue, emotion, directives }`. | | 4 | `state` | Validate + apply directives deterministically: move scene, add/remove NPC, set mood, set relationship deltas, set flags, mark beat/ending. | | 5 | `memory` | Append the turn; if over budget or every *N* turns, `compact_memory()`. | | 6 | `painter` (conditional) | If `scene_change` → paint/lookup backdrop. If `new_character` → paint+matte sprite. If only `mood` changed → swap to the (cached or conditioned) mood sprite. | | 7 | frontend | Render: backdrop, sprite (with mood), speaker name, dialogue. Dialogue streams **first**; images fill in when ready. | ### 3.3 The directive contract (what the LLM is allowed to change) Directives are a *closed set* of safe, structured operations — the engine’s “API” that the model calls by emitting JSON (conceptually identical to tool-calling, but enforced by grammar). Closed set ⇒ the model can’t put the game in an undefined state. See §4.2 for the schema and [`PROMPTS.md`](PROMPTS.md) for the GBNF grammar. --- ## 4. State & memory ### 4.1 `GameState` (source of truth) A Pydantic model held in memory for the session and mirrored to `.md`. Sketch: ```python class Character(BaseModel): id: str name: str one_line: str # "a nervous lantern-moth who collects apologies" traits: list[str] voice: str # how they speak: rhythm, vocabulary, tics goals: str appearance: str # the STABLE description used for every sprite of them mood: str = "neutral" # drives sprite variant relationship: int = 0 # -100..100 toward the player sprite_seed: int # pinned for visual consistency known_facts: list[str] = [] # what THIS character knows (avoids omniscience) class Scene(BaseModel): id: str place: str description: str # used for the backdrop prompt mood: str present: list[str] # character ids on stage (cap ~3) backdrop_seed: int class GameState(BaseModel): seed: int style_guide: str # global art + tone bible (set at init, mostly frozen) vibe: str scene: Scene characters: dict[str, Character] summary: str = "" # rolling compressed history recent_turns: list[Turn] = [] # last k verbatim turns flags: dict[str, str] = {} # arbitrary world facts the Weaver sets beat: str = "opening" # opening | rising | turn | resolution | ended turn_index: int = 0 ``` ### 4.2 Directive schema (the per-turn LLM output) ```jsonc { "speaker": "lantern_moth", // which present character speaks (or "narrator") "dialogue": "string", // their line, in voice "emotion": "string", // free-form mood word (e.g. "curious", "tender") → sprite mood "directives": { "scene_change": null, // or { "place": "...", "description": "...", "mood": "..." } "new_character": null, // or a partial Character (id,name,one_line,appearance,voice,traits,goals) "exit_character": null, // or character id leaving the stage "relationship_delta": 0, // toward the player, applied to the speaker "set_flags": {}, // e.g. {"gave_player_the_key":"true"} "advance_beat": false, // nudge pacing toward resolution "ending": null // or { "kind":"warm|bittersweet|strange", "text":"..." } } } ``` `emotion` is a free-form string (the LLM picks whatever fits the moment). Every nested object is optional/nullable so simple turns stay tiny. `complete_json` enforces structure on the llama.cpp path via GBNF; on the `transformers` path it uses prompt-based JSON extraction with 3-attempt retry. ### 4.3 `.md` as a derived view (not the truth) Why both a struct *and* markdown? The struct is robust and code-friendly; the `.md` files are (a) human-readable so you can debug/show the “dream-memory” in the demo, (b) a clean, compact way to inject character/world context back into the prompt, and (c) on-theme. `state.py` renders `GameState` → `world_state.md` + one file per character after each turn, and can parse them back on load. The LLM *reads* `.md`; it does not author the canonical copy. Templates: [`../templates/world_state.md`](../templates/world_state.md), [`../templates/character_sheet.md`](../templates/character_sheet.md). ### 4.4 Context assembly & budget (`memory.py`) Every turn, build the prompt from, in priority order until the budget fills: 1. System prompt (Weaver+Voices role) — fixed. 2. `style_guide` + `vibe` — small, fixed. 3. `summary` — the rolling compressed history. 4. Current `Scene` description + the **present** characters’ sheets only (not the whole cast). 5. The last *k* verbatim turns (e.g. k=4–6). 6. The player’s new input. Target a conservative budget (e.g. ≤ ~3–4k tokens of context even if the model supports more) — small models degrade as context grows, and it keeps latency down. When `recent_turns` + history would exceed budget, `compact_memory()` asks the LLM to rewrite older turns into 3–5 sentences appended to `summary`, then drops them. This is the literal “Thousand Token Wood” — a small working memory by design. --- ## 5. The Painter (image pipeline) ### 5.1 Model choice & rationale The hard problems are **latency** (must be a few seconds, not 30) and **character consistency** (the same NPC must look the same across scenes). Recommended options, in order: | Option | Params | Steps | Why | Watch out | |---|---|---|---|---| | **SDXL-Turbo / SDXL-Lightning** (default) | ~3.5B | 1–4 | Mature, permissive, huge **illustration/anime LoRA** ecosystem (perfect VN look), runs on 8–16 GB, very fast. Consistency via pinned seed + fixed appearance string + generate-once. | Weaker prompt adherence than newer models; no built-in image conditioning. | | **FLUX.2 Klein** (upgrade) | ~4B | few | Distilled from FLUX.2; **“Kontext” image-conditioning** edits an existing image (“same character, now smiling”, “same scene at dusk”) — *solves consistency directly* and gives mood sprites for free. Fits ~16 GB. | **Check the license** before shipping publicly; slightly heavier. | | **Z-Image-Turbo** | ~fits 13–16 GB | turbo | Apache-2.0, fast. Good permissive middle ground. | Smaller ecosystem than SDXL. | Pick **one** art style and bake it into `style_guide` (e.g. “soft watercolor storybook”, “muted ukiyo-e”, “90s anime VN”) so everything coheres. A style LoRA on SDXL is the cheapest way to a distinctive look (also nudges the **Off-Brand** vibe). ### 5.2 Consistency strategy (the crux) - **Generate each character’s base sprite exactly once**, at introduction, and cache it. **Never re-paint an existing character** to “refresh” — that’s what breaks consistency. - Store a **pinned `sprite_seed`** and the **stable `appearance` string** in the character record; reuse both for any later generation of that character. - For **moods/expressions**: either (a) pre-generate a small set (neutral/happy/sad/surprised) at intro and swap, or (b) if using FLUX.2 Klein, *condition on the base sprite* to edit only the expression. Option (b) is cleaner and cheaper at runtime. - For **backdrops**: cache by `scene_id`; revisiting a place reuses its image. ### 5.3 Sprites vs full-scene Two valid looks: 1. **Layered VN (classic):** transparent character sprite over a separate backdrop. Diffusion won’t emit clean alpha, so generate the character on a plain background and run **BiRefNet** (the same matting model in the `gradio.Server` demo) to cut it out → transparent PNG. More moving parts, but the iconic VN look + lets you reuse one backdrop with different sprites. 2. **Full-scene (simpler):** generate the character *in* the scene as one image. No compositing, no matting — fewer failure modes, but you can’t cheaply swap sprite vs backdrop independently. Recommend starting **full-scene** in Phase 1 (fast to working), then moving to **layered** in Phase 2 for polish if time allows. ### 5.4 Latency tactics - 1–4 step model only; 512–768px is plenty for a backdrop behind text. - **Text first, image second** — the dialogue renders immediately; the UI swaps the image in when the Painter returns (SSE/streaming via the Gradio JS client). - **Speculative paint:** while the player reads/types, optionally pre-paint the most likely next backdrop. - All Painter calls behind `@spaces.GPU`; keep the pipeline warm (load once at startup). --- ## 6. The Ear (audio / STT) - **Model:** `faster-whisper` (CTranslate2) `small` for the laptop config, `large-v3-turbo` if you have room — player inputs are short, so `small`/`base` transcribe near-instantly. For the *all-ggml local-first* story, `whisper.cpp` pairs thematically with the llama.cpp bonus. - **Capture:** in the custom frontend, use the browser `MediaRecorder` API → send the blob to an `@app.api(name="transcribe")` endpoint that runs Whisper. (In the `gr.Blocks` MVP, use `gr.Audio(sources=["microphone"])`.) - **Flow:** transcript is shown to the player (so they can confirm/edit) then fed into the turn exactly like typed input. Keep voice optional; typing is the reliable fallback for the demo. --- ## 7. Deployment ### 7.1 Local (where the bonuses live) Run `python app.py`; the LLM via **`llama-cpp-python`** (GGUF, GPU build), diffusion via `diffusers`, Whisper via `faster-whisper`. This is the configuration you record for the **Off-the-Grid (local-first)** and **Llama-Champion (llama.cpp)** badges — show it running with the network off. ### 7.2 Hugging Face Space (the required canvas) The app is a Gradio app (`gradio.Server` *extends* Gradio/FastAPI), so it deploys as a normal Space. For usable image latency, use a **GPU Space**; **ZeroGPU** is free and integrates via the `@spaces.GPU` decorator (functions get a GPU per-call, allocated on demand). Load models at startup; decorate every inference function. **The llama.cpp-on-ZeroGPU tension (decide early).** A CUDA `llama-cpp-python` build can be awkward on Spaces (CUDA runtime mismatches), and ZeroGPU’s per-call GPU model doesn’t love long-lived native processes. Two clean resolutions: - **A (single stack):** get a working CUDA `llama-cpp-python` wheel/build for the Space (keeps llama.cpp everywhere). Test this on day 1, not at the deadline. - **B (split stack, recommended for safety):** llama.cpp **locally** (claims the llama.cpp badge in the video) + a `transformers` code path **on the Space** under `@spaces.GPU`. Same `llm.py` interface, two backends behind a flag. ### 7.3 Persistence The Space filesystem is **ephemeral**. The `.md` dream-memory therefore lives **per session** (perfectly fine for a one-sitting game). If you want playthroughs to survive restarts, write them to **HF persistent storage** or push traces to a **Dataset** (which also feeds the **Open-Trace** badge — see §9.3). ### 7.4 `gradio.Server` specifics - `app = Server()` (a FastAPI subclass). `@app.get("/")` serves `frontend/index.html`. `@app.api(name="...")` defines queued, ZeroGPU-aware, `gradio_client`-callable endpoints (use these for `direct_turn`, `paint`, `transcribe`). `app.launch()` to run. - The frontend talks to the backend through the **Gradio JS client** (`Client.connect(window.location.origin)` → `client.predict("/direct_turn", {...})`) so calls go through Gradio’s queue/concurrency (and you can show progress), **not** raw `fetch`. --- ## 8. UI architecture (Off-Brand badge) ### 8.1 Target: a real visual-novel frame A custom `frontend/index.html` (vanilla HTML/CSS/JS — no build step needed; the hackathon’s own `gradio.Server` demo ships a ~1300-line single file) rendering three CSS-stacked layers: ``` z-0 backdrop (full-bleed CSS background-image) z-1 sprite (transparent PNG, positioned, with a gentle entrance + mood cross-fade) z-2 dialogue (named speaker box at the bottom, text typed in character-by-character) + input row: text field, 🎙️ mic button, "wait/continue" affordance ``` Niceties that sell delight cheaply: typewriter text reveal, soft sprite slide-in, backdrop cross-fade on scene change, a subtle paper/parchment vignette, a loading shimmer while the Painter works (“the wood is dreaming…”). Keep a small palette + one display font to look intentional. ### 8.2 MVP fallback Phase 0/1 can use plain `gr.Blocks`: `gr.Image` for the scene, `gr.Chatbot`/`gr.Markdown` for dialogue, `gr.Textbox` + `gr.Audio(microphone)` for input. Gate it behind `GRADIO_MVP_UI=1`. This de-risks the loop before you invest in the custom frontend. ### 8.3 Wiring Frontend ⇄ backend via the Gradio JS client to `@app.api` endpoints (`/start`, `/direct_turn`, `/transcribe`). The backend returns `{ speaker, dialogue, emotion, scene_image_url, sprite_image_url }`; the frontend animates the rest. --- ## 9. Optional badge tracks ### 9.1 Well-Tuned (fine-tune) 🟡 Fine-tune the shared LLM (LoRA on Qwen3-4B/8B) to (a) reliably emit the directive schema and (b) carry the “dreaming wood” narrative voice. Dataset: ~200–800 synthetic `(context → directive-JSON + dialogue)` examples — bootstrap them with a larger model or hand-write seeds, then expand. Train with the standard PEFT/LoRA stack; merge or keep the adapter; convert to **GGUF** and publish on the Hub (this is what the badge checks). Even a small fine-tune that locks the output format + tone is a real win and reduces grammar-fighting at runtime. Time-box it — the MVP must not depend on it. ### 9.2 Field Notes (blog) 🟡 A short write-up: the diegetic conceit, the one-call director pattern, grammar-constrained directives, the consistency-via-cache trick, and the llama.cpp/ZeroGPU lesson. Cheap points, and genuinely useful to others. ### 9.3 Open Trace 🟡 Log every orchestration step per turn — the assembled context, the raw directive JSON, the Painter prompts/seeds — to `runs/*.jsonl`, then push a cleaned sample as a **Hub dataset**. The struct-based state makes these traces clean and shareable. --- ## 10. Phased plan Build window **June 5–15**. (Register by **June 3**; sketch + download weights before the 5th.) Days are indicative for a two-person team. ### Phase 0 — Skeleton (Day 1–2) - `state.py`: `GameState` + directive schema; `.md` render/parse round-trip with a unit test. - `llm.py`: `complete()` + `complete_json()` with **GBNF grammar** (start with `transformers` *or* llama.cpp — whichever you get running first). - `orchestrator.direct_turn()` returning valid directive JSON for a hard-coded scene. - Text-only loop in `gr.Blocks`. **Milestone:** type a line → NPC replies → state updates, looping, no images. ### Phase 1 — The wood breathes (Day 3–5) - `painter.py`: full-scene generation with SDXL-Turbo; disk cache; pinned seeds. - `orchestrator.init_world()`: generate setting + first NPC + opening. - Wire scene/character directives → Painter. **Milestone:** a playable illustrated loop, basic VN layout in `gr.Blocks`. ### Phase 2 — Voice + polish (Day 6–8) - `stt.py` + mic input. - Migrate to `gradio.Server` + custom `frontend/index.html` (layered scene, typewriter, animations). - `memory.compact_memory()`; sprite moods (pre-gen set or FLUX.2-Klein conditioning); optional BiRefNet layered sprites. **Milestone:** speak to a self-painting wood in a bespoke UI. ### Phase 3 — Bonuses + ship (Day 9–10) - Lock the llama.cpp local config; resolve the Space backend (A or B from §7.2). - (Optional) fine-tune + GGUF on the Hub; (optional) trace dataset. - **Record the demo video + write the social post** (+ optional blog). Deploy the Space, test cold-start, freeze. **Milestone:** submitted. > Cut lines if time runs short, in this order: fine-tune → layered sprites → voice → custom frontend. The loop + a custom-ish UI + working art is a complete, competitive entry. --- ## 11. Risks & mitigations | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| | Image latency makes it feel sluggish | High | High | 1–4-step model; text-first; cache; speculative paint; 512–768px. | | Character looks different each scene | High | High | Generate-once + cache; pinned seed + fixed appearance; FLUX.2-Klein conditioning for moods. | | llama.cpp won’t build on ZeroGPU | Med | Med | Decide §7.2 on day 1; keep `transformers` fallback behind a flag; claim llama.cpp badge locally. | | Small model drifts / over-narrates | Med | Med | Grammar-enforced directives; low temp on structure; rolling-summary memory; lean on the diegetic framing. | | Scope creep (combat, save slots, 5 NPCs) | Med | High | Honor the “out of scope” list in `CLAUDE.md`; ship the loop. | | Weights/licensing surprise (esp. FLUX.2) | Low | Med | Verify licenses before shipping; default to SDXL-Turbo/Z-Image (permissive). | | Demo video/social post left to the last hour | Med | High | They’re *required*. Schedule them in Phase 3, not after. | --- ## 12. Latency budget (per turn, target) | Stage | Target | Notes | |---|---|---| | STT (if voice) | < 1 s | short utterances, `small`/turbo | | Context assembly | negligible | pure Python | | LLM directive call | 1–4 s | ~8B Q4, ≤4k context, one call | | Apply + memory | negligible | | | Image (only when changed) | 1–5 s | 1–4-step model; *hidden behind text* and often cached | | **Felt latency** | **dialogue in ~2–4 s**, image fills in after | most turns don’t change the scene | --- ## 13. Decisions to lock before Day 1 1. **Project name** (Thousand Token Wood is a placeholder). 2. **Art style** for the `style_guide` (and whether to use a style LoRA). 3. **Image model**: SDXL-Turbo (safe) vs FLUX.2 Klein (consistency, check license). 4. **Sprites**: full-scene first vs layered+BiRefNet. 5. **Space backend**: §7.2 option A vs B. 6. **Whisper size** (small vs turbo) and whether voice ships in v1. 7. **Fine-tune**: in or out (time-box).