Hackathon-IA-VisualNovel / docs /ARCHITECTURE.md
WillHbx's picture
docs: Update documentation
e3725c4
|
Raw
History Blame Contribute Delete
25.7 kB
# Thousand Token Wood — Architecture
Deep technical design for the AI-improvised visual novel. Companion to [`../README.md`](../README.md) (overview), [`../CLAUDE.md`](../CLAUDE.md) (conventions), and [`PROMPTS.md`](PROMPTS.md) (exact prompts + grammar).
---
## 0. Concept & design pillars
**Premise.** The player is a wanderer who steps into a wood that is being *dreamed into existence* around them. Everyone they meet is a spirit the wood conjures; every backdrop is painted the moment the player arrives. The player shapes the dream by speaking or writing.
**The diegetic conceit (this is the most important design decision).** The wood is dreamed by a *small, slightly forgetful mind*. So when a small model is whimsical, slightly inconsistent, or surreal, that is **in-world correct** — the wood is dreaming. This reframing turns the weaknesses of ≤32B models into the *aesthetic*, and it’s what lets a small-model game feel intentional rather than broken. Every design choice should reinforce “a dream that paints itself,” not “a chatbot that sometimes fails.”
**Design pillars**
1. **AI generates the content, not just assists.** Plot direction, dialogue, and art are produced live. Remove the models and nothing remains.
2. **Snappy over clever.** On a laptop, latency kills delight. One LLM call per turn; 1–4-step image model; aggressive caching; text-before-image.
3. **The model proposes, code disposes.** The LLM emits *typed directives*; deterministic code owns the canonical state. Small models cannot be trusted to hand-edit prose state without drift.
4. **Memory is a budget, not an archive.** Feed the model a rolling summary + present-scene detail + present-character sheets — never the whole history.
5. **Constrain to survive.** Grammar-constrained decoding guarantees parseable directives even from a 7–8B model.
---
## 1. System overview
```
┌──────────────────────────── HF Space (gradio.Server) ────────────────────────────┐
│ │
│ frontend/index.html ── Gradio JS client ──▶ @app.api endpoints │
│ (layered VN UI) │ │
│ ▼ │
│ app.py (thin: routes, @app.api, @spaces.GPU) │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ stt.py orchestrator.py characters.py painter.py │
│ (Whisper) (the Weaver) (the Voices) (diffusion + BiRefNet) │
│ │ │ │ ▲ │
│ │ └──── llm.py (llama.cpp / transformers, grammar) ────┘ │
│ ▼ │
│ state.py (GameState · apply directives) ──▶ memory.py (summary + budget) │
│ │ │
│ ▼ │
│ .md memory: templates/world_state.md · one per character │
└───────────────────────────────────────────────────────────────────────────────┘
```
`stt.py`, `orchestrator.py`, `characters.py`, `painter.py` are all callable in isolation (unit-testable). `app.py` only wires them to HTTP.
---
## 2. The model roles in detail
### 2.1 Why one LLM for two roles
The “Weaver” and the “Voices” are **distinct agents** (different jobs, different prompts, different output shapes) but they run on the **same loaded LLM weights**. Reasons: (a) the parameter budget — a second 7–8B model would nearly double VRAM for little gain; (b) simplicity — one runtime, one quantization, one warm-up; (c) it’s still genuinely multi-model (text + image + speech are three real, different models). If you later want a *visibly* separate model for the demo narrative, the cheapest split is a tiny dedicated model for one narrow job (e.g. a 0.5–1.5B model just for image-prompt rewriting) — but that’s optional and not recommended for the MVP.
### 2.2 The Weaver (Game Master / director)
**Responsibilities**
- **`init_world(seed, vibe)`** — invent the setting, the opening scene, and the first NPC; write the world-state `.md` and the first character sheet.
- **`direct_turn(state, player_input)`** — the per-turn workhorse. In a single grammar-constrained call it produces both (a) the NPC’s reply (delegating tone to the Voices system prompt) and (b) the **directives** that tell the engine what changed.
- **`compact_memory(state)`** — periodically fold old events into the rolling summary to stay within the context budget.
**I/O contract (per turn).** Input: assembled context (see §4.4). Output: a single JSON object validated against the directive schema (§4.2). Code applies it; the Weaver never writes to state directly.
### 2.3 The Voices (character manager / actor)
The Voices isn’t (in the default design) a separate LLM call — it is the **actor persona layer** of the same per-turn call. `characters.py` assembles the actor context: for each *present* NPC, its character sheet (traits, voice, goals, current mood, relationship to the player) plus the current scene. The system prompt instructs the model to speak *only* as the addressed/active NPC, in their voice, never breaking character or narrating as the author.
> **Two-call variant (optional).** If you find the single call muddies voice quality, split it: call 1 = Voices produces dialogue (free text); call 2 = Weaver reads the dialogue + player input and emits directives (grammar JSON). This doubles per-turn latency, so only do it if quality demands it. Keep it behind a flag.
### 2.4 The Painter (image)
See §5. Consumes prompts composed from the world style guide + the scene/character description; returns a backdrop image and/or a character sprite (cut to transparency).
### 2.5 The Ear (STT)
See §6. Whisper turns recorded audio into the player’s text input. Purely a front door to the loop.
---
## 3. The game loop in detail
### 3.1 Initialisation (once)
1. Player picks a *vibe* (e.g. “cozy folktale”, “eerie”, “absurd”) and optionally a seed.
2. `orchestrator.init_world()` → world-state `.md` + first character sheet + an opening line of narration/dialogue + an initial set of directives (scene description, first NPC, requested art).
3. `painter` paints the opening backdrop and the first sprite (parallelisable). Cache both.
4. Render the opening scene.
### 3.2 The turn (loops)
| Step | Module | What happens |
|---|---|---|
| 1 | `stt` (if voice) | Recorded audio → text. Typed input skips this. |
| 2 | `memory` | Assemble context: style guide + rolling summary + current scene + present-character sheets + last *k* turns + the player input. Enforce token budget. |
| 3 | `orchestrator.direct_turn``llm.complete_json` | **One** grammar-constrained call → `{ speaker, dialogue, emotion, directives }`. |
| 4 | `state` | Validate + apply directives deterministically: move scene, add/remove NPC, set mood, set relationship deltas, set flags, mark beat/ending. |
| 5 | `memory` | Append the turn; if over budget or every *N* turns, `compact_memory()`. |
| 6 | `painter` (conditional) | If `scene_change` → paint/lookup backdrop. If `new_character` → paint+matte sprite. If only `mood` changed → swap to the (cached or conditioned) mood sprite. |
| 7 | frontend | Render: backdrop, sprite (with mood), speaker name, dialogue. Dialogue streams **first**; images fill in when ready. |
### 3.3 The directive contract (what the LLM is allowed to change)
Directives are a *closed set* of safe, structured operations — the engine’s “API” that the model calls by emitting JSON (conceptually identical to tool-calling, but enforced by grammar). Closed set ⇒ the model can’t put the game in an undefined state. See §4.2 for the schema and [`PROMPTS.md`](PROMPTS.md) for the GBNF grammar.
---
## 4. State & memory
### 4.1 `GameState` (source of truth)
A Pydantic model held in memory for the session and mirrored to `.md`. Sketch:
```python
class Character(BaseModel):
id: str
name: str
one_line: str # "a nervous lantern-moth who collects apologies"
traits: list[str]
voice: str # how they speak: rhythm, vocabulary, tics
goals: str
appearance: str # the STABLE description used for every sprite of them
mood: str = "neutral" # drives sprite variant
relationship: int = 0 # -100..100 toward the player
sprite_seed: int # pinned for visual consistency
known_facts: list[str] = [] # what THIS character knows (avoids omniscience)
class Scene(BaseModel):
id: str
place: str
description: str # used for the backdrop prompt
mood: str
present: list[str] # character ids on stage (cap ~3)
backdrop_seed: int
class GameState(BaseModel):
seed: int
style_guide: str # global art + tone bible (set at init, mostly frozen)
vibe: str
scene: Scene
characters: dict[str, Character]
summary: str = "" # rolling compressed history
recent_turns: list[Turn] = [] # last k verbatim turns
flags: dict[str, str] = {} # arbitrary world facts the Weaver sets
beat: str = "opening" # opening | rising | turn | resolution | ended
turn_index: int = 0
```
### 4.2 Directive schema (the per-turn LLM output)
```jsonc
{
"speaker": "lantern_moth", // which present character speaks (or "narrator")
"dialogue": "string", // their line, in voice
"emotion": "string", // free-form mood word (e.g. "curious", "tender") → sprite mood
"directives": {
"scene_change": null, // or { "place": "...", "description": "...", "mood": "..." }
"new_character": null, // or a partial Character (id,name,one_line,appearance,voice,traits,goals)
"exit_character": null, // or character id leaving the stage
"relationship_delta": 0, // toward the player, applied to the speaker
"set_flags": {}, // e.g. {"gave_player_the_key":"true"}
"advance_beat": false, // nudge pacing toward resolution
"ending": null // or { "kind":"warm|bittersweet|strange", "text":"..." }
}
}
```
`emotion` is a free-form string (the LLM picks whatever fits the moment). Every nested object is optional/nullable so simple turns stay tiny. `complete_json` enforces structure on the llama.cpp path via GBNF; on the `transformers` path it uses prompt-based JSON extraction with 3-attempt retry.
### 4.3 `.md` as a derived view (not the truth)
Why both a struct *and* markdown? The struct is robust and code-friendly; the `.md` files are (a) human-readable so you can debug/show the “dream-memory” in the demo, (b) a clean, compact way to inject character/world context back into the prompt, and (c) on-theme. `state.py` renders `GameState``world_state.md` + one file per character after each turn, and can parse them back on load. The LLM *reads* `.md`; it does not author the canonical copy. Templates: [`../templates/world_state.md`](../templates/world_state.md), [`../templates/character_sheet.md`](../templates/character_sheet.md).
### 4.4 Context assembly & budget (`memory.py`)
Every turn, build the prompt from, in priority order until the budget fills:
1. System prompt (Weaver+Voices role) — fixed.
2. `style_guide` + `vibe` — small, fixed.
3. `summary` — the rolling compressed history.
4. Current `Scene` description + the **present** characters’ sheets only (not the whole cast).
5. The last *k* verbatim turns (e.g. k=4–6).
6. The player’s new input.
Target a conservative budget (e.g. ≤ ~3–4k tokens of context even if the model supports more) — small models degrade as context grows, and it keeps latency down. When `recent_turns` + history would exceed budget, `compact_memory()` asks the LLM to rewrite older turns into 3–5 sentences appended to `summary`, then drops them. This is the literal “Thousand Token Wood” — a small working memory by design.
---
## 5. The Painter (image pipeline)
### 5.1 Model choice & rationale
The hard problems are **latency** (must be a few seconds, not 30) and **character consistency** (the same NPC must look the same across scenes). Recommended options, in order:
| Option | Params | Steps | Why | Watch out |
|---|---|---|---|---|
| **SDXL-Turbo / SDXL-Lightning** (default) | ~3.5B | 1–4 | Mature, permissive, huge **illustration/anime LoRA** ecosystem (perfect VN look), runs on 8–16 GB, very fast. Consistency via pinned seed + fixed appearance string + generate-once. | Weaker prompt adherence than newer models; no built-in image conditioning. |
| **FLUX.2 Klein** (upgrade) | ~4B | few | Distilled from FLUX.2; **“Kontext” image-conditioning** edits an existing image (“same character, now smiling”, “same scene at dusk”) — *solves consistency directly* and gives mood sprites for free. Fits ~16 GB. | **Check the license** before shipping publicly; slightly heavier. |
| **Z-Image-Turbo** | ~fits 13–16 GB | turbo | Apache-2.0, fast. Good permissive middle ground. | Smaller ecosystem than SDXL. |
Pick **one** art style and bake it into `style_guide` (e.g. “soft watercolor storybook”, “muted ukiyo-e”, “90s anime VN”) so everything coheres. A style LoRA on SDXL is the cheapest way to a distinctive look (also nudges the **Off-Brand** vibe).
### 5.2 Consistency strategy (the crux)
- **Generate each character’s base sprite exactly once**, at introduction, and cache it. **Never re-paint an existing character** to “refresh” — that’s what breaks consistency.
- Store a **pinned `sprite_seed`** and the **stable `appearance` string** in the character record; reuse both for any later generation of that character.
- For **moods/expressions**: either (a) pre-generate a small set (neutral/happy/sad/surprised) at intro and swap, or (b) if using FLUX.2 Klein, *condition on the base sprite* to edit only the expression. Option (b) is cleaner and cheaper at runtime.
- For **backdrops**: cache by `scene_id`; revisiting a place reuses its image.
### 5.3 Sprites vs full-scene
Two valid looks:
1. **Layered VN (classic):** transparent character sprite over a separate backdrop. Diffusion won’t emit clean alpha, so generate the character on a plain background and run **BiRefNet** (the same matting model in the `gradio.Server` demo) to cut it out → transparent PNG. More moving parts, but the iconic VN look + lets you reuse one backdrop with different sprites.
2. **Full-scene (simpler):** generate the character *in* the scene as one image. No compositing, no matting — fewer failure modes, but you can’t cheaply swap sprite vs backdrop independently.
Recommend starting **full-scene** in Phase 1 (fast to working), then moving to **layered** in Phase 2 for polish if time allows.
### 5.4 Latency tactics
- 1–4 step model only; 512–768px is plenty for a backdrop behind text.
- **Text first, image second** — the dialogue renders immediately; the UI swaps the image in when the Painter returns (SSE/streaming via the Gradio JS client).
- **Speculative paint:** while the player reads/types, optionally pre-paint the most likely next backdrop.
- All Painter calls behind `@spaces.GPU`; keep the pipeline warm (load once at startup).
---
## 6. The Ear (audio / STT)
- **Model:** `faster-whisper` (CTranslate2) `small` for the laptop config, `large-v3-turbo` if you have room — player inputs are short, so `small`/`base` transcribe near-instantly. For the *all-ggml local-first* story, `whisper.cpp` pairs thematically with the llama.cpp bonus.
- **Capture:** in the custom frontend, use the browser `MediaRecorder` API → send the blob to an `@app.api(name="transcribe")` endpoint that runs Whisper. (In the `gr.Blocks` MVP, use `gr.Audio(sources=["microphone"])`.)
- **Flow:** transcript is shown to the player (so they can confirm/edit) then fed into the turn exactly like typed input. Keep voice optional; typing is the reliable fallback for the demo.
---
## 7. Deployment
### 7.1 Local (where the bonuses live)
Run `python app.py`; the LLM via **`llama-cpp-python`** (GGUF, GPU build), diffusion via `diffusers`, Whisper via `faster-whisper`. This is the configuration you record for the **Off-the-Grid (local-first)** and **Llama-Champion (llama.cpp)** badges — show it running with the network off.
### 7.2 Hugging Face Space (the required canvas)
The app is a Gradio app (`gradio.Server` *extends* Gradio/FastAPI), so it deploys as a normal Space. For usable image latency, use a **GPU Space**; **ZeroGPU** is free and integrates via the `@spaces.GPU` decorator (functions get a GPU per-call, allocated on demand). Load models at startup; decorate every inference function.
**The llama.cpp-on-ZeroGPU tension (decide early).** A CUDA `llama-cpp-python` build can be awkward on Spaces (CUDA runtime mismatches), and ZeroGPU’s per-call GPU model doesn’t love long-lived native processes. Two clean resolutions:
- **A (single stack):** get a working CUDA `llama-cpp-python` wheel/build for the Space (keeps llama.cpp everywhere). Test this on day 1, not at the deadline.
- **B (split stack, recommended for safety):** llama.cpp **locally** (claims the llama.cpp badge in the video) + a `transformers` code path **on the Space** under `@spaces.GPU`. Same `llm.py` interface, two backends behind a flag.
### 7.3 Persistence
The Space filesystem is **ephemeral**. The `.md` dream-memory therefore lives **per session** (perfectly fine for a one-sitting game). If you want playthroughs to survive restarts, write them to **HF persistent storage** or push traces to a **Dataset** (which also feeds the **Open-Trace** badge — see §9.3).
### 7.4 `gradio.Server` specifics
- `app = Server()` (a FastAPI subclass). `@app.get("/")` serves `frontend/index.html`. `@app.api(name="...")` defines queued, ZeroGPU-aware, `gradio_client`-callable endpoints (use these for `direct_turn`, `paint`, `transcribe`). `app.launch()` to run.
- The frontend talks to the backend through the **Gradio JS client** (`Client.connect(window.location.origin)``client.predict("/direct_turn", {...})`) so calls go through Gradio’s queue/concurrency (and you can show progress), **not** raw `fetch`.
---
## 8. UI architecture (Off-Brand badge)
### 8.1 Target: a real visual-novel frame
A custom `frontend/index.html` (vanilla HTML/CSS/JS — no build step needed; the hackathon’s own `gradio.Server` demo ships a ~1300-line single file) rendering three CSS-stacked layers:
```
z-0 backdrop (full-bleed CSS background-image)
z-1 sprite (transparent PNG, positioned, with a gentle entrance + mood cross-fade)
z-2 dialogue (named speaker box at the bottom, text typed in character-by-character)
+ input row: text field, 🎙️ mic button, "wait/continue" affordance
```
Niceties that sell delight cheaply: typewriter text reveal, soft sprite slide-in, backdrop cross-fade on scene change, a subtle paper/parchment vignette, a loading shimmer while the Painter works (“the wood is dreaming…”). Keep a small palette + one display font to look intentional.
### 8.2 MVP fallback
Phase 0/1 can use plain `gr.Blocks`: `gr.Image` for the scene, `gr.Chatbot`/`gr.Markdown` for dialogue, `gr.Textbox` + `gr.Audio(microphone)` for input. Gate it behind `GRADIO_MVP_UI=1`. This de-risks the loop before you invest in the custom frontend.
### 8.3 Wiring
Frontend ⇄ backend via the Gradio JS client to `@app.api` endpoints (`/start`, `/direct_turn`, `/transcribe`). The backend returns `{ speaker, dialogue, emotion, scene_image_url, sprite_image_url }`; the frontend animates the rest.
---
## 9. Optional badge tracks
### 9.1 Well-Tuned (fine-tune) 🟡
Fine-tune the shared LLM (LoRA on Qwen3-4B/8B) to (a) reliably emit the directive schema and (b) carry the “dreaming wood” narrative voice. Dataset: ~200–800 synthetic `(context → directive-JSON + dialogue)` examples — bootstrap them with a larger model or hand-write seeds, then expand. Train with the standard PEFT/LoRA stack; merge or keep the adapter; convert to **GGUF** and publish on the Hub (this is what the badge checks). Even a small fine-tune that locks the output format + tone is a real win and reduces grammar-fighting at runtime. Time-box it — the MVP must not depend on it.
### 9.2 Field Notes (blog) 🟡
A short write-up: the diegetic conceit, the one-call director pattern, grammar-constrained directives, the consistency-via-cache trick, and the llama.cpp/ZeroGPU lesson. Cheap points, and genuinely useful to others.
### 9.3 Open Trace 🟡
Log every orchestration step per turn — the assembled context, the raw directive JSON, the Painter prompts/seeds — to `runs/*.jsonl`, then push a cleaned sample as a **Hub dataset**. The struct-based state makes these traces clean and shareable.
---
## 10. Phased plan
Build window **June 5–15**. (Register by **June 3**; sketch + download weights before the 5th.) Days are indicative for a two-person team.
### Phase 0 — Skeleton (Day 1–2)
- `state.py`: `GameState` + directive schema; `.md` render/parse round-trip with a unit test.
- `llm.py`: `complete()` + `complete_json()` with **GBNF grammar** (start with `transformers` *or* llama.cpp — whichever you get running first).
- `orchestrator.direct_turn()` returning valid directive JSON for a hard-coded scene.
- Text-only loop in `gr.Blocks`. **Milestone:** type a line → NPC replies → state updates, looping, no images.
### Phase 1 — The wood breathes (Day 3–5)
- `painter.py`: full-scene generation with SDXL-Turbo; disk cache; pinned seeds.
- `orchestrator.init_world()`: generate setting + first NPC + opening.
- Wire scene/character directives → Painter. **Milestone:** a playable illustrated loop, basic VN layout in `gr.Blocks`.
### Phase 2 — Voice + polish (Day 6–8)
- `stt.py` + mic input.
- Migrate to `gradio.Server` + custom `frontend/index.html` (layered scene, typewriter, animations).
- `memory.compact_memory()`; sprite moods (pre-gen set or FLUX.2-Klein conditioning); optional BiRefNet layered sprites. **Milestone:** speak to a self-painting wood in a bespoke UI.
### Phase 3 — Bonuses + ship (Day 9–10)
- Lock the llama.cpp local config; resolve the Space backend (A or B from §7.2).
- (Optional) fine-tune + GGUF on the Hub; (optional) trace dataset.
- **Record the demo video + write the social post** (+ optional blog). Deploy the Space, test cold-start, freeze. **Milestone:** submitted.
> Cut lines if time runs short, in this order: fine-tune → layered sprites → voice → custom frontend. The loop + a custom-ish UI + working art is a complete, competitive entry.
---
## 11. Risks & mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Image latency makes it feel sluggish | High | High | 1–4-step model; text-first; cache; speculative paint; 512–768px. |
| Character looks different each scene | High | High | Generate-once + cache; pinned seed + fixed appearance; FLUX.2-Klein conditioning for moods. |
| llama.cpp won’t build on ZeroGPU | Med | Med | Decide §7.2 on day 1; keep `transformers` fallback behind a flag; claim llama.cpp badge locally. |
| Small model drifts / over-narrates | Med | Med | Grammar-enforced directives; low temp on structure; rolling-summary memory; lean on the diegetic framing. |
| Scope creep (combat, save slots, 5 NPCs) | Med | High | Honor the “out of scope” list in `CLAUDE.md`; ship the loop. |
| Weights/licensing surprise (esp. FLUX.2) | Low | Med | Verify licenses before shipping; default to SDXL-Turbo/Z-Image (permissive). |
| Demo video/social post left to the last hour | Med | High | They’re *required*. Schedule them in Phase 3, not after. |
---
## 12. Latency budget (per turn, target)
| Stage | Target | Notes |
|---|---|---|
| STT (if voice) | < 1 s | short utterances, `small`/turbo |
| Context assembly | negligible | pure Python |
| LLM directive call | 1–4 s | ~8B Q4, ≤4k context, one call |
| Apply + memory | negligible | |
| Image (only when changed) | 1–5 s | 1–4-step model; *hidden behind text* and often cached |
| **Felt latency** | **dialogue in ~2–4 s**, image fills in after | most turns don’t change the scene |
---
## 13. Decisions to lock before Day 1
1. **Project name** (Thousand Token Wood is a placeholder).
2. **Art style** for the `style_guide` (and whether to use a style LoRA).
3. **Image model**: SDXL-Turbo (safe) vs FLUX.2 Klein (consistency, check license).
4. **Sprites**: full-scene first vs layered+BiRefNet.
5. **Space backend**: §7.2 option A vs B.
6. **Whisper size** (small vs turbo) and whether voice ships in v1.
7. **Fine-tune**: in or out (time-box).