Spaces:

build-small-hackathon
/

Hackathon-IA-VisualNovel

Running on Zero

App Files Files Community

Hackathon-IA-VisualNovel / docs /ARCHITECTURE.md

WillHbx

docs: Update documentation

e3725c4 20 days ago

preview code

Raw

History Blame Contribute Delete

25.7 kB

	# Thousand Token Wood — Architecture

	Deep technical design for the AI-improvised visual novel. Companion to [`../README.md`](../README.md) (overview), [`../CLAUDE.md`](../CLAUDE.md) (conventions), and [`PROMPTS.md`](PROMPTS.md) (exact prompts + grammar).

	---

	## 0. Concept & design pillars

	Premise. The player is a wanderer who steps into a wood that is being dreamed into existence around them. Everyone they meet is a spirit the wood conjures; every backdrop is painted the moment the player arrives. The player shapes the dream by speaking or writing.

	The diegetic conceit (this is the most important design decision). The wood is dreamed by a small, slightly forgetful mind. So when a small model is whimsical, slightly inconsistent, or surreal, that is in-world correct — the wood is dreaming. This reframing turns the weaknesses of ≤32B models into the aesthetic, and it’s what lets a small-model game feel intentional rather than broken. Every design choice should reinforce “a dream that paints itself,” not “a chatbot that sometimes fails.”

	Design pillars

	1. AI generates the content, not just assists. Plot direction, dialogue, and art are produced live. Remove the models and nothing remains.
	2. Snappy over clever. On a laptop, latency kills delight. One LLM call per turn; 1–4-step image model; aggressive caching; text-before-image.
	3. The model proposes, code disposes. The LLM emits typed directives; deterministic code owns the canonical state. Small models cannot be trusted to hand-edit prose state without drift.
	4. Memory is a budget, not an archive. Feed the model a rolling summary + present-scene detail + present-character sheets — never the whole history.
	5. Constrain to survive. Grammar-constrained decoding guarantees parseable directives even from a 7–8B model.

	---

	## 1. System overview

	```
	┌──────────────────────────── HF Space (gradio.Server) ────────────────────────────┐
	│ │
	│ frontend/index.html ── Gradio JS client ──▶ @app.api endpoints │
	│ (layered VN UI) │ │
	│ ▼ │
	│ app.py (thin: routes, @app.api, @spaces.GPU) │
	│ │ │ │ │ │
	│ ▼ ▼ ▼ ▼ │
	│ stt.py orchestrator.py characters.py painter.py │
	│ (Whisper) (the Weaver) (the Voices) (diffusion + BiRefNet) │
	│ │ │ │ ▲ │
	│ │ └──── llm.py (llama.cpp / transformers, grammar) ────┘ │
	│ ▼ │
	│ state.py (GameState · apply directives) ──▶ memory.py (summary + budget) │
	│ │ │
	│ ▼ │
	│ .md memory: templates/world_state.md · one per character │
	└───────────────────────────────────────────────────────────────────────────────┘
	```

	`stt.py`, `orchestrator.py`, `characters.py`, `painter.py` are all callable in isolation (unit-testable). `app.py` only wires them to HTTP.

	---

	## 2. The model roles in detail

	### 2.1 Why one LLM for two roles

	The “Weaver” and the “Voices” are distinct agents (different jobs, different prompts, different output shapes) but they run on the same loaded LLM weights. Reasons: (a) the parameter budget — a second 7–8B model would nearly double VRAM for little gain; (b) simplicity — one runtime, one quantization, one warm-up; (c) it’s still genuinely multi-model (text + image + speech are three real, different models). If you later want a visibly separate model for the demo narrative, the cheapest split is a tiny dedicated model for one narrow job (e.g. a 0.5–1.5B model just for image-prompt rewriting) — but that’s optional and not recommended for the MVP.

	### 2.2 The Weaver (Game Master / director)

	Responsibilities

	- `init_world(seed, vibe)` — invent the setting, the opening scene, and the first NPC; write the world-state `.md` and the first character sheet.
	- `direct_turn(state, player_input)` — the per-turn workhorse. In a single grammar-constrained call it produces both (a) the NPC’s reply (delegating tone to the Voices system prompt) and (b) the directives that tell the engine what changed.
	- `compact_memory(state)` — periodically fold old events into the rolling summary to stay within the context budget.

	I/O contract (per turn). Input: assembled context (see §4.4). Output: a single JSON object validated against the directive schema (§4.2). Code applies it; the Weaver never writes to state directly.

	### 2.3 The Voices (character manager / actor)

	The Voices isn’t (in the default design) a separate LLM call — it is the actor persona layer of the same per-turn call. `characters.py` assembles the actor context: for each present NPC, its character sheet (traits, voice, goals, current mood, relationship to the player) plus the current scene. The system prompt instructs the model to speak only as the addressed/active NPC, in their voice, never breaking character or narrating as the author.

	> Two-call variant (optional). If you find the single call muddies voice quality, split it: call 1 = Voices produces dialogue (free text); call 2 = Weaver reads the dialogue + player input and emits directives (grammar JSON). This doubles per-turn latency, so only do it if quality demands it. Keep it behind a flag.

	### 2.4 The Painter (image)

	See §5. Consumes prompts composed from the world style guide + the scene/character description; returns a backdrop image and/or a character sprite (cut to transparency).

	### 2.5 The Ear (STT)

	See §6. Whisper turns recorded audio into the player’s text input. Purely a front door to the loop.

	---

	## 3. The game loop in detail

	### 3.1 Initialisation (once)

	1. Player picks a vibe (e.g. “cozy folktale”, “eerie”, “absurd”) and optionally a seed.
	2. `orchestrator.init_world()` → world-state `.md` + first character sheet + an opening line of narration/dialogue + an initial set of directives (scene description, first NPC, requested art).
	3. `painter` paints the opening backdrop and the first sprite (parallelisable). Cache both.
	4. Render the opening scene.

	### 3.2 The turn (loops)

	\| Step \| Module \| What happens \|
	\|---\|---\|---\|
	\| 1 \| `stt` (if voice) \| Recorded audio → text. Typed input skips this. \|
	\| 2 \| `memory` \| Assemble context: style guide + rolling summary + current scene + present-character sheets + last k turns + the player input. Enforce token budget. \|
	\| 3 \| `orchestrator.direct_turn` → `llm.complete_json` \| One grammar-constrained call → `{ speaker, dialogue, emotion, directives }`. \|
	\| 4 \| `state` \| Validate + apply directives deterministically: move scene, add/remove NPC, set mood, set relationship deltas, set flags, mark beat/ending. \|
	\| 5 \| `memory` \| Append the turn; if over budget or every N turns, `compact_memory()`. \|
	\| 6 \| `painter` (conditional) \| If `scene_change` → paint/lookup backdrop. If `new_character` → paint+matte sprite. If only `mood` changed → swap to the (cached or conditioned) mood sprite. \|
	\| 7 \| frontend \| Render: backdrop, sprite (with mood), speaker name, dialogue. Dialogue streams first; images fill in when ready. \|

	### 3.3 The directive contract (what the LLM is allowed to change)

	Directives are a closed set of safe, structured operations — the engine’s “API” that the model calls by emitting JSON (conceptually identical to tool-calling, but enforced by grammar). Closed set ⇒ the model can’t put the game in an undefined state. See §4.2 for the schema and [`PROMPTS.md`](PROMPTS.md) for the GBNF grammar.

	---

	## 4. State & memory

	### 4.1 `GameState` (source of truth)

	A Pydantic model held in memory for the session and mirrored to `.md`. Sketch:

	```python
	class Character(BaseModel):
	id: str
	name: str
	one_line: str # "a nervous lantern-moth who collects apologies"
	traits: list[str]
	voice: str # how they speak: rhythm, vocabulary, tics
	goals: str
	appearance: str # the STABLE description used for every sprite of them
	mood: str = "neutral" # drives sprite variant
	relationship: int = 0 # -100..100 toward the player
	sprite_seed: int # pinned for visual consistency
	known_facts: list[str] = [] # what THIS character knows (avoids omniscience)

	class Scene(BaseModel):
	id: str
	place: str
	description: str # used for the backdrop prompt
	mood: str
	present: list[str] # character ids on stage (cap ~3)
	backdrop_seed: int

	class GameState(BaseModel):
	seed: int
	style_guide: str # global art + tone bible (set at init, mostly frozen)
	vibe: str
	scene: Scene
	characters: dict[str, Character]
	summary: str = "" # rolling compressed history
	recent_turns: list[Turn] = [] # last k verbatim turns
	flags: dict[str, str] = {} # arbitrary world facts the Weaver sets
	beat: str = "opening" # opening \| rising \| turn \| resolution \| ended
	turn_index: int = 0
	```

	### 4.2 Directive schema (the per-turn LLM output)

	```jsonc
	{
	"speaker": "lantern_moth", // which present character speaks (or "narrator")
	"dialogue": "string", // their line, in voice
	"emotion": "string", // free-form mood word (e.g. "curious", "tender") → sprite mood
	"directives": {
	"scene_change": null, // or { "place": "...", "description": "...", "mood": "..." }
	"new_character": null, // or a partial Character (id,name,one_line,appearance,voice,traits,goals)
	"exit_character": null, // or character id leaving the stage
	"relationship_delta": 0, // toward the player, applied to the speaker
	"set_flags": {}, // e.g. {"gave_player_the_key":"true"}
	"advance_beat": false, // nudge pacing toward resolution
	"ending": null // or { "kind":"warm\|bittersweet\|strange", "text":"..." }
	}
	}
	```

	`emotion` is a free-form string (the LLM picks whatever fits the moment). Every nested object is optional/nullable so simple turns stay tiny. `complete_json` enforces structure on the llama.cpp path via GBNF; on the `transformers` path it uses prompt-based JSON extraction with 3-attempt retry.

	### 4.3 `.md` as a derived view (not the truth)

	Why both a struct and markdown? The struct is robust and code-friendly; the `.md` files are (a) human-readable so you can debug/show the “dream-memory” in the demo, (b) a clean, compact way to inject character/world context back into the prompt, and (c) on-theme. `state.py` renders `GameState` → `world_state.md` + one file per character after each turn, and can parse them back on load. The LLM reads `.md`; it does not author the canonical copy. Templates: [`../templates/world_state.md`](../templates/world_state.md), [`../templates/character_sheet.md`](../templates/character_sheet.md).

	### 4.4 Context assembly & budget (`memory.py`)

	Every turn, build the prompt from, in priority order until the budget fills:

	1. System prompt (Weaver+Voices role) — fixed.
	2. `style_guide` + `vibe` — small, fixed.
	3. `summary` — the rolling compressed history.
	4. Current `Scene` description + the present characters’ sheets only (not the whole cast).
	5. The last k verbatim turns (e.g. k=4–6).
	6. The player’s new input.

	Target a conservative budget (e.g. ≤ ~3–4k tokens of context even if the model supports more) — small models degrade as context grows, and it keeps latency down. When `recent_turns` + history would exceed budget, `compact_memory()` asks the LLM to rewrite older turns into 3–5 sentences appended to `summary`, then drops them. This is the literal “Thousand Token Wood” — a small working memory by design.

	---

	## 5. The Painter (image pipeline)

	### 5.1 Model choice & rationale

	The hard problems are latency (must be a few seconds, not 30) and character consistency (the same NPC must look the same across scenes). Recommended options, in order:

	\| Option \| Params \| Steps \| Why \| Watch out \|
	\|---\|---\|---\|---\|---\|
	\| SDXL-Turbo / SDXL-Lightning (default) \| ~3.5B \| 1–4 \| Mature, permissive, huge illustration/anime LoRA ecosystem (perfect VN look), runs on 8–16 GB, very fast. Consistency via pinned seed + fixed appearance string + generate-once. \| Weaker prompt adherence than newer models; no built-in image conditioning. \|
	\| FLUX.2 Klein (upgrade) \| ~4B \| few \| Distilled from FLUX.2; “Kontext” image-conditioning edits an existing image (“same character, now smiling”, “same scene at dusk”) — solves consistency directly and gives mood sprites for free. Fits ~16 GB. \| Check the license before shipping publicly; slightly heavier. \|
	\| Z-Image-Turbo \| ~fits 13–16 GB \| turbo \| Apache-2.0, fast. Good permissive middle ground. \| Smaller ecosystem than SDXL. \|

	Pick one art style and bake it into `style_guide` (e.g. “soft watercolor storybook”, “muted ukiyo-e”, “90s anime VN”) so everything coheres. A style LoRA on SDXL is the cheapest way to a distinctive look (also nudges the Off-Brand vibe).

	### 5.2 Consistency strategy (the crux)

	- Generate each character’s base sprite exactly once, at introduction, and cache it. Never re-paint an existing character to “refresh” — that’s what breaks consistency.
	- Store a pinned `sprite_seed` and the stable `appearance` string in the character record; reuse both for any later generation of that character.
	- For moods/expressions: either (a) pre-generate a small set (neutral/happy/sad/surprised) at intro and swap, or (b) if using FLUX.2 Klein, condition on the base sprite to edit only the expression. Option (b) is cleaner and cheaper at runtime.
	- For backdrops: cache by `scene_id`; revisiting a place reuses its image.

	### 5.3 Sprites vs full-scene

	Two valid looks:

	1. Layered VN (classic): transparent character sprite over a separate backdrop. Diffusion won’t emit clean alpha, so generate the character on a plain background and run BiRefNet (the same matting model in the `gradio.Server` demo) to cut it out → transparent PNG. More moving parts, but the iconic VN look + lets you reuse one backdrop with different sprites.
	2. Full-scene (simpler): generate the character in the scene as one image. No compositing, no matting — fewer failure modes, but you can’t cheaply swap sprite vs backdrop independently.

	Recommend starting full-scene in Phase 1 (fast to working), then moving to layered in Phase 2 for polish if time allows.

	### 5.4 Latency tactics

	- 1–4 step model only; 512–768px is plenty for a backdrop behind text.
	- Text first, image second — the dialogue renders immediately; the UI swaps the image in when the Painter returns (SSE/streaming via the Gradio JS client).
	- Speculative paint: while the player reads/types, optionally pre-paint the most likely next backdrop.
	- All Painter calls behind `@spaces.GPU`; keep the pipeline warm (load once at startup).

	---

	## 6. The Ear (audio / STT)

	- Model: `faster-whisper` (CTranslate2) `small` for the laptop config, `large-v3-turbo` if you have room — player inputs are short, so `small`/`base` transcribe near-instantly. For the all-ggml local-first story, `whisper.cpp` pairs thematically with the llama.cpp bonus.
	- Capture: in the custom frontend, use the browser `MediaRecorder` API → send the blob to an `@app.api(name="transcribe")` endpoint that runs Whisper. (In the `gr.Blocks` MVP, use `gr.Audio(sources=["microphone"])`.)
	- Flow: transcript is shown to the player (so they can confirm/edit) then fed into the turn exactly like typed input. Keep voice optional; typing is the reliable fallback for the demo.

	---

	## 7. Deployment

	### 7.1 Local (where the bonuses live)

	Run `python app.py`; the LLM via `llama-cpp-python` (GGUF, GPU build), diffusion via `diffusers`, Whisper via `faster-whisper`. This is the configuration you record for the Off-the-Grid (local-first) and Llama-Champion (llama.cpp) badges — show it running with the network off.

	### 7.2 Hugging Face Space (the required canvas)

	The app is a Gradio app (`gradio.Server` extends Gradio/FastAPI), so it deploys as a normal Space. For usable image latency, use a GPU Space; ZeroGPU is free and integrates via the `@spaces.GPU` decorator (functions get a GPU per-call, allocated on demand). Load models at startup; decorate every inference function.

	The llama.cpp-on-ZeroGPU tension (decide early). A CUDA `llama-cpp-python` build can be awkward on Spaces (CUDA runtime mismatches), and ZeroGPU’s per-call GPU model doesn’t love long-lived native processes. Two clean resolutions:

	- A (single stack): get a working CUDA `llama-cpp-python` wheel/build for the Space (keeps llama.cpp everywhere). Test this on day 1, not at the deadline.
	- B (split stack, recommended for safety): llama.cpp locally (claims the llama.cpp badge in the video) + a `transformers` code path on the Space under `@spaces.GPU`. Same `llm.py` interface, two backends behind a flag.

	### 7.3 Persistence

	The Space filesystem is ephemeral. The `.md` dream-memory therefore lives per session (perfectly fine for a one-sitting game). If you want playthroughs to survive restarts, write them to HF persistent storage or push traces to a Dataset (which also feeds the Open-Trace badge — see §9.3).

	### 7.4 `gradio.Server` specifics

	- `app = Server()` (a FastAPI subclass). `@app.get("/")` serves `frontend/index.html`. `@app.api(name="...")` defines queued, ZeroGPU-aware, `gradio_client`-callable endpoints (use these for `direct_turn`, `paint`, `transcribe`). `app.launch()` to run.
	- The frontend talks to the backend through the Gradio JS client (`Client.connect(window.location.origin)` → `client.predict("/direct_turn", {...})`) so calls go through Gradio’s queue/concurrency (and you can show progress), not raw `fetch`.

	---

	## 8. UI architecture (Off-Brand badge)

	### 8.1 Target: a real visual-novel frame

	A custom `frontend/index.html` (vanilla HTML/CSS/JS — no build step needed; the hackathon’s own `gradio.Server` demo ships a ~1300-line single file) rendering three CSS-stacked layers:

	```
	z-0 backdrop (full-bleed CSS background-image)
	z-1 sprite (transparent PNG, positioned, with a gentle entrance + mood cross-fade)
	z-2 dialogue (named speaker box at the bottom, text typed in character-by-character)
	+ input row: text field, 🎙️ mic button, "wait/continue" affordance
	```

	Niceties that sell delight cheaply: typewriter text reveal, soft sprite slide-in, backdrop cross-fade on scene change, a subtle paper/parchment vignette, a loading shimmer while the Painter works (“the wood is dreaming…”). Keep a small palette + one display font to look intentional.

	### 8.2 MVP fallback

	Phase 0/1 can use plain `gr.Blocks`: `gr.Image` for the scene, `gr.Chatbot`/`gr.Markdown` for dialogue, `gr.Textbox` + `gr.Audio(microphone)` for input. Gate it behind `GRADIO_MVP_UI=1`. This de-risks the loop before you invest in the custom frontend.

	### 8.3 Wiring

	Frontend ⇄ backend via the Gradio JS client to `@app.api` endpoints (`/start`, `/direct_turn`, `/transcribe`). The backend returns `{ speaker, dialogue, emotion, scene_image_url, sprite_image_url }`; the frontend animates the rest.

	---

	## 9. Optional badge tracks

	### 9.1 Well-Tuned (fine-tune) 🟡

	Fine-tune the shared LLM (LoRA on Qwen3-4B/8B) to (a) reliably emit the directive schema and (b) carry the “dreaming wood” narrative voice. Dataset: ~200–800 synthetic `(context → directive-JSON + dialogue)` examples — bootstrap them with a larger model or hand-write seeds, then expand. Train with the standard PEFT/LoRA stack; merge or keep the adapter; convert to GGUF and publish on the Hub (this is what the badge checks). Even a small fine-tune that locks the output format + tone is a real win and reduces grammar-fighting at runtime. Time-box it — the MVP must not depend on it.

	### 9.2 Field Notes (blog) 🟡

	A short write-up: the diegetic conceit, the one-call director pattern, grammar-constrained directives, the consistency-via-cache trick, and the llama.cpp/ZeroGPU lesson. Cheap points, and genuinely useful to others.

	### 9.3 Open Trace 🟡

	Log every orchestration step per turn — the assembled context, the raw directive JSON, the Painter prompts/seeds — to `runs/.jsonl`, then push a cleaned sample as a Hub dataset*. The struct-based state makes these traces clean and shareable.

	---

	## 10. Phased plan

	Build window June 5–15. (Register by June 3; sketch + download weights before the 5th.) Days are indicative for a two-person team.

	### Phase 0 — Skeleton (Day 1–2)
	- `state.py`: `GameState` + directive schema; `.md` render/parse round-trip with a unit test.
	- `llm.py`: `complete()` + `complete_json()` with GBNF grammar (start with `transformers` or llama.cpp — whichever you get running first).
	- `orchestrator.direct_turn()` returning valid directive JSON for a hard-coded scene.
	- Text-only loop in `gr.Blocks`. Milestone: type a line → NPC replies → state updates, looping, no images.

	### Phase 1 — The wood breathes (Day 3–5)
	- `painter.py`: full-scene generation with SDXL-Turbo; disk cache; pinned seeds.
	- `orchestrator.init_world()`: generate setting + first NPC + opening.
	- Wire scene/character directives → Painter. Milestone: a playable illustrated loop, basic VN layout in `gr.Blocks`.

	### Phase 2 — Voice + polish (Day 6–8)
	- `stt.py` + mic input.
	- Migrate to `gradio.Server` + custom `frontend/index.html` (layered scene, typewriter, animations).
	- `memory.compact_memory()`; sprite moods (pre-gen set or FLUX.2-Klein conditioning); optional BiRefNet layered sprites. Milestone: speak to a self-painting wood in a bespoke UI.

	### Phase 3 — Bonuses + ship (Day 9–10)
	- Lock the llama.cpp local config; resolve the Space backend (A or B from §7.2).
	- (Optional) fine-tune + GGUF on the Hub; (optional) trace dataset.
	- Record the demo video + write the social post (+ optional blog). Deploy the Space, test cold-start, freeze. Milestone: submitted.

	> Cut lines if time runs short, in this order: fine-tune → layered sprites → voice → custom frontend. The loop + a custom-ish UI + working art is a complete, competitive entry.

	---

	## 11. Risks & mitigations

	\| Risk \| Likelihood \| Impact \| Mitigation \|
	\|---\|---\|---\|---\|
	\| Image latency makes it feel sluggish \| High \| High \| 1–4-step model; text-first; cache; speculative paint; 512–768px. \|
	\| Character looks different each scene \| High \| High \| Generate-once + cache; pinned seed + fixed appearance; FLUX.2-Klein conditioning for moods. \|
	\| llama.cpp won’t build on ZeroGPU \| Med \| Med \| Decide §7.2 on day 1; keep `transformers` fallback behind a flag; claim llama.cpp badge locally. \|
	\| Small model drifts / over-narrates \| Med \| Med \| Grammar-enforced directives; low temp on structure; rolling-summary memory; lean on the diegetic framing. \|
	\| Scope creep (combat, save slots, 5 NPCs) \| Med \| High \| Honor the “out of scope” list in `CLAUDE.md`; ship the loop. \|
	\| Weights/licensing surprise (esp. FLUX.2) \| Low \| Med \| Verify licenses before shipping; default to SDXL-Turbo/Z-Image (permissive). \|
	\| Demo video/social post left to the last hour \| Med \| High \| They’re required. Schedule them in Phase 3, not after. \|

	---

	## 12. Latency budget (per turn, target)

	\| Stage \| Target \| Notes \|
	\|---\|---\|---\|
	\| STT (if voice) \| < 1 s \| short utterances, `small`/turbo \|
	\| Context assembly \| negligible \| pure Python \|
	\| LLM directive call \| 1–4 s \| ~8B Q4, ≤4k context, one call \|
	\| Apply + memory \| negligible \| \|
	\| Image (only when changed) \| 1–5 s \| 1–4-step model; hidden behind text and often cached \|
	\| Felt latency \| dialogue in ~2–4 s, image fills in after \| most turns don’t change the scene \|

	---

	## 13. Decisions to lock before Day 1

	1. Project name (Thousand Token Wood is a placeholder).
	2. Art style for the `style_guide` (and whether to use a style LoRA).
	3. Image model: SDXL-Turbo (safe) vs FLUX.2 Klein (consistency, check license).
	4. Sprites: full-scene first vs layered+BiRefNet.
	5. Space backend: §7.2 option A vs B.
	6. Whisper size (small vs turbo) and whether voice ships in v1.
	7. Fine-tune: in or out (time-box).