Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
Thousand Token Wood — Architecture
Deep technical design for the AI-improvised visual novel. Companion to ../README.md (overview), ../CLAUDE.md (conventions), and PROMPTS.md (exact prompts + grammar).
0. Concept & design pillars
Premise. The player is a wanderer who steps into a wood that is being dreamed into existence around them. Everyone they meet is a spirit the wood conjures; every backdrop is painted the moment the player arrives. The player shapes the dream by speaking or writing.
The diegetic conceit (this is the most important design decision). The wood is dreamed by a small, slightly forgetful mind. So when a small model is whimsical, slightly inconsistent, or surreal, that is in-world correct — the wood is dreaming. This reframing turns the weaknesses of ≤32B models into the aesthetic, and it’s what lets a small-model game feel intentional rather than broken. Every design choice should reinforce “a dream that paints itself,” not “a chatbot that sometimes fails.”
Design pillars
- AI generates the content, not just assists. Plot direction, dialogue, and art are produced live. Remove the models and nothing remains.
- Snappy over clever. On a laptop, latency kills delight. One LLM call per turn; 1–4-step image model; aggressive caching; text-before-image.
- The model proposes, code disposes. The LLM emits typed directives; deterministic code owns the canonical state. Small models cannot be trusted to hand-edit prose state without drift.
- Memory is a budget, not an archive. Feed the model a rolling summary + present-scene detail + present-character sheets — never the whole history.
- Constrain to survive. Grammar-constrained decoding guarantees parseable directives even from a 7–8B model.
1. System overview
┌──────────────────────────── HF Space (gradio.Server) ────────────────────────────┐
│ │
│ frontend/index.html ── Gradio JS client ──▶ @app.api endpoints │
│ (layered VN UI) │ │
│ ▼ │
│ app.py (thin: routes, @app.api, @spaces.GPU) │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ stt.py orchestrator.py characters.py painter.py │
│ (Whisper) (the Weaver) (the Voices) (diffusion + BiRefNet) │
│ │ │ │ ▲ │
│ │ └──── llm.py (llama.cpp / transformers, grammar) ────┘ │
│ ▼ │
│ state.py (GameState · apply directives) ──▶ memory.py (summary + budget) │
│ │ │
│ ▼ │
│ .md memory: templates/world_state.md · one per character │
└───────────────────────────────────────────────────────────────────────────────┘
stt.py, orchestrator.py, characters.py, painter.py are all callable in isolation (unit-testable). app.py only wires them to HTTP.
2. The model roles in detail
2.1 Why one LLM for two roles
The “Weaver” and the “Voices” are distinct agents (different jobs, different prompts, different output shapes) but they run on the same loaded LLM weights. Reasons: (a) the parameter budget — a second 7–8B model would nearly double VRAM for little gain; (b) simplicity — one runtime, one quantization, one warm-up; (c) it’s still genuinely multi-model (text + image + speech are three real, different models). If you later want a visibly separate model for the demo narrative, the cheapest split is a tiny dedicated model for one narrow job (e.g. a 0.5–1.5B model just for image-prompt rewriting) — but that’s optional and not recommended for the MVP.
2.2 The Weaver (Game Master / director)
Responsibilities
init_world(seed, vibe)— invent the setting, the opening scene, and the first NPC; write the world-state.mdand the first character sheet.direct_turn(state, player_input)— the per-turn workhorse. In a single grammar-constrained call it produces both (a) the NPC’s reply (delegating tone to the Voices system prompt) and (b) the directives that tell the engine what changed.compact_memory(state)— periodically fold old events into the rolling summary to stay within the context budget.
I/O contract (per turn). Input: assembled context (see §4.4). Output: a single JSON object validated against the directive schema (§4.2). Code applies it; the Weaver never writes to state directly.
2.3 The Voices (character manager / actor)
The Voices isn’t (in the default design) a separate LLM call — it is the actor persona layer of the same per-turn call. characters.py assembles the actor context: for each present NPC, its character sheet (traits, voice, goals, current mood, relationship to the player) plus the current scene. The system prompt instructs the model to speak only as the addressed/active NPC, in their voice, never breaking character or narrating as the author.
Two-call variant (optional). If you find the single call muddies voice quality, split it: call 1 = Voices produces dialogue (free text); call 2 = Weaver reads the dialogue + player input and emits directives (grammar JSON). This doubles per-turn latency, so only do it if quality demands it. Keep it behind a flag.
2.4 The Painter (image)
See §5. Consumes prompts composed from the world style guide + the scene/character description; returns a backdrop image and/or a character sprite (cut to transparency).
2.5 The Ear (STT)
See §6. Whisper turns recorded audio into the player’s text input. Purely a front door to the loop.
3. The game loop in detail
3.1 Initialisation (once)
- Player picks a vibe (e.g. “cozy folktale”, “eerie”, “absurd”) and optionally a seed.
orchestrator.init_world()→ world-state.md+ first character sheet + an opening line of narration/dialogue + an initial set of directives (scene description, first NPC, requested art).painterpaints the opening backdrop and the first sprite (parallelisable). Cache both.- Render the opening scene.
3.2 The turn (loops)
| Step | Module | What happens |
|---|---|---|
| 1 | stt (if voice) |
Recorded audio → text. Typed input skips this. |
| 2 | memory |
Assemble context: style guide + rolling summary + current scene + present-character sheets + last k turns + the player input. Enforce token budget. |
| 3 | orchestrator.direct_turn → llm.complete_json |
One grammar-constrained call → { speaker, dialogue, emotion, directives }. |
| 4 | state |
Validate + apply directives deterministically: move scene, add/remove NPC, set mood, set relationship deltas, set flags, mark beat/ending. |
| 5 | memory |
Append the turn; if over budget or every N turns, compact_memory(). |
| 6 | painter (conditional) |
If scene_change → paint/lookup backdrop. If new_character → paint+matte sprite. If only mood changed → swap to the (cached or conditioned) mood sprite. |
| 7 | frontend | Render: backdrop, sprite (with mood), speaker name, dialogue. Dialogue streams first; images fill in when ready. |
3.3 The directive contract (what the LLM is allowed to change)
Directives are a closed set of safe, structured operations — the engine’s “API” that the model calls by emitting JSON (conceptually identical to tool-calling, but enforced by grammar). Closed set ⇒ the model can’t put the game in an undefined state. See §4.2 for the schema and PROMPTS.md for the GBNF grammar.
4. State & memory
4.1 GameState (source of truth)
A Pydantic model held in memory for the session and mirrored to .md. Sketch:
class Character(BaseModel):
id: str
name: str
one_line: str # "a nervous lantern-moth who collects apologies"
traits: list[str]
voice: str # how they speak: rhythm, vocabulary, tics
goals: str
appearance: str # the STABLE description used for every sprite of them
mood: str = "neutral" # drives sprite variant
relationship: int = 0 # -100..100 toward the player
sprite_seed: int # pinned for visual consistency
known_facts: list[str] = [] # what THIS character knows (avoids omniscience)
class Scene(BaseModel):
id: str
place: str
description: str # used for the backdrop prompt
mood: str
present: list[str] # character ids on stage (cap ~3)
backdrop_seed: int
class GameState(BaseModel):
seed: int
style_guide: str # global art + tone bible (set at init, mostly frozen)
vibe: str
scene: Scene
characters: dict[str, Character]
summary: str = "" # rolling compressed history
recent_turns: list[Turn] = [] # last k verbatim turns
flags: dict[str, str] = {} # arbitrary world facts the Weaver sets
beat: str = "opening" # opening | rising | turn | resolution | ended
turn_index: int = 0
4.2 Directive schema (the per-turn LLM output)
{
"speaker": "lantern_moth", // which present character speaks (or "narrator")
"dialogue": "string", // their line, in voice
"emotion": "string", // free-form mood word (e.g. "curious", "tender") → sprite mood
"directives": {
"scene_change": null, // or { "place": "...", "description": "...", "mood": "..." }
"new_character": null, // or a partial Character (id,name,one_line,appearance,voice,traits,goals)
"exit_character": null, // or character id leaving the stage
"relationship_delta": 0, // toward the player, applied to the speaker
"set_flags": {}, // e.g. {"gave_player_the_key":"true"}
"advance_beat": false, // nudge pacing toward resolution
"ending": null // or { "kind":"warm|bittersweet|strange", "text":"..." }
}
}
emotion is a free-form string (the LLM picks whatever fits the moment). Every nested object is optional/nullable so simple turns stay tiny. complete_json enforces structure on the llama.cpp path via GBNF; on the transformers path it uses prompt-based JSON extraction with 3-attempt retry.
4.3 .md as a derived view (not the truth)
Why both a struct and markdown? The struct is robust and code-friendly; the .md files are (a) human-readable so you can debug/show the “dream-memory” in the demo, (b) a clean, compact way to inject character/world context back into the prompt, and (c) on-theme. state.py renders GameState → world_state.md + one file per character after each turn, and can parse them back on load. The LLM reads .md; it does not author the canonical copy. Templates: ../templates/world_state.md, ../templates/character_sheet.md.
4.4 Context assembly & budget (memory.py)
Every turn, build the prompt from, in priority order until the budget fills:
- System prompt (Weaver+Voices role) — fixed.
style_guide+vibe— small, fixed.summary— the rolling compressed history.- Current
Scenedescription + the present characters’ sheets only (not the whole cast). - The last k verbatim turns (e.g. k=4–6).
- The player’s new input.
Target a conservative budget (e.g. ≤ ~3–4k tokens of context even if the model supports more) — small models degrade as context grows, and it keeps latency down. When recent_turns + history would exceed budget, compact_memory() asks the LLM to rewrite older turns into 3–5 sentences appended to summary, then drops them. This is the literal “Thousand Token Wood” — a small working memory by design.
5. The Painter (image pipeline)
5.1 Model choice & rationale
The hard problems are latency (must be a few seconds, not 30) and character consistency (the same NPC must look the same across scenes). Recommended options, in order:
| Option | Params | Steps | Why | Watch out |
|---|---|---|---|---|
| SDXL-Turbo / SDXL-Lightning (default) | ~3.5B | 1–4 | Mature, permissive, huge illustration/anime LoRA ecosystem (perfect VN look), runs on 8–16 GB, very fast. Consistency via pinned seed + fixed appearance string + generate-once. | Weaker prompt adherence than newer models; no built-in image conditioning. |
| FLUX.2 Klein (upgrade) | ~4B | few | Distilled from FLUX.2; “Kontext” image-conditioning edits an existing image (“same character, now smiling”, “same scene at dusk”) — solves consistency directly and gives mood sprites for free. Fits ~16 GB. | Check the license before shipping publicly; slightly heavier. |
| Z-Image-Turbo | ~fits 13–16 GB | turbo | Apache-2.0, fast. Good permissive middle ground. | Smaller ecosystem than SDXL. |
Pick one art style and bake it into style_guide (e.g. “soft watercolor storybook”, “muted ukiyo-e”, “90s anime VN”) so everything coheres. A style LoRA on SDXL is the cheapest way to a distinctive look (also nudges the Off-Brand vibe).
5.2 Consistency strategy (the crux)
- Generate each character’s base sprite exactly once, at introduction, and cache it. Never re-paint an existing character to “refresh” — that’s what breaks consistency.
- Store a pinned
sprite_seedand the stableappearancestring in the character record; reuse both for any later generation of that character. - For moods/expressions: either (a) pre-generate a small set (neutral/happy/sad/surprised) at intro and swap, or (b) if using FLUX.2 Klein, condition on the base sprite to edit only the expression. Option (b) is cleaner and cheaper at runtime.
- For backdrops: cache by
scene_id; revisiting a place reuses its image.
5.3 Sprites vs full-scene
Two valid looks:
- Layered VN (classic): transparent character sprite over a separate backdrop. Diffusion won’t emit clean alpha, so generate the character on a plain background and run BiRefNet (the same matting model in the
gradio.Serverdemo) to cut it out → transparent PNG. More moving parts, but the iconic VN look + lets you reuse one backdrop with different sprites. - Full-scene (simpler): generate the character in the scene as one image. No compositing, no matting — fewer failure modes, but you can’t cheaply swap sprite vs backdrop independently.
Recommend starting full-scene in Phase 1 (fast to working), then moving to layered in Phase 2 for polish if time allows.
5.4 Latency tactics
- 1–4 step model only; 512–768px is plenty for a backdrop behind text.
- Text first, image second — the dialogue renders immediately; the UI swaps the image in when the Painter returns (SSE/streaming via the Gradio JS client).
- Speculative paint: while the player reads/types, optionally pre-paint the most likely next backdrop.
- All Painter calls behind
@spaces.GPU; keep the pipeline warm (load once at startup).
6. The Ear (audio / STT)
- Model:
faster-whisper(CTranslate2)smallfor the laptop config,large-v3-turboif you have room — player inputs are short, sosmall/basetranscribe near-instantly. For the all-ggml local-first story,whisper.cpppairs thematically with the llama.cpp bonus. - Capture: in the custom frontend, use the browser
MediaRecorderAPI → send the blob to an@app.api(name="transcribe")endpoint that runs Whisper. (In thegr.BlocksMVP, usegr.Audio(sources=["microphone"]).) - Flow: transcript is shown to the player (so they can confirm/edit) then fed into the turn exactly like typed input. Keep voice optional; typing is the reliable fallback for the demo.
7. Deployment
7.1 Local (where the bonuses live)
Run python app.py; the LLM via llama-cpp-python (GGUF, GPU build), diffusion via diffusers, Whisper via faster-whisper. This is the configuration you record for the Off-the-Grid (local-first) and Llama-Champion (llama.cpp) badges — show it running with the network off.
7.2 Hugging Face Space (the required canvas)
The app is a Gradio app (gradio.Server extends Gradio/FastAPI), so it deploys as a normal Space. For usable image latency, use a GPU Space; ZeroGPU is free and integrates via the @spaces.GPU decorator (functions get a GPU per-call, allocated on demand). Load models at startup; decorate every inference function.
The llama.cpp-on-ZeroGPU tension (decide early). A CUDA llama-cpp-python build can be awkward on Spaces (CUDA runtime mismatches), and ZeroGPU’s per-call GPU model doesn’t love long-lived native processes. Two clean resolutions:
- A (single stack): get a working CUDA
llama-cpp-pythonwheel/build for the Space (keeps llama.cpp everywhere). Test this on day 1, not at the deadline. - B (split stack, recommended for safety): llama.cpp locally (claims the llama.cpp badge in the video) + a
transformerscode path on the Space under@spaces.GPU. Samellm.pyinterface, two backends behind a flag.
7.3 Persistence
The Space filesystem is ephemeral. The .md dream-memory therefore lives per session (perfectly fine for a one-sitting game). If you want playthroughs to survive restarts, write them to HF persistent storage or push traces to a Dataset (which also feeds the Open-Trace badge — see §9.3).
7.4 gradio.Server specifics
app = Server()(a FastAPI subclass).@app.get("/")servesfrontend/index.html.@app.api(name="...")defines queued, ZeroGPU-aware,gradio_client-callable endpoints (use these fordirect_turn,paint,transcribe).app.launch()to run.- The frontend talks to the backend through the Gradio JS client (
Client.connect(window.location.origin)→client.predict("/direct_turn", {...})) so calls go through Gradio’s queue/concurrency (and you can show progress), not rawfetch.
8. UI architecture (Off-Brand badge)
8.1 Target: a real visual-novel frame
A custom frontend/index.html (vanilla HTML/CSS/JS — no build step needed; the hackathon’s own gradio.Server demo ships a ~1300-line single file) rendering three CSS-stacked layers:
z-0 backdrop (full-bleed CSS background-image)
z-1 sprite (transparent PNG, positioned, with a gentle entrance + mood cross-fade)
z-2 dialogue (named speaker box at the bottom, text typed in character-by-character)
+ input row: text field, 🎙️ mic button, "wait/continue" affordance
Niceties that sell delight cheaply: typewriter text reveal, soft sprite slide-in, backdrop cross-fade on scene change, a subtle paper/parchment vignette, a loading shimmer while the Painter works (“the wood is dreaming…”). Keep a small palette + one display font to look intentional.
8.2 MVP fallback
Phase 0/1 can use plain gr.Blocks: gr.Image for the scene, gr.Chatbot/gr.Markdown for dialogue, gr.Textbox + gr.Audio(microphone) for input. Gate it behind GRADIO_MVP_UI=1. This de-risks the loop before you invest in the custom frontend.
8.3 Wiring
Frontend ⇄ backend via the Gradio JS client to @app.api endpoints (/start, /direct_turn, /transcribe). The backend returns { speaker, dialogue, emotion, scene_image_url, sprite_image_url }; the frontend animates the rest.
9. Optional badge tracks
9.1 Well-Tuned (fine-tune) 🟡
Fine-tune the shared LLM (LoRA on Qwen3-4B/8B) to (a) reliably emit the directive schema and (b) carry the “dreaming wood” narrative voice. Dataset: ~200–800 synthetic (context → directive-JSON + dialogue) examples — bootstrap them with a larger model or hand-write seeds, then expand. Train with the standard PEFT/LoRA stack; merge or keep the adapter; convert to GGUF and publish on the Hub (this is what the badge checks). Even a small fine-tune that locks the output format + tone is a real win and reduces grammar-fighting at runtime. Time-box it — the MVP must not depend on it.
9.2 Field Notes (blog) 🟡
A short write-up: the diegetic conceit, the one-call director pattern, grammar-constrained directives, the consistency-via-cache trick, and the llama.cpp/ZeroGPU lesson. Cheap points, and genuinely useful to others.
9.3 Open Trace 🟡
Log every orchestration step per turn — the assembled context, the raw directive JSON, the Painter prompts/seeds — to runs/*.jsonl, then push a cleaned sample as a Hub dataset. The struct-based state makes these traces clean and shareable.
10. Phased plan
Build window June 5–15. (Register by June 3; sketch + download weights before the 5th.) Days are indicative for a two-person team.
Phase 0 — Skeleton (Day 1–2)
state.py:GameState+ directive schema;.mdrender/parse round-trip with a unit test.llm.py:complete()+complete_json()with GBNF grammar (start withtransformersor llama.cpp — whichever you get running first).orchestrator.direct_turn()returning valid directive JSON for a hard-coded scene.- Text-only loop in
gr.Blocks. Milestone: type a line → NPC replies → state updates, looping, no images.
Phase 1 — The wood breathes (Day 3–5)
painter.py: full-scene generation with SDXL-Turbo; disk cache; pinned seeds.orchestrator.init_world(): generate setting + first NPC + opening.- Wire scene/character directives → Painter. Milestone: a playable illustrated loop, basic VN layout in
gr.Blocks.
Phase 2 — Voice + polish (Day 6–8)
stt.py+ mic input.- Migrate to
gradio.Server+ customfrontend/index.html(layered scene, typewriter, animations). memory.compact_memory(); sprite moods (pre-gen set or FLUX.2-Klein conditioning); optional BiRefNet layered sprites. Milestone: speak to a self-painting wood in a bespoke UI.
Phase 3 — Bonuses + ship (Day 9–10)
- Lock the llama.cpp local config; resolve the Space backend (A or B from §7.2).
- (Optional) fine-tune + GGUF on the Hub; (optional) trace dataset.
- Record the demo video + write the social post (+ optional blog). Deploy the Space, test cold-start, freeze. Milestone: submitted.
Cut lines if time runs short, in this order: fine-tune → layered sprites → voice → custom frontend. The loop + a custom-ish UI + working art is a complete, competitive entry.
11. Risks & mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Image latency makes it feel sluggish | High | High | 1–4-step model; text-first; cache; speculative paint; 512–768px. |
| Character looks different each scene | High | High | Generate-once + cache; pinned seed + fixed appearance; FLUX.2-Klein conditioning for moods. |
| llama.cpp won’t build on ZeroGPU | Med | Med | Decide §7.2 on day 1; keep transformers fallback behind a flag; claim llama.cpp badge locally. |
| Small model drifts / over-narrates | Med | Med | Grammar-enforced directives; low temp on structure; rolling-summary memory; lean on the diegetic framing. |
| Scope creep (combat, save slots, 5 NPCs) | Med | High | Honor the “out of scope” list in CLAUDE.md; ship the loop. |
| Weights/licensing surprise (esp. FLUX.2) | Low | Med | Verify licenses before shipping; default to SDXL-Turbo/Z-Image (permissive). |
| Demo video/social post left to the last hour | Med | High | They’re required. Schedule them in Phase 3, not after. |
12. Latency budget (per turn, target)
| Stage | Target | Notes |
|---|---|---|
| STT (if voice) | < 1 s | short utterances, small/turbo |
| Context assembly | negligible | pure Python |
| LLM directive call | 1–4 s | ~8B Q4, ≤4k context, one call |
| Apply + memory | negligible | |
| Image (only when changed) | 1–5 s | 1–4-step model; hidden behind text and often cached |
| Felt latency | dialogue in ~2–4 s, image fills in after | most turns don’t change the scene |
13. Decisions to lock before Day 1
- Project name (Thousand Token Wood is a placeholder).
- Art style for the
style_guide(and whether to use a style LoRA). - Image model: SDXL-Turbo (safe) vs FLUX.2 Klein (consistency, check license).
- Sprites: full-scene first vs layered+BiRefNet.
- Space backend: §7.2 option A vs B.
- Whisper size (small vs turbo) and whether voice ships in v1.
- Fine-tune: in or out (time-box).