Spaces:
Running on Zero
Running on Zero
| # Thousand Token Wood — Architecture | |
| Deep technical design for the AI-improvised visual novel. Companion to [`../README.md`](../README.md) (overview), [`../CLAUDE.md`](../CLAUDE.md) (conventions), and [`PROMPTS.md`](PROMPTS.md) (exact prompts + grammar). | |
| --- | |
| ## 0. Concept & design pillars | |
| **Premise.** The player is a wanderer who steps into a wood that is being *dreamed into existence* around them. Everyone they meet is a spirit the wood conjures; every backdrop is painted the moment the player arrives. The player shapes the dream by speaking or writing. | |
| **The diegetic conceit (this is the most important design decision).** The wood is dreamed by a *small, slightly forgetful mind*. So when a small model is whimsical, slightly inconsistent, or surreal, that is **in-world correct** — the wood is dreaming. This reframing turns the weaknesses of ≤32B models into the *aesthetic*, and it’s what lets a small-model game feel intentional rather than broken. Every design choice should reinforce “a dream that paints itself,” not “a chatbot that sometimes fails.” | |
| **Design pillars** | |
| 1. **AI generates the content, not just assists.** Plot direction, dialogue, and art are produced live. Remove the models and nothing remains. | |
| 2. **Snappy over clever.** On a laptop, latency kills delight. One LLM call per turn; 1–4-step image model; aggressive caching; text-before-image. | |
| 3. **The model proposes, code disposes.** The LLM emits *typed directives*; deterministic code owns the canonical state. Small models cannot be trusted to hand-edit prose state without drift. | |
| 4. **Memory is a budget, not an archive.** Feed the model a rolling summary + present-scene detail + present-character sheets — never the whole history. | |
| 5. **Constrain to survive.** Grammar-constrained decoding guarantees parseable directives even from a 7–8B model. | |
| --- | |
| ## 1. System overview | |
| ``` | |
| ┌──────────────────────────── HF Space (gradio.Server) ────────────────────────────┐ | |
| │ │ | |
| │ frontend/index.html ── Gradio JS client ──▶ @app.api endpoints │ | |
| │ (layered VN UI) │ │ | |
| │ ▼ │ | |
| │ app.py (thin: routes, @app.api, @spaces.GPU) │ | |
| │ │ │ │ │ │ | |
| │ ▼ ▼ ▼ ▼ │ | |
| │ stt.py orchestrator.py characters.py painter.py │ | |
| │ (Whisper) (the Weaver) (the Voices) (diffusion + BiRefNet) │ | |
| │ │ │ │ ▲ │ | |
| │ │ └──── llm.py (llama.cpp / transformers, grammar) ────┘ │ | |
| │ ▼ │ | |
| │ state.py (GameState · apply directives) ──▶ memory.py (summary + budget) │ | |
| │ │ │ | |
| │ ▼ │ | |
| │ .md memory: templates/world_state.md · one per character │ | |
| └───────────────────────────────────────────────────────────────────────────────┘ | |
| ``` | |
| `stt.py`, `orchestrator.py`, `characters.py`, `painter.py` are all callable in isolation (unit-testable). `app.py` only wires them to HTTP. | |
| --- | |
| ## 2. The model roles in detail | |
| ### 2.1 Why one LLM for two roles | |
| The “Weaver” and the “Voices” are **distinct agents** (different jobs, different prompts, different output shapes) but they run on the **same loaded LLM weights**. Reasons: (a) the parameter budget — a second 7–8B model would nearly double VRAM for little gain; (b) simplicity — one runtime, one quantization, one warm-up; (c) it’s still genuinely multi-model (text + image + speech are three real, different models). If you later want a *visibly* separate model for the demo narrative, the cheapest split is a tiny dedicated model for one narrow job (e.g. a 0.5–1.5B model just for image-prompt rewriting) — but that’s optional and not recommended for the MVP. | |
| ### 2.2 The Weaver (Game Master / director) | |
| **Responsibilities** | |
| - **`init_world(seed, vibe)`** — invent the setting, the opening scene, and the first NPC; write the world-state `.md` and the first character sheet. | |
| - **`direct_turn(state, player_input)`** — the per-turn workhorse. In a single grammar-constrained call it produces both (a) the NPC’s reply (delegating tone to the Voices system prompt) and (b) the **directives** that tell the engine what changed. | |
| - **`compact_memory(state)`** — periodically fold old events into the rolling summary to stay within the context budget. | |
| **I/O contract (per turn).** Input: assembled context (see §4.4). Output: a single JSON object validated against the directive schema (§4.2). Code applies it; the Weaver never writes to state directly. | |
| ### 2.3 The Voices (character manager / actor) | |
| The Voices isn’t (in the default design) a separate LLM call — it is the **actor persona layer** of the same per-turn call. `characters.py` assembles the actor context: for each *present* NPC, its character sheet (traits, voice, goals, current mood, relationship to the player) plus the current scene. The system prompt instructs the model to speak *only* as the addressed/active NPC, in their voice, never breaking character or narrating as the author. | |
| > **Two-call variant (optional).** If you find the single call muddies voice quality, split it: call 1 = Voices produces dialogue (free text); call 2 = Weaver reads the dialogue + player input and emits directives (grammar JSON). This doubles per-turn latency, so only do it if quality demands it. Keep it behind a flag. | |
| ### 2.4 The Painter (image) | |
| See §5. Consumes prompts composed from the world style guide + the scene/character description; returns a backdrop image and/or a character sprite (cut to transparency). | |
| ### 2.5 The Ear (STT) | |
| See §6. Whisper turns recorded audio into the player’s text input. Purely a front door to the loop. | |
| --- | |
| ## 3. The game loop in detail | |
| ### 3.1 Initialisation (once) | |
| 1. Player picks a *vibe* (e.g. “cozy folktale”, “eerie”, “absurd”) and optionally a seed. | |
| 2. `orchestrator.init_world()` → world-state `.md` + first character sheet + an opening line of narration/dialogue + an initial set of directives (scene description, first NPC, requested art). | |
| 3. `painter` paints the opening backdrop and the first sprite (parallelisable). Cache both. | |
| 4. Render the opening scene. | |
| ### 3.2 The turn (loops) | |
| | Step | Module | What happens | | |
| |---|---|---| | |
| | 1 | `stt` (if voice) | Recorded audio → text. Typed input skips this. | | |
| | 2 | `memory` | Assemble context: style guide + rolling summary + current scene + present-character sheets + last *k* turns + the player input. Enforce token budget. | | |
| | 3 | `orchestrator.direct_turn` → `llm.complete_json` | **One** grammar-constrained call → `{ speaker, dialogue, emotion, directives }`. | | |
| | 4 | `state` | Validate + apply directives deterministically: move scene, add/remove NPC, set mood, set relationship deltas, set flags, mark beat/ending. | | |
| | 5 | `memory` | Append the turn; if over budget or every *N* turns, `compact_memory()`. | | |
| | 6 | `painter` (conditional) | If `scene_change` → paint/lookup backdrop. If `new_character` → paint+matte sprite. If only `mood` changed → swap to the (cached or conditioned) mood sprite. | | |
| | 7 | frontend | Render: backdrop, sprite (with mood), speaker name, dialogue. Dialogue streams **first**; images fill in when ready. | | |
| ### 3.3 The directive contract (what the LLM is allowed to change) | |
| Directives are a *closed set* of safe, structured operations — the engine’s “API” that the model calls by emitting JSON (conceptually identical to tool-calling, but enforced by grammar). Closed set ⇒ the model can’t put the game in an undefined state. See §4.2 for the schema and [`PROMPTS.md`](PROMPTS.md) for the GBNF grammar. | |
| --- | |
| ## 4. State & memory | |
| ### 4.1 `GameState` (source of truth) | |
| A Pydantic model held in memory for the session and mirrored to `.md`. Sketch: | |
| ```python | |
| class Character(BaseModel): | |
| id: str | |
| name: str | |
| one_line: str # "a nervous lantern-moth who collects apologies" | |
| traits: list[str] | |
| voice: str # how they speak: rhythm, vocabulary, tics | |
| goals: str | |
| appearance: str # the STABLE description used for every sprite of them | |
| mood: str = "neutral" # drives sprite variant | |
| relationship: int = 0 # -100..100 toward the player | |
| sprite_seed: int # pinned for visual consistency | |
| known_facts: list[str] = [] # what THIS character knows (avoids omniscience) | |
| class Scene(BaseModel): | |
| id: str | |
| place: str | |
| description: str # used for the backdrop prompt | |
| mood: str | |
| present: list[str] # character ids on stage (cap ~3) | |
| backdrop_seed: int | |
| class GameState(BaseModel): | |
| seed: int | |
| style_guide: str # global art + tone bible (set at init, mostly frozen) | |
| vibe: str | |
| scene: Scene | |
| characters: dict[str, Character] | |
| summary: str = "" # rolling compressed history | |
| recent_turns: list[Turn] = [] # last k verbatim turns | |
| flags: dict[str, str] = {} # arbitrary world facts the Weaver sets | |
| beat: str = "opening" # opening | rising | turn | resolution | ended | |
| turn_index: int = 0 | |
| ``` | |
| ### 4.2 Directive schema (the per-turn LLM output) | |
| ```jsonc | |
| { | |
| "speaker": "lantern_moth", // which present character speaks (or "narrator") | |
| "dialogue": "string", // their line, in voice | |
| "emotion": "string", // free-form mood word (e.g. "curious", "tender") → sprite mood | |
| "directives": { | |
| "scene_change": null, // or { "place": "...", "description": "...", "mood": "..." } | |
| "new_character": null, // or a partial Character (id,name,one_line,appearance,voice,traits,goals) | |
| "exit_character": null, // or character id leaving the stage | |
| "relationship_delta": 0, // toward the player, applied to the speaker | |
| "set_flags": {}, // e.g. {"gave_player_the_key":"true"} | |
| "advance_beat": false, // nudge pacing toward resolution | |
| "ending": null // or { "kind":"warm|bittersweet|strange", "text":"..." } | |
| } | |
| } | |
| ``` | |
| `emotion` is a free-form string (the LLM picks whatever fits the moment). Every nested object is optional/nullable so simple turns stay tiny. `complete_json` enforces structure on the llama.cpp path via GBNF; on the `transformers` path it uses prompt-based JSON extraction with 3-attempt retry. | |
| ### 4.3 `.md` as a derived view (not the truth) | |
| Why both a struct *and* markdown? The struct is robust and code-friendly; the `.md` files are (a) human-readable so you can debug/show the “dream-memory” in the demo, (b) a clean, compact way to inject character/world context back into the prompt, and (c) on-theme. `state.py` renders `GameState` → `world_state.md` + one file per character after each turn, and can parse them back on load. The LLM *reads* `.md`; it does not author the canonical copy. Templates: [`../templates/world_state.md`](../templates/world_state.md), [`../templates/character_sheet.md`](../templates/character_sheet.md). | |
| ### 4.4 Context assembly & budget (`memory.py`) | |
| Every turn, build the prompt from, in priority order until the budget fills: | |
| 1. System prompt (Weaver+Voices role) — fixed. | |
| 2. `style_guide` + `vibe` — small, fixed. | |
| 3. `summary` — the rolling compressed history. | |
| 4. Current `Scene` description + the **present** characters’ sheets only (not the whole cast). | |
| 5. The last *k* verbatim turns (e.g. k=4–6). | |
| 6. The player’s new input. | |
| Target a conservative budget (e.g. ≤ ~3–4k tokens of context even if the model supports more) — small models degrade as context grows, and it keeps latency down. When `recent_turns` + history would exceed budget, `compact_memory()` asks the LLM to rewrite older turns into 3–5 sentences appended to `summary`, then drops them. This is the literal “Thousand Token Wood” — a small working memory by design. | |
| --- | |
| ## 5. The Painter (image pipeline) | |
| ### 5.1 Model choice & rationale | |
| The hard problems are **latency** (must be a few seconds, not 30) and **character consistency** (the same NPC must look the same across scenes). Recommended options, in order: | |
| | Option | Params | Steps | Why | Watch out | | |
| |---|---|---|---|---| | |
| | **SDXL-Turbo / SDXL-Lightning** (default) | ~3.5B | 1–4 | Mature, permissive, huge **illustration/anime LoRA** ecosystem (perfect VN look), runs on 8–16 GB, very fast. Consistency via pinned seed + fixed appearance string + generate-once. | Weaker prompt adherence than newer models; no built-in image conditioning. | | |
| | **FLUX.2 Klein** (upgrade) | ~4B | few | Distilled from FLUX.2; **“Kontext” image-conditioning** edits an existing image (“same character, now smiling”, “same scene at dusk”) — *solves consistency directly* and gives mood sprites for free. Fits ~16 GB. | **Check the license** before shipping publicly; slightly heavier. | | |
| | **Z-Image-Turbo** | ~fits 13–16 GB | turbo | Apache-2.0, fast. Good permissive middle ground. | Smaller ecosystem than SDXL. | | |
| Pick **one** art style and bake it into `style_guide` (e.g. “soft watercolor storybook”, “muted ukiyo-e”, “90s anime VN”) so everything coheres. A style LoRA on SDXL is the cheapest way to a distinctive look (also nudges the **Off-Brand** vibe). | |
| ### 5.2 Consistency strategy (the crux) | |
| - **Generate each character’s base sprite exactly once**, at introduction, and cache it. **Never re-paint an existing character** to “refresh” — that’s what breaks consistency. | |
| - Store a **pinned `sprite_seed`** and the **stable `appearance` string** in the character record; reuse both for any later generation of that character. | |
| - For **moods/expressions**: either (a) pre-generate a small set (neutral/happy/sad/surprised) at intro and swap, or (b) if using FLUX.2 Klein, *condition on the base sprite* to edit only the expression. Option (b) is cleaner and cheaper at runtime. | |
| - For **backdrops**: cache by `scene_id`; revisiting a place reuses its image. | |
| ### 5.3 Sprites vs full-scene | |
| Two valid looks: | |
| 1. **Layered VN (classic):** transparent character sprite over a separate backdrop. Diffusion won’t emit clean alpha, so generate the character on a plain background and run **BiRefNet** (the same matting model in the `gradio.Server` demo) to cut it out → transparent PNG. More moving parts, but the iconic VN look + lets you reuse one backdrop with different sprites. | |
| 2. **Full-scene (simpler):** generate the character *in* the scene as one image. No compositing, no matting — fewer failure modes, but you can’t cheaply swap sprite vs backdrop independently. | |
| Recommend starting **full-scene** in Phase 1 (fast to working), then moving to **layered** in Phase 2 for polish if time allows. | |
| ### 5.4 Latency tactics | |
| - 1–4 step model only; 512–768px is plenty for a backdrop behind text. | |
| - **Text first, image second** — the dialogue renders immediately; the UI swaps the image in when the Painter returns (SSE/streaming via the Gradio JS client). | |
| - **Speculative paint:** while the player reads/types, optionally pre-paint the most likely next backdrop. | |
| - All Painter calls behind `@spaces.GPU`; keep the pipeline warm (load once at startup). | |
| --- | |
| ## 6. The Ear (audio / STT) | |
| - **Model:** `faster-whisper` (CTranslate2) `small` for the laptop config, `large-v3-turbo` if you have room — player inputs are short, so `small`/`base` transcribe near-instantly. For the *all-ggml local-first* story, `whisper.cpp` pairs thematically with the llama.cpp bonus. | |
| - **Capture:** in the custom frontend, use the browser `MediaRecorder` API → send the blob to an `@app.api(name="transcribe")` endpoint that runs Whisper. (In the `gr.Blocks` MVP, use `gr.Audio(sources=["microphone"])`.) | |
| - **Flow:** transcript is shown to the player (so they can confirm/edit) then fed into the turn exactly like typed input. Keep voice optional; typing is the reliable fallback for the demo. | |
| --- | |
| ## 7. Deployment | |
| ### 7.1 Local (where the bonuses live) | |
| Run `python app.py`; the LLM via **`llama-cpp-python`** (GGUF, GPU build), diffusion via `diffusers`, Whisper via `faster-whisper`. This is the configuration you record for the **Off-the-Grid (local-first)** and **Llama-Champion (llama.cpp)** badges — show it running with the network off. | |
| ### 7.2 Hugging Face Space (the required canvas) | |
| The app is a Gradio app (`gradio.Server` *extends* Gradio/FastAPI), so it deploys as a normal Space. For usable image latency, use a **GPU Space**; **ZeroGPU** is free and integrates via the `@spaces.GPU` decorator (functions get a GPU per-call, allocated on demand). Load models at startup; decorate every inference function. | |
| **The llama.cpp-on-ZeroGPU tension (decide early).** A CUDA `llama-cpp-python` build can be awkward on Spaces (CUDA runtime mismatches), and ZeroGPU’s per-call GPU model doesn’t love long-lived native processes. Two clean resolutions: | |
| - **A (single stack):** get a working CUDA `llama-cpp-python` wheel/build for the Space (keeps llama.cpp everywhere). Test this on day 1, not at the deadline. | |
| - **B (split stack, recommended for safety):** llama.cpp **locally** (claims the llama.cpp badge in the video) + a `transformers` code path **on the Space** under `@spaces.GPU`. Same `llm.py` interface, two backends behind a flag. | |
| ### 7.3 Persistence | |
| The Space filesystem is **ephemeral**. The `.md` dream-memory therefore lives **per session** (perfectly fine for a one-sitting game). If you want playthroughs to survive restarts, write them to **HF persistent storage** or push traces to a **Dataset** (which also feeds the **Open-Trace** badge — see §9.3). | |
| ### 7.4 `gradio.Server` specifics | |
| - `app = Server()` (a FastAPI subclass). `@app.get("/")` serves `frontend/index.html`. `@app.api(name="...")` defines queued, ZeroGPU-aware, `gradio_client`-callable endpoints (use these for `direct_turn`, `paint`, `transcribe`). `app.launch()` to run. | |
| - The frontend talks to the backend through the **Gradio JS client** (`Client.connect(window.location.origin)` → `client.predict("/direct_turn", {...})`) so calls go through Gradio’s queue/concurrency (and you can show progress), **not** raw `fetch`. | |
| --- | |
| ## 8. UI architecture (Off-Brand badge) | |
| ### 8.1 Target: a real visual-novel frame | |
| A custom `frontend/index.html` (vanilla HTML/CSS/JS — no build step needed; the hackathon’s own `gradio.Server` demo ships a ~1300-line single file) rendering three CSS-stacked layers: | |
| ``` | |
| z-0 backdrop (full-bleed CSS background-image) | |
| z-1 sprite (transparent PNG, positioned, with a gentle entrance + mood cross-fade) | |
| z-2 dialogue (named speaker box at the bottom, text typed in character-by-character) | |
| + input row: text field, 🎙️ mic button, "wait/continue" affordance | |
| ``` | |
| Niceties that sell delight cheaply: typewriter text reveal, soft sprite slide-in, backdrop cross-fade on scene change, a subtle paper/parchment vignette, a loading shimmer while the Painter works (“the wood is dreaming…”). Keep a small palette + one display font to look intentional. | |
| ### 8.2 MVP fallback | |
| Phase 0/1 can use plain `gr.Blocks`: `gr.Image` for the scene, `gr.Chatbot`/`gr.Markdown` for dialogue, `gr.Textbox` + `gr.Audio(microphone)` for input. Gate it behind `GRADIO_MVP_UI=1`. This de-risks the loop before you invest in the custom frontend. | |
| ### 8.3 Wiring | |
| Frontend ⇄ backend via the Gradio JS client to `@app.api` endpoints (`/start`, `/direct_turn`, `/transcribe`). The backend returns `{ speaker, dialogue, emotion, scene_image_url, sprite_image_url }`; the frontend animates the rest. | |
| --- | |
| ## 9. Optional badge tracks | |
| ### 9.1 Well-Tuned (fine-tune) 🟡 | |
| Fine-tune the shared LLM (LoRA on Qwen3-4B/8B) to (a) reliably emit the directive schema and (b) carry the “dreaming wood” narrative voice. Dataset: ~200–800 synthetic `(context → directive-JSON + dialogue)` examples — bootstrap them with a larger model or hand-write seeds, then expand. Train with the standard PEFT/LoRA stack; merge or keep the adapter; convert to **GGUF** and publish on the Hub (this is what the badge checks). Even a small fine-tune that locks the output format + tone is a real win and reduces grammar-fighting at runtime. Time-box it — the MVP must not depend on it. | |
| ### 9.2 Field Notes (blog) 🟡 | |
| A short write-up: the diegetic conceit, the one-call director pattern, grammar-constrained directives, the consistency-via-cache trick, and the llama.cpp/ZeroGPU lesson. Cheap points, and genuinely useful to others. | |
| ### 9.3 Open Trace 🟡 | |
| Log every orchestration step per turn — the assembled context, the raw directive JSON, the Painter prompts/seeds — to `runs/*.jsonl`, then push a cleaned sample as a **Hub dataset**. The struct-based state makes these traces clean and shareable. | |
| --- | |
| ## 10. Phased plan | |
| Build window **June 5–15**. (Register by **June 3**; sketch + download weights before the 5th.) Days are indicative for a two-person team. | |
| ### Phase 0 — Skeleton (Day 1–2) | |
| - `state.py`: `GameState` + directive schema; `.md` render/parse round-trip with a unit test. | |
| - `llm.py`: `complete()` + `complete_json()` with **GBNF grammar** (start with `transformers` *or* llama.cpp — whichever you get running first). | |
| - `orchestrator.direct_turn()` returning valid directive JSON for a hard-coded scene. | |
| - Text-only loop in `gr.Blocks`. **Milestone:** type a line → NPC replies → state updates, looping, no images. | |
| ### Phase 1 — The wood breathes (Day 3–5) | |
| - `painter.py`: full-scene generation with SDXL-Turbo; disk cache; pinned seeds. | |
| - `orchestrator.init_world()`: generate setting + first NPC + opening. | |
| - Wire scene/character directives → Painter. **Milestone:** a playable illustrated loop, basic VN layout in `gr.Blocks`. | |
| ### Phase 2 — Voice + polish (Day 6–8) | |
| - `stt.py` + mic input. | |
| - Migrate to `gradio.Server` + custom `frontend/index.html` (layered scene, typewriter, animations). | |
| - `memory.compact_memory()`; sprite moods (pre-gen set or FLUX.2-Klein conditioning); optional BiRefNet layered sprites. **Milestone:** speak to a self-painting wood in a bespoke UI. | |
| ### Phase 3 — Bonuses + ship (Day 9–10) | |
| - Lock the llama.cpp local config; resolve the Space backend (A or B from §7.2). | |
| - (Optional) fine-tune + GGUF on the Hub; (optional) trace dataset. | |
| - **Record the demo video + write the social post** (+ optional blog). Deploy the Space, test cold-start, freeze. **Milestone:** submitted. | |
| > Cut lines if time runs short, in this order: fine-tune → layered sprites → voice → custom frontend. The loop + a custom-ish UI + working art is a complete, competitive entry. | |
| --- | |
| ## 11. Risks & mitigations | |
| | Risk | Likelihood | Impact | Mitigation | | |
| |---|---|---|---| | |
| | Image latency makes it feel sluggish | High | High | 1–4-step model; text-first; cache; speculative paint; 512–768px. | | |
| | Character looks different each scene | High | High | Generate-once + cache; pinned seed + fixed appearance; FLUX.2-Klein conditioning for moods. | | |
| | llama.cpp won’t build on ZeroGPU | Med | Med | Decide §7.2 on day 1; keep `transformers` fallback behind a flag; claim llama.cpp badge locally. | | |
| | Small model drifts / over-narrates | Med | Med | Grammar-enforced directives; low temp on structure; rolling-summary memory; lean on the diegetic framing. | | |
| | Scope creep (combat, save slots, 5 NPCs) | Med | High | Honor the “out of scope” list in `CLAUDE.md`; ship the loop. | | |
| | Weights/licensing surprise (esp. FLUX.2) | Low | Med | Verify licenses before shipping; default to SDXL-Turbo/Z-Image (permissive). | | |
| | Demo video/social post left to the last hour | Med | High | They’re *required*. Schedule them in Phase 3, not after. | | |
| --- | |
| ## 12. Latency budget (per turn, target) | |
| | Stage | Target | Notes | | |
| |---|---|---| | |
| | STT (if voice) | < 1 s | short utterances, `small`/turbo | | |
| | Context assembly | negligible | pure Python | | |
| | LLM directive call | 1–4 s | ~8B Q4, ≤4k context, one call | | |
| | Apply + memory | negligible | | | |
| | Image (only when changed) | 1–5 s | 1–4-step model; *hidden behind text* and often cached | | |
| | **Felt latency** | **dialogue in ~2–4 s**, image fills in after | most turns don’t change the scene | | |
| --- | |
| ## 13. Decisions to lock before Day 1 | |
| 1. **Project name** (Thousand Token Wood is a placeholder). | |
| 2. **Art style** for the `style_guide` (and whether to use a style LoRA). | |
| 3. **Image model**: SDXL-Turbo (safe) vs FLUX.2 Klein (consistency, check license). | |
| 4. **Sprites**: full-scene first vs layered+BiRefNet. | |
| 5. **Space backend**: §7.2 option A vs B. | |
| 6. **Whisper size** (small vs turbo) and whether voice ships in v1. | |
| 7. **Fine-tune**: in or out (time-box). | |