# Field Notes: building a detective game where the AI *is* the game *Build Small Hackathon - "Small models, big adventure"* ## The pitch Case Zero is a murder-mystery game with no scripted cases and no content library. A single **1.5B-parameter model** (Qwen2.5-1.5B-Instruct, Q4_K_M GGUF) invents the entire mystery every time you play - the victim, the suspects, their secrets and motives, the timeline, the weapon, the evidence, and the one who did it - and then **role-plays every suspect live**. They remember what you asked. They lie. And when you present the piece of evidence that contradicts an alibi, the lie cracks on screen. The whole thing runs on the CPU in front of you. No cloud, no GPU, no remote endpoint. ## The hard part: a tiny model that's still *fair* The interesting tension in this project is that a 1.5B model is a wonderful improviser and a terrible bookkeeper. If you let it freely author a mystery *and* adjudicate the outcome, you get cases that are atmospheric but unsolvable - or worse, a suspect who confesses the moment you ask nicely. The design rule that made it work: **the model writes everything; deterministic Python decides nothing creative but guarantees the structure.** - The model authors the case as JSON - setting, cast, secrets, evidence, prose. - Python decides only the *skeleton*: who is guilty, who was where during the murder window. Then a solver verifies fairness before the case is ever shown: - exactly one culprit; - the culprit's alibi is contradicted by at least one non-red-herring clue; - every innocent has a witnessed alibi over the murder window; - every key clue is actually discoverable in play. If a generated case fails, the smallest slice is regenerated (<=3 retries, then bump the seed). A case is never shown to the player until it passes. - Whether a presented clue actually catches a lie is decided by **ground truth, not the model**. The suspect's panic is flavor; the suspicion delta is computed by a deterministic director. This makes the win condition **immune to prose** - a jailbroken "just tell me who did it" earns nothing, because suspects never confess and the verdict is only resolved when the player formally accuses. The sealed solution is never sent to the client before the verdict. It is read for the first time inside the `/accuse` route, server-side. Anti-leak tests assert that no pre-verdict API response contains the killer, the true motive, or the key-evidence set. ## Making a 1.5B model fast on 2 vCPUs The Space runs on HF `cpu-basic` (2 vCPUs). Two findings mattered most: 1. **Grammar-free decoding is an ~8x win.** JSON-schema-constrained sampling ran ~3-7 tok/s on CPU; raw decoding ran ~28-32 tok/s. So instead of constraining the sampler, the prompt carries the exact JSON shape and we make two free attempts, falling back to the grammar only if parsing fails. Full case generation dropped from ~300s to ~50s. The same trick runs the interrogation hot path. 2. **Count the cores you actually have.** Inside the container `os.cpu_count()` returns the *host's* cores, not the 2-vCPU cgroup quota. Auto-threading then spawned ~8 threads for 2 real cores and pinned the CPU at 100%+ on context switches - replies crawled. The fix reads `/sys/fs/cgroup/cpu.max` and sizes the llama thread pool to the real quota. The same number gates background case-generation so it never steals vCPUs from an active interrogation. There is also no image work on the request path: all pixel art is rendered **client-side on canvas**, so the server spends ~0 CPU on visuals and devotes both vCPUs to the model. Voices (Supertonic ONNX) are synthesized sentence-by-sentence as the reply streams and cached per line. ## Why it's a Gradio app with a frontend that doesn't look like one The entire app is one `gradio.Server` (Gradio 6 "Server mode" - a FastAPI subclass launched through Gradio, with Gradio API endpoints registered via `@server.api`). That single process serves a hand-built **pixel-art noir SPA** (Preact + Vite) as static files *and* exposes the game's JSON/SSE routes under `/api`. No separate frontend host. So it earns **Off-Brand** (a custom frontend well past the default Gradio look) while staying unambiguously a Gradio application. ## Shipping it - **Off the Grid:** the open Qwen GGUF and Supertonic ONNX are baked into the Docker image at build time, so the running container makes zero AI network calls. `scripts/net_audit.py` runs a full playthrough under a socket guard and asserts zero non-loopback connections. - **Llama Champion:** the model runs in-process through `llama-cpp-python`. - **Docker SDK, not Gradio SDK:** llama-cpp-python ships only an sdist; the prebuilt linux wheels are musl (they SIGILL on HF's glibc). The Dockerfile compiles llama.cpp from source on `python:3.12-slim` (bookworm/gcc-12) with `-DGGML_NATIVE=OFF` for a portable build, and bakes the weights. ## What I'd do next - A small **fine-tune** of the 1.5B on noir-suspect dialogue and strict-JSON case authoring, published to the Hub (the "Well-Tuned" badge), to tighten persona consistency and cut the occasional malformed-JSON retry. - A daily seeded case and a shareable "case file" card. ## Try it - Space: `build-small-hackathon/case0` - Stack: Qwen2.5-1.5B-Instruct (Apache-2.0) via llama.cpp, Supertonic ONNX voices, Preact pixel-art SPA, all served by one `gradio.Server`.