Spaces:
Running
Running
Case Zero - initial public release (fully local: Qwen2.5-1.5B via llama.cpp + Supertonic, custom pixel-noir SPA via gradio.Server)
414dc55 | # Field Notes: building a detective game where the AI *is* the game | |
| *Build Small Hackathon - "Small models, big adventure"* | |
| ## The pitch | |
| Case Zero is a murder-mystery game with no scripted cases and no content library. A single | |
| **1.5B-parameter model** (Qwen2.5-1.5B-Instruct, Q4_K_M GGUF) invents the entire mystery | |
| every time you play - the victim, the suspects, their secrets and motives, the timeline, the | |
| weapon, the evidence, and the one who did it - and then **role-plays every suspect live**. | |
| They remember what you asked. They lie. And when you present the piece of evidence that | |
| contradicts an alibi, the lie cracks on screen. | |
| The whole thing runs on the CPU in front of you. No cloud, no GPU, no remote endpoint. | |
| ## The hard part: a tiny model that's still *fair* | |
| The interesting tension in this project is that a 1.5B model is a wonderful improviser and a | |
| terrible bookkeeper. If you let it freely author a mystery *and* adjudicate the outcome, | |
| you get cases that are atmospheric but unsolvable - or worse, a suspect who confesses the | |
| moment you ask nicely. | |
| The design rule that made it work: **the model writes everything; deterministic Python | |
| decides nothing creative but guarantees the structure.** | |
| - The model authors the case as JSON - setting, cast, secrets, evidence, prose. | |
| - Python decides only the *skeleton*: who is guilty, who was where during the murder window. | |
| Then a solver verifies fairness before the case is ever shown: | |
| - exactly one culprit; | |
| - the culprit's alibi is contradicted by at least one non-red-herring clue; | |
| - every innocent has a witnessed alibi over the murder window; | |
| - every key clue is actually discoverable in play. | |
| If a generated case fails, the smallest slice is regenerated (<=3 retries, then bump the | |
| seed). A case is never shown to the player until it passes. | |
| - Whether a presented clue actually catches a lie is decided by **ground truth, not the | |
| model**. The suspect's panic is flavor; the suspicion delta is computed by a deterministic | |
| director. This makes the win condition **immune to prose** - a jailbroken "just tell me | |
| who did it" earns nothing, because suspects never confess and the verdict is only resolved | |
| when the player formally accuses. | |
| The sealed solution is never sent to the client before the verdict. It is read for the first | |
| time inside the `/accuse` route, server-side. Anti-leak tests assert that no pre-verdict API | |
| response contains the killer, the true motive, or the key-evidence set. | |
| ## Making a 1.5B model fast on 2 vCPUs | |
| The Space runs on HF `cpu-basic` (2 vCPUs). Two findings mattered most: | |
| 1. **Grammar-free decoding is an ~8x win.** JSON-schema-constrained sampling ran ~3-7 tok/s | |
| on CPU; raw decoding ran ~28-32 tok/s. So instead of constraining the sampler, the prompt | |
| carries the exact JSON shape and we make two free attempts, falling back to the grammar | |
| only if parsing fails. Full case generation dropped from ~300s to ~50s. The same trick | |
| runs the interrogation hot path. | |
| 2. **Count the cores you actually have.** Inside the container `os.cpu_count()` returns the | |
| *host's* cores, not the 2-vCPU cgroup quota. Auto-threading then spawned ~8 threads for 2 | |
| real cores and pinned the CPU at 100%+ on context switches - replies crawled. The fix | |
| reads `/sys/fs/cgroup/cpu.max` and sizes the llama thread pool to the real quota. The same | |
| number gates background case-generation so it never steals vCPUs from an active | |
| interrogation. | |
| There is also no image work on the request path: all pixel art is rendered **client-side on | |
| canvas**, so the server spends ~0 CPU on visuals and devotes both vCPUs to the model. Voices | |
| (Supertonic ONNX) are synthesized sentence-by-sentence as the reply streams and cached per | |
| line. | |
| ## Why it's a Gradio app with a frontend that doesn't look like one | |
| The entire app is one `gradio.Server` (Gradio 6 "Server mode" - a FastAPI subclass launched | |
| through Gradio, with Gradio API endpoints registered via `@server.api`). That single process | |
| serves a hand-built **pixel-art noir SPA** (Preact + Vite) as static files *and* exposes the | |
| game's JSON/SSE routes under `/api`. No separate frontend host. So it earns **Off-Brand** (a | |
| custom frontend well past the default Gradio look) while staying unambiguously a Gradio | |
| application. | |
| ## Shipping it | |
| - **Off the Grid:** the open Qwen GGUF and Supertonic ONNX are baked into the Docker image at | |
| build time, so the running container makes zero AI network calls. `scripts/net_audit.py` | |
| runs a full playthrough under a socket guard and asserts zero non-loopback connections. | |
| - **Llama Champion:** the model runs in-process through `llama-cpp-python`. | |
| - **Docker SDK, not Gradio SDK:** llama-cpp-python ships only an sdist; the prebuilt linux | |
| wheels are musl (they SIGILL on HF's glibc). The Dockerfile compiles llama.cpp from source | |
| on `python:3.12-slim` (bookworm/gcc-12) with `-DGGML_NATIVE=OFF` for a portable build, and | |
| bakes the weights. | |
| ## What I'd do next | |
| - A small **fine-tune** of the 1.5B on noir-suspect dialogue and strict-JSON case authoring, | |
| published to the Hub (the "Well-Tuned" badge), to tighten persona consistency and cut the | |
| occasional malformed-JSON retry. | |
| - A daily seeded case and a shareable "case file" card. | |
| ## Try it | |
| - Space: `build-small-hackathon/case0` | |
| - Stack: Qwen2.5-1.5B-Instruct (Apache-2.0) via llama.cpp, Supertonic ONNX voices, Preact | |
| pixel-art SPA, all served by one `gradio.Server`. | |