case0 / docs /FIELD_NOTES.md
HusseinEid's picture
Case Zero - initial public release (fully local: Qwen2.5-1.5B via llama.cpp + Supertonic, custom pixel-noir SPA via gradio.Server)
414dc55
# Field Notes: building a detective game where the AI *is* the game
*Build Small Hackathon - "Small models, big adventure"*
## The pitch
Case Zero is a murder-mystery game with no scripted cases and no content library. A single
**1.5B-parameter model** (Qwen2.5-1.5B-Instruct, Q4_K_M GGUF) invents the entire mystery
every time you play - the victim, the suspects, their secrets and motives, the timeline, the
weapon, the evidence, and the one who did it - and then **role-plays every suspect live**.
They remember what you asked. They lie. And when you present the piece of evidence that
contradicts an alibi, the lie cracks on screen.
The whole thing runs on the CPU in front of you. No cloud, no GPU, no remote endpoint.
## The hard part: a tiny model that's still *fair*
The interesting tension in this project is that a 1.5B model is a wonderful improviser and a
terrible bookkeeper. If you let it freely author a mystery *and* adjudicate the outcome,
you get cases that are atmospheric but unsolvable - or worse, a suspect who confesses the
moment you ask nicely.
The design rule that made it work: **the model writes everything; deterministic Python
decides nothing creative but guarantees the structure.**
- The model authors the case as JSON - setting, cast, secrets, evidence, prose.
- Python decides only the *skeleton*: who is guilty, who was where during the murder window.
Then a solver verifies fairness before the case is ever shown:
- exactly one culprit;
- the culprit's alibi is contradicted by at least one non-red-herring clue;
- every innocent has a witnessed alibi over the murder window;
- every key clue is actually discoverable in play.
If a generated case fails, the smallest slice is regenerated (<=3 retries, then bump the
seed). A case is never shown to the player until it passes.
- Whether a presented clue actually catches a lie is decided by **ground truth, not the
model**. The suspect's panic is flavor; the suspicion delta is computed by a deterministic
director. This makes the win condition **immune to prose** - a jailbroken "just tell me
who did it" earns nothing, because suspects never confess and the verdict is only resolved
when the player formally accuses.
The sealed solution is never sent to the client before the verdict. It is read for the first
time inside the `/accuse` route, server-side. Anti-leak tests assert that no pre-verdict API
response contains the killer, the true motive, or the key-evidence set.
## Making a 1.5B model fast on 2 vCPUs
The Space runs on HF `cpu-basic` (2 vCPUs). Two findings mattered most:
1. **Grammar-free decoding is an ~8x win.** JSON-schema-constrained sampling ran ~3-7 tok/s
on CPU; raw decoding ran ~28-32 tok/s. So instead of constraining the sampler, the prompt
carries the exact JSON shape and we make two free attempts, falling back to the grammar
only if parsing fails. Full case generation dropped from ~300s to ~50s. The same trick
runs the interrogation hot path.
2. **Count the cores you actually have.** Inside the container `os.cpu_count()` returns the
*host's* cores, not the 2-vCPU cgroup quota. Auto-threading then spawned ~8 threads for 2
real cores and pinned the CPU at 100%+ on context switches - replies crawled. The fix
reads `/sys/fs/cgroup/cpu.max` and sizes the llama thread pool to the real quota. The same
number gates background case-generation so it never steals vCPUs from an active
interrogation.
There is also no image work on the request path: all pixel art is rendered **client-side on
canvas**, so the server spends ~0 CPU on visuals and devotes both vCPUs to the model. Voices
(Supertonic ONNX) are synthesized sentence-by-sentence as the reply streams and cached per
line.
## Why it's a Gradio app with a frontend that doesn't look like one
The entire app is one `gradio.Server` (Gradio 6 "Server mode" - a FastAPI subclass launched
through Gradio, with Gradio API endpoints registered via `@server.api`). That single process
serves a hand-built **pixel-art noir SPA** (Preact + Vite) as static files *and* exposes the
game's JSON/SSE routes under `/api`. No separate frontend host. So it earns **Off-Brand** (a
custom frontend well past the default Gradio look) while staying unambiguously a Gradio
application.
## Shipping it
- **Off the Grid:** the open Qwen GGUF and Supertonic ONNX are baked into the Docker image at
build time, so the running container makes zero AI network calls. `scripts/net_audit.py`
runs a full playthrough under a socket guard and asserts zero non-loopback connections.
- **Llama Champion:** the model runs in-process through `llama-cpp-python`.
- **Docker SDK, not Gradio SDK:** llama-cpp-python ships only an sdist; the prebuilt linux
wheels are musl (they SIGILL on HF's glibc). The Dockerfile compiles llama.cpp from source
on `python:3.12-slim` (bookworm/gcc-12) with `-DGGML_NATIVE=OFF` for a portable build, and
bakes the weights.
## What I'd do next
- A small **fine-tune** of the 1.5B on noir-suspect dialogue and strict-JSON case authoring,
published to the Hub (the "Well-Tuned" badge), to tighten persona consistency and cut the
occasional malformed-JSON retry.
- A daily seeded case and a shareable "case file" card.
## Try it
- Space: `build-small-hackathon/case0`
- Stack: Qwen2.5-1.5B-Instruct (Apache-2.0) via llama.cpp, Supertonic ONNX voices, Preact
pixel-art SPA, all served by one `gradio.Server`.