Spaces:

build-small-hackathon
/

case0

Running

case0 / docs /FIELD_NOTES.md

Case Zero - initial public release (fully local: Qwen2.5-1.5B via llama.cpp + Supertonic, custom pixel-noir SPA via gradio.Server)

414dc55 3 days ago

preview code

raw

history blame contribute delete

5.46 kB

	# Field Notes: building a detective game where the AI is the game

	Build Small Hackathon - "Small models, big adventure"

	## The pitch

	Case Zero is a murder-mystery game with no scripted cases and no content library. A single
	1.5B-parameter model (Qwen2.5-1.5B-Instruct, Q4_K_M GGUF) invents the entire mystery
	every time you play - the victim, the suspects, their secrets and motives, the timeline, the
	weapon, the evidence, and the one who did it - and then role-plays every suspect live.
	They remember what you asked. They lie. And when you present the piece of evidence that
	contradicts an alibi, the lie cracks on screen.

	The whole thing runs on the CPU in front of you. No cloud, no GPU, no remote endpoint.

	## The hard part: a tiny model that's still fair

	The interesting tension in this project is that a 1.5B model is a wonderful improviser and a
	terrible bookkeeper. If you let it freely author a mystery and adjudicate the outcome,
	you get cases that are atmospheric but unsolvable - or worse, a suspect who confesses the
	moment you ask nicely.

	The design rule that made it work: **the model writes everything; deterministic Python
	decides nothing creative but guarantees the structure.**

	- The model authors the case as JSON - setting, cast, secrets, evidence, prose.
	- Python decides only the skeleton: who is guilty, who was where during the murder window.
	Then a solver verifies fairness before the case is ever shown:
	- exactly one culprit;
	- the culprit's alibi is contradicted by at least one non-red-herring clue;
	- every innocent has a witnessed alibi over the murder window;
	- every key clue is actually discoverable in play.
	If a generated case fails, the smallest slice is regenerated (<=3 retries, then bump the
	seed). A case is never shown to the player until it passes.
	- Whether a presented clue actually catches a lie is decided by **ground truth, not the
	model**. The suspect's panic is flavor; the suspicion delta is computed by a deterministic
	director. This makes the win condition immune to prose - a jailbroken "just tell me
	who did it" earns nothing, because suspects never confess and the verdict is only resolved
	when the player formally accuses.

	The sealed solution is never sent to the client before the verdict. It is read for the first
	time inside the `/accuse` route, server-side. Anti-leak tests assert that no pre-verdict API
	response contains the killer, the true motive, or the key-evidence set.

	## Making a 1.5B model fast on 2 vCPUs

	The Space runs on HF `cpu-basic` (2 vCPUs). Two findings mattered most:

	1. Grammar-free decoding is an ~8x win. JSON-schema-constrained sampling ran ~3-7 tok/s
	on CPU; raw decoding ran ~28-32 tok/s. So instead of constraining the sampler, the prompt
	carries the exact JSON shape and we make two free attempts, falling back to the grammar
	only if parsing fails. Full case generation dropped from ~300s to ~50s. The same trick
	runs the interrogation hot path.
	2. Count the cores you actually have. Inside the container `os.cpu_count()` returns the
	host's cores, not the 2-vCPU cgroup quota. Auto-threading then spawned ~8 threads for 2
	real cores and pinned the CPU at 100%+ on context switches - replies crawled. The fix
	reads `/sys/fs/cgroup/cpu.max` and sizes the llama thread pool to the real quota. The same
	number gates background case-generation so it never steals vCPUs from an active
	interrogation.

	There is also no image work on the request path: all pixel art is rendered **client-side on
	canvas**, so the server spends ~0 CPU on visuals and devotes both vCPUs to the model. Voices
	(Supertonic ONNX) are synthesized sentence-by-sentence as the reply streams and cached per
	line.

	## Why it's a Gradio app with a frontend that doesn't look like one

	The entire app is one `gradio.Server` (Gradio 6 "Server mode" - a FastAPI subclass launched
	through Gradio, with Gradio API endpoints registered via `@server.api`). That single process
	serves a hand-built pixel-art noir SPA (Preact + Vite) as static files and exposes the
	game's JSON/SSE routes under `/api`. No separate frontend host. So it earns Off-Brand (a
	custom frontend well past the default Gradio look) while staying unambiguously a Gradio
	application.

	## Shipping it

	- Off the Grid: the open Qwen GGUF and Supertonic ONNX are baked into the Docker image at
	build time, so the running container makes zero AI network calls. `scripts/net_audit.py`
	runs a full playthrough under a socket guard and asserts zero non-loopback connections.
	- Llama Champion: the model runs in-process through `llama-cpp-python`.
	- Docker SDK, not Gradio SDK: llama-cpp-python ships only an sdist; the prebuilt linux
	wheels are musl (they SIGILL on HF's glibc). The Dockerfile compiles llama.cpp from source
	on `python:3.12-slim` (bookworm/gcc-12) with `-DGGML_NATIVE=OFF` for a portable build, and
	bakes the weights.

	## What I'd do next

	- A small fine-tune of the 1.5B on noir-suspect dialogue and strict-JSON case authoring,
	published to the Hub (the "Well-Tuned" badge), to tighten persona consistency and cut the
	occasional malformed-JSON retry.
	- A daily seeded case and a shareable "case file" card.

	## Try it

	- Space: `build-small-hackathon/case0`
	- Stack: Qwen2.5-1.5B-Instruct (Apache-2.0) via llama.cpp, Supertonic ONNX voices, Preact
	pixel-art SPA, all served by one `gradio.Server`.