File size: 6,249 Bytes
414dc55
 
 
 
 
 
80cd1f2
414dc55
80cd1f2
 
 
414dc55
 
 
 
 
80cd1f2
 
 
 
 
 
414dc55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80cd1f2
 
 
 
 
 
 
414dc55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# Field Notes: building a detective game where the AI *is* the game

*Build Small Hackathon - "Small models, big adventure"*

## The pitch

Case Zero is a detective game with no scripted cases and no content library. A single
**1.5B-parameter model** (Qwen2.5-1.5B-Instruct, Q4_K_M GGUF) invents the entire mystery
every time you play - a homicide, a heist, a fraud, a blackmail ring, an arson, a
disappearance: the victim, the suspects, their secrets and motives, the timeline, the
evidence, and the one who did it - and then **role-plays every suspect live**.
They remember what you asked. They lie. And when you present the piece of evidence that
contradicts an alibi, the lie cracks on screen.

The whole thing runs on the CPU in front of you. No cloud, no GPU, no remote endpoint.

One structure, many crimes: the solver only ever checks *structure* (one culprit, a false
alibi contradicted by discoverable physical evidence, every innocent cleared), so the same
fairness guarantees hold whether the case is a murder or a vanished heirloom. A single
frozen "crime profile" table rewords the prompts, the suspect briefs, and the dossier
labels per kind - the engine itself never changes.

## The hard part: a tiny model that's still *fair*

The interesting tension in this project is that a 1.5B model is a wonderful improviser and a
terrible bookkeeper. If you let it freely author a mystery *and* adjudicate the outcome,
you get cases that are atmospheric but unsolvable - or worse, a suspect who confesses the
moment you ask nicely.

The design rule that made it work: **the model writes everything; deterministic Python
decides nothing creative but guarantees the structure.**

- The model authors the case as JSON - setting, cast, secrets, evidence, prose.
- Python decides only the *skeleton*: who is guilty, who was where during the murder window.
  Then a solver verifies fairness before the case is ever shown:
  - exactly one culprit;
  - the culprit's alibi is contradicted by at least one non-red-herring clue;
  - every innocent has a witnessed alibi over the murder window;
  - every key clue is actually discoverable in play.
  If a generated case fails, the smallest slice is regenerated (<=3 retries, then bump the
  seed). A case is never shown to the player until it passes.
- Whether a presented clue actually catches a lie is decided by **ground truth, not the
  model**. The suspect's panic is flavor; the suspicion delta is computed by a deterministic
  director. This makes the win condition **immune to prose** - a jailbroken "just tell me
  who did it" earns nothing, because suspects never confess and the verdict is only resolved
  when the player formally accuses.

The sealed solution is never sent to the client before the verdict. It is read for the first
time inside the `/accuse` route, server-side. Anti-leak tests assert that no pre-verdict API
response contains the killer, the true motive, or the key-evidence set.

## Making a 1.5B model fast on 2 vCPUs

The Space runs on HF `cpu-basic` (2 vCPUs). Two findings mattered most:

1. **Grammar-free decoding is an ~8x win.** JSON-schema-constrained sampling ran ~3-7 tok/s
   on CPU; raw decoding ran ~28-32 tok/s. So instead of constraining the sampler, the prompt
   carries the exact JSON shape and we make two free attempts, falling back to the grammar
   only if parsing fails. Full case generation dropped from ~300s to ~50s. The same trick
   runs the interrogation hot path.
2. **Count the cores you actually have.** Inside the container `os.cpu_count()` returns the
   *host's* cores, not the 2-vCPU cgroup quota. Auto-threading then spawned ~8 threads for 2
   real cores and pinned the CPU at 100%+ on context switches - replies crawled. The fix
   reads `/sys/fs/cgroup/cpu.max` and sizes the llama thread pool to the real quota.

Background case generation runs even on the 2-vCPU box without ever making a player wait:
each generation call holds the single-flight model lock for just that one call, streams,
and **aborts between tokens** the moment a player asks a question - then resumes once the
table has been idle. Fresh AI cases join a shuffled pre-baked pool, so New Case is always
instant and the pool of mysteries grows for as long as the Space stays up.

There is also no image work on the request path: all pixel art is rendered **client-side on
canvas**, so the server spends ~0 CPU on visuals and devotes both vCPUs to the model. Voices
(Supertonic ONNX) are synthesized sentence-by-sentence as the reply streams and cached per
line.

## Why it's a Gradio app with a frontend that doesn't look like one

The entire app is one `gradio.Server` (Gradio 6 "Server mode" - a FastAPI subclass launched
through Gradio, with Gradio API endpoints registered via `@server.api`). That single process
serves a hand-built **pixel-art noir SPA** (Preact + Vite) as static files *and* exposes the
game's JSON/SSE routes under `/api`. No separate frontend host. So it earns **Off-Brand** (a
custom frontend well past the default Gradio look) while staying unambiguously a Gradio
application.

## Shipping it

- **Off the Grid:** the open Qwen GGUF and Supertonic ONNX are baked into the Docker image at
  build time, so the running container makes zero AI network calls. `scripts/net_audit.py`
  runs a full playthrough under a socket guard and asserts zero non-loopback connections.
- **Llama Champion:** the model runs in-process through `llama-cpp-python`.
- **Docker SDK, not Gradio SDK:** llama-cpp-python ships only an sdist; the prebuilt linux
  wheels are musl (they SIGILL on HF's glibc). The Dockerfile compiles llama.cpp from source
  on `python:3.12-slim` (bookworm/gcc-12) with `-DGGML_NATIVE=OFF` for a portable build, and
  bakes the weights.

## What I'd do next

- A small **fine-tune** of the 1.5B on noir-suspect dialogue and strict-JSON case authoring,
  published to the Hub (the "Well-Tuned" badge), to tighten persona consistency and cut the
  occasional malformed-JSON retry.
- A daily seeded case and a shareable "case file" card.

## Try it

- Space: `build-small-hackathon/case0`
- Stack: Qwen2.5-1.5B-Instruct (Apache-2.0) via llama.cpp, Supertonic ONNX voices, Preact
  pixel-art SPA, all served by one `gradio.Server`.