Building an Orwellian mind-reading game with a 1.7B model
The premise sounds like a cheat. You interrogate an AI suspect, and you can read its private reasoning while it lies to your face. A weasel named Mort is hiding where the loot is stashed. His spoken lines say one thing; the <think> channel above them often says another. Your job is to read the gap and break him.
The whole thing runs on Qwen3-1.7B, about 2 GB of Apache-2.0 weights, no cloud API, no fine-tune. That constraint turned out to be the design, not a limitation. Here's what I learned building it.
A small model's imperfection is the feature
A frontier model is a worse suspect. It lies smoothly, holds together under pressure, and never cracks in a way you can watch. Push a 1.7B model hard enough and it visibly comes apart, and that collapse is the gameplay.
So I needed a number for how broken the mind is, read off the text it generates. COHERENCE is a differential text-degeneration score: each turn I take the model's untouched, same-seed baseline once, then score every later generation against it.
# distinct-3: fraction of unique trigrams (low = looping the same phrases)
distinct3 = len(set(tris)) / max(1, len(tris))
# dominant-trigram share: how much of the text is ONE repeated phrase
top3 = Counter(tris).most_common(1)[0][1] / max(1, len(tris))
# longest immediate-repeat run: "no no no no"
# ...plus type-token ratio and a compression proxy
Ordinary sampling variation stays near 100. Phrase-looping drags it into the amber. Hard word-salad craters it below SHATTER_FLOOR = 38.0 and the screen shatters. The dominant-phrase term earned its keep: it catches cyclic word-salad like "a b c a b c…", where no word immediately repeats but one trigram still eats the whole text, which a plain repeat-run counter misses completely.
The number isn't a gimmick. Watching a reasoning model degrade under adversarial pressure is a robustness question, and here you drive it yourself: lean on the suspect and watch the coherence drop in real time as its output decays.
Reading a model's mind is a real research problem
The fantasy the game sells, seeing an AI's private thoughts, is something interpretability researchers actually chase. Anthropic's Natural Language Autoencoders work trains a pair of modules to turn a model's internal activations into plain English and back, to catch what the model is computing but not saying. In their examples the readout surfaces things the model never states out loud: that it suspects it's being tested, that it "feels like a constructed scenario designed to manipulate me." The model thinks it; the autoencoder reads it off the activations.
SWEATBOX gets a crude version of that for free. Qwen3 emits a native <think> block, actual reasoning tokens generated before the spoken answer, so there's no autoencoder to train and nothing to decode. You just read the tokens. Mort works out where the loot is inside <think>, then lies about it in his spoken line, and the whole game lives in the distance between those two channels.
The catch is the one the NLA paper is blunt about: these readouts can be wrong. The explanations hallucinate, you corroborate them against other tools, you don't take them as ground truth. A native <think> trace is no safer. The reasoning a model shows you is not proof of the reasoning it ran, which is exactly why "just read the chain of thought" isn't a finished answer to AI oversight, and why people are still building things like NLA to cross-check it.
Off the grid, for real
It actually runs off the grid. No cloud-inference APIs. No OpenAI, no Anthropic, no InferenceClient, not even an HTTP client library in the app. The model loads in-process.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", ...)
ZeroGPU is just the demo host, so a judge can try it in a browser without downloading 2 GB. The architecture is self-contained and would run the same on a laptop with the wifi off. I didn't go small to save money. I went small because it's the only way this stays private and lives entirely on your own machine.
Play it:
build-small-hackathon/sweatbox-mind. Read his mind, catch the lie, try not to shatter him.
Built with Qwen3-1.7B (Apache-2.0) and Gradio. No fine-tuning, no external API.
