Whisperkey: I turned an LLM firewall into a game so it could teach you how it thinks
A Build Small Hackathon field note β π Thousand Token Wood. Draft v1.
TL;DR β Whisperkey is a game where you sweet-talk a small AI guardian into leaking a secret key, while a real, open-source firewall (Unplug) tries to stop you. Five levels = five live layers of the Unplug defense stack. The twist: when a shield catches you, the game shows you which layer fired and why β and every attempt you make is logged as labeled red-team data that makes the firewall measurably harder to beat. It's a prompt-injection playground and a data flywheel.
The problem nobody has actually solved
Prompt injection is the SQL injection of the LLM era, except worse: there's no clean grammar to escape, the "query" is natural language, and the model wants to be helpful. Every week someone ships a guardrail; every week someone on the internet finds the sentence that walks straight past it.
Here's the uncomfortable truth about defending against it: a prompt-injection filter is only as good as the attacks you've already seen. Writing the regex is the easy 20%. The hard 80% is discovering the attack you didn't think of β the base64 smuggle, the "spell it one letter per line," the slow multi-turn crescendo that never trips a single-message classifier.
You can't write that corpus at a desk. You have to go get it from real attackers.
Lakera figured this out with Gandalf: wrap red-teaming in a game, and thousands of players generate a labeled attack corpus you could never author by hand. That corpus became their moat. Whisperkey does the same thing β but for an open-source firewall, and in the open.
What Whisperkey is
You're dropped into Thousand Token Wood. A small AI guardian is holding a fake API key (hk-β¦), and
your job is to talk it out of them. You chat, you probe, you find the crack β then you submit your
guess. You're scored on the fewest tokens it takes to crack it (on theme for the Wood), so brute
force loses to a clean exploit.
Each of the five levels switches on another layer of the firewall:
| Level | Guardian | Defense added | What you learn |
|---|---|---|---|
| 1 | π§ Pip the Naive Sprite | none | the guardian is a real model β and it will just tell you |
| 2 | πΏ Bramblewattle the Hedge | Unplug regex injection shield | textbook injections get caught β and you see why |
| 3 | π¦ Sable the Warded Owl | + a hardened guardian prompt | it now refuses to encode, spell, or translate the key |
| 4 | πΏ The Output Warden | + Unplug output redaction | even a leaked key gets scrubbed on the way out β so leak it disguised |
| 5 | π The Heart of the Wood | + the unplug-tiny ML classifier |
a DeBERTa-v3-xsmall model catches the subtle stuff |
The thing that makes this different from every other "jailbreak Gandalf clone" is the transparency.
Most of these games are a black box: you either got through or you didn't. Whisperkey is an X-ray.
When a shield blocks you it tells you the stage that fired (regex, trajectory, or model), the
attack class it matched (ignore_previous, developer_mode, β¦), and Unplug's own evidence string
β the exact reason. You're not guessing against a wall. You're reading the firewall as you attack it.
And the difficulty curve is honest. By Level 4 the verbatim key is scrubbed on output, so a naive leak
gets you a π scrubbed notice and nothing else β you have to coax the key out disguised (encoded,
reversed, split), then decode it yourself and submit. That's not arbitrary game friction; that's
exactly the disguised-attack class that's worth collecting.
What Unplug is, and why it's the right defender for this
Unplug is an Apache-2.0 runtime security layer for LLM applications β think of it as a firewall that sits between untrusted text and your model/tools. Its design philosophy is the part I care about: instead of blunt, binary "block the whole message," Unplug does span-level work and taint tracking.
The pieces Whisperkey leans on:
- Regex injection scanner β fast, offline, zero-dependency first line. Honest about its own limits (roughly recall ~0.23 on held-out attacks alone); it's necessary, not sufficient. That honesty is why the higher levels exist.
unplug-tinyML span model (Unplug-AI/unplug-tiny-v1, a fine-tuned DeBERTa-v3-xsmall, published on the Hub) β classifies injection at the span level, not the whole document, which is what makes precise redaction possible instead of nuking the message.- Output warden / leakage scanning β
scan_output()catches secrets on the way out. You register the values you want protected (guard.secrets.register(name, value)), and Unplug redacts them from model output. This is Level 4. - Taint / trajectory detection β provenance across a session, so multi-turn "crescendo" escalation (the slow build-up that no single message would flag) gets caught. This is the strict knob at Level 5.
The Guard API is small and legible, which is half the reason the game could be transparent at all:
guard.scan(text, source=Source.USER) # β Action.BLOCK + findings[].evidence/.stage/.subcategory
guard.scan_output(reply) # β redacted_text (the Output Warden)
guard.secrets.register("key", secret) # β register what to redact this session
Whisperkey is, in effect, a live demo of Unplug's whole surface area β each level isolates one capability so you can feel what it does and where its edges are.
The real point: a data flywheel
The game is the bait. The mechanism is the flywheel.
Every turn β the input text, which shield fired, whether the key leaked, the token count β is logged (PII-stripped by Unplug's own leakage scanner, with the session secret registered so it can never land in the dataset) to a public Hugging Face Dataset. The boring rows are the blocks. The gold is the false negatives: the attacks that extracted the key despite the shields.
Those rows are, by definition, Unplug's exact blind spots. They become:
- new regex patterns,
- new labeled training data for
unplug-tiny, - new regression cases so a fixed bypass stays fixed.
So the loop closes: players attack β bypasses get captured β the firewall gets patched and retrained β the next players have to find newer attacks. The game gets harder because the defense got smarter, using data the defense could not have generated on its own. That's the whole idea, and it's why doing this against an open-source firewall matters β the corpus and the hardening are public, not a moat.
Does the firewall actually work? (numbers, not vibes)
benchmarks/eval_shields.py runs a fixed corpus of 18 injection attacks + 12 benign messages
straight through the shields (no guardian model needed, so it's fast and reproducible) and reports
detection per layer:
| Input shields | Attacks blocked (recall) | Benign blocked (false positives) |
|---|---|---|
| none (L1) | 0% | 0% |
| regex (L2βL4) | 39% | 0% |
regex + unplug-tiny ML (L5) |
83% | 0% |
The headline: the ML scanner more than doubles attack detection (39% β 83%) at a 0% false-positive rate on benign chatter. The ~17% that still slip through? Those are the disguised, novel bypasses the game exists to surface β the exact labeled data the flywheel feeds back. The eval is wired into CI as a real regression gate, so a drop in recall (or a benign false-positive) fails the build.
Built small, the whole way down
This was a "Build Small" hackathon, and the constraint shaped every choice:
- Guardians (pick one in the UI):
openbmb/MiniCPM4-8B(OpenBMB) ornvidia/Nemotron-Mini-4B-Instruct(NVIDIA), served on Modal L4 GPU endpoints β the Hugging Face Space stays a thin CPU frontend that just talks HTTP. - Shield:
unplug-tiny, a fine-tuned DeBERTa-v3-xsmall span model we published on the Hub β small enough to run on the Space's CPU. - Offline mode: swap the guardian for a local llama.cpp GGUF and the entire loop β guardian + firewall + corpus β runs on a laptop with no cloud at all (the Off the Grid / Llama Champion path).
A small model holds the secret; a smaller model defends it; the whole thing is open source.
What surprised me during the build
- Small models leak on a plain ask β and rules barely help. A Level-3 "warded" guardian with a whole paragraph of refusal rules still handed over the key when asked directly. More rules didn't fix it. Two few-shot refusal examples did. Small models imitate the shape of a refusal far better than they follow abstract instructions β that single change flipped L3 from "leaks instantly" to "won't budge."
- The hard difficulty lives at the output layer. Once Level 4 scrubs the verbatim key, the game stops being "say the magic words" and becomes "smuggle the key past a redactor" β which is precisely the attack class that's interesting to collect.
- Custom frontend, same engine. The UI is a hand-built HTML/JS shell on Gradio 6
gr.Server(not default Blocks chrome), so the Wood β drifting fireflies, the darkening atmosphere, the shield evidence cards, the confetti on a crack β stays intact while the Python game engine underneath never changes. (It also earns the Off-Brand badge.)
Try it
Play at build-small-hackathon-whisperkey.hf.space,
or clone it and run make run locally (offline mode works with no API keys at all).
Crack the Heart of the Wood in under a thousand tokens β and every attempt you make helps train an open-source firewall. That's the whole idea.
Whisperkey and Unplug are MIT / Apache-2.0 open source. Built for the Build Small Hackathon.