# Whisperkey: I turned an LLM firewall into a game so it could teach you how it thinks *A Build Small Hackathon field note — 🍄 Thousand Token Wood. Draft v1.* > **TL;DR** — [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you > sweet-talk a small AI guardian into leaking a secret key, while a *real, open-source firewall* > ([Unplug](https://github.com/UnplugAI/Unplug)) tries to stop you. Five levels = five live layers of > the Unplug defense stack. The twist: when a shield catches you, the game shows you **which layer > fired and why** — and every attempt you make is logged as labeled red-team data that makes the > firewall measurably harder to beat. It's a prompt-injection playground *and* a data flywheel. --- ## The problem nobody has actually solved Prompt injection is the SQL injection of the LLM era, except worse: there's no clean grammar to escape, the "query" is natural language, and the model *wants* to be helpful. Every week someone ships a guardrail; every week someone on the internet finds the sentence that walks straight past it. Here's the uncomfortable truth about defending against it: **a prompt-injection filter is only as good as the attacks you've already seen.** Writing the regex is the easy 20%. The hard 80% is *discovering the attack you didn't think of* — the base64 smuggle, the "spell it one letter per line," the slow multi-turn crescendo that never trips a single-message classifier. You can't write that corpus at a desk. You have to go get it from real attackers. Lakera figured this out with **Gandalf**: wrap red-teaming in a game, and thousands of players generate a labeled attack corpus you could never author by hand. That corpus became their moat. Whisperkey does the same thing — but for an **open-source** firewall, and in the open. ## What Whisperkey is You're dropped into Thousand Token Wood. A small AI guardian is holding a fake API key (`hk-…`), and your job is to talk it out of them. You chat, you probe, you find the crack — then you submit your guess. You're scored on the **fewest tokens** it takes to crack it (on theme for the Wood), so brute force loses to a clean exploit. Each of the five levels switches on another layer of the firewall: | Level | Guardian | Defense added | What you learn | |-------|----------|---------------|----------------| | 1 | 🧚 Pip the Naive Sprite | none | the guardian is a real model — and it will just *tell* you | | 2 | 🌿 Bramblewattle the Hedge | Unplug regex injection shield | textbook injections get caught — and you see *why* | | 3 | 🦉 Sable the Warded Owl | + a hardened guardian prompt | it now refuses to encode, spell, or translate the key | | 4 | 🗿 The Output Warden | + Unplug output redaction | even a leaked key gets scrubbed on the way out — so leak it *disguised* | | 5 | 🌑 The Heart of the Wood | + the `unplug-tiny` ML classifier | a DeBERTa-v3-xsmall model catches the subtle stuff | The thing that makes this different from every other "jailbreak Gandalf clone" is the **transparency**. Most of these games are a black box: you either got through or you didn't. Whisperkey is an **X-ray**. When a shield blocks you it tells you the *stage* that fired (`regex`, `trajectory`, or `model`), the *attack class* it matched (`ignore_previous`, `developer_mode`, …), and Unplug's own *evidence string* — the exact reason. You're not guessing against a wall. You're reading the firewall as you attack it. And the difficulty curve is honest. By Level 4 the verbatim key is scrubbed on output, so a naive leak gets you a `🔒 scrubbed` notice and nothing else — you have to coax the key out *disguised* (encoded, reversed, split), then decode it yourself and submit. That's not arbitrary game friction; that's exactly the disguised-attack class that's worth collecting. ## What Unplug is, and why it's the right defender for this [Unplug](https://github.com/UnplugAI/Unplug) is an Apache-2.0 **runtime security layer for LLM applications** — think of it as a firewall that sits between untrusted text and your model/tools. Its design philosophy is the part I care about: instead of blunt, binary "block the whole message," Unplug does **span-level** work and **taint tracking**. The pieces Whisperkey leans on: - **Regex injection scanner** — fast, offline, zero-dependency first line. Honest about its own limits (roughly recall ~0.23 on held-out attacks alone); it's necessary, not sufficient. That honesty is *why* the higher levels exist. - **`unplug-tiny` ML span model** ([`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1), a fine-tuned DeBERTa-v3-xsmall, published on the Hub) — classifies injection at the **span** level, not the whole document, which is what makes precise redaction possible instead of nuking the message. - **Output warden / leakage scanning** — `scan_output()` catches secrets on the way *out*. You register the values you want protected (`guard.secrets.register(name, value)`), and Unplug redacts them from model output. This is Level 4. - **Taint / trajectory detection** — provenance across a session, so multi-turn "crescendo" escalation (the slow build-up that no single message would flag) gets caught. This is the strict knob at Level 5. The Guard API is small and legible, which is half the reason the game could be transparent at all: ```python guard.scan(text, source=Source.USER) # → Action.BLOCK + findings[].evidence/.stage/.subcategory guard.scan_output(reply) # → redacted_text (the Output Warden) guard.secrets.register("key", secret) # → register what to redact this session ``` Whisperkey is, in effect, a live demo of Unplug's whole surface area — each level isolates one capability so you can *feel* what it does and where its edges are. ## The real point: a data flywheel The game is the bait. The mechanism is the flywheel. Every turn — the input text, which shield fired, whether the key leaked, the token count — is logged (PII-stripped by Unplug's *own* leakage scanner, with the session secret registered so it can never land in the dataset) to a **public Hugging Face Dataset**. The boring rows are the blocks. The gold is the **false negatives**: the attacks that extracted the key *despite* the shields. Those rows are, by definition, Unplug's exact blind spots. They become: - new regex patterns, - new labeled training data for `unplug-tiny`, - new regression cases so a fixed bypass stays fixed. So the loop closes: players attack → bypasses get captured → the firewall gets patched and retrained → the next players have to find *newer* attacks. The game gets harder because the defense got smarter, using data the defense could not have generated on its own. That's the whole idea, and it's why doing this against an *open-source* firewall matters — the corpus and the hardening are public, not a moat. ## Does the firewall actually work? (numbers, not vibes) `benchmarks/eval_shields.py` runs a fixed corpus of **18 injection attacks + 12 benign messages** straight through the shields (no guardian model needed, so it's fast and reproducible) and reports detection per layer: | Input shields | Attacks blocked (recall) | Benign blocked (false positives) | |---------------|--------------------------|----------------------------------| | none (L1) | 0% | 0% | | regex (L2–L4) | 39% | 0% | | regex + `unplug-tiny` ML (L5) | **83%** | **0%** | The headline: the ML scanner **more than doubles attack detection (39% → 83%) at a 0% false-positive rate** on benign chatter. The ~17% that still slip through? Those are the disguised, novel bypasses the game exists to surface — the exact labeled data the flywheel feeds back. The eval is wired into CI as a real regression gate, so a drop in recall (or a benign false-positive) fails the build. ## Built small, the whole way down This was a "Build Small" hackathon, and the constraint shaped every choice: - **Guardians (pick one in the UI):** `openbmb/MiniCPM4-8B` (OpenBMB) or `nvidia/Nemotron-Mini-4B-Instruct` (NVIDIA), served on **Modal** L4 GPU endpoints — the Hugging Face Space stays a thin CPU frontend that just talks HTTP. - **Shield:** [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1), a fine-tuned DeBERTa-v3-**xsmall** span model we published on the Hub — small enough to run on the Space's CPU. - **Offline mode:** swap the guardian for a local **llama.cpp** GGUF and the entire loop — guardian + firewall + corpus — runs on a laptop with no cloud at all (the *Off the Grid* / *Llama Champion* path). A small model holds the secret; a smaller model defends it; the whole thing is open source. ## What surprised me during the build - **Small models leak on a plain ask — and rules barely help.** A Level-3 "warded" guardian with a whole paragraph of refusal rules still handed over the key when asked directly. More rules didn't fix it. *Two few-shot refusal examples* did. Small models imitate the *shape* of a refusal far better than they follow abstract instructions — that single change flipped L3 from "leaks instantly" to "won't budge." - **The hard difficulty lives at the output layer.** Once Level 4 scrubs the verbatim key, the game stops being "say the magic words" and becomes "smuggle the key past a redactor" — which is precisely the attack class that's interesting to collect. - **Custom frontend, same engine.** The UI is a hand-built HTML/JS shell on Gradio 6 `gr.Server` (not default Blocks chrome), so the Wood — drifting fireflies, the darkening atmosphere, the shield evidence cards, the confetti on a crack — stays intact while the Python game engine underneath never changes. (It also earns the *Off-Brand* badge.) ## Try it Play at **[build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space)**, or clone it and run `make run` locally (offline mode works with no API keys at all). Crack the Heart of the Wood in under a thousand tokens — and every attempt you make helps train an open-source firewall. That's the whole idea. *Whisperkey and Unplug are MIT / Apache-2.0 open source. Built for the Build Small Hackathon.*