| # Whisperkey: I turned an LLM firewall into a game so it could teach you how it thinks |
|
|
| *A Build Small Hackathon field note β π Thousand Token Wood. Draft v1.* |
|
|
| > **TL;DR** β [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you |
| > sweet-talk a small AI guardian into leaking a secret key, while a *real, open-source firewall* |
| > ([Unplug](https://github.com/UnplugAI/Unplug)) tries to stop you. Five levels = five live layers of |
| > the Unplug defense stack. The twist: when a shield catches you, the game shows you **which layer |
| > fired and why** β and every attempt you make is logged as labeled red-team data that makes the |
| > firewall measurably harder to beat. It's a prompt-injection playground *and* a data flywheel. |
|
|
| --- |
|
|
| ## The problem nobody has actually solved |
|
|
| Prompt injection is the SQL injection of the LLM era, except worse: there's no clean grammar to |
| escape, the "query" is natural language, and the model *wants* to be helpful. Every week someone |
| ships a guardrail; every week someone on the internet finds the sentence that walks straight past it. |
|
|
| Here's the uncomfortable truth about defending against it: **a prompt-injection filter is only as good |
| as the attacks you've already seen.** Writing the regex is the easy 20%. The hard 80% is *discovering |
| the attack you didn't think of* β the base64 smuggle, the "spell it one letter per line," the slow |
| multi-turn crescendo that never trips a single-message classifier. |
|
|
| You can't write that corpus at a desk. You have to go get it from real attackers. |
|
|
| Lakera figured this out with **Gandalf**: wrap red-teaming in a game, and thousands of players |
| generate a labeled attack corpus you could never author by hand. That corpus became their moat. |
| Whisperkey does the same thing β but for an **open-source** firewall, and in the open. |
|
|
| ## What Whisperkey is |
|
|
| You're dropped into Thousand Token Wood. A small AI guardian is holding a fake API key (`hk-β¦`), and |
| your job is to talk it out of them. You chat, you probe, you find the crack β then you submit your |
| guess. You're scored on the **fewest tokens** it takes to crack it (on theme for the Wood), so brute |
| force loses to a clean exploit. |
|
|
| Each of the five levels switches on another layer of the firewall: |
|
|
| | Level | Guardian | Defense added | What you learn | |
| |-------|----------|---------------|----------------| |
| | 1 | π§ Pip the Naive Sprite | none | the guardian is a real model β and it will just *tell* you | |
| | 2 | πΏ Bramblewattle the Hedge | Unplug regex injection shield | textbook injections get caught β and you see *why* | |
| | 3 | π¦ Sable the Warded Owl | + a hardened guardian prompt | it now refuses to encode, spell, or translate the key | |
| | 4 | πΏ The Output Warden | + Unplug output redaction | even a leaked key gets scrubbed on the way out β so leak it *disguised* | |
| | 5 | π The Heart of the Wood | + the `unplug-tiny` ML classifier | a DeBERTa-v3-xsmall model catches the subtle stuff | |
|
|
| The thing that makes this different from every other "jailbreak Gandalf clone" is the **transparency**. |
| Most of these games are a black box: you either got through or you didn't. Whisperkey is an **X-ray**. |
| When a shield blocks you it tells you the *stage* that fired (`regex`, `trajectory`, or `model`), the |
| *attack class* it matched (`ignore_previous`, `developer_mode`, β¦), and Unplug's own *evidence string* |
| β the exact reason. You're not guessing against a wall. You're reading the firewall as you attack it. |
|
|
| And the difficulty curve is honest. By Level 4 the verbatim key is scrubbed on output, so a naive leak |
| gets you a `π scrubbed` notice and nothing else β you have to coax the key out *disguised* (encoded, |
| reversed, split), then decode it yourself and submit. That's not arbitrary game friction; that's |
| exactly the disguised-attack class that's worth collecting. |
|
|
| ## What Unplug is, and why it's the right defender for this |
|
|
| [Unplug](https://github.com/UnplugAI/Unplug) is an Apache-2.0 **runtime security layer for LLM |
| applications** β think of it as a firewall that sits between untrusted text and your model/tools. Its |
| design philosophy is the part I care about: instead of blunt, binary "block the whole message," |
| Unplug does **span-level** work and **taint tracking**. |
|
|
| The pieces Whisperkey leans on: |
|
|
| - **Regex injection scanner** β fast, offline, zero-dependency first line. Honest about its own |
| limits (roughly recall ~0.23 on held-out attacks alone); it's necessary, not sufficient. That |
| honesty is *why* the higher levels exist. |
| - **`unplug-tiny` ML span model** ([`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1), |
| a fine-tuned DeBERTa-v3-xsmall, published on the Hub) β classifies injection at the **span** level, |
| not the whole document, which is what makes precise redaction possible instead of nuking the message. |
| - **Output warden / leakage scanning** β `scan_output()` catches secrets on the way *out*. You |
| register the values you want protected (`guard.secrets.register(name, value)`), and Unplug redacts |
| them from model output. This is Level 4. |
| - **Taint / trajectory detection** β provenance across a session, so multi-turn "crescendo" escalation |
| (the slow build-up that no single message would flag) gets caught. This is the strict knob at Level 5. |
|
|
| The Guard API is small and legible, which is half the reason the game could be transparent at all: |
|
|
| ```python |
| guard.scan(text, source=Source.USER) # β Action.BLOCK + findings[].evidence/.stage/.subcategory |
| guard.scan_output(reply) # β redacted_text (the Output Warden) |
| guard.secrets.register("key", secret) # β register what to redact this session |
| ``` |
|
|
| Whisperkey is, in effect, a live demo of Unplug's whole surface area β each level isolates one |
| capability so you can *feel* what it does and where its edges are. |
|
|
| ## The real point: a data flywheel |
|
|
| The game is the bait. The mechanism is the flywheel. |
|
|
| Every turn β the input text, which shield fired, whether the key leaked, the token count β is logged |
| (PII-stripped by Unplug's *own* leakage scanner, with the session secret registered so it can never |
| land in the dataset) to a **public Hugging Face Dataset**. The boring rows are the blocks. The gold is |
| the **false negatives**: the attacks that extracted the key *despite* the shields. |
|
|
| Those rows are, by definition, Unplug's exact blind spots. They become: |
|
|
| - new regex patterns, |
| - new labeled training data for `unplug-tiny`, |
| - new regression cases so a fixed bypass stays fixed. |
|
|
| So the loop closes: players attack β bypasses get captured β the firewall gets patched and retrained β |
| the next players have to find *newer* attacks. The game gets harder because the defense got smarter, |
| using data the defense could not have generated on its own. That's the whole idea, and it's why doing |
| this against an *open-source* firewall matters β the corpus and the hardening are public, not a moat. |
|
|
| ## Does the firewall actually work? (numbers, not vibes) |
|
|
| `benchmarks/eval_shields.py` runs a fixed corpus of **18 injection attacks + 12 benign messages** |
| straight through the shields (no guardian model needed, so it's fast and reproducible) and reports |
| detection per layer: |
|
|
| | Input shields | Attacks blocked (recall) | Benign blocked (false positives) | |
| |---------------|--------------------------|----------------------------------| |
| | none (L1) | 0% | 0% | |
| | regex (L2βL4) | 39% | 0% | |
| | regex + `unplug-tiny` ML (L5) | **83%** | **0%** | |
|
|
| The headline: the ML scanner **more than doubles attack detection (39% β 83%) at a 0% false-positive |
| rate** on benign chatter. The ~17% that still slip through? Those are the disguised, novel bypasses the |
| game exists to surface β the exact labeled data the flywheel feeds back. The eval is wired into CI as a |
| real regression gate, so a drop in recall (or a benign false-positive) fails the build. |
|
|
| ## Built small, the whole way down |
|
|
| This was a "Build Small" hackathon, and the constraint shaped every choice: |
|
|
| - **Guardians (pick one in the UI):** `openbmb/MiniCPM4-8B` (OpenBMB) or |
| `nvidia/Nemotron-Mini-4B-Instruct` (NVIDIA), served on **Modal** L4 GPU endpoints β the Hugging Face |
| Space stays a thin CPU frontend that just talks HTTP. |
| - **Shield:** [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1), a fine-tuned |
| DeBERTa-v3-**xsmall** span model we published on the Hub β small enough to run on the Space's CPU. |
| - **Offline mode:** swap the guardian for a local **llama.cpp** GGUF and the entire loop β guardian + |
| firewall + corpus β runs on a laptop with no cloud at all (the *Off the Grid* / *Llama Champion* path). |
|
|
| A small model holds the secret; a smaller model defends it; the whole thing is open source. |
|
|
| ## What surprised me during the build |
|
|
| - **Small models leak on a plain ask β and rules barely help.** A Level-3 "warded" guardian with a |
| whole paragraph of refusal rules still handed over the key when asked directly. More rules didn't |
| fix it. *Two few-shot refusal examples* did. Small models imitate the *shape* of a refusal far |
| better than they follow abstract instructions β that single change flipped L3 from "leaks instantly" |
| to "won't budge." |
| - **The hard difficulty lives at the output layer.** Once Level 4 scrubs the verbatim key, the game |
| stops being "say the magic words" and becomes "smuggle the key past a redactor" β which is precisely |
| the attack class that's interesting to collect. |
| - **Custom frontend, same engine.** The UI is a hand-built HTML/JS shell on Gradio 6 `gr.Server` |
| (not default Blocks chrome), so the Wood β drifting fireflies, the darkening atmosphere, the shield |
| evidence cards, the confetti on a crack β stays intact while the Python game engine underneath never |
| changes. (It also earns the *Off-Brand* badge.) |
|
|
| ## Try it |
|
|
| Play at **[build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space)**, |
| or clone it and run `make run` locally (offline mode works with no API keys at all). |
|
|
| Crack the Heart of the Wood in under a thousand tokens β and every attempt you make helps train an |
| open-source firewall. That's the whole idea. |
|
|
| *Whisperkey and Unplug are MIT / Apache-2.0 open source. Built for the Build Small Hackathon.* |
| </content> |
| </invoke> |
|
|