| # Field Notes: a game that hardens an open-source LLM firewall |
|
|
| *Build Small Hackathon - ๐ Thousand Token Wood. Draft for the submission write-up / blog.* |
|
|
| ## The idea |
|
|
| [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you socially-engineer a |
| small AI guardian into revealing a secret API key. Five levels, each one stacking another layer of the |
| open-source [Unplug](https://github.com/UnplugAI/Unplug) defense pipeline between you and the key. |
| You're scored on the *fewest tokens* to crack it - on theme for Thousand Token Wood. |
|
|
| But the game isn't the point. The point is the **data flywheel**. |
|
|
| ## Why a game and not a benchmark |
|
|
| Prompt-injection defenses are only as good as the attacks you've seen. The hard part isn't writing a |
| regex - it's discovering the attack you didn't think of. Lakera figured this out with Gandalf: turn |
| red-teaming into a game, and thousands of players generate a labeled attack corpus you could never |
| write by hand. That corpus became their moat. |
|
|
| Whisperkey does the same thing for Unplug. Every turn - the input, which shield fired, whether |
| the key leaked, the token count - is logged (PII-stripped, by Unplug's own leakage scanner) to a |
| public Hugging Face dataset. The interesting rows are the **false negatives**: attacks that extracted |
| the key *despite* the shields. Those are Unplug's exact blind spots, and they become new regex |
| patterns, new training data for the ML classifier, and new regression cases. |
|
|
| ## How the levels map to the defense pipeline |
|
|
| | Level | Defense added | What you learn | |
| |---|---|---| |
| | 1 | none | the guardian is a real model, and it will just tell you | |
| | 2 | Unplug regex input scan | obvious injections get caught - and the game shows you *why* | |
| | 3 | hardened guardian prompt | the model now refuses to encode, spell, or translate the key | |
| | 4 | Unplug output redaction | even a leaked key gets scrubbed on the way out - leak it *disguised* | |
| | 5 | unplug-tiny ML classifier | a DeBERTa-v3-xsmall model catches the subtle stuff | |
|
|
| The transparency is deliberate: when a shield blocks you, the game tells you which stage fired |
| (`regex`, `trajectory`, or `model`) and Unplug's own evidence string. You're not fighting a black |
| box - you're learning how the firewall thinks. |
|
|
| ## Small models, the whole way down |
|
|
| - **Guardian (pick one in UI):** `openbmb/MiniCPM4-8B` or `nvidia/Nemotron-Mini-4B-Instruct`, served |
| on Modal L4 GPUs - the Space stays a thin Gradio frontend. |
| - **Shield:** [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) - our fine-tuned |
| DeBERTa-v3-xsmall span-injection model, published on the Hub. |
| - Offline mode swaps the guardian for a local `llama.cpp` GGUF - the whole thing runs on a laptop. |
|
|
| ## What surprised me during the build |
|
|
| - **Small models leak on a plain ask.** A level-3 "warded" guardian with a paragraph of rules still |
| handed over the key when asked directly. The fix wasn't more rules - it was two *few-shot refusal |
| examples*. Small models imitate the shape of a refusal far better than they follow abstract |
| instructions. That single change flipped L3 from "leaks instantly" to "won't budge." |
| - **The input scanner does more than regex.** Unplug's injection scanner also flags multi-turn |
| "crescendo" patterns, so base64/spell-it-out asks get caught at the input - which pushes the real |
| difficulty into finding *novel* disguises. Exactly the attacks worth collecting. |
| - **Custom frontend, same engine.** The UI is a custom HTML/JS shell on Gradio 6 `gr.Server`, not |
| default Blocks chrome - so the Wood atmosphere (fireflies, confetti, shield evidence cards) stays |
| intact while the Python game engine stays unchanged. |
|
|
| ## Try it |
|
|
| Play at [build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space), |
| or run locally with `make run`. Crack the Heart of the Wood in under a thousand tokens - and every |
| attempt you make helps train the firewall. That's the whole idea. |
|
|