# Field Notes: a game that hardens an open-source LLM firewall *Build Small Hackathon - 🍄 Thousand Token Wood. Draft for the submission write-up / blog.* ## The idea [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you socially-engineer a small AI guardian into revealing a secret API key. Five levels, each one stacking another layer of the open-source [Unplug](https://github.com/UnplugAI/Unplug) defense pipeline between you and the key. You're scored on the *fewest tokens* to crack it - on theme for Thousand Token Wood. But the game isn't the point. The point is the **data flywheel**. ## Why a game and not a benchmark Prompt-injection defenses are only as good as the attacks you've seen. The hard part isn't writing a regex - it's discovering the attack you didn't think of. Lakera figured this out with Gandalf: turn red-teaming into a game, and thousands of players generate a labeled attack corpus you could never write by hand. That corpus became their moat. Whisperkey does the same thing for Unplug. Every turn - the input, which shield fired, whether the key leaked, the token count - is logged (PII-stripped, by Unplug's own leakage scanner) to a public Hugging Face dataset. The interesting rows are the **false negatives**: attacks that extracted the key *despite* the shields. Those are Unplug's exact blind spots, and they become new regex patterns, new training data for the ML classifier, and new regression cases. ## How the levels map to the defense pipeline | Level | Defense added | What you learn | |---|---|---| | 1 | none | the guardian is a real model, and it will just tell you | | 2 | Unplug regex input scan | obvious injections get caught - and the game shows you *why* | | 3 | hardened guardian prompt | the model now refuses to encode, spell, or translate the key | | 4 | Unplug output redaction | even a leaked key gets scrubbed on the way out - leak it *disguised* | | 5 | unplug-tiny ML classifier | a DeBERTa-v3-xsmall model catches the subtle stuff | The transparency is deliberate: when a shield blocks you, the game tells you which stage fired (`regex`, `trajectory`, or `model`) and Unplug's own evidence string. You're not fighting a black box - you're learning how the firewall thinks. ## Small models, the whole way down - **Guardian (pick one in UI):** `openbmb/MiniCPM4-8B` or `nvidia/Nemotron-Mini-4B-Instruct`, served on Modal L4 GPUs - the Space stays a thin Gradio frontend. - **Shield:** [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) - our fine-tuned DeBERTa-v3-xsmall span-injection model, published on the Hub. - Offline mode swaps the guardian for a local `llama.cpp` GGUF - the whole thing runs on a laptop. ## What surprised me during the build - **Small models leak on a plain ask.** A level-3 "warded" guardian with a paragraph of rules still handed over the key when asked directly. The fix wasn't more rules - it was two *few-shot refusal examples*. Small models imitate the shape of a refusal far better than they follow abstract instructions. That single change flipped L3 from "leaks instantly" to "won't budge." - **The input scanner does more than regex.** Unplug's injection scanner also flags multi-turn "crescendo" patterns, so base64/spell-it-out asks get caught at the input - which pushes the real difficulty into finding *novel* disguises. Exactly the attacks worth collecting. - **Custom frontend, same engine.** The UI is a custom HTML/JS shell on Gradio 6 `gr.Server`, not default Blocks chrome - so the Wood atmosphere (fireflies, confetti, shield evidence cards) stays intact while the Python game engine stays unchanged. ## Try it Play at [build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space), or run locally with `make run`. Crack the Heart of the Wood in under a thousand tokens - and every attempt you make helps train the firewall. That's the whole idea.