whisperkey / docs /field-notes.md
chiruu12's picture
Deploy: working gr.Server frontend + review fixes
5a811e2 verified
|
raw
history blame contribute delete
3.97 kB
# Field Notes: a game that hardens an open-source LLM firewall
*Build Small Hackathon - ๐Ÿ„ Thousand Token Wood. Draft for the submission write-up / blog.*
## The idea
[Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you socially-engineer a
small AI guardian into revealing a secret API key. Five levels, each one stacking another layer of the
open-source [Unplug](https://github.com/UnplugAI/Unplug) defense pipeline between you and the key.
You're scored on the *fewest tokens* to crack it - on theme for Thousand Token Wood.
But the game isn't the point. The point is the **data flywheel**.
## Why a game and not a benchmark
Prompt-injection defenses are only as good as the attacks you've seen. The hard part isn't writing a
regex - it's discovering the attack you didn't think of. Lakera figured this out with Gandalf: turn
red-teaming into a game, and thousands of players generate a labeled attack corpus you could never
write by hand. That corpus became their moat.
Whisperkey does the same thing for Unplug. Every turn - the input, which shield fired, whether
the key leaked, the token count - is logged (PII-stripped, by Unplug's own leakage scanner) to a
public Hugging Face dataset. The interesting rows are the **false negatives**: attacks that extracted
the key *despite* the shields. Those are Unplug's exact blind spots, and they become new regex
patterns, new training data for the ML classifier, and new regression cases.
## How the levels map to the defense pipeline
| Level | Defense added | What you learn |
|---|---|---|
| 1 | none | the guardian is a real model, and it will just tell you |
| 2 | Unplug regex input scan | obvious injections get caught - and the game shows you *why* |
| 3 | hardened guardian prompt | the model now refuses to encode, spell, or translate the key |
| 4 | Unplug output redaction | even a leaked key gets scrubbed on the way out - leak it *disguised* |
| 5 | unplug-tiny ML classifier | a DeBERTa-v3-xsmall model catches the subtle stuff |
The transparency is deliberate: when a shield blocks you, the game tells you which stage fired
(`regex`, `trajectory`, or `model`) and Unplug's own evidence string. You're not fighting a black
box - you're learning how the firewall thinks.
## Small models, the whole way down
- **Guardian (pick one in UI):** `openbmb/MiniCPM4-8B` or `nvidia/Nemotron-Mini-4B-Instruct`, served
on Modal L4 GPUs - the Space stays a thin Gradio frontend.
- **Shield:** [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) - our fine-tuned
DeBERTa-v3-xsmall span-injection model, published on the Hub.
- Offline mode swaps the guardian for a local `llama.cpp` GGUF - the whole thing runs on a laptop.
## What surprised me during the build
- **Small models leak on a plain ask.** A level-3 "warded" guardian with a paragraph of rules still
handed over the key when asked directly. The fix wasn't more rules - it was two *few-shot refusal
examples*. Small models imitate the shape of a refusal far better than they follow abstract
instructions. That single change flipped L3 from "leaks instantly" to "won't budge."
- **The input scanner does more than regex.** Unplug's injection scanner also flags multi-turn
"crescendo" patterns, so base64/spell-it-out asks get caught at the input - which pushes the real
difficulty into finding *novel* disguises. Exactly the attacks worth collecting.
- **Custom frontend, same engine.** The UI is a custom HTML/JS shell on Gradio 6 `gr.Server`, not
default Blocks chrome - so the Wood atmosphere (fireflies, confetti, shield evidence cards) stays
intact while the Python game engine stays unchanged.
## Try it
Play at [build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space),
or run locally with `make run`. Crack the Heart of the Wood in under a thousand tokens - and every
attempt you make helps train the firewall. That's the whole idea.