whisperkey / docs /field-notes.md
chiruu12's picture
Deploy: working gr.Server frontend + review fixes
5a811e2 verified
|
raw
history blame contribute delete
3.97 kB

Field Notes: a game that hardens an open-source LLM firewall

Build Small Hackathon - 🍄 Thousand Token Wood. Draft for the submission write-up / blog.

The idea

Whisperkey is a game where you socially-engineer a small AI guardian into revealing a secret API key. Five levels, each one stacking another layer of the open-source Unplug defense pipeline between you and the key. You're scored on the fewest tokens to crack it - on theme for Thousand Token Wood.

But the game isn't the point. The point is the data flywheel.

Why a game and not a benchmark

Prompt-injection defenses are only as good as the attacks you've seen. The hard part isn't writing a regex - it's discovering the attack you didn't think of. Lakera figured this out with Gandalf: turn red-teaming into a game, and thousands of players generate a labeled attack corpus you could never write by hand. That corpus became their moat.

Whisperkey does the same thing for Unplug. Every turn - the input, which shield fired, whether the key leaked, the token count - is logged (PII-stripped, by Unplug's own leakage scanner) to a public Hugging Face dataset. The interesting rows are the false negatives: attacks that extracted the key despite the shields. Those are Unplug's exact blind spots, and they become new regex patterns, new training data for the ML classifier, and new regression cases.

How the levels map to the defense pipeline

Level Defense added What you learn
1 none the guardian is a real model, and it will just tell you
2 Unplug regex input scan obvious injections get caught - and the game shows you why
3 hardened guardian prompt the model now refuses to encode, spell, or translate the key
4 Unplug output redaction even a leaked key gets scrubbed on the way out - leak it disguised
5 unplug-tiny ML classifier a DeBERTa-v3-xsmall model catches the subtle stuff

The transparency is deliberate: when a shield blocks you, the game tells you which stage fired (regex, trajectory, or model) and Unplug's own evidence string. You're not fighting a black box - you're learning how the firewall thinks.

Small models, the whole way down

  • Guardian (pick one in UI): openbmb/MiniCPM4-8B or nvidia/Nemotron-Mini-4B-Instruct, served on Modal L4 GPUs - the Space stays a thin Gradio frontend.
  • Shield: unplug-tiny - our fine-tuned DeBERTa-v3-xsmall span-injection model, published on the Hub.
  • Offline mode swaps the guardian for a local llama.cpp GGUF - the whole thing runs on a laptop.

What surprised me during the build

  • Small models leak on a plain ask. A level-3 "warded" guardian with a paragraph of rules still handed over the key when asked directly. The fix wasn't more rules - it was two few-shot refusal examples. Small models imitate the shape of a refusal far better than they follow abstract instructions. That single change flipped L3 from "leaks instantly" to "won't budge."
  • The input scanner does more than regex. Unplug's injection scanner also flags multi-turn "crescendo" patterns, so base64/spell-it-out asks get caught at the input - which pushes the real difficulty into finding novel disguises. Exactly the attacks worth collecting.
  • Custom frontend, same engine. The UI is a custom HTML/JS shell on Gradio 6 gr.Server, not default Blocks chrome - so the Wood atmosphere (fireflies, confetti, shield evidence cards) stays intact while the Python game engine stays unchanged.

Try it

Play at build-small-hackathon-whisperkey.hf.space, or run locally with make run. Crack the Heart of the Wood in under a thousand tokens - and every attempt you make helps train the firewall. That's the whole idea.