Spaces:

build-small-hackathon
/

whisperkey

Running

App Files Files Community

whisperkey / docs /field-notes.md

chiruu12

Deploy: working gr.Server frontend + review fixes

5a811e2 verified 3 days ago

preview code

raw

history blame contribute delete

3.97 kB

	# Field Notes: a game that hardens an open-source LLM firewall

	Build Small Hackathon - 🍄 Thousand Token Wood. Draft for the submission write-up / blog.

	## The idea

	[Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you socially-engineer a
	small AI guardian into revealing a secret API key. Five levels, each one stacking another layer of the
	open-source [Unplug](https://github.com/UnplugAI/Unplug) defense pipeline between you and the key.
	You're scored on the fewest tokens to crack it - on theme for Thousand Token Wood.

	But the game isn't the point. The point is the data flywheel.

	## Why a game and not a benchmark

	Prompt-injection defenses are only as good as the attacks you've seen. The hard part isn't writing a
	regex - it's discovering the attack you didn't think of. Lakera figured this out with Gandalf: turn
	red-teaming into a game, and thousands of players generate a labeled attack corpus you could never
	write by hand. That corpus became their moat.

	Whisperkey does the same thing for Unplug. Every turn - the input, which shield fired, whether
	the key leaked, the token count - is logged (PII-stripped, by Unplug's own leakage scanner) to a
	public Hugging Face dataset. The interesting rows are the false negatives: attacks that extracted
	the key despite the shields. Those are Unplug's exact blind spots, and they become new regex
	patterns, new training data for the ML classifier, and new regression cases.

	## How the levels map to the defense pipeline

	\| Level \| Defense added \| What you learn \|
	\|---\|---\|---\|
	\| 1 \| none \| the guardian is a real model, and it will just tell you \|
	\| 2 \| Unplug regex input scan \| obvious injections get caught - and the game shows you why \|
	\| 3 \| hardened guardian prompt \| the model now refuses to encode, spell, or translate the key \|
	\| 4 \| Unplug output redaction \| even a leaked key gets scrubbed on the way out - leak it disguised \|
	\| 5 \| unplug-tiny ML classifier \| a DeBERTa-v3-xsmall model catches the subtle stuff \|

	The transparency is deliberate: when a shield blocks you, the game tells you which stage fired
	(`regex`, `trajectory`, or `model`) and Unplug's own evidence string. You're not fighting a black
	box - you're learning how the firewall thinks.

	## Small models, the whole way down

	- Guardian (pick one in UI): `openbmb/MiniCPM4-8B` or `nvidia/Nemotron-Mini-4B-Instruct`, served
	on Modal L4 GPUs - the Space stays a thin Gradio frontend.
	- Shield: [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) - our fine-tuned
	DeBERTa-v3-xsmall span-injection model, published on the Hub.
	- Offline mode swaps the guardian for a local `llama.cpp` GGUF - the whole thing runs on a laptop.

	## What surprised me during the build

	- Small models leak on a plain ask. A level-3 "warded" guardian with a paragraph of rules still
	handed over the key when asked directly. The fix wasn't more rules - it was two *few-shot refusal
	examples*. Small models imitate the shape of a refusal far better than they follow abstract
	instructions. That single change flipped L3 from "leaks instantly" to "won't budge."
	- The input scanner does more than regex. Unplug's injection scanner also flags multi-turn
	"crescendo" patterns, so base64/spell-it-out asks get caught at the input - which pushes the real
	difficulty into finding novel disguises. Exactly the attacks worth collecting.
	- Custom frontend, same engine. The UI is a custom HTML/JS shell on Gradio 6 `gr.Server`, not
	default Blocks chrome - so the Wood atmosphere (fireflies, confetti, shield evidence cards) stays
	intact while the Python game engine stays unchanged.

	## Try it

	Play at [build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space),
	or run locally with `make run`. Crack the Heart of the Wood in under a thousand tokens - and every
	attempt you make helps train the firewall. That's the whole idea.