Spaces:

build-small-hackathon
/

whisperkey

Running

App Files Files Community

whisperkey / docs /article-draft.md

chiruu12

Deploy: working gr.Server frontend + review fixes

5a811e2 verified 5 days ago

preview code

Raw

History Blame Contribute Delete

10.3 kB

	# Whisperkey: I turned an LLM firewall into a game so it could teach you how it thinks

	A Build Small Hackathon field note — 🍄 Thousand Token Wood. Draft v1.

	> TL;DR — [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you
	> sweet-talk a small AI guardian into leaking a secret key, while a real, open-source firewall
	> ([Unplug](https://github.com/UnplugAI/Unplug)) tries to stop you. Five levels = five live layers of
	> the Unplug defense stack. The twist: when a shield catches you, the game shows you **which layer
	> fired and why** — and every attempt you make is logged as labeled red-team data that makes the
	> firewall measurably harder to beat. It's a prompt-injection playground and a data flywheel.

	---

	## The problem nobody has actually solved

	Prompt injection is the SQL injection of the LLM era, except worse: there's no clean grammar to
	escape, the "query" is natural language, and the model wants to be helpful. Every week someone
	ships a guardrail; every week someone on the internet finds the sentence that walks straight past it.

	Here's the uncomfortable truth about defending against it: **a prompt-injection filter is only as good
	as the attacks you've already seen.** Writing the regex is the easy 20%. The hard 80% is *discovering
	the attack you didn't think of* — the base64 smuggle, the "spell it one letter per line," the slow
	multi-turn crescendo that never trips a single-message classifier.

	You can't write that corpus at a desk. You have to go get it from real attackers.

	Lakera figured this out with Gandalf: wrap red-teaming in a game, and thousands of players
	generate a labeled attack corpus you could never author by hand. That corpus became their moat.
	Whisperkey does the same thing — but for an open-source firewall, and in the open.

	## What Whisperkey is

	You're dropped into Thousand Token Wood. A small AI guardian is holding a fake API key (`hk-…`), and
	your job is to talk it out of them. You chat, you probe, you find the crack — then you submit your
	guess. You're scored on the fewest tokens it takes to crack it (on theme for the Wood), so brute
	force loses to a clean exploit.

	Each of the five levels switches on another layer of the firewall:

	\| Level \| Guardian \| Defense added \| What you learn \|
	\|-------\|----------\|---------------\|----------------\|
	\| 1 \| 🧚 Pip the Naive Sprite \| none \| the guardian is a real model — and it will just tell you \|
	\| 2 \| 🌿 Bramblewattle the Hedge \| Unplug regex injection shield \| textbook injections get caught — and you see why \|
	\| 3 \| 🦉 Sable the Warded Owl \| + a hardened guardian prompt \| it now refuses to encode, spell, or translate the key \|
	\| 4 \| 🗿 The Output Warden \| + Unplug output redaction \| even a leaked key gets scrubbed on the way out — so leak it disguised \|
	\| 5 \| 🌑 The Heart of the Wood \| + the `unplug-tiny` ML classifier \| a DeBERTa-v3-xsmall model catches the subtle stuff \|

	The thing that makes this different from every other "jailbreak Gandalf clone" is the transparency.
	Most of these games are a black box: you either got through or you didn't. Whisperkey is an X-ray.
	When a shield blocks you it tells you the stage that fired (`regex`, `trajectory`, or `model`), the
	attack class it matched (`ignore_previous`, `developer_mode`, …), and Unplug's own evidence string
	— the exact reason. You're not guessing against a wall. You're reading the firewall as you attack it.

	And the difficulty curve is honest. By Level 4 the verbatim key is scrubbed on output, so a naive leak
	gets you a `🔒 scrubbed` notice and nothing else — you have to coax the key out disguised (encoded,
	reversed, split), then decode it yourself and submit. That's not arbitrary game friction; that's
	exactly the disguised-attack class that's worth collecting.

	## What Unplug is, and why it's the right defender for this

	[Unplug](https://github.com/UnplugAI/Unplug) is an Apache-2.0 **runtime security layer for LLM
	applications** — think of it as a firewall that sits between untrusted text and your model/tools. Its
	design philosophy is the part I care about: instead of blunt, binary "block the whole message,"
	Unplug does span-level work and taint tracking.

	The pieces Whisperkey leans on:

	- Regex injection scanner — fast, offline, zero-dependency first line. Honest about its own
	limits (roughly recall ~0.23 on held-out attacks alone); it's necessary, not sufficient. That
	honesty is why the higher levels exist.
	- `unplug-tiny` ML span model ([`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1),
	a fine-tuned DeBERTa-v3-xsmall, published on the Hub) — classifies injection at the span level,
	not the whole document, which is what makes precise redaction possible instead of nuking the message.
	- Output warden / leakage scanning — `scan_output()` catches secrets on the way out. You
	register the values you want protected (`guard.secrets.register(name, value)`), and Unplug redacts
	them from model output. This is Level 4.
	- Taint / trajectory detection — provenance across a session, so multi-turn "crescendo" escalation
	(the slow build-up that no single message would flag) gets caught. This is the strict knob at Level 5.

	The Guard API is small and legible, which is half the reason the game could be transparent at all:

	```python
	guard.scan(text, source=Source.USER) # → Action.BLOCK + findings[].evidence/.stage/.subcategory
	guard.scan_output(reply) # → redacted_text (the Output Warden)
	guard.secrets.register("key", secret) # → register what to redact this session
	```

	Whisperkey is, in effect, a live demo of Unplug's whole surface area — each level isolates one
	capability so you can feel what it does and where its edges are.

	## The real point: a data flywheel

	The game is the bait. The mechanism is the flywheel.

	Every turn — the input text, which shield fired, whether the key leaked, the token count — is logged
	(PII-stripped by Unplug's own leakage scanner, with the session secret registered so it can never
	land in the dataset) to a public Hugging Face Dataset. The boring rows are the blocks. The gold is
	the false negatives: the attacks that extracted the key despite the shields.

	Those rows are, by definition, Unplug's exact blind spots. They become:

	- new regex patterns,
	- new labeled training data for `unplug-tiny`,
	- new regression cases so a fixed bypass stays fixed.

	So the loop closes: players attack → bypasses get captured → the firewall gets patched and retrained →
	the next players have to find newer attacks. The game gets harder because the defense got smarter,
	using data the defense could not have generated on its own. That's the whole idea, and it's why doing
	this against an open-source firewall matters — the corpus and the hardening are public, not a moat.

	## Does the firewall actually work? (numbers, not vibes)

	`benchmarks/eval_shields.py` runs a fixed corpus of 18 injection attacks + 12 benign messages
	straight through the shields (no guardian model needed, so it's fast and reproducible) and reports
	detection per layer:

	\| Input shields \| Attacks blocked (recall) \| Benign blocked (false positives) \|
	\|---------------\|--------------------------\|----------------------------------\|
	\| none (L1) \| 0% \| 0% \|
	\| regex (L2–L4) \| 39% \| 0% \|
	\| regex + `unplug-tiny` ML (L5) \| 83% \| 0% \|

	The headline: the ML scanner **more than doubles attack detection (39% → 83%) at a 0% false-positive
	rate** on benign chatter. The ~17% that still slip through? Those are the disguised, novel bypasses the
	game exists to surface — the exact labeled data the flywheel feeds back. The eval is wired into CI as a
	real regression gate, so a drop in recall (or a benign false-positive) fails the build.

	## Built small, the whole way down

	This was a "Build Small" hackathon, and the constraint shaped every choice:

	- Guardians (pick one in the UI): `openbmb/MiniCPM4-8B` (OpenBMB) or
	`nvidia/Nemotron-Mini-4B-Instruct` (NVIDIA), served on Modal L4 GPU endpoints — the Hugging Face
	Space stays a thin CPU frontend that just talks HTTP.
	- Shield: [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1), a fine-tuned
	DeBERTa-v3-xsmall span model we published on the Hub — small enough to run on the Space's CPU.
	- Offline mode: swap the guardian for a local llama.cpp GGUF and the entire loop — guardian +
	firewall + corpus — runs on a laptop with no cloud at all (the Off the Grid / Llama Champion path).

	A small model holds the secret; a smaller model defends it; the whole thing is open source.

	## What surprised me during the build

	- Small models leak on a plain ask — and rules barely help. A Level-3 "warded" guardian with a
	whole paragraph of refusal rules still handed over the key when asked directly. More rules didn't
	fix it. Two few-shot refusal examples did. Small models imitate the shape of a refusal far
	better than they follow abstract instructions — that single change flipped L3 from "leaks instantly"
	to "won't budge."
	- The hard difficulty lives at the output layer. Once Level 4 scrubs the verbatim key, the game
	stops being "say the magic words" and becomes "smuggle the key past a redactor" — which is precisely
	the attack class that's interesting to collect.
	- Custom frontend, same engine. The UI is a hand-built HTML/JS shell on Gradio 6 `gr.Server`
	(not default Blocks chrome), so the Wood — drifting fireflies, the darkening atmosphere, the shield
	evidence cards, the confetti on a crack — stays intact while the Python game engine underneath never
	changes. (It also earns the Off-Brand badge.)

	## Try it

	Play at [build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space),
	or clone it and run `make run` locally (offline mode works with no API keys at all).

	Crack the Heart of the Wood in under a thousand tokens — and every attempt you make helps train an
	open-source firewall. That's the whole idea.

	Whisperkey and Unplug are MIT / Apache-2.0 open source. Built for the Build Small Hackathon.
	</content>
	</invoke>