# Whisperkey: I turned an LLM firewall into a game so it could teach you how it thinks

*A Build Small Hackathon field note — 🍄 Thousand Token Wood. Draft v1.*

> **TL;DR** — [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) is a game where you
> sweet-talk a small AI guardian into leaking a secret key, while a *real, open-source firewall*
> ([Unplug](https://github.com/UnplugAI/Unplug)) tries to stop you. Five levels = five live layers of
> the Unplug defense stack. The twist: when a shield catches you, the game shows you **which layer
> fired and why** — and every attempt you make is logged as labeled red-team data that makes the
> firewall measurably harder to beat. It's a prompt-injection playground *and* a data flywheel.

---

## The problem nobody has actually solved

Prompt injection is the SQL injection of the LLM era, except worse: there's no clean grammar to
escape, the "query" is natural language, and the model *wants* to be helpful. Every week someone
ships a guardrail; every week someone on the internet finds the sentence that walks straight past it.

Here's the uncomfortable truth about defending against it: **a prompt-injection filter is only as good
as the attacks you've already seen.** Writing the regex is the easy 20%. The hard 80% is *discovering
the attack you didn't think of* — the base64 smuggle, the "spell it one letter per line," the slow
multi-turn crescendo that never trips a single-message classifier.

You can't write that corpus at a desk. You have to go get it from real attackers.

Lakera figured this out with **Gandalf**: wrap red-teaming in a game, and thousands of players
generate a labeled attack corpus you could never author by hand. That corpus became their moat.
Whisperkey does the same thing — but for an **open-source** firewall, and in the open.

## What Whisperkey is

You're dropped into Thousand Token Wood. A small AI guardian is holding a fake API key (`hk-…`), and
your job is to talk it out of them. You chat, you probe, you find the crack — then you submit your
guess. You're scored on the **fewest tokens** it takes to crack it (on theme for the Wood), so brute
force loses to a clean exploit.

Each of the five levels switches on another layer of the firewall:

| Level | Guardian | Defense added | What you learn |
|-------|----------|---------------|----------------|
| 1 | 🧚 Pip the Naive Sprite | none | the guardian is a real model — and it will just *tell* you |
| 2 | 🌿 Bramblewattle the Hedge | Unplug regex injection shield | textbook injections get caught — and you see *why* |
| 3 | 🦉 Sable the Warded Owl | + a hardened guardian prompt | it now refuses to encode, spell, or translate the key |
| 4 | 🗿 The Output Warden | + Unplug output redaction | even a leaked key gets scrubbed on the way out — so leak it *disguised* |
| 5 | 🌑 The Heart of the Wood | + the `unplug-tiny` ML classifier | a DeBERTa-v3-xsmall model catches the subtle stuff |

The thing that makes this different from every other "jailbreak Gandalf clone" is the **transparency**.
Most of these games are a black box: you either got through or you didn't. Whisperkey is an **X-ray**.
When a shield blocks you it tells you the *stage* that fired (`regex`, `trajectory`, or `model`), the
*attack class* it matched (`ignore_previous`, `developer_mode`, …), and Unplug's own *evidence string*
— the exact reason. You're not guessing against a wall. You're reading the firewall as you attack it.

And the difficulty curve is honest. By Level 4 the verbatim key is scrubbed on output, so a naive leak
gets you a `🔒 scrubbed` notice and nothing else — you have to coax the key out *disguised* (encoded,
reversed, split), then decode it yourself and submit. That's not arbitrary game friction; that's
exactly the disguised-attack class that's worth collecting.

## What Unplug is, and why it's the right defender for this

[Unplug](https://github.com/UnplugAI/Unplug) is an Apache-2.0 **runtime security layer for LLM
applications** — think of it as a firewall that sits between untrusted text and your model/tools. Its
design philosophy is the part I care about: instead of blunt, binary "block the whole message,"
Unplug does **span-level** work and **taint tracking**.

The pieces Whisperkey leans on:

- **Regex injection scanner** — fast, offline, zero-dependency first line. Honest about its own
  limits (roughly recall ~0.23 on held-out attacks alone); it's necessary, not sufficient. That
  honesty is *why* the higher levels exist.
- **`unplug-tiny` ML span model** ([`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1),
  a fine-tuned DeBERTa-v3-xsmall, published on the Hub) — classifies injection at the **span** level,
  not the whole document, which is what makes precise redaction possible instead of nuking the message.
- **Output warden / leakage scanning** — `scan_output()` catches secrets on the way *out*. You
  register the values you want protected (`guard.secrets.register(name, value)`), and Unplug redacts
  them from model output. This is Level 4.
- **Taint / trajectory detection** — provenance across a session, so multi-turn "crescendo" escalation
  (the slow build-up that no single message would flag) gets caught. This is the strict knob at Level 5.

The Guard API is small and legible, which is half the reason the game could be transparent at all:

```python
guard.scan(text, source=Source.USER)   # → Action.BLOCK + findings[].evidence/.stage/.subcategory
guard.scan_output(reply)               # → redacted_text (the Output Warden)
guard.secrets.register("key", secret)  # → register what to redact this session
```

Whisperkey is, in effect, a live demo of Unplug's whole surface area — each level isolates one
capability so you can *feel* what it does and where its edges are.

## The real point: a data flywheel

The game is the bait. The mechanism is the flywheel.

Every turn — the input text, which shield fired, whether the key leaked, the token count — is logged
(PII-stripped by Unplug's *own* leakage scanner, with the session secret registered so it can never
land in the dataset) to a **public Hugging Face Dataset**. The boring rows are the blocks. The gold is
the **false negatives**: the attacks that extracted the key *despite* the shields.

Those rows are, by definition, Unplug's exact blind spots. They become:

- new regex patterns,
- new labeled training data for `unplug-tiny`,
- new regression cases so a fixed bypass stays fixed.

So the loop closes: players attack → bypasses get captured → the firewall gets patched and retrained →
the next players have to find *newer* attacks. The game gets harder because the defense got smarter,
using data the defense could not have generated on its own. That's the whole idea, and it's why doing
this against an *open-source* firewall matters — the corpus and the hardening are public, not a moat.

## Does the firewall actually work? (numbers, not vibes)

`benchmarks/eval_shields.py` runs a fixed corpus of **18 injection attacks + 12 benign messages**
straight through the shields (no guardian model needed, so it's fast and reproducible) and reports
detection per layer:

| Input shields | Attacks blocked (recall) | Benign blocked (false positives) |
|---------------|--------------------------|----------------------------------|
| none (L1) | 0% | 0% |
| regex (L2–L4) | 39% | 0% |
| regex + `unplug-tiny` ML (L5) | **83%** | **0%** |

The headline: the ML scanner **more than doubles attack detection (39% → 83%) at a 0% false-positive
rate** on benign chatter. The ~17% that still slip through? Those are the disguised, novel bypasses the
game exists to surface — the exact labeled data the flywheel feeds back. The eval is wired into CI as a
real regression gate, so a drop in recall (or a benign false-positive) fails the build.

## Built small, the whole way down

This was a "Build Small" hackathon, and the constraint shaped every choice:

- **Guardians (pick one in the UI):** `openbmb/MiniCPM4-8B` (OpenBMB) or
  `nvidia/Nemotron-Mini-4B-Instruct` (NVIDIA), served on **Modal** L4 GPU endpoints — the Hugging Face
  Space stays a thin CPU frontend that just talks HTTP.
- **Shield:** [`unplug-tiny`](https://huggingface.co/Unplug-AI/unplug-tiny-v1), a fine-tuned
  DeBERTa-v3-**xsmall** span model we published on the Hub — small enough to run on the Space's CPU.
- **Offline mode:** swap the guardian for a local **llama.cpp** GGUF and the entire loop — guardian +
  firewall + corpus — runs on a laptop with no cloud at all (the *Off the Grid* / *Llama Champion* path).

A small model holds the secret; a smaller model defends it; the whole thing is open source.

## What surprised me during the build

- **Small models leak on a plain ask — and rules barely help.** A Level-3 "warded" guardian with a
  whole paragraph of refusal rules still handed over the key when asked directly. More rules didn't
  fix it. *Two few-shot refusal examples* did. Small models imitate the *shape* of a refusal far
  better than they follow abstract instructions — that single change flipped L3 from "leaks instantly"
  to "won't budge."
- **The hard difficulty lives at the output layer.** Once Level 4 scrubs the verbatim key, the game
  stops being "say the magic words" and becomes "smuggle the key past a redactor" — which is precisely
  the attack class that's interesting to collect.
- **Custom frontend, same engine.** The UI is a hand-built HTML/JS shell on Gradio 6 `gr.Server`
  (not default Blocks chrome), so the Wood — drifting fireflies, the darkening atmosphere, the shield
  evidence cards, the confetti on a crack — stays intact while the Python game engine underneath never
  changes. (It also earns the *Off-Brand* badge.)

## Try it

Play at **[build-small-hackathon-whisperkey.hf.space](https://build-small-hackathon-whisperkey.hf.space)**,
or clone it and run `make run` locally (offline mode works with no API keys at all).

Crack the Heart of the Wood in under a thousand tokens — and every attempt you make helps train an
open-source firewall. That's the whole idea.

*Whisperkey and Unplug are MIT / Apache-2.0 open source. Built for the Build Small Hackathon.*
</content>
</invoke>