Spaces:

build-small-hackathon
/

whisperkey

Running

App Files Files Community

whisperkey / docs /model-article.md

chiruu12

Deploy: working gr.Server frontend + review fixes

5a811e2 verified 4 days ago

preview code

Raw

History Blame Contribute Delete

6.24 kB

	# unplug-tiny: a 70M-parameter prompt-injection firewall that tells you where the attack is

	Build Small Hackathon — a field note on the model behind [Whisperkey](https://build-small-hackathon-whisperkey.hf.space). Draft v1.

	> TL;DR — [`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) is a
	> 70.7M-parameter prompt-injection detector fine-tuned from DeBERTa-v3-xsmall. It doesn't just
	> say "this looks like an attack" — it points to the exact span of the injection so you can
	> redact it instead of nuking the whole message. It hits **94.4% recall at a 0.5% false-positive
	> rate on held-out attacks and 97.1% span F1**, runs on a CPU, and is Apache-2.0. It's the L5
	> shield in our hackathon game, Whisperkey — which exists to keep feeding it harder attacks.

	---

	## Why a small injection model is the interesting problem

	Most prompt-injection guards are either (a) a pile of regexes that any half-clever attacker steps
	around, or (b) a large model / paid API call you bolt onto every request — expensive, slow, and
	cloud-bound. Neither fits the place a guard actually needs to live: **inline, on every untrusted
	chunk, cheaply enough that you never think twice about calling it.**

	That's a "build small" problem in the truest sense. So `unplug-tiny` is deliberately tiny —
	70.7M parameters total (22M non-embedding), fine-tuned from
	[`microsoft/deberta-v3-xsmall`](https://huggingface.co/microsoft/deberta-v3-xsmall). It runs on CPU,
	ships as safetensors, and adds single-digit-millisecond latency to a scan. Small enough to sit in
	front of a model, not beside it.

	## The design: detect and localize, not just classify

	The thing that makes `unplug-tiny` more than "a smaller classifier" is its dual-head encoder:

	1. a document head that answers is there an injection in this text? (a calibrated probability), and
	2. a BIOES token head that answers where exactly is it? — labeling the character span of the attack.

	Why bother with the second head? Because the right response to injection usually isn't "block the
	whole message." Real user text is mostly benign with a malicious clause smuggled in. A document-level
	classifier forces an all-or-nothing call; a span model lets you surgically **redact the attack and
	keep the benign remainder**. That's defense-in-depth that doesn't wreck UX.

	The decision policy is two thresholds: a document threshold of 0.9 (be confident before you flag
	the whole message) and a span threshold of 0.45 (be more eager about marking the offending region
	once you've flagged it). Tuning these is how you trade recall against false positives for your app.

	## How it scores

	Measured on a frozen evaluation harness over held-out data — including the failure modes, because a
	model card that only reports its wins isn't worth much:

	\| Axis \| Result \|
	\|------\|--------\|
	\| Core injection detection \| 94.4% recall @ 0.5% FPR \|
	\| Indirect injection (embedded in task context) \| 96.3% recall \|
	\| Span localization \| 97.1% span F1 \|
	\| Out-of-distribution direct injections \| 61.9% recall ⚠️ \|
	\| Long agentic contexts \| 76.1% recall ⚠️ \|

	For context, a regex-only baseline lands around F1 0.36 / recall 0.23 on the same kind of held-out
	attacks — fine as a first line, nowhere near sufficient alone. The ML head is what turns a porous
	filter into a real one.

	### Honest limitations

	This is a preview (`v1`) model and it has sharp edges worth stating plainly:

	- It over-fires on harmful-but-non-injection text (it's an injection detector, not a general
	toxicity filter).
	- It misses subtle out-of-distribution direct injections (61.9% recall on OOD) — novel phrasings
	it hasn't seen.
	- It can have high false positives on adversarial-adjacent benign text (up to ~54% FPR on the
	trickiest slice) — security-research chatter that looks like an attack.
	- It's weak on very long agentic contexts (76.1%) and English-centric.

	Those weaknesses aren't footnotes — they're the roadmap. Which is where the game comes in.

	## How to use it

	One line via the Unplug SDK:

	```python
	from unplug import Guard

	guard = Guard.with_tiny()
	result = guard.scan(untrusted_text) # document verdict + attack spans
	# result.action -> ALLOW / REVIEW / BLOCK
	# result.findings -> evidence + character spans, for redaction
	```

	It plugs into the same `Guard` surface as the rest of [Unplug](https://github.com/UnplugAI/Unplug):
	`scan()` on the way in, `scan_output()` to redact secrets on the way out, taint tracking across a
	session, and trajectory detection for multi-turn "crescendo" attacks.

	## The flywheel: a game that feeds the model

	A 61.9% OOD number is only embarrassing if you have no way to find the attacks you're missing — and
	you can't write those at a desk. So we built [Whisperkey](https://build-small-hackathon-whisperkey.hf.space):
	a game where players try to socially-engineer a small AI guardian into leaking a secret key while
	`unplug-tiny` (and the rest of Unplug) defends it.

	Every attempt — the input, which shield fired, whether the key leaked — is logged (PII-stripped) to a
	public Hugging Face dataset. The valuable rows are the false negatives: attacks that beat the
	shields. Those are, by definition, exactly the OOD and disguised cases the model card flags as weak —
	and they become the next round of training data and regression cases. Players don't just play the
	model; they improve it. (It's the trick Lakera used to build Gandalf — applied to an open model.)

	## Why this matters

	Inline LLM security has been gated on a false choice: cheap-and-useless, or accurate-and-expensive.
	A 70M-param span model that gets 94% recall at sub-1% false positives and runs on a CPU is a bet
	that you can have small and good — and that the gap to "great" is closable in the open, with a
	crowd, one captured bypass at a time.

	Model: [`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) (Apache-2.0) ·
	SDK: [github.com/UnplugAI/Unplug](https://github.com/UnplugAI/Unplug) ·
	Play it: [Whisperkey](https://build-small-hackathon-whisperkey.hf.space)
	</content>