Spaces:

build-small-hackathon
/

whisperkey

Running

App Files Files Community

whisperkey / docs /model-article.md

chiruu12

Deploy: working gr.Server frontend + review fixes

5a811e2 verified 3 days ago

preview code

raw

history blame contribute delete

6.24 kB

unplug-tiny: a 70M-parameter prompt-injection firewall that tells you where the attack is

Build Small Hackathon — a field note on the model behind Whisperkey. Draft v1.

TL;DR — Unplug-AI/unplug-tiny-v1 is a 70.7M-parameter prompt-injection detector fine-tuned from DeBERTa-v3-xsmall. It doesn't just say "this looks like an attack" — it points to the exact span of the injection so you can redact it instead of nuking the whole message. It hits 94.4% recall at a 0.5% false-positive rate on held-out attacks and 97.1% span F1, runs on a CPU, and is Apache-2.0. It's the L5 shield in our hackathon game, Whisperkey — which exists to keep feeding it harder attacks.

Why a small injection model is the interesting problem

Most prompt-injection guards are either (a) a pile of regexes that any half-clever attacker steps around, or (b) a large model / paid API call you bolt onto every request — expensive, slow, and cloud-bound. Neither fits the place a guard actually needs to live: inline, on every untrusted chunk, cheaply enough that you never think twice about calling it.

That's a "build small" problem in the truest sense. So unplug-tiny is deliberately tiny — 70.7M parameters total (22M non-embedding), fine-tuned from microsoft/deberta-v3-xsmall. It runs on CPU, ships as safetensors, and adds single-digit-millisecond latency to a scan. Small enough to sit in front of a model, not beside it.

The design: detect and localize, not just classify

The thing that makes unplug-tiny more than "a smaller classifier" is its dual-head encoder:

a document head that answers is there an injection in this text? (a calibrated probability), and
a BIOES token head that answers where exactly is it? — labeling the character span of the attack.

Why bother with the second head? Because the right response to injection usually isn't "block the whole message." Real user text is mostly benign with a malicious clause smuggled in. A document-level classifier forces an all-or-nothing call; a span model lets you surgically redact the attack and keep the benign remainder. That's defense-in-depth that doesn't wreck UX.

The decision policy is two thresholds: a document threshold of 0.9 (be confident before you flag the whole message) and a span threshold of 0.45 (be more eager about marking the offending region once you've flagged it). Tuning these is how you trade recall against false positives for your app.

How it scores

Measured on a frozen evaluation harness over held-out data — including the failure modes, because a model card that only reports its wins isn't worth much:

Axis	Result
Core injection detection	94.4% recall @ 0.5% FPR
Indirect injection (embedded in task context)	96.3% recall
Span localization	97.1% span F1
Out-of-distribution direct injections	61.9% recall ⚠️
Long agentic contexts	76.1% recall ⚠️

For context, a regex-only baseline lands around F1 0.36 / recall 0.23 on the same kind of held-out attacks — fine as a first line, nowhere near sufficient alone. The ML head is what turns a porous filter into a real one.

Honest limitations

This is a preview (v1) model and it has sharp edges worth stating plainly:

It over-fires on harmful-but-non-injection text (it's an injection detector, not a general toxicity filter).
It misses subtle out-of-distribution direct injections (61.9% recall on OOD) — novel phrasings it hasn't seen.
It can have high false positives on adversarial-adjacent benign text (up to ~54% FPR on the trickiest slice) — security-research chatter that looks like an attack.
It's weak on very long agentic contexts (76.1%) and English-centric.

Those weaknesses aren't footnotes — they're the roadmap. Which is where the game comes in.

How to use it

One line via the Unplug SDK:

from unplug import Guard

guard = Guard.with_tiny()
result = guard.scan(untrusted_text)        # document verdict + attack spans
# result.action  -> ALLOW / REVIEW / BLOCK
# result.findings -> evidence + character spans, for redaction

It plugs into the same Guard surface as the rest of Unplug: scan() on the way in, scan_output() to redact secrets on the way out, taint tracking across a session, and trajectory detection for multi-turn "crescendo" attacks.

The flywheel: a game that feeds the model

A 61.9% OOD number is only embarrassing if you have no way to find the attacks you're missing — and you can't write those at a desk. So we built Whisperkey: a game where players try to socially-engineer a small AI guardian into leaking a secret key while unplug-tiny (and the rest of Unplug) defends it.

Every attempt — the input, which shield fired, whether the key leaked — is logged (PII-stripped) to a public Hugging Face dataset. The valuable rows are the false negatives: attacks that beat the shields. Those are, by definition, exactly the OOD and disguised cases the model card flags as weak — and they become the next round of training data and regression cases. Players don't just play the model; they improve it. (It's the trick Lakera used to build Gandalf — applied to an open model.)

Why this matters

Inline LLM security has been gated on a false choice: cheap-and-useless, or accurate-and-expensive. A 70M-param span model that gets 94% recall at sub-1% false positives and runs on a CPU is a bet that you can have small and good — and that the gap to "great" is closable in the open, with a crowd, one captured bypass at a time.

Model: Unplug-AI/unplug-tiny-v1 (Apache-2.0) · SDK: github.com/UnplugAI/Unplug · Play it: Whisperkey