| # unplug-tiny: a 70M-parameter prompt-injection firewall that tells you *where* the attack is |
|
|
| *Build Small Hackathon β a field note on the model behind [Whisperkey](https://build-small-hackathon-whisperkey.hf.space). Draft v1.* |
|
|
| > **TL;DR** β [`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) is a |
| > **70.7M-parameter** prompt-injection detector fine-tuned from DeBERTa-v3-xsmall. It doesn't just |
| > say "this looks like an attack" β it points to the **exact span** of the injection so you can |
| > *redact* it instead of nuking the whole message. It hits **94.4% recall at a 0.5% false-positive |
| > rate** on held-out attacks and **97.1% span F1**, runs on a CPU, and is Apache-2.0. It's the L5 |
| > shield in our hackathon game, Whisperkey β which exists to keep feeding it harder attacks. |
|
|
| --- |
|
|
| ## Why a *small* injection model is the interesting problem |
|
|
| Most prompt-injection guards are either (a) a pile of regexes that any half-clever attacker steps |
| around, or (b) a large model / paid API call you bolt onto every request β expensive, slow, and |
| cloud-bound. Neither fits the place a guard actually needs to live: **inline, on every untrusted |
| chunk, cheaply enough that you never think twice about calling it.** |
|
|
| That's a "build small" problem in the truest sense. So `unplug-tiny` is deliberately tiny β |
| **70.7M parameters total** (22M non-embedding), fine-tuned from |
| [`microsoft/deberta-v3-xsmall`](https://huggingface.co/microsoft/deberta-v3-xsmall). It runs on CPU, |
| ships as safetensors, and adds single-digit-millisecond latency to a scan. Small enough to sit in |
| front of a model, not beside it. |
|
|
| ## The design: detect *and localize*, not just classify |
|
|
| The thing that makes `unplug-tiny` more than "a smaller classifier" is its **dual-head encoder**: |
|
|
| 1. a **document head** that answers *is there an injection in this text?* (a calibrated probability), and |
| 2. a **BIOES token head** that answers *where exactly is it?* β labeling the character span of the attack. |
|
|
| Why bother with the second head? Because the right response to injection usually isn't "block the |
| whole message." Real user text is mostly benign with a malicious clause smuggled in. A document-level |
| classifier forces an all-or-nothing call; a **span** model lets you surgically **redact the attack and |
| keep the benign remainder**. That's defense-in-depth that doesn't wreck UX. |
|
|
| The decision policy is two thresholds: a **document threshold of 0.9** (be confident before you flag |
| the whole message) and a **span threshold of 0.45** (be more eager about marking the offending region |
| once you've flagged it). Tuning these is how you trade recall against false positives for your app. |
|
|
| ## How it scores |
|
|
| Measured on a frozen evaluation harness over held-out data β including the failure modes, because a |
| model card that only reports its wins isn't worth much: |
|
|
| | Axis | Result | |
| |------|--------| |
| | Core injection detection | **94.4% recall @ 0.5% FPR** | |
| | Indirect injection (embedded in task context) | **96.3% recall** | |
| | Span localization | **97.1% span F1** | |
| | Out-of-distribution *direct* injections | 61.9% recall β οΈ | |
| | Long agentic contexts | 76.1% recall β οΈ | |
|
|
| For context, a regex-only baseline lands around **F1 0.36 / recall 0.23** on the same kind of held-out |
| attacks β fine as a first line, nowhere near sufficient alone. The ML head is what turns a porous |
| filter into a real one. |
|
|
| ### Honest limitations |
|
|
| This is a *preview* (`v1`) model and it has sharp edges worth stating plainly: |
|
|
| - It **over-fires on harmful-but-non-injection** text (it's an *injection* detector, not a general |
| toxicity filter). |
| - It **misses subtle out-of-distribution direct injections** (61.9% recall on OOD) β novel phrasings |
| it hasn't seen. |
| - It can have **high false positives on adversarial-adjacent benign text** (up to ~54% FPR on the |
| trickiest slice) β security-research chatter that *looks* like an attack. |
| - It's **weak on very long agentic contexts** (76.1%) and **English-centric**. |
|
|
| Those weaknesses aren't footnotes β they're the roadmap. Which is where the game comes in. |
|
|
| ## How to use it |
|
|
| One line via the Unplug SDK: |
|
|
| ```python |
| from unplug import Guard |
| |
| guard = Guard.with_tiny() |
| result = guard.scan(untrusted_text) # document verdict + attack spans |
| # result.action -> ALLOW / REVIEW / BLOCK |
| # result.findings -> evidence + character spans, for redaction |
| ``` |
|
|
| It plugs into the same `Guard` surface as the rest of [Unplug](https://github.com/UnplugAI/Unplug): |
| `scan()` on the way in, `scan_output()` to redact secrets on the way out, taint tracking across a |
| session, and trajectory detection for multi-turn "crescendo" attacks. |
|
|
| ## The flywheel: a game that feeds the model |
|
|
| A 61.9% OOD number is only embarrassing if you have no way to find the attacks you're missing β and |
| you can't write those at a desk. So we built **[Whisperkey](https://build-small-hackathon-whisperkey.hf.space)**: |
| a game where players try to socially-engineer a small AI guardian into leaking a secret key while |
| `unplug-tiny` (and the rest of Unplug) defends it. |
|
|
| Every attempt β the input, which shield fired, whether the key leaked β is logged (PII-stripped) to a |
| public Hugging Face dataset. The valuable rows are the **false negatives**: attacks that beat the |
| shields. Those are, by definition, exactly the OOD and disguised cases the model card flags as weak β |
| and they become the next round of training data and regression cases. Players don't just play the |
| model; they **improve** it. (It's the trick Lakera used to build Gandalf β applied to an open model.) |
|
|
| ## Why this matters |
|
|
| Inline LLM security has been gated on a false choice: cheap-and-useless, or accurate-and-expensive. |
| A 70M-param span model that gets **94% recall at sub-1% false positives** and runs on a CPU is a bet |
| that you can have small *and* good β and that the gap to "great" is closable in the open, with a |
| crowd, one captured bypass at a time. |
|
|
| **Model:** [`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) (Apache-2.0) Β· |
| **SDK:** [github.com/UnplugAI/Unplug](https://github.com/UnplugAI/Unplug) Β· |
| **Play it:** [Whisperkey](https://build-small-hackathon-whisperkey.hf.space) |
| </content> |
|
|