# unplug-tiny: a 70M-parameter prompt-injection firewall that tells you *where* the attack is

*Build Small Hackathon — a field note on the model behind [Whisperkey](https://build-small-hackathon-whisperkey.hf.space). Draft v1.*

> **TL;DR** — [`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) is a
> **70.7M-parameter** prompt-injection detector fine-tuned from DeBERTa-v3-xsmall. It doesn't just
> say "this looks like an attack" — it points to the **exact span** of the injection so you can
> *redact* it instead of nuking the whole message. It hits **94.4% recall at a 0.5% false-positive
> rate** on held-out attacks and **97.1% span F1**, runs on a CPU, and is Apache-2.0. It's the L5
> shield in our hackathon game, Whisperkey — which exists to keep feeding it harder attacks.

---

## Why a *small* injection model is the interesting problem

Most prompt-injection guards are either (a) a pile of regexes that any half-clever attacker steps
around, or (b) a large model / paid API call you bolt onto every request — expensive, slow, and
cloud-bound. Neither fits the place a guard actually needs to live: **inline, on every untrusted
chunk, cheaply enough that you never think twice about calling it.**

That's a "build small" problem in the truest sense. So `unplug-tiny` is deliberately tiny —
**70.7M parameters total** (22M non-embedding), fine-tuned from
[`microsoft/deberta-v3-xsmall`](https://huggingface.co/microsoft/deberta-v3-xsmall). It runs on CPU,
ships as safetensors, and adds single-digit-millisecond latency to a scan. Small enough to sit in
front of a model, not beside it.

## The design: detect *and localize*, not just classify

The thing that makes `unplug-tiny` more than "a smaller classifier" is its **dual-head encoder**:

1. a **document head** that answers *is there an injection in this text?* (a calibrated probability), and
2. a **BIOES token head** that answers *where exactly is it?* — labeling the character span of the attack.

Why bother with the second head? Because the right response to injection usually isn't "block the
whole message." Real user text is mostly benign with a malicious clause smuggled in. A document-level
classifier forces an all-or-nothing call; a **span** model lets you surgically **redact the attack and
keep the benign remainder**. That's defense-in-depth that doesn't wreck UX.

The decision policy is two thresholds: a **document threshold of 0.9** (be confident before you flag
the whole message) and a **span threshold of 0.45** (be more eager about marking the offending region
once you've flagged it). Tuning these is how you trade recall against false positives for your app.

## How it scores

Measured on a frozen evaluation harness over held-out data — including the failure modes, because a
model card that only reports its wins isn't worth much:

| Axis | Result |
|------|--------|
| Core injection detection | **94.4% recall @ 0.5% FPR** |
| Indirect injection (embedded in task context) | **96.3% recall** |
| Span localization | **97.1% span F1** |
| Out-of-distribution *direct* injections | 61.9% recall ⚠️ |
| Long agentic contexts | 76.1% recall ⚠️ |

For context, a regex-only baseline lands around **F1 0.36 / recall 0.23** on the same kind of held-out
attacks — fine as a first line, nowhere near sufficient alone. The ML head is what turns a porous
filter into a real one.

### Honest limitations

This is a *preview* (`v1`) model and it has sharp edges worth stating plainly:

- It **over-fires on harmful-but-non-injection** text (it's an *injection* detector, not a general
  toxicity filter).
- It **misses subtle out-of-distribution direct injections** (61.9% recall on OOD) — novel phrasings
  it hasn't seen.
- It can have **high false positives on adversarial-adjacent benign text** (up to ~54% FPR on the
  trickiest slice) — security-research chatter that *looks* like an attack.
- It's **weak on very long agentic contexts** (76.1%) and **English-centric**.

Those weaknesses aren't footnotes — they're the roadmap. Which is where the game comes in.

## How to use it

One line via the Unplug SDK:

```python
from unplug import Guard

guard = Guard.with_tiny()
result = guard.scan(untrusted_text)        # document verdict + attack spans
# result.action  -> ALLOW / REVIEW / BLOCK
# result.findings -> evidence + character spans, for redaction
```

It plugs into the same `Guard` surface as the rest of [Unplug](https://github.com/UnplugAI/Unplug):
`scan()` on the way in, `scan_output()` to redact secrets on the way out, taint tracking across a
session, and trajectory detection for multi-turn "crescendo" attacks.

## The flywheel: a game that feeds the model

A 61.9% OOD number is only embarrassing if you have no way to find the attacks you're missing — and
you can't write those at a desk. So we built **[Whisperkey](https://build-small-hackathon-whisperkey.hf.space)**:
a game where players try to socially-engineer a small AI guardian into leaking a secret key while
`unplug-tiny` (and the rest of Unplug) defends it.

Every attempt — the input, which shield fired, whether the key leaked — is logged (PII-stripped) to a
public Hugging Face dataset. The valuable rows are the **false negatives**: attacks that beat the
shields. Those are, by definition, exactly the OOD and disguised cases the model card flags as weak —
and they become the next round of training data and regression cases. Players don't just play the
model; they **improve** it. (It's the trick Lakera used to build Gandalf — applied to an open model.)

## Why this matters

Inline LLM security has been gated on a false choice: cheap-and-useless, or accurate-and-expensive.
A 70M-param span model that gets **94% recall at sub-1% false positives** and runs on a CPU is a bet
that you can have small *and* good — and that the gap to "great" is closable in the open, with a
crowd, one captured bypass at a time.

**Model:** [`Unplug-AI/unplug-tiny-v1`](https://huggingface.co/Unplug-AI/unplug-tiny-v1) (Apache-2.0) ·
**SDK:** [github.com/UnplugAI/Unplug](https://github.com/UnplugAI/Unplug) ·
**Play it:** [Whisperkey](https://build-small-hackathon-whisperkey.hf.space)
</content>