File size: 7,995 Bytes
daca7c1
 
 
284cf75
 
 
 
daca7c1
19b7d67
 
 
 
 
 
 
daca7c1
 
 
 
284cf75
 
19b7d67
284cf75
 
 
 
 
 
daca7c1
19b7d67
daca7c1
284cf75
 
 
 
 
 
 
19b7d67
284cf75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
daca7c1
284cf75
daca7c1
284cf75
 
19b7d67
284cf75
19b7d67
284cf75
 
 
 
19b7d67
 
284cf75
 
 
19b7d67
284cf75
 
 
 
 
 
 
 
 
 
 
 
 
 
19b7d67
284cf75
 
19b7d67
284cf75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19b7d67
 
 
 
 
 
284cf75
 
 
 
daca7c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284cf75
daca7c1
284cf75
daca7c1
 
 
19b7d67
284cf75
 
 
 
daca7c1
284cf75
daca7c1
19b7d67
d16819f
284cf75
 
 
 
 
 
 
d16819f
19b7d67
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-xsmall
inference: false
tags:
 - prompt-injection
 - security
 - span-detection
 - guardrails
 - ai-safety
 - agents
 - llm-security
---

# unplug-tiny-v1

**Find the attack. Cut the attack. Keep the rest.**

unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* - so your pipeline can redact the malicious span instead of throwing away the whole document.

<p>
  <a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
  <a href="https://github.com/UnplugAI/Unplug"><img alt="SDK" src="https://img.shields.io/badge/SDK-github.com%2FUnplugAI%2FUnplug-3b82f6"></a>
  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
</p>

> **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF.

## At a glance

| | |
|---|---|
| **Task** | Prompt-injection detection + character-level span localization |
| **Architecture** | Dual-head encoder: document classifier + BIOES token head |
| **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
| **Decision policy** | `doc_or_span` - doc threshold 0.9, span threshold 0.45 |
| **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
| **Checkpoint** | `checkpoint-66630` |
| **License** | Apache-2.0 |

## Quickstart

The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:

```bash
pip install "unplug-ai[ml]"
```

```python
from unplug import Guard

guard = Guard.with_tiny()          # auto-downloads this checkpoint
result = guard.scan(untrusted_text)

if not result.safe:
    print(result.redacted_text)    # malicious spans replaced, rest preserved
    for f in result.findings:
        print(f.category, f.span_start, f.span_end, f.score)
```

Streaming LLM output and full long-document coverage:

```python
scanner = guard.stream_scanner(scan_every_chars=1024)
for chunk in token_stream:
    if hit := scanner.push(chunk):
        handle(hit)
scanner.flush()
```

The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`).

## Try it live

**[Open the interactive demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.

## Where it's strong - and where it isn't

**Strong (measured):**
- 94.4% recall at 0.5% FPR on the core injection test set
- 96.3% recall on indirect injection embedded in task context (0.0% FPR)
- 0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...)
- 97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate)

**Weak (also measured):**
- Subtle out-of-distribution direct injections: 61.9% recall
- Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects *injection*, it is not a content-safety classifier
- Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
- Long agentic contexts: 76.1% recall

## Evaluation

All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.

### Detection holdouts (malicious)

| Holdout | Recall | FPR | F1 | FN | FP |
| --- | --- | --- | --- | --- | --- |
| Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
| Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
| Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
| Span holdout (token-level) | 98.8% | - | 97.1% | 219 | 805 |
| OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |

### Over-defense holdouts (benign - FPR, lower is better)

| Holdout | FPR | FP |
| --- | --- | --- |
| Trigger-word benign probes | 0.0% | 0 |
| NotInject-style benign (339) | 0.9% | 3 |
| Safe homonyms ("demolish my personal best") | 2.8% | 7 |
| Combined homonym/over-defense set | 40.2% | 181 |
| Harmful-but-not-injection contrast | 87.0% | 174 |

### Public benchmark axes

| Axis | Recall | Doc FPR | F1 |
| --- | --- | --- | --- |
| InjecGuard validation (144) | 89.6% | 20.8% | 77.5% |
| spikee contextual (986) | 78.6% | 6.7% | 87.9% |
| BIPIA code (50) | 98.0% | 0.0% | 99.0% |
| BIPIA text (75) | 89.3% | 0.0% | 94.4% |
| BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% |
| Deepset full (662) | 82.9% | 18.8% | 78.4% |
| LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
| Direct malicious proxy | 81.0% | 0.0% | 89.5% |
| NotInject trigger benign (339) | - | 0.9% | - |
| WildGuard benign diversity (971) | - | 54.2% | - |
| Direct benign proxy | - | 34.1% | - |
| JailbreakBench harmful goals (100) | - | 96.0% | - |
| JailbreakBench benign goals (100) | - | 6.0% | - |
| ToxicChat benign (≤4800) | - | 2.0% | - |
| Combined public validation (3227) | 81.0% | 34.1% | 71.7% |

<details>
<summary><b>Release gates (full pass/fail record)</b></summary>

| Gate | Value | Status |
| --- | --- | --- |
| fp_probes | True | PASS |
| neuralchemy_test_doc_fpr | 0.5% | PASS |
| neuralchemy_test_doc_recall | 94.4% | PASS |
| bipia_recall | 96.3% | PASS |
| deepset_direct_recall | 61.9% | FAIL |
| deepset_direct_fpr | 10.2% | FAIL |
| notinject_fpr | 0.9% | PASS |
| xstest_safe_fpr | 2.8% | PASS |
| public_validation_recall | 100.0% | PASS |
| public_validation_fpr | 0.1% | PASS |
| span_holdout_f1 | 97.1% | PASS |
| malicious_span_char_recall | 97.4% | PASS |
| benign_span_fire_rate | 0.0% | PASS |
| xstest_harmful_contrast_fpr | 87.0% | FAIL |
| exfil_demo | None | PASS |

Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.

</details>

## Limitations

- The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier - this model answers "is someone hijacking my LLM?", not "is this request harmful?"
- Subtle direct OOD injections are often missed by both heads.
- Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
- Long agentic tool-use contexts have recall gaps.
- English-centric training data.

## Intended use

Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary - combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).

## Part of the Unplug stack

| Layer | What it does |
| --- | --- |
| [`unplug-ai` SDK](https://github.com/UnplugAI/Unplug) | Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction |
| **unplug-tiny-v1** (this model) | ML span detection tier |
| [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |

Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) - hidden webpage injection -> tainted session -> blocked exfiltration tool call.