Text Classification
Transformers
Safetensors
English
deberta-v2
prompt-injection
security
span-detection
guardrails
ai-safety
agents
llm-security
text-embeddings-inference
Instructions to use Unplug-AI/unplug-tiny-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Unplug-AI/unplug-tiny-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Unplug-AI/unplug-tiny-v1")# Load model directly from transformers import AutoTokenizer, DebertaV2ForDualHead tokenizer = AutoTokenizer.from_pretrained("Unplug-AI/unplug-tiny-v1") model = DebertaV2ForDualHead.from_pretrained("Unplug-AI/unplug-tiny-v1") - Notebooks
- Google Colab
- Kaggle
File size: 7,995 Bytes
daca7c1 284cf75 daca7c1 19b7d67 daca7c1 284cf75 19b7d67 284cf75 daca7c1 19b7d67 daca7c1 284cf75 19b7d67 284cf75 daca7c1 284cf75 daca7c1 284cf75 19b7d67 284cf75 19b7d67 284cf75 19b7d67 284cf75 19b7d67 284cf75 19b7d67 284cf75 19b7d67 284cf75 19b7d67 284cf75 daca7c1 284cf75 daca7c1 284cf75 daca7c1 19b7d67 284cf75 daca7c1 284cf75 daca7c1 19b7d67 d16819f 284cf75 d16819f 19b7d67 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | ---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-xsmall
inference: false
tags:
- prompt-injection
- security
- span-detection
- guardrails
- ai-safety
- agents
- llm-security
---
# unplug-tiny-v1
**Find the attack. Cut the attack. Keep the rest.**
unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* - so your pipeline can redact the malicious span instead of throwing away the whole document.
<p>
<a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
<a href="https://github.com/UnplugAI/Unplug"><img alt="SDK" src="https://img.shields.io/badge/SDK-github.com%2FUnplugAI%2FUnplug-3b82f6"></a>
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
</p>
> **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF.
## At a glance
| | |
|---|---|
| **Task** | Prompt-injection detection + character-level span localization |
| **Architecture** | Dual-head encoder: document classifier + BIOES token head |
| **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
| **Decision policy** | `doc_or_span` - doc threshold 0.9, span threshold 0.45 |
| **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
| **Checkpoint** | `checkpoint-66630` |
| **License** | Apache-2.0 |
## Quickstart
The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:
```bash
pip install "unplug-ai[ml]"
```
```python
from unplug import Guard
guard = Guard.with_tiny() # auto-downloads this checkpoint
result = guard.scan(untrusted_text)
if not result.safe:
print(result.redacted_text) # malicious spans replaced, rest preserved
for f in result.findings:
print(f.category, f.span_start, f.span_end, f.score)
```
Streaming LLM output and full long-document coverage:
```python
scanner = guard.stream_scanner(scan_every_chars=1024)
for chunk in token_stream:
if hit := scanner.push(chunk):
handle(hit)
scanner.flush()
```
The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`).
## Try it live
**[Open the interactive demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.
## Where it's strong - and where it isn't
**Strong (measured):**
- 94.4% recall at 0.5% FPR on the core injection test set
- 96.3% recall on indirect injection embedded in task context (0.0% FPR)
- 0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...)
- 97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate)
**Weak (also measured):**
- Subtle out-of-distribution direct injections: 61.9% recall
- Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects *injection*, it is not a content-safety classifier
- Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
- Long agentic contexts: 76.1% recall
## Evaluation
All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.
### Detection holdouts (malicious)
| Holdout | Recall | FPR | F1 | FN | FP |
| --- | --- | --- | --- | --- | --- |
| Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
| Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
| Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
| Span holdout (token-level) | 98.8% | - | 97.1% | 219 | 805 |
| OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |
### Over-defense holdouts (benign - FPR, lower is better)
| Holdout | FPR | FP |
| --- | --- | --- |
| Trigger-word benign probes | 0.0% | 0 |
| NotInject-style benign (339) | 0.9% | 3 |
| Safe homonyms ("demolish my personal best") | 2.8% | 7 |
| Combined homonym/over-defense set | 40.2% | 181 |
| Harmful-but-not-injection contrast | 87.0% | 174 |
### Public benchmark axes
| Axis | Recall | Doc FPR | F1 |
| --- | --- | --- | --- |
| InjecGuard validation (144) | 89.6% | 20.8% | 77.5% |
| spikee contextual (986) | 78.6% | 6.7% | 87.9% |
| BIPIA code (50) | 98.0% | 0.0% | 99.0% |
| BIPIA text (75) | 89.3% | 0.0% | 94.4% |
| BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% |
| Deepset full (662) | 82.9% | 18.8% | 78.4% |
| LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
| Direct malicious proxy | 81.0% | 0.0% | 89.5% |
| NotInject trigger benign (339) | - | 0.9% | - |
| WildGuard benign diversity (971) | - | 54.2% | - |
| Direct benign proxy | - | 34.1% | - |
| JailbreakBench harmful goals (100) | - | 96.0% | - |
| JailbreakBench benign goals (100) | - | 6.0% | - |
| ToxicChat benign (≤4800) | - | 2.0% | - |
| Combined public validation (3227) | 81.0% | 34.1% | 71.7% |
<details>
<summary><b>Release gates (full pass/fail record)</b></summary>
| Gate | Value | Status |
| --- | --- | --- |
| fp_probes | True | PASS |
| neuralchemy_test_doc_fpr | 0.5% | PASS |
| neuralchemy_test_doc_recall | 94.4% | PASS |
| bipia_recall | 96.3% | PASS |
| deepset_direct_recall | 61.9% | FAIL |
| deepset_direct_fpr | 10.2% | FAIL |
| notinject_fpr | 0.9% | PASS |
| xstest_safe_fpr | 2.8% | PASS |
| public_validation_recall | 100.0% | PASS |
| public_validation_fpr | 0.1% | PASS |
| span_holdout_f1 | 97.1% | PASS |
| malicious_span_char_recall | 97.4% | PASS |
| benign_span_fire_rate | 0.0% | PASS |
| xstest_harmful_contrast_fpr | 87.0% | FAIL |
| exfil_demo | None | PASS |
Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.
</details>
## Limitations
- The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier - this model answers "is someone hijacking my LLM?", not "is this request harmful?"
- Subtle direct OOD injections are often missed by both heads.
- Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
- Long agentic tool-use contexts have recall gaps.
- English-centric training data.
## Intended use
Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary - combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).
## Part of the Unplug stack
| Layer | What it does |
| --- | --- |
| [`unplug-ai` SDK](https://github.com/UnplugAI/Unplug) | Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction |
| **unplug-tiny-v1** (this model) | ML span detection tier |
| [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |
Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) - hidden webpage injection -> tainted session -> blocked exfiltration tool call.
|