Text Classification
Transformers
Safetensors
English
deberta-v2
prompt-injection
security
span-detection
guardrails
ai-safety
agents
llm-security
text-embeddings-inference
Instructions to use Unplug-AI/unplug-tiny-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Unplug-AI/unplug-tiny-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Unplug-AI/unplug-tiny-v1")# Load model directly from transformers import AutoTokenizer, DebertaV2ForDualHead tokenizer = AutoTokenizer.from_pretrained("Unplug-AI/unplug-tiny-v1") model = DebertaV2ForDualHead.from_pretrained("Unplug-AI/unplug-tiny-v1") - Notebooks
- Google Colab
- Kaggle
product-grade model card
Browse files
README.md
CHANGED
|
@@ -1,29 +1,142 @@
|
|
| 1 |
---
|
| 2 |
language: en
|
| 3 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
- prompt-injection
|
| 6 |
- security
|
| 7 |
- span-detection
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# unplug-tiny-v1
|
| 14 |
|
| 15 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
- Policy: `doc_or_span` @ Ο_doc=0.9, Ο_span=0.45
|
| 19 |
-
- Checkpoint: `checkpoint-66630`
|
| 20 |
-
- Generated: 2026-06-09
|
| 21 |
|
| 22 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
| Gate | Value | Status |
|
| 29 |
| --- | --- | --- |
|
|
@@ -43,65 +156,28 @@ Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weakn
|
|
| 43 |
| xstest_harmful_contrast_fpr | 87.0% | FAIL |
|
| 44 |
| exfil_demo | None | PASS |
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
| Holdout | Recall | FPR | F1 | FN | FP |
|
| 51 |
-
| --- | --- | --- | --- | --- | --- |
|
| 52 |
-
| fp_probes | None | None | None | 0 | 0 |
|
| 53 |
-
| neuralchemy_test | 94.4% | 0.5% | 96.9% | 31 | 2 |
|
| 54 |
-
| train_span_holdout | 98.8% | None | 97.1% | 219 | 805 |
|
| 55 |
-
| bipia_indirect | 96.3% | 0.0% | 98.1% | 74 | 0 |
|
| 56 |
-
| deepset_direct | 61.9% | 10.2% | 69.2% | 40 | 18 |
|
| 57 |
-
| notinject_fpr | 0.0% | 0.9% | 0.0% | 0 | 3 |
|
| 58 |
-
| xstest_safe | 0.0% | 2.8% | 0.0% | 0 | 7 |
|
| 59 |
-
| xstest_fpr | 0.0% | 40.2% | 0.0% | 0 | 181 |
|
| 60 |
-
| xstest_harmful_contrast | 0.0% | 87.0% | 0.0% | 0 | 174 |
|
| 61 |
-
| public_validation | 100.0% | 0.1% | 100.0% | 1 | 2 |
|
| 62 |
-
|
| 63 |
-
### Vigil-parity holdouts (per-axis, not blended)
|
| 64 |
-
|
| 65 |
-
| Holdout | Recall | Doc FPR | F1 | Purpose |
|
| 66 |
-
| --- | --- | --- | --- | --- |
|
| 67 |
-
| pai_injecguard_valid **weak** | 89.6% | 20.8% | 77.5% | ProtectAI validation: InjecGuard_valid (144) |
|
| 68 |
-
| pai_spikee | 78.6% | 6.7% | 87.9% | ProtectAI validation: spikee contextual (986) |
|
| 69 |
-
| pai_bipia_code | 98.0% | 0.0% | 99.0% | ProtectAI validation: bipia_code (50) |
|
| 70 |
-
| pai_bipia_text | 89.3% | 0.0% | 94.4% | ProtectAI validation: bipia_text (75) |
|
| 71 |
-
| pai_not_inject | 0.0% | 0.9% | 0.0% | ProtectAI validation: not_inject trigger benign (339) |
|
| 72 |
-
| pai_wildguard **weak** | 0.0% | 54.2% | 0.0% | ProtectAI validation: wildguard benign diversity (971) |
|
| 73 |
-
| pai_deepset | 82.9% | 18.8% | 78.4% | ProtectAI validation: deepset full (662) |
|
| 74 |
-
| pai_validation_all **weak** | 81.0% | 34.1% | 71.7% | ProtectAI validation combined (3227) |
|
| 75 |
-
| bipia_contextual_proxy | 97.3% | 0.0% | 98.6% | Proxy for test_contextual (1242 indirect BIPIA rows) |
|
| 76 |
-
| llm_pieval | 76.1% | 0.0% | 86.5% | LLM-PIEval agentic injection (750, recall-only) |
|
| 77 |
-
| gold_direct_malicious_proxy | 81.0% | 0.0% | 89.5% | Proxy for test_gold_direct malicious slice |
|
| 78 |
-
| gold_direct_benign_proxy **weak** | 0.0% | 34.1% | 0.0% | Proxy for test_gold_direct benign slice (FPR) |
|
| 79 |
-
| jbb_harmful_overdefense **weak** | 0.0% | 96.0% | 0.0% | JailbreakBench harmful goals β should stay SAFE (100) |
|
| 80 |
-
| jbb_benign_overdefense | 0.0% | 6.0% | 0.0% | JailbreakBench benign goals β should stay SAFE (100) |
|
| 81 |
-
| toxicchat_benign | 0.0% | 2.0% | 0.0% | ToxicChat benign over-defense (up to 4800) |
|
| 82 |
-
| neuralchemy_test | 94.4% | 0.5% | 96.9% | NeurAlchemy test (942) β Vigil card reports this axis |
|
| 83 |
-
| neuralchemy_validation | 93.8% | 2.5% | 95.9% | NeurAlchemy validation split |
|
| 84 |
-
| bipia_indirect | 96.3% | 0.0% | 98.1% | Our BIPIA indirect holdout (2000) |
|
| 85 |
-
| deepset_direct | 61.9% | 10.2% | 69.1% | Our Deepset OOD holdout (281) |
|
| 86 |
-
| notinject_fpr | 0.0% | 0.9% | 0.0% | Our notinject FPR holdout (339) |
|
| 87 |
-
| xstest_safe | 0.0% | 2.8% | 0.0% | XSTest safe homonym FPR |
|
| 88 |
-
| xstest_fpr **weak** | 0.0% | 40.2% | 0.0% | XSTest combined FPR |
|
| 89 |
-
| xstest_harmful_contrast **weak** | 0.0% | 87.0% | 0.0% | Harmful but non-injection contrast FPR |
|
| 90 |
|
| 91 |
## Limitations
|
| 92 |
|
| 93 |
-
-
|
| 94 |
-
-
|
| 95 |
-
-
|
| 96 |
-
- Long agentic
|
|
|
|
| 97 |
|
| 98 |
-
##
|
| 99 |
|
| 100 |
-
|
| 101 |
-
from unplug import Guard
|
| 102 |
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
|
|
|
| 1 |
---
|
| 2 |
language: en
|
| 3 |
license: apache-2.0
|
| 4 |
+
library_name: transformers
|
| 5 |
+
pipeline_tag: text-classification
|
| 6 |
+
base_model: microsoft/deberta-v3-xsmall
|
| 7 |
+
inference: false
|
| 8 |
tags:
|
| 9 |
- prompt-injection
|
| 10 |
- security
|
| 11 |
- span-detection
|
| 12 |
+
- guardrails
|
| 13 |
+
- ai-safety
|
| 14 |
+
- agents
|
| 15 |
+
- llm-security
|
| 16 |
---
|
| 17 |
|
| 18 |
# unplug-tiny-v1
|
| 19 |
|
| 20 |
+
**Find the attack. Cut the attack. Keep the rest.**
|
| 21 |
+
|
| 22 |
+
unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* β so your pipeline can redact the malicious span instead of throwing away the whole document.
|
| 23 |
+
|
| 24 |
+
<p>
|
| 25 |
+
<a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
|
| 26 |
+
<a href="https://github.com/UnplugAI/Unplug"><img alt="SDK" src="https://img.shields.io/badge/SDK-github.com%2FUnplugAI%2FUnplug-3b82f6"></a>
|
| 27 |
+
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
|
| 28 |
+
</p>
|
| 29 |
|
| 30 |
+
> **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data β including the axes where it fails. It is not a production WAF.
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
## At a glance
|
| 33 |
+
|
| 34 |
+
| | |
|
| 35 |
+
|---|---|
|
| 36 |
+
| **Task** | Prompt-injection detection + character-level span localization |
|
| 37 |
+
| **Architecture** | Dual-head encoder: document classifier + BIOES token head |
|
| 38 |
+
| **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
|
| 39 |
+
| **Decision policy** | `doc_or_span` β doc threshold 0.9, span threshold 0.45 |
|
| 40 |
+
| **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
|
| 41 |
+
| **Checkpoint** | `checkpoint-66630` |
|
| 42 |
+
| **License** | Apache-2.0 |
|
| 43 |
+
|
| 44 |
+
## Quickstart
|
| 45 |
+
|
| 46 |
+
The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
pip install "unplug-ai[ml]"
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
from unplug import Guard
|
| 54 |
+
|
| 55 |
+
guard = Guard.with_tiny() # auto-downloads this checkpoint
|
| 56 |
+
result = guard.scan(untrusted_text)
|
| 57 |
+
|
| 58 |
+
if not result.safe:
|
| 59 |
+
print(result.redacted_text) # malicious spans replaced, rest preserved
|
| 60 |
+
for f in result.findings:
|
| 61 |
+
print(f.category, f.span_start, f.span_end, f.score)
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Streaming LLM output and full long-document coverage:
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
scanner = guard.stream_scanner(scan_every_chars=1024)
|
| 68 |
+
for chunk in token_stream:
|
| 69 |
+
if hit := scanner.push(chunk):
|
| 70 |
+
handle(hit)
|
| 71 |
+
scanner.flush()
|
| 72 |
+
```
|
| 73 |
|
| 74 |
+
The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`).
|
| 75 |
|
| 76 |
+
## Try it live
|
| 77 |
+
|
| 78 |
+
**[Interactive demo β](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** β paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.
|
| 79 |
+
|
| 80 |
+
## Where it's strong β and where it isn't
|
| 81 |
+
|
| 82 |
+
**Strong (measured):**
|
| 83 |
+
- 94.4% recall at 0.5% FPR on the core injection test set
|
| 84 |
+
- 96.3% recall on indirect injection embedded in task context (0.0% FPR)
|
| 85 |
+
- 0.9% FPR on benign text full of trigger words ("ignore", "instructions", β¦)
|
| 86 |
+
- 97.1% span F1 β when it fires, it localizes precisely (0.0% benign span fire rate)
|
| 87 |
+
|
| 88 |
+
**Weak (also measured):**
|
| 89 |
+
- Subtle out-of-distribution direct injections: 61.9% recall
|
| 90 |
+
- Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) β this model detects *injection*, it is not a content-safety classifier
|
| 91 |
+
- Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
|
| 92 |
+
- Long agentic contexts: 76.1% recall
|
| 93 |
+
|
| 94 |
+
## Evaluation
|
| 95 |
+
|
| 96 |
+
All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.
|
| 97 |
+
|
| 98 |
+
### Detection holdouts (malicious)
|
| 99 |
+
|
| 100 |
+
| Holdout | Recall | FPR | F1 | FN | FP |
|
| 101 |
+
| --- | --- | --- | --- | --- | --- |
|
| 102 |
+
| Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
|
| 103 |
+
| Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
|
| 104 |
+
| Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
|
| 105 |
+
| Span holdout (token-level) | 98.8% | β | 97.1% | 219 | 805 |
|
| 106 |
+
| OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |
|
| 107 |
+
|
| 108 |
+
### Over-defense holdouts (benign β FPR, lower is better)
|
| 109 |
+
|
| 110 |
+
| Holdout | FPR | FP |
|
| 111 |
+
| --- | --- | --- |
|
| 112 |
+
| Trigger-word benign probes | 0.0% | 0 |
|
| 113 |
+
| NotInject-style benign (339) | 0.9% | 3 |
|
| 114 |
+
| Safe homonyms ("demolish my personal best") | 2.8% | 7 |
|
| 115 |
+
| Combined homonym/over-defense set | 40.2% | 181 |
|
| 116 |
+
| Harmful-but-not-injection contrast | 87.0% | 174 |
|
| 117 |
+
|
| 118 |
+
### Public benchmark axes
|
| 119 |
+
|
| 120 |
+
| Axis | Recall | Doc FPR | F1 |
|
| 121 |
+
| --- | --- | --- | --- |
|
| 122 |
+
| InjecGuard validation (144) | 89.6% | 20.8% | 77.5% |
|
| 123 |
+
| spikee contextual (986) | 78.6% | 6.7% | 87.9% |
|
| 124 |
+
| BIPIA code (50) | 98.0% | 0.0% | 99.0% |
|
| 125 |
+
| BIPIA text (75) | 89.3% | 0.0% | 94.4% |
|
| 126 |
+
| BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% |
|
| 127 |
+
| Deepset full (662) | 82.9% | 18.8% | 78.4% |
|
| 128 |
+
| LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
|
| 129 |
+
| Direct malicious proxy | 81.0% | 0.0% | 89.5% |
|
| 130 |
+
| NotInject trigger benign (339) | β | 0.9% | β |
|
| 131 |
+
| WildGuard benign diversity (971) | β | 54.2% | β |
|
| 132 |
+
| Direct benign proxy | β | 34.1% | β |
|
| 133 |
+
| JailbreakBench harmful goals (100) | β | 96.0% | β |
|
| 134 |
+
| JailbreakBench benign goals (100) | β | 6.0% | β |
|
| 135 |
+
| ToxicChat benign (β€4800) | β | 2.0% | β |
|
| 136 |
+
| Combined public validation (3227) | 81.0% | 34.1% | 71.7% |
|
| 137 |
+
|
| 138 |
+
<details>
|
| 139 |
+
<summary><b>Release gates (full pass/fail record)</b></summary>
|
| 140 |
|
| 141 |
| Gate | Value | Status |
|
| 142 |
| --- | --- | --- |
|
|
|
|
| 156 |
| xstest_harmful_contrast_fpr | 87.0% | FAIL |
|
| 157 |
| exfil_demo | None | PASS |
|
| 158 |
|
| 159 |
+
Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.
|
| 160 |
|
| 161 |
+
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
## Limitations
|
| 164 |
|
| 165 |
+
- The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier β this model answers "is someone hijacking my LLM?", not "is this request harmful?"
|
| 166 |
+
- Subtle direct OOD injections are often missed by both heads.
|
| 167 |
+
- Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
|
| 168 |
+
- Long agentic tool-use contexts have recall gaps.
|
| 169 |
+
- English-centric training data.
|
| 170 |
|
| 171 |
+
## Intended use
|
| 172 |
|
| 173 |
+
Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary β combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).
|
|
|
|
| 174 |
|
| 175 |
+
## Part of the Unplug stack
|
| 176 |
+
|
| 177 |
+
| Layer | What it does |
|
| 178 |
+
| --- | --- |
|
| 179 |
+
| [`unplug-ai` SDK](https://github.com/UnplugAI/Unplug) | Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction |
|
| 180 |
+
| **unplug-tiny-v1** (this model) | ML span detection tier |
|
| 181 |
+
| [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |
|
| 182 |
|
| 183 |
+
Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) β hidden webpage injection β tainted session β blocked exfiltration tool call.
|