---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-xsmall
inference: false
tags:
 - prompt-injection
 - security
 - span-detection
 - guardrails
 - ai-safety
 - agents
 - llm-security
---

# unplug-tiny-v1

**Find the attack. Cut the attack. Keep the rest.**

unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* - so your pipeline can redact the malicious span instead of throwing away the whole document.

<p>
  <a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
  <a href="https://github.com/UnplugAI/Unplug"><img alt="SDK" src="https://img.shields.io/badge/SDK-github.com%2FUnplugAI%2FUnplug-3b82f6"></a>
  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
</p>

> **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF.

## At a glance

| | |
|---|---|
| **Task** | Prompt-injection detection + character-level span localization |
| **Architecture** | Dual-head encoder: document classifier + BIOES token head |
| **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
| **Decision policy** | `doc_or_span` - doc threshold 0.9, span threshold 0.45 |
| **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
| **Checkpoint** | `checkpoint-66630` |
| **License** | Apache-2.0 |

## Quickstart

The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:

```bash
pip install "unplug-ai[ml]"
```

```python
from unplug import Guard

guard = Guard.with_tiny()          # auto-downloads this checkpoint
result = guard.scan(untrusted_text)

if not result.safe:
    print(result.redacted_text)    # malicious spans replaced, rest preserved
    for f in result.findings:
        print(f.category, f.span_start, f.span_end, f.score)
```

Streaming LLM output and full long-document coverage:

```python
scanner = guard.stream_scanner(scan_every_chars=1024)
for chunk in token_stream:
    if hit := scanner.push(chunk):
        handle(hit)
scanner.flush()
```

The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`).

## Try it live

**[Open the interactive demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.

## Where it's strong - and where it isn't

**Strong (measured):**
- 94.4% recall at 0.5% FPR on the core injection test set
- 96.3% recall on indirect injection embedded in task context (0.0% FPR)
- 0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...)
- 97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate)

**Weak (also measured):**
- Subtle out-of-distribution direct injections: 61.9% recall
- Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects *injection*, it is not a content-safety classifier
- Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
- Long agentic contexts: 76.1% recall

## Evaluation

All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.

### Detection holdouts (malicious)

| Holdout | Recall | FPR | F1 | FN | FP |
| --- | --- | --- | --- | --- | --- |
| Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
| Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
| Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
| Span holdout (token-level) | 98.8% | - | 97.1% | 219 | 805 |
| OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |

### Over-defense holdouts (benign - FPR, lower is better)

| Holdout | FPR | FP |
| --- | --- | --- |
| Trigger-word benign probes | 0.0% | 0 |
| NotInject-style benign (339) | 0.9% | 3 |
| Safe homonyms ("demolish my personal best") | 2.8% | 7 |
| Combined homonym/over-defense set | 40.2% | 181 |
| Harmful-but-not-injection contrast | 87.0% | 174 |

### Public benchmark axes

| Axis | Recall | Doc FPR | F1 |
| --- | --- | --- | --- |
| InjecGuard validation (144) | 89.6% | 20.8% | 77.5% |
| spikee contextual (986) | 78.6% | 6.7% | 87.9% |
| BIPIA code (50) | 98.0% | 0.0% | 99.0% |
| BIPIA text (75) | 89.3% | 0.0% | 94.4% |
| BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% |
| Deepset full (662) | 82.9% | 18.8% | 78.4% |
| LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
| Direct malicious proxy | 81.0% | 0.0% | 89.5% |
| NotInject trigger benign (339) | - | 0.9% | - |
| WildGuard benign diversity (971) | - | 54.2% | - |
| Direct benign proxy | - | 34.1% | - |
| JailbreakBench harmful goals (100) | - | 96.0% | - |
| JailbreakBench benign goals (100) | - | 6.0% | - |
| ToxicChat benign (≤4800) | - | 2.0% | - |
| Combined public validation (3227) | 81.0% | 34.1% | 71.7% |

<details>
<summary><b>Release gates (full pass/fail record)</b></summary>

| Gate | Value | Status |
| --- | --- | --- |
| fp_probes | True | PASS |
| neuralchemy_test_doc_fpr | 0.5% | PASS |
| neuralchemy_test_doc_recall | 94.4% | PASS |
| bipia_recall | 96.3% | PASS |
| deepset_direct_recall | 61.9% | FAIL |
| deepset_direct_fpr | 10.2% | FAIL |
| notinject_fpr | 0.9% | PASS |
| xstest_safe_fpr | 2.8% | PASS |
| public_validation_recall | 100.0% | PASS |
| public_validation_fpr | 0.1% | PASS |
| span_holdout_f1 | 97.1% | PASS |
| malicious_span_char_recall | 97.4% | PASS |
| benign_span_fire_rate | 0.0% | PASS |
| xstest_harmful_contrast_fpr | 87.0% | FAIL |
| exfil_demo | None | PASS |

Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.

</details>

## Limitations

- The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier - this model answers "is someone hijacking my LLM?", not "is this request harmful?"
- Subtle direct OOD injections are often missed by both heads.
- Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
- Long agentic tool-use contexts have recall gaps.
- English-centric training data.

## Intended use

Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary - combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).

## Part of the Unplug stack

| Layer | What it does |
| --- | --- |
| [`unplug-ai` SDK](https://github.com/UnplugAI/Unplug) | Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction |
| **unplug-tiny-v1** (this model) | ML span detection tier |
| [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |

Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) - hidden webpage injection -> tainted session -> blocked exfiltration tool call.