--- language: en license: apache-2.0 library_name: transformers pipeline_tag: text-classification base_model: microsoft/deberta-v3-xsmall inference: false tags: - prompt-injection - security - span-detection - guardrails - ai-safety - agents - llm-security --- # unplug-tiny-v1 **Find the attack. Cut the attack. Keep the rest.** unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* - so your pipeline can redact the malicious span instead of throwing away the whole document.
> **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF. ## At a glance | | | |---|---| | **Task** | Prompt-injection detection + character-level span localization | | **Architecture** | Dual-head encoder: document classifier + BIOES token head | | **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) | | **Decision policy** | `doc_or_span` - doc threshold 0.9, span threshold 0.45 | | **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK | | **Checkpoint** | `checkpoint-66630` | | **License** | Apache-2.0 | ## Quickstart The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model: ```bash pip install "unplug-ai[ml]" ``` ```python from unplug import Guard guard = Guard.with_tiny() # auto-downloads this checkpoint result = guard.scan(untrusted_text) if not result.safe: print(result.redacted_text) # malicious spans replaced, rest preserved for f in result.findings: print(f.category, f.span_start, f.span_end, f.score) ``` Streaming LLM output and full long-document coverage: ```python scanner = guard.stream_scanner(scan_every_chars=1024) for chunk in token_stream: if hit := scanner.push(chunk): handle(hit) scanner.flush() ``` The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`). ## Try it live **[Open the interactive demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong. ## Where it's strong - and where it isn't **Strong (measured):** - 94.4% recall at 0.5% FPR on the core injection test set - 96.3% recall on indirect injection embedded in task context (0.0% FPR) - 0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...) - 97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate) **Weak (also measured):** - Subtle out-of-distribution direct injections: 61.9% recall - Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects *injection*, it is not a content-safety classifier - Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis - Long agentic contexts: 76.1% recall ## Evaluation All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed. ### Detection holdouts (malicious) | Holdout | Recall | FPR | F1 | FN | FP | | --- | --- | --- | --- | --- | --- | | Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 | | Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 | | Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 | | Span holdout (token-level) | 98.8% | - | 97.1% | 219 | 805 | | OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 | ### Over-defense holdouts (benign - FPR, lower is better) | Holdout | FPR | FP | | --- | --- | --- | | Trigger-word benign probes | 0.0% | 0 | | NotInject-style benign (339) | 0.9% | 3 | | Safe homonyms ("demolish my personal best") | 2.8% | 7 | | Combined homonym/over-defense set | 40.2% | 181 | | Harmful-but-not-injection contrast | 87.0% | 174 | ### Public benchmark axes | Axis | Recall | Doc FPR | F1 | | --- | --- | --- | --- | | InjecGuard validation (144) | 89.6% | 20.8% | 77.5% | | spikee contextual (986) | 78.6% | 6.7% | 87.9% | | BIPIA code (50) | 98.0% | 0.0% | 99.0% | | BIPIA text (75) | 89.3% | 0.0% | 94.4% | | BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% | | Deepset full (662) | 82.9% | 18.8% | 78.4% | | LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% | | Direct malicious proxy | 81.0% | 0.0% | 89.5% | | NotInject trigger benign (339) | - | 0.9% | - | | WildGuard benign diversity (971) | - | 54.2% | - | | Direct benign proxy | - | 34.1% | - | | JailbreakBench harmful goals (100) | - | 96.0% | - | | JailbreakBench benign goals (100) | - | 6.0% | - | | ToxicChat benign (≤4800) | - | 2.0% | - | | Combined public validation (3227) | 81.0% | 34.1% | 71.7% |