Unplug-AI
/

unplug-tiny-v1

@@ -1,29 +1,142 @@
 ---
 language: en
 license: apache-2.0
 tags:
   - prompt-injection
   - security
   - span-detection
-library_name: transformers
-pipeline_tag: text-classification
-model_name: Unplug-AI/unplug-tiny-v1
 ---
 # unplug-tiny-v1
-**Preview OSS span detector** — not a production WAF / Vigil replacement.
-- Backbone: `microsoft/deberta-v3-xsmall` (~22M dual-head)
-- Policy: `doc_or_span` @ τ_doc=0.9, τ_span=0.45
-- Checkpoint: `checkpoint-66630`
-- Generated: 2026-06-09
-## Scope
-Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weaknesses: WildGuard benign FPR, harmful-non-injection contrast, Deepset OOD recall, agentic (LLM-PIEval).
-## Required ship gates
 | Gate | Value | Status |
 | --- | --- | --- |
@@ -43,65 +156,28 @@ Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weakn
 | xstest_harmful_contrast_fpr | 87.0% | FAIL |
 | exfil_demo | None | PASS |
-**Required gate failures:** deepset_direct_recall, deepset_direct_fpr
-### Ship-gate holdouts (checkpoint-66630)
-| Holdout | Recall | FPR | F1 | FN | FP |
-| --- | --- | --- | --- | --- | --- |
-| fp_probes | None | None | None | 0 | 0 |
-| neuralchemy_test | 94.4% | 0.5% | 96.9% | 31 | 2 |
-| train_span_holdout | 98.8% | None | 97.1% | 219 | 805 |
-| bipia_indirect | 96.3% | 0.0% | 98.1% | 74 | 0 |
-| deepset_direct | 61.9% | 10.2% | 69.2% | 40 | 18 |
-| notinject_fpr | 0.0% | 0.9% | 0.0% | 0 | 3 |
-| xstest_safe | 0.0% | 2.8% | 0.0% | 0 | 7 |
-| xstest_fpr | 0.0% | 40.2% | 0.0% | 0 | 181 |
-| xstest_harmful_contrast | 0.0% | 87.0% | 0.0% | 0 | 174 |
-| public_validation | 100.0% | 0.1% | 100.0% | 1 | 2 |
-### Vigil-parity holdouts (per-axis, not blended)
-| Holdout | Recall | Doc FPR | F1 | Purpose |
-| --- | --- | --- | --- | --- |
-| pai_injecguard_valid **weak** | 89.6% | 20.8% | 77.5% | ProtectAI validation: InjecGuard_valid (144) |
-| pai_spikee | 78.6% | 6.7% | 87.9% | ProtectAI validation: spikee contextual (986) |
-| pai_bipia_code | 98.0% | 0.0% | 99.0% | ProtectAI validation: bipia_code (50) |
-| pai_bipia_text | 89.3% | 0.0% | 94.4% | ProtectAI validation: bipia_text (75) |
-| pai_not_inject | 0.0% | 0.9% | 0.0% | ProtectAI validation: not_inject trigger benign (339) |
-| pai_wildguard **weak** | 0.0% | 54.2% | 0.0% | ProtectAI validation: wildguard benign diversity (971) |
-| pai_deepset | 82.9% | 18.8% | 78.4% | ProtectAI validation: deepset full (662) |
-| pai_validation_all **weak** | 81.0% | 34.1% | 71.7% | ProtectAI validation combined (3227) |
-| bipia_contextual_proxy | 97.3% | 0.0% | 98.6% | Proxy for test_contextual (1242 indirect BIPIA rows) |
-| llm_pieval | 76.1% | 0.0% | 86.5% | LLM-PIEval agentic injection (750, recall-only) |
-| gold_direct_malicious_proxy | 81.0% | 0.0% | 89.5% | Proxy for test_gold_direct malicious slice |
-| gold_direct_benign_proxy **weak** | 0.0% | 34.1% | 0.0% | Proxy for test_gold_direct benign slice (FPR) |
-| jbb_harmful_overdefense **weak** | 0.0% | 96.0% | 0.0% | JailbreakBench harmful goals — should stay SAFE (100) |
-| jbb_benign_overdefense | 0.0% | 6.0% | 0.0% | JailbreakBench benign goals — should stay SAFE (100) |
-| toxicchat_benign | 0.0% | 2.0% | 0.0% | ToxicChat benign over-defense (up to 4800) |
-| neuralchemy_test | 94.4% | 0.5% | 96.9% | NeurAlchemy test (942) — Vigil card reports this axis |
-| neuralchemy_validation | 93.8% | 2.5% | 95.9% | NeurAlchemy validation split |
-| bipia_indirect | 96.3% | 0.0% | 98.1% | Our BIPIA indirect holdout (2000) |
-| deepset_direct | 61.9% | 10.2% | 69.1% | Our Deepset OOD holdout (281) |
-| notinject_fpr | 0.0% | 0.9% | 0.0% | Our notinject FPR holdout (339) |
-| xstest_safe | 0.0% | 2.8% | 0.0% | XSTest safe homonym FPR |
-| xstest_fpr **weak** | 0.0% | 40.2% | 0.0% | XSTest combined FPR |
-| xstest_harmful_contrast **weak** | 0.0% | 87.0% | 0.0% | Harmful but non-injection contrast FPR |
 ## Limitations
-- Doc head over-fires on harmful-but-non-injection text (XSTest contrast, JBB harmful goals)
-- WildGuard benign diversity triggers false positives
-- Subtle direct OOD injections (Deepset-class) often missed by both heads
-- Long agentic contexts (LLM-PIEval) have recall gaps
-## Usage (SDK)
-```python
-from unplug import Guard
-guard = Guard.with_tiny()  # auto-downloads Unplug-AI/unplug-tiny-v1
-result = guard.scan(user_text)
-```
-**Interactive demo:** [Unplug-AI/unplug-tiny-demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) (span highlights + redaction).

 ---
 language: en
 license: apache-2.0
+library_name: transformers
+pipeline_tag: text-classification
+base_model: microsoft/deberta-v3-xsmall
+inference: false
 tags:
   - prompt-injection
   - security
   - span-detection
+  - guardrails
+  - ai-safety
+  - agents
+  - llm-security
 ---
 # unplug-tiny-v1
+**Find the attack. Cut the attack. Keep the rest.**
+unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* — so your pipeline can redact the malicious span instead of throwing away the whole document.
+<p>
+  <a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
+  <a href="https://github.com/UnplugAI/Unplug"><img alt="SDK" src="https://img.shields.io/badge/SDK-github.com%2FUnplugAI%2FUnplug-3b82f6"></a>
+  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
+</p>
+> **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data — including the axes where it fails. It is not a production WAF.
+## At a glance
+| | |
+|---|---|
+| **Task** | Prompt-injection detection + character-level span localization |
+| **Architecture** | Dual-head encoder: document classifier + BIOES token head |
+| **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
+| **Decision policy** | `doc_or_span` — doc threshold 0.9, span threshold 0.45 |
+| **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
+| **Checkpoint** | `checkpoint-66630` |
+| **License** | Apache-2.0 |
+## Quickstart
+The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:
+```bash
+pip install "unplug-ai[ml]"
+```
+```python
+from unplug import Guard
+guard = Guard.with_tiny()          # auto-downloads this checkpoint
+result = guard.scan(untrusted_text)
+if not result.safe:
+    print(result.redacted_text)    # malicious spans replaced, rest preserved
+    for f in result.findings:
+        print(f.category, f.span_start, f.span_end, f.score)
+```
+Streaming LLM output and full long-document coverage:
+```python
+scanner = guard.stream_scanner(scan_every_chars=1024)
+for chunk in token_stream:
+    if hit := scanner.push(chunk):
+        handle(hit)
+scanner.flush()
+```
+The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`).
+## Try it live
+**[Interactive demo →](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** — paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.
+## Where it's strong — and where it isn't
+**Strong (measured):**
+- 94.4% recall at 0.5% FPR on the core injection test set
+- 96.3% recall on indirect injection embedded in task context (0.0% FPR)
+- 0.9% FPR on benign text full of trigger words ("ignore", "instructions", …)
+- 97.1% span F1 — when it fires, it localizes precisely (0.0% benign span fire rate)
+**Weak (also measured):**
+- Subtle out-of-distribution direct injections: 61.9% recall
+- Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) — this model detects *injection*, it is not a content-safety classifier
+- Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
+- Long agentic contexts: 76.1% recall
+## Evaluation
+All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.
+### Detection holdouts (malicious)
+| Holdout | Recall | FPR | F1 | FN | FP |
+| --- | --- | --- | --- | --- | --- |
+| Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
+| Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
+| Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
+| Span holdout (token-level) | 98.8% | — | 97.1% | 219 | 805 |
+| OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |
+### Over-defense holdouts (benign — FPR, lower is better)
+| Holdout | FPR | FP |
+| --- | --- | --- |
+| Trigger-word benign probes | 0.0% | 0 |
+| NotInject-style benign (339) | 0.9% | 3 |
+| Safe homonyms ("demolish my personal best") | 2.8% | 7 |
+| Combined homonym/over-defense set | 40.2% | 181 |
+| Harmful-but-not-injection contrast | 87.0% | 174 |
+### Public benchmark axes
+| Axis | Recall | Doc FPR | F1 |
+| --- | --- | --- | --- |
+| InjecGuard validation (144) | 89.6% | 20.8% | 77.5% |
+| spikee contextual (986) | 78.6% | 6.7% | 87.9% |
+| BIPIA code (50) | 98.0% | 0.0% | 99.0% |
+| BIPIA text (75) | 89.3% | 0.0% | 94.4% |
+| BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% |
+| Deepset full (662) | 82.9% | 18.8% | 78.4% |
+| LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
+| Direct malicious proxy | 81.0% | 0.0% | 89.5% |
+| NotInject trigger benign (339) | — | 0.9% | — |
+| WildGuard benign diversity (971) | — | 54.2% | — |
+| Direct benign proxy | — | 34.1% | — |
+| JailbreakBench harmful goals (100) | — | 96.0% | — |
+| JailbreakBench benign goals (100) | — | 6.0% | — |
+| ToxicChat benign (≤4800) | — | 2.0% | — |
+| Combined public validation (3227) | 81.0% | 34.1% | 71.7% |
+<details>
+<summary><b>Release gates (full pass/fail record)</b></summary>
 | Gate | Value | Status |
 | --- | --- | --- |
 | xstest_harmful_contrast_fpr | 87.0% | FAIL |
 | exfil_demo | None | PASS |
+Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.
+</details>
 ## Limitations
+- The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier — this model answers "is someone hijacking my LLM?", not "is this request harmful?"
+- Subtle direct OOD injections are often missed by both heads.
+- Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
+- Long agentic tool-use contexts have recall gaps.
+- English-centric training data.
+## Intended use
+Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary — combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).
+## Part of the Unplug stack
+| Layer | What it does |
+| --- | --- |
+| [`unplug-ai` SDK](https://github.com/UnplugAI/Unplug) | Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction |
+| **unplug-tiny-v1** (this model) | ML span detection tier |
+| [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |
+Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) — hidden webpage injection → tainted session → blocked exfiltration tool call.