Unplug-AI
/

unplug-tiny-v1

+---
+language: en
+license: apache-2.0
+tags:
+  - prompt-injection
+  - security
+  - span-detection
+library_name: transformers
+pipeline_tag: text-classification
+model_name: Unplug-AI/unplug-tiny-v1
+---
+# unplug-tiny-v1
+**Preview OSS span detector** — not a production WAF / Vigil replacement.
+- Backbone: `microsoft/deberta-v3-xsmall` (~22M dual-head)
+- Policy: `doc_or_span` @ τ_doc=0.9, τ_span=0.45
+- Checkpoint: `checkpoint-66630`
+- Generated: 2026-06-09
+## Scope
+Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weaknesses: WildGuard benign FPR, harmful-non-injection contrast, Deepset OOD recall, agentic (LLM-PIEval).
+## Required ship gates
+| Gate | Value | Status |
+| --- | --- | --- |
+| fp_probes | True | PASS |
+| neuralchemy_test_doc_fpr | 0.5% | PASS |
+| neuralchemy_test_doc_recall | 94.4% | PASS |
+| bipia_recall | 96.3% | PASS |
+| deepset_direct_recall | 61.9% | FAIL |
+| deepset_direct_fpr | 10.2% | FAIL |
+| notinject_fpr | 0.9% | PASS |
+| xstest_safe_fpr | 2.8% | PASS |
+| public_validation_recall | 100.0% | PASS |
+| public_validation_fpr | 0.1% | PASS |
+| span_holdout_f1 | 97.1% | PASS |
+| malicious_span_char_recall | 97.4% | PASS |
+| benign_span_fire_rate | 0.0% | PASS |
+| xstest_harmful_contrast_fpr | 87.0% | FAIL |
+| exfil_demo | None | PASS |
+**Required gate failures:** deepset_direct_recall, deepset_direct_fpr
+### Ship-gate holdouts (checkpoint-66630)
+| Holdout | Recall | FPR | F1 | FN | FP |
+| --- | --- | --- | --- | --- | --- |
+| fp_probes | None | None | None | 0 | 0 |
+| neuralchemy_test | 94.4% | 0.5% | 96.9% | 31 | 2 |
+| train_span_holdout | 98.8% | None | 97.1% | 219 | 805 |
+| bipia_indirect | 96.3% | 0.0% | 98.1% | 74 | 0 |
+| deepset_direct | 61.9% | 10.2% | 69.2% | 40 | 18 |
+| notinject_fpr | 0.0% | 0.9% | 0.0% | 0 | 3 |
+| xstest_safe | 0.0% | 2.8% | 0.0% | 0 | 7 |
+| xstest_fpr | 0.0% | 40.2% | 0.0% | 0 | 181 |
+| xstest_harmful_contrast | 0.0% | 87.0% | 0.0% | 0 | 174 |
+| public_validation | 100.0% | 0.1% | 100.0% | 1 | 2 |
+### Vigil-parity holdouts (per-axis, not blended)
+| Holdout | Recall | Doc FPR | F1 | Purpose |
+| --- | --- | --- | --- | --- |
+| pai_injecguard_valid **weak** | 89.6% | 20.8% | 77.5% | ProtectAI validation: InjecGuard_valid (144) |
+| pai_spikee | 78.6% | 6.7% | 87.9% | ProtectAI validation: spikee contextual (986) |
+| pai_bipia_code | 98.0% | 0.0% | 99.0% | ProtectAI validation: bipia_code (50) |
+| pai_bipia_text | 89.3% | 0.0% | 94.4% | ProtectAI validation: bipia_text (75) |
+| pai_not_inject | 0.0% | 0.9% | 0.0% | ProtectAI validation: not_inject trigger benign (339) |
+| pai_wildguard **weak** | 0.0% | 54.2% | 0.0% | ProtectAI validation: wildguard benign diversity (971) |
+| pai_deepset | 82.9% | 18.8% | 78.4% | ProtectAI validation: deepset full (662) |
+| pai_validation_all **weak** | 81.0% | 34.1% | 71.7% | ProtectAI validation combined (3227) |
+| bipia_contextual_proxy | 97.3% | 0.0% | 98.6% | Proxy for test_contextual (1242 indirect BIPIA rows) |
+| llm_pieval | 76.1% | 0.0% | 86.5% | LLM-PIEval agentic injection (750, recall-only) |
+| gold_direct_malicious_proxy | 81.0% | 0.0% | 89.5% | Proxy for test_gold_direct malicious slice |
+| gold_direct_benign_proxy **weak** | 0.0% | 34.1% | 0.0% | Proxy for test_gold_direct benign slice (FPR) |
+| jbb_harmful_overdefense **weak** | 0.0% | 96.0% | 0.0% | JailbreakBench harmful goals — should stay SAFE (100) |
+| jbb_benign_overdefense | 0.0% | 6.0% | 0.0% | JailbreakBench benign goals — should stay SAFE (100) |
+| toxicchat_benign | 0.0% | 2.0% | 0.0% | ToxicChat benign over-defense (up to 4800) |
+| neuralchemy_test | 94.4% | 0.5% | 96.9% | NeurAlchemy test (942) — Vigil card reports this axis |
+| neuralchemy_validation | 93.8% | 2.5% | 95.9% | NeurAlchemy validation split |
+| bipia_indirect | 96.3% | 0.0% | 98.1% | Our BIPIA indirect holdout (2000) |
+| deepset_direct | 61.9% | 10.2% | 69.1% | Our Deepset OOD holdout (281) |
+| notinject_fpr | 0.0% | 0.9% | 0.0% | Our notinject FPR holdout (339) |
+| xstest_safe | 0.0% | 2.8% | 0.0% | XSTest safe homonym FPR |
+| xstest_fpr **weak** | 0.0% | 40.2% | 0.0% | XSTest combined FPR |
+| xstest_harmful_contrast **weak** | 0.0% | 87.0% | 0.0% | Harmful but non-injection contrast FPR |
+## Limitations
+- Doc head over-fires on harmful-but-non-injection text (XSTest contrast, JBB harmful goals)
+- WildGuard benign diversity triggers false positives
+- Subtle direct OOD injections (Deepset-class) often missed by both heads
+- Long agentic contexts (LLM-PIEval) have recall gaps
+## Usage (SDK)
+```python
+from unplug import Guard
+guard = Guard(mode="local")  # loads Unplug-AI/unplug-tiny-v1
+result = guard.scan(user_text)
+```