chiruu12 commited on
Commit
284cf75
Β·
verified Β·
1 Parent(s): d16819f

product-grade model card

Browse files
Files changed (1) hide show
  1. README.md +141 -65
README.md CHANGED
@@ -1,29 +1,142 @@
1
  ---
2
  language: en
3
  license: apache-2.0
 
 
 
 
4
  tags:
5
  - prompt-injection
6
  - security
7
  - span-detection
8
- library_name: transformers
9
- pipeline_tag: text-classification
10
- model_name: Unplug-AI/unplug-tiny-v1
 
11
  ---
12
 
13
  # unplug-tiny-v1
14
 
15
- **Preview OSS span detector** β€” not a production WAF / Vigil replacement.
 
 
 
 
 
 
 
 
16
 
17
- - Backbone: `microsoft/deberta-v3-xsmall` (~22M dual-head)
18
- - Policy: `doc_or_span` @ Ο„_doc=0.9, Ο„_span=0.45
19
- - Checkpoint: `checkpoint-66630`
20
- - Generated: 2026-06-09
21
 
22
- ## Scope
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weaknesses: WildGuard benign FPR, harmful-non-injection contrast, Deepset OOD recall, agentic (LLM-PIEval).
25
 
26
- ## Required ship gates
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  | Gate | Value | Status |
29
  | --- | --- | --- |
@@ -43,65 +156,28 @@ Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weakn
43
  | xstest_harmful_contrast_fpr | 87.0% | FAIL |
44
  | exfil_demo | None | PASS |
45
 
46
- **Required gate failures:** deepset_direct_recall, deepset_direct_fpr
47
 
48
- ### Ship-gate holdouts (checkpoint-66630)
49
-
50
- | Holdout | Recall | FPR | F1 | FN | FP |
51
- | --- | --- | --- | --- | --- | --- |
52
- | fp_probes | None | None | None | 0 | 0 |
53
- | neuralchemy_test | 94.4% | 0.5% | 96.9% | 31 | 2 |
54
- | train_span_holdout | 98.8% | None | 97.1% | 219 | 805 |
55
- | bipia_indirect | 96.3% | 0.0% | 98.1% | 74 | 0 |
56
- | deepset_direct | 61.9% | 10.2% | 69.2% | 40 | 18 |
57
- | notinject_fpr | 0.0% | 0.9% | 0.0% | 0 | 3 |
58
- | xstest_safe | 0.0% | 2.8% | 0.0% | 0 | 7 |
59
- | xstest_fpr | 0.0% | 40.2% | 0.0% | 0 | 181 |
60
- | xstest_harmful_contrast | 0.0% | 87.0% | 0.0% | 0 | 174 |
61
- | public_validation | 100.0% | 0.1% | 100.0% | 1 | 2 |
62
-
63
- ### Vigil-parity holdouts (per-axis, not blended)
64
-
65
- | Holdout | Recall | Doc FPR | F1 | Purpose |
66
- | --- | --- | --- | --- | --- |
67
- | pai_injecguard_valid **weak** | 89.6% | 20.8% | 77.5% | ProtectAI validation: InjecGuard_valid (144) |
68
- | pai_spikee | 78.6% | 6.7% | 87.9% | ProtectAI validation: spikee contextual (986) |
69
- | pai_bipia_code | 98.0% | 0.0% | 99.0% | ProtectAI validation: bipia_code (50) |
70
- | pai_bipia_text | 89.3% | 0.0% | 94.4% | ProtectAI validation: bipia_text (75) |
71
- | pai_not_inject | 0.0% | 0.9% | 0.0% | ProtectAI validation: not_inject trigger benign (339) |
72
- | pai_wildguard **weak** | 0.0% | 54.2% | 0.0% | ProtectAI validation: wildguard benign diversity (971) |
73
- | pai_deepset | 82.9% | 18.8% | 78.4% | ProtectAI validation: deepset full (662) |
74
- | pai_validation_all **weak** | 81.0% | 34.1% | 71.7% | ProtectAI validation combined (3227) |
75
- | bipia_contextual_proxy | 97.3% | 0.0% | 98.6% | Proxy for test_contextual (1242 indirect BIPIA rows) |
76
- | llm_pieval | 76.1% | 0.0% | 86.5% | LLM-PIEval agentic injection (750, recall-only) |
77
- | gold_direct_malicious_proxy | 81.0% | 0.0% | 89.5% | Proxy for test_gold_direct malicious slice |
78
- | gold_direct_benign_proxy **weak** | 0.0% | 34.1% | 0.0% | Proxy for test_gold_direct benign slice (FPR) |
79
- | jbb_harmful_overdefense **weak** | 0.0% | 96.0% | 0.0% | JailbreakBench harmful goals β€” should stay SAFE (100) |
80
- | jbb_benign_overdefense | 0.0% | 6.0% | 0.0% | JailbreakBench benign goals β€” should stay SAFE (100) |
81
- | toxicchat_benign | 0.0% | 2.0% | 0.0% | ToxicChat benign over-defense (up to 4800) |
82
- | neuralchemy_test | 94.4% | 0.5% | 96.9% | NeurAlchemy test (942) β€” Vigil card reports this axis |
83
- | neuralchemy_validation | 93.8% | 2.5% | 95.9% | NeurAlchemy validation split |
84
- | bipia_indirect | 96.3% | 0.0% | 98.1% | Our BIPIA indirect holdout (2000) |
85
- | deepset_direct | 61.9% | 10.2% | 69.1% | Our Deepset OOD holdout (281) |
86
- | notinject_fpr | 0.0% | 0.9% | 0.0% | Our notinject FPR holdout (339) |
87
- | xstest_safe | 0.0% | 2.8% | 0.0% | XSTest safe homonym FPR |
88
- | xstest_fpr **weak** | 0.0% | 40.2% | 0.0% | XSTest combined FPR |
89
- | xstest_harmful_contrast **weak** | 0.0% | 87.0% | 0.0% | Harmful but non-injection contrast FPR |
90
 
91
  ## Limitations
92
 
93
- - Doc head over-fires on harmful-but-non-injection text (XSTest contrast, JBB harmful goals)
94
- - WildGuard benign diversity triggers false positives
95
- - Subtle direct OOD injections (Deepset-class) often missed by both heads
96
- - Long agentic contexts (LLM-PIEval) have recall gaps
 
97
 
98
- ## Usage (SDK)
99
 
100
- ```python
101
- from unplug import Guard
102
 
103
- guard = Guard.with_tiny() # auto-downloads Unplug-AI/unplug-tiny-v1
104
- result = guard.scan(user_text)
105
- ```
 
 
 
 
106
 
107
- **Interactive demo:** [Unplug-AI/unplug-tiny-demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) (span highlights + redaction).
 
1
  ---
2
  language: en
3
  license: apache-2.0
4
+ library_name: transformers
5
+ pipeline_tag: text-classification
6
+ base_model: microsoft/deberta-v3-xsmall
7
+ inference: false
8
  tags:
9
  - prompt-injection
10
  - security
11
  - span-detection
12
+ - guardrails
13
+ - ai-safety
14
+ - agents
15
+ - llm-security
16
  ---
17
 
18
  # unplug-tiny-v1
19
 
20
+ **Find the attack. Cut the attack. Keep the rest.**
21
+
22
+ unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* β€” so your pipeline can redact the malicious span instead of throwing away the whole document.
23
+
24
+ <p>
25
+ <a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
26
+ <a href="https://github.com/UnplugAI/Unplug"><img alt="SDK" src="https://img.shields.io/badge/SDK-github.com%2FUnplugAI%2FUnplug-3b82f6"></a>
27
+ <a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
28
+ </p>
29
 
30
+ > **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data β€” including the axes where it fails. It is not a production WAF.
 
 
 
31
 
32
+ ## At a glance
33
+
34
+ | | |
35
+ |---|---|
36
+ | **Task** | Prompt-injection detection + character-level span localization |
37
+ | **Architecture** | Dual-head encoder: document classifier + BIOES token head |
38
+ | **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
39
+ | **Decision policy** | `doc_or_span` β€” doc threshold 0.9, span threshold 0.45 |
40
+ | **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
41
+ | **Checkpoint** | `checkpoint-66630` |
42
+ | **License** | Apache-2.0 |
43
+
44
+ ## Quickstart
45
+
46
+ The recommended path is the [Unplug SDK](https://github.com/UnplugAI/Unplug), which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:
47
+
48
+ ```bash
49
+ pip install "unplug-ai[ml]"
50
+ ```
51
+
52
+ ```python
53
+ from unplug import Guard
54
+
55
+ guard = Guard.with_tiny() # auto-downloads this checkpoint
56
+ result = guard.scan(untrusted_text)
57
+
58
+ if not result.safe:
59
+ print(result.redacted_text) # malicious spans replaced, rest preserved
60
+ for f in result.findings:
61
+ print(f.category, f.span_start, f.span_end, f.score)
62
+ ```
63
+
64
+ Streaming LLM output and full long-document coverage:
65
+
66
+ ```python
67
+ scanner = guard.stream_scanner(scan_every_chars=1024)
68
+ for chunk in token_stream:
69
+ if hit := scanner.push(chunk):
70
+ handle(hit)
71
+ scanner.flush()
72
+ ```
73
 
74
+ The checkpoint uses a custom dual-head architecture; loading it raw with `AutoModel` will not give you the decision policy. Use the SDK or replicate the policy from `config.json` (`dual_head: true`, `doc_positive_index`, `label2id`).
75
 
76
+ ## Try it live
77
+
78
+ **[Interactive demo β†’](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** β€” paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.
79
+
80
+ ## Where it's strong β€” and where it isn't
81
+
82
+ **Strong (measured):**
83
+ - 94.4% recall at 0.5% FPR on the core injection test set
84
+ - 96.3% recall on indirect injection embedded in task context (0.0% FPR)
85
+ - 0.9% FPR on benign text full of trigger words ("ignore", "instructions", …)
86
+ - 97.1% span F1 β€” when it fires, it localizes precisely (0.0% benign span fire rate)
87
+
88
+ **Weak (also measured):**
89
+ - Subtle out-of-distribution direct injections: 61.9% recall
90
+ - Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) β€” this model detects *injection*, it is not a content-safety classifier
91
+ - Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
92
+ - Long agentic contexts: 76.1% recall
93
+
94
+ ## Evaluation
95
+
96
+ All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.
97
+
98
+ ### Detection holdouts (malicious)
99
+
100
+ | Holdout | Recall | FPR | F1 | FN | FP |
101
+ | --- | --- | --- | --- | --- | --- |
102
+ | Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
103
+ | Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
104
+ | Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
105
+ | Span holdout (token-level) | 98.8% | β€” | 97.1% | 219 | 805 |
106
+ | OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |
107
+
108
+ ### Over-defense holdouts (benign β€” FPR, lower is better)
109
+
110
+ | Holdout | FPR | FP |
111
+ | --- | --- | --- |
112
+ | Trigger-word benign probes | 0.0% | 0 |
113
+ | NotInject-style benign (339) | 0.9% | 3 |
114
+ | Safe homonyms ("demolish my personal best") | 2.8% | 7 |
115
+ | Combined homonym/over-defense set | 40.2% | 181 |
116
+ | Harmful-but-not-injection contrast | 87.0% | 174 |
117
+
118
+ ### Public benchmark axes
119
+
120
+ | Axis | Recall | Doc FPR | F1 |
121
+ | --- | --- | --- | --- |
122
+ | InjecGuard validation (144) | 89.6% | 20.8% | 77.5% |
123
+ | spikee contextual (986) | 78.6% | 6.7% | 87.9% |
124
+ | BIPIA code (50) | 98.0% | 0.0% | 99.0% |
125
+ | BIPIA text (75) | 89.3% | 0.0% | 94.4% |
126
+ | BIPIA indirect proxy (1242) | 97.3% | 0.0% | 98.6% |
127
+ | Deepset full (662) | 82.9% | 18.8% | 78.4% |
128
+ | LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
129
+ | Direct malicious proxy | 81.0% | 0.0% | 89.5% |
130
+ | NotInject trigger benign (339) | β€” | 0.9% | β€” |
131
+ | WildGuard benign diversity (971) | β€” | 54.2% | β€” |
132
+ | Direct benign proxy | β€” | 34.1% | β€” |
133
+ | JailbreakBench harmful goals (100) | β€” | 96.0% | β€” |
134
+ | JailbreakBench benign goals (100) | β€” | 6.0% | β€” |
135
+ | ToxicChat benign (≀4800) | β€” | 2.0% | β€” |
136
+ | Combined public validation (3227) | 81.0% | 34.1% | 71.7% |
137
+
138
+ <details>
139
+ <summary><b>Release gates (full pass/fail record)</b></summary>
140
 
141
  | Gate | Value | Status |
142
  | --- | --- | --- |
 
156
  | xstest_harmful_contrast_fpr | 87.0% | FAIL |
157
  | exfil_demo | None | PASS |
158
 
159
+ Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.
160
 
161
+ </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
  ## Limitations
164
 
165
+ - The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier β€” this model answers "is someone hijacking my LLM?", not "is this request harmful?"
166
+ - Subtle direct OOD injections are often missed by both heads.
167
+ - Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
168
+ - Long agentic tool-use contexts have recall gaps.
169
+ - English-centric training data.
170
 
171
+ ## Intended use
172
 
173
+ Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary β€” combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).
 
174
 
175
+ ## Part of the Unplug stack
176
+
177
+ | Layer | What it does |
178
+ | --- | --- |
179
+ | [`unplug-ai` SDK](https://github.com/UnplugAI/Unplug) | Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction |
180
+ | **unplug-tiny-v1** (this model) | ML span detection tier |
181
+ | [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |
182
 
183
+ Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) β€” hidden webpage injection β†’ tainted session β†’ blocked exfiltration tool call.