chiruu12 commited on
Commit
19b7d67
Β·
verified Β·
1 Parent(s): 284cf75

plain ascii typography

Browse files
Files changed (1) hide show
  1. README.md +26 -26
README.md CHANGED
@@ -6,20 +6,20 @@ pipeline_tag: text-classification
6
  base_model: microsoft/deberta-v3-xsmall
7
  inference: false
8
  tags:
9
- - prompt-injection
10
- - security
11
- - span-detection
12
- - guardrails
13
- - ai-safety
14
- - agents
15
- - llm-security
16
  ---
17
 
18
  # unplug-tiny-v1
19
 
20
  **Find the attack. Cut the attack. Keep the rest.**
21
 
22
- unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* β€” so your pipeline can redact the malicious span instead of throwing away the whole document.
23
 
24
  <p>
25
  <a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
@@ -27,7 +27,7 @@ unplug-tiny is a dual-head span detector for prompt injection. A document head d
27
  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
28
  </p>
29
 
30
- > **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data β€” including the axes where it fails. It is not a production WAF.
31
 
32
  ## At a glance
33
 
@@ -36,7 +36,7 @@ unplug-tiny is a dual-head span detector for prompt injection. A document head d
36
  | **Task** | Prompt-injection detection + character-level span localization |
37
  | **Architecture** | Dual-head encoder: document classifier + BIOES token head |
38
  | **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
39
- | **Decision policy** | `doc_or_span` β€” doc threshold 0.9, span threshold 0.45 |
40
  | **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
41
  | **Checkpoint** | `checkpoint-66630` |
42
  | **License** | Apache-2.0 |
@@ -75,19 +75,19 @@ The checkpoint uses a custom dual-head architecture; loading it raw with `AutoMo
75
 
76
  ## Try it live
77
 
78
- **[Interactive demo β†’](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** β€” paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.
79
 
80
- ## Where it's strong β€” and where it isn't
81
 
82
  **Strong (measured):**
83
  - 94.4% recall at 0.5% FPR on the core injection test set
84
  - 96.3% recall on indirect injection embedded in task context (0.0% FPR)
85
- - 0.9% FPR on benign text full of trigger words ("ignore", "instructions", …)
86
- - 97.1% span F1 β€” when it fires, it localizes precisely (0.0% benign span fire rate)
87
 
88
  **Weak (also measured):**
89
  - Subtle out-of-distribution direct injections: 61.9% recall
90
- - Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) β€” this model detects *injection*, it is not a content-safety classifier
91
  - Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
92
  - Long agentic contexts: 76.1% recall
93
 
@@ -102,10 +102,10 @@ All numbers are produced by a frozen golden-eval harness on held-out data. Recal
102
  | Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
103
  | Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
104
  | Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
105
- | Span holdout (token-level) | 98.8% | β€” | 97.1% | 219 | 805 |
106
  | OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |
107
 
108
- ### Over-defense holdouts (benign β€” FPR, lower is better)
109
 
110
  | Holdout | FPR | FP |
111
  | --- | --- | --- |
@@ -127,12 +127,12 @@ All numbers are produced by a frozen golden-eval harness on held-out data. Recal
127
  | Deepset full (662) | 82.9% | 18.8% | 78.4% |
128
  | LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
129
  | Direct malicious proxy | 81.0% | 0.0% | 89.5% |
130
- | NotInject trigger benign (339) | β€” | 0.9% | β€” |
131
- | WildGuard benign diversity (971) | β€” | 54.2% | β€” |
132
- | Direct benign proxy | β€” | 34.1% | β€” |
133
- | JailbreakBench harmful goals (100) | β€” | 96.0% | β€” |
134
- | JailbreakBench benign goals (100) | β€” | 6.0% | β€” |
135
- | ToxicChat benign (≀4800) | β€” | 2.0% | β€” |
136
  | Combined public validation (3227) | 81.0% | 34.1% | 71.7% |
137
 
138
  <details>
@@ -162,7 +162,7 @@ Shipped as a preview with two failing required gates (OOD direct recall/FPR) and
162
 
163
  ## Limitations
164
 
165
- - The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier β€” this model answers "is someone hijacking my LLM?", not "is this request harmful?"
166
  - Subtle direct OOD injections are often missed by both heads.
167
  - Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
168
  - Long agentic tool-use contexts have recall gaps.
@@ -170,7 +170,7 @@ Shipped as a preview with two failing required gates (OOD direct recall/FPR) and
170
 
171
  ## Intended use
172
 
173
- Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary β€” combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).
174
 
175
  ## Part of the Unplug stack
176
 
@@ -180,4 +180,4 @@ Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messa
180
  | **unplug-tiny-v1** (this model) | ML span detection tier |
181
  | [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |
182
 
183
- Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) β€” hidden webpage injection β†’ tainted session β†’ blocked exfiltration tool call.
 
6
  base_model: microsoft/deberta-v3-xsmall
7
  inference: false
8
  tags:
9
+ - prompt-injection
10
+ - security
11
+ - span-detection
12
+ - guardrails
13
+ - ai-safety
14
+ - agents
15
+ - llm-security
16
  ---
17
 
18
  # unplug-tiny-v1
19
 
20
  **Find the attack. Cut the attack. Keep the rest.**
21
 
22
+ unplug-tiny is a dual-head span detector for prompt injection. A document head decides *whether* text is hostile; a BIOES token head localizes *where* - so your pipeline can redact the malicious span instead of throwing away the whole document.
23
 
24
  <p>
25
  <a href="https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo"><img alt="Live demo" src="https://img.shields.io/badge/Live_demo-unplug--tiny--demo-22c55e"></a>
 
27
  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-9ca3af"></a>
28
  </p>
29
 
30
+ > **Preview release.** unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF.
31
 
32
  ## At a glance
33
 
 
36
  | **Task** | Prompt-injection detection + character-level span localization |
37
  | **Architecture** | Dual-head encoder: document classifier + BIOES token head |
38
  | **Backbone** | DeBERTa-v3-xsmall (70M params, 22M non-embedding) |
39
+ | **Decision policy** | `doc_or_span` - doc threshold 0.9, span threshold 0.45 |
40
  | **Long documents** | Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK |
41
  | **Checkpoint** | `checkpoint-66630` |
42
  | **License** | Apache-2.0 |
 
75
 
76
  ## Try it live
77
 
78
+ **[Open the interactive demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo)** to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.
79
 
80
+ ## Where it's strong - and where it isn't
81
 
82
  **Strong (measured):**
83
  - 94.4% recall at 0.5% FPR on the core injection test set
84
  - 96.3% recall on indirect injection embedded in task context (0.0% FPR)
85
+ - 0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...)
86
+ - 97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate)
87
 
88
  **Weak (also measured):**
89
  - Subtle out-of-distribution direct injections: 61.9% recall
90
+ - Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects *injection*, it is not a content-safety classifier
91
  - Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
92
  - Long agentic contexts: 76.1% recall
93
 
 
102
  | Core injection test (942) | 94.4% | 0.5% | 96.9% | 31 | 2 |
103
  | Indirect injection in context (2000) | 96.3% | 0.0% | 98.1% | 74 | 0 |
104
  | Public validation set | 100.0% | 0.1% | 100.0% | 1 | 2 |
105
+ | Span holdout (token-level) | 98.8% | - | 97.1% | 219 | 805 |
106
  | OOD direct injection (281) | 61.9% | 10.2% | 69.2% | 40 | 18 |
107
 
108
+ ### Over-defense holdouts (benign - FPR, lower is better)
109
 
110
  | Holdout | FPR | FP |
111
  | --- | --- | --- |
 
127
  | Deepset full (662) | 82.9% | 18.8% | 78.4% |
128
  | LLM-PIEval agentic (750, recall-only) | 76.1% | 0.0% | 86.5% |
129
  | Direct malicious proxy | 81.0% | 0.0% | 89.5% |
130
+ | NotInject trigger benign (339) | - | 0.9% | - |
131
+ | WildGuard benign diversity (971) | - | 54.2% | - |
132
+ | Direct benign proxy | - | 34.1% | - |
133
+ | JailbreakBench harmful goals (100) | - | 96.0% | - |
134
+ | JailbreakBench benign goals (100) | - | 6.0% | - |
135
+ | ToxicChat benign (≀4800) | - | 2.0% | - |
136
  | Combined public validation (3227) | 81.0% | 34.1% | 71.7% |
137
 
138
  <details>
 
162
 
163
  ## Limitations
164
 
165
+ - The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier - this model answers "is someone hijacking my LLM?", not "is this request harmful?"
166
  - Subtle direct OOD injections are often missed by both heads.
167
  - Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
168
  - Long agentic tool-use contexts have recall gaps.
 
170
 
171
  ## Intended use
172
 
173
+ Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary - combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).
174
 
175
  ## Part of the Unplug stack
176
 
 
180
  | **unplug-tiny-v1** (this model) | ML span detection tier |
181
  | [Live demo](https://huggingface.co/spaces/Unplug-AI/unplug-tiny-demo) | Interactive span highlighting + redaction |
182
 
183
+ Agent kill-chain walkthrough: [`agent_exfil_demo.py`](https://github.com/UnplugAI/Unplug/blob/main/sdk/examples/agent_exfil_demo.py) - hidden webpage injection -> tainted session -> blocked exfiltration tool call.