chiruu12 commited on
Commit
daca7c1
·
verified ·
1 Parent(s): eb7e1fc

Publish unplug-tiny-v1 checkpoint-66630

Browse files

DeBERTa-v3-xsmall dual-head span injection model. Preview OSS — not a WAF replacement.

Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - prompt-injection
6
+ - security
7
+ - span-detection
8
+ library_name: transformers
9
+ pipeline_tag: text-classification
10
+ model_name: Unplug-AI/unplug-tiny-v1
11
+ ---
12
+
13
+ # unplug-tiny-v1
14
+
15
+ **Preview OSS span detector** — not a production WAF / Vigil replacement.
16
+
17
+ - Backbone: `microsoft/deberta-v3-xsmall` (~22M dual-head)
18
+ - Policy: `doc_or_span` @ τ_doc=0.9, τ_span=0.45
19
+ - Checkpoint: `checkpoint-66630`
20
+ - Generated: 2026-06-09
21
+
22
+ ## Scope
23
+
24
+ Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weaknesses: WildGuard benign FPR, harmful-non-injection contrast, Deepset OOD recall, agentic (LLM-PIEval).
25
+
26
+ ## Required ship gates
27
+
28
+ | Gate | Value | Status |
29
+ | --- | --- | --- |
30
+ | fp_probes | True | PASS |
31
+ | neuralchemy_test_doc_fpr | 0.5% | PASS |
32
+ | neuralchemy_test_doc_recall | 94.4% | PASS |
33
+ | bipia_recall | 96.3% | PASS |
34
+ | deepset_direct_recall | 61.9% | FAIL |
35
+ | deepset_direct_fpr | 10.2% | FAIL |
36
+ | notinject_fpr | 0.9% | PASS |
37
+ | xstest_safe_fpr | 2.8% | PASS |
38
+ | public_validation_recall | 100.0% | PASS |
39
+ | public_validation_fpr | 0.1% | PASS |
40
+ | span_holdout_f1 | 97.1% | PASS |
41
+ | malicious_span_char_recall | 97.4% | PASS |
42
+ | benign_span_fire_rate | 0.0% | PASS |
43
+ | xstest_harmful_contrast_fpr | 87.0% | FAIL |
44
+ | exfil_demo | None | PASS |
45
+
46
+ **Required gate failures:** deepset_direct_recall, deepset_direct_fpr
47
+
48
+ ### Ship-gate holdouts (checkpoint-66630)
49
+
50
+ | Holdout | Recall | FPR | F1 | FN | FP |
51
+ | --- | --- | --- | --- | --- | --- |
52
+ | fp_probes | None | None | None | 0 | 0 |
53
+ | neuralchemy_test | 94.4% | 0.5% | 96.9% | 31 | 2 |
54
+ | train_span_holdout | 98.8% | None | 97.1% | 219 | 805 |
55
+ | bipia_indirect | 96.3% | 0.0% | 98.1% | 74 | 0 |
56
+ | deepset_direct | 61.9% | 10.2% | 69.2% | 40 | 18 |
57
+ | notinject_fpr | 0.0% | 0.9% | 0.0% | 0 | 3 |
58
+ | xstest_safe | 0.0% | 2.8% | 0.0% | 0 | 7 |
59
+ | xstest_fpr | 0.0% | 40.2% | 0.0% | 0 | 181 |
60
+ | xstest_harmful_contrast | 0.0% | 87.0% | 0.0% | 0 | 174 |
61
+ | public_validation | 100.0% | 0.1% | 100.0% | 1 | 2 |
62
+
63
+ ### Vigil-parity holdouts (per-axis, not blended)
64
+
65
+ | Holdout | Recall | Doc FPR | F1 | Purpose |
66
+ | --- | --- | --- | --- | --- |
67
+ | pai_injecguard_valid **weak** | 89.6% | 20.8% | 77.5% | ProtectAI validation: InjecGuard_valid (144) |
68
+ | pai_spikee | 78.6% | 6.7% | 87.9% | ProtectAI validation: spikee contextual (986) |
69
+ | pai_bipia_code | 98.0% | 0.0% | 99.0% | ProtectAI validation: bipia_code (50) |
70
+ | pai_bipia_text | 89.3% | 0.0% | 94.4% | ProtectAI validation: bipia_text (75) |
71
+ | pai_not_inject | 0.0% | 0.9% | 0.0% | ProtectAI validation: not_inject trigger benign (339) |
72
+ | pai_wildguard **weak** | 0.0% | 54.2% | 0.0% | ProtectAI validation: wildguard benign diversity (971) |
73
+ | pai_deepset | 82.9% | 18.8% | 78.4% | ProtectAI validation: deepset full (662) |
74
+ | pai_validation_all **weak** | 81.0% | 34.1% | 71.7% | ProtectAI validation combined (3227) |
75
+ | bipia_contextual_proxy | 97.3% | 0.0% | 98.6% | Proxy for test_contextual (1242 indirect BIPIA rows) |
76
+ | llm_pieval | 76.1% | 0.0% | 86.5% | LLM-PIEval agentic injection (750, recall-only) |
77
+ | gold_direct_malicious_proxy | 81.0% | 0.0% | 89.5% | Proxy for test_gold_direct malicious slice |
78
+ | gold_direct_benign_proxy **weak** | 0.0% | 34.1% | 0.0% | Proxy for test_gold_direct benign slice (FPR) |
79
+ | jbb_harmful_overdefense **weak** | 0.0% | 96.0% | 0.0% | JailbreakBench harmful goals — should stay SAFE (100) |
80
+ | jbb_benign_overdefense | 0.0% | 6.0% | 0.0% | JailbreakBench benign goals — should stay SAFE (100) |
81
+ | toxicchat_benign | 0.0% | 2.0% | 0.0% | ToxicChat benign over-defense (up to 4800) |
82
+ | neuralchemy_test | 94.4% | 0.5% | 96.9% | NeurAlchemy test (942) — Vigil card reports this axis |
83
+ | neuralchemy_validation | 93.8% | 2.5% | 95.9% | NeurAlchemy validation split |
84
+ | bipia_indirect | 96.3% | 0.0% | 98.1% | Our BIPIA indirect holdout (2000) |
85
+ | deepset_direct | 61.9% | 10.2% | 69.1% | Our Deepset OOD holdout (281) |
86
+ | notinject_fpr | 0.0% | 0.9% | 0.0% | Our notinject FPR holdout (339) |
87
+ | xstest_safe | 0.0% | 2.8% | 0.0% | XSTest safe homonym FPR |
88
+ | xstest_fpr **weak** | 0.0% | 40.2% | 0.0% | XSTest combined FPR |
89
+ | xstest_harmful_contrast **weak** | 0.0% | 87.0% | 0.0% | Harmful but non-injection contrast FPR |
90
+
91
+ ## Limitations
92
+
93
+ - Doc head over-fires on harmful-but-non-injection text (XSTest contrast, JBB harmful goals)
94
+ - WildGuard benign diversity triggers false positives
95
+ - Subtle direct OOD injections (Deepset-class) often missed by both heads
96
+ - Long agentic contexts (LLM-PIEval) have recall gaps
97
+
98
+ ## Usage (SDK)
99
+
100
+ ```python
101
+ from unplug import Guard
102
+ guard = Guard(mode="local") # loads Unplug-AI/unplug-tiny-v1
103
+ result = guard.scan(user_text)
104
+ ```