Text Classification
Transformers
Safetensors
English
deberta-v2
prompt-injection
security
span-detection
guardrails
ai-safety
agents
llm-security
text-embeddings-inference
Instructions to use Unplug-AI/unplug-tiny-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Unplug-AI/unplug-tiny-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Unplug-AI/unplug-tiny-v1")# Load model directly from transformers import AutoTokenizer, DebertaV2ForDualHead tokenizer = AutoTokenizer.from_pretrained("Unplug-AI/unplug-tiny-v1") model = DebertaV2ForDualHead.from_pretrained("Unplug-AI/unplug-tiny-v1") - Notebooks
- Google Colab
- Kaggle
Publish unplug-tiny-v1 checkpoint-66630
Browse filesDeBERTa-v3-xsmall dual-head span injection model. Preview OSS — not a WAF replacement.
README.md
ADDED
|
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
tags:
|
| 5 |
+
- prompt-injection
|
| 6 |
+
- security
|
| 7 |
+
- span-detection
|
| 8 |
+
library_name: transformers
|
| 9 |
+
pipeline_tag: text-classification
|
| 10 |
+
model_name: Unplug-AI/unplug-tiny-v1
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# unplug-tiny-v1
|
| 14 |
+
|
| 15 |
+
**Preview OSS span detector** — not a production WAF / Vigil replacement.
|
| 16 |
+
|
| 17 |
+
- Backbone: `microsoft/deberta-v3-xsmall` (~22M dual-head)
|
| 18 |
+
- Policy: `doc_or_span` @ τ_doc=0.9, τ_span=0.45
|
| 19 |
+
- Checkpoint: `checkpoint-66630`
|
| 20 |
+
- Generated: 2026-06-09
|
| 21 |
+
|
| 22 |
+
## Scope
|
| 23 |
+
|
| 24 |
+
Strong on Neuralchemy, BIPIA indirect, notinject, public validation. Known weaknesses: WildGuard benign FPR, harmful-non-injection contrast, Deepset OOD recall, agentic (LLM-PIEval).
|
| 25 |
+
|
| 26 |
+
## Required ship gates
|
| 27 |
+
|
| 28 |
+
| Gate | Value | Status |
|
| 29 |
+
| --- | --- | --- |
|
| 30 |
+
| fp_probes | True | PASS |
|
| 31 |
+
| neuralchemy_test_doc_fpr | 0.5% | PASS |
|
| 32 |
+
| neuralchemy_test_doc_recall | 94.4% | PASS |
|
| 33 |
+
| bipia_recall | 96.3% | PASS |
|
| 34 |
+
| deepset_direct_recall | 61.9% | FAIL |
|
| 35 |
+
| deepset_direct_fpr | 10.2% | FAIL |
|
| 36 |
+
| notinject_fpr | 0.9% | PASS |
|
| 37 |
+
| xstest_safe_fpr | 2.8% | PASS |
|
| 38 |
+
| public_validation_recall | 100.0% | PASS |
|
| 39 |
+
| public_validation_fpr | 0.1% | PASS |
|
| 40 |
+
| span_holdout_f1 | 97.1% | PASS |
|
| 41 |
+
| malicious_span_char_recall | 97.4% | PASS |
|
| 42 |
+
| benign_span_fire_rate | 0.0% | PASS |
|
| 43 |
+
| xstest_harmful_contrast_fpr | 87.0% | FAIL |
|
| 44 |
+
| exfil_demo | None | PASS |
|
| 45 |
+
|
| 46 |
+
**Required gate failures:** deepset_direct_recall, deepset_direct_fpr
|
| 47 |
+
|
| 48 |
+
### Ship-gate holdouts (checkpoint-66630)
|
| 49 |
+
|
| 50 |
+
| Holdout | Recall | FPR | F1 | FN | FP |
|
| 51 |
+
| --- | --- | --- | --- | --- | --- |
|
| 52 |
+
| fp_probes | None | None | None | 0 | 0 |
|
| 53 |
+
| neuralchemy_test | 94.4% | 0.5% | 96.9% | 31 | 2 |
|
| 54 |
+
| train_span_holdout | 98.8% | None | 97.1% | 219 | 805 |
|
| 55 |
+
| bipia_indirect | 96.3% | 0.0% | 98.1% | 74 | 0 |
|
| 56 |
+
| deepset_direct | 61.9% | 10.2% | 69.2% | 40 | 18 |
|
| 57 |
+
| notinject_fpr | 0.0% | 0.9% | 0.0% | 0 | 3 |
|
| 58 |
+
| xstest_safe | 0.0% | 2.8% | 0.0% | 0 | 7 |
|
| 59 |
+
| xstest_fpr | 0.0% | 40.2% | 0.0% | 0 | 181 |
|
| 60 |
+
| xstest_harmful_contrast | 0.0% | 87.0% | 0.0% | 0 | 174 |
|
| 61 |
+
| public_validation | 100.0% | 0.1% | 100.0% | 1 | 2 |
|
| 62 |
+
|
| 63 |
+
### Vigil-parity holdouts (per-axis, not blended)
|
| 64 |
+
|
| 65 |
+
| Holdout | Recall | Doc FPR | F1 | Purpose |
|
| 66 |
+
| --- | --- | --- | --- | --- |
|
| 67 |
+
| pai_injecguard_valid **weak** | 89.6% | 20.8% | 77.5% | ProtectAI validation: InjecGuard_valid (144) |
|
| 68 |
+
| pai_spikee | 78.6% | 6.7% | 87.9% | ProtectAI validation: spikee contextual (986) |
|
| 69 |
+
| pai_bipia_code | 98.0% | 0.0% | 99.0% | ProtectAI validation: bipia_code (50) |
|
| 70 |
+
| pai_bipia_text | 89.3% | 0.0% | 94.4% | ProtectAI validation: bipia_text (75) |
|
| 71 |
+
| pai_not_inject | 0.0% | 0.9% | 0.0% | ProtectAI validation: not_inject trigger benign (339) |
|
| 72 |
+
| pai_wildguard **weak** | 0.0% | 54.2% | 0.0% | ProtectAI validation: wildguard benign diversity (971) |
|
| 73 |
+
| pai_deepset | 82.9% | 18.8% | 78.4% | ProtectAI validation: deepset full (662) |
|
| 74 |
+
| pai_validation_all **weak** | 81.0% | 34.1% | 71.7% | ProtectAI validation combined (3227) |
|
| 75 |
+
| bipia_contextual_proxy | 97.3% | 0.0% | 98.6% | Proxy for test_contextual (1242 indirect BIPIA rows) |
|
| 76 |
+
| llm_pieval | 76.1% | 0.0% | 86.5% | LLM-PIEval agentic injection (750, recall-only) |
|
| 77 |
+
| gold_direct_malicious_proxy | 81.0% | 0.0% | 89.5% | Proxy for test_gold_direct malicious slice |
|
| 78 |
+
| gold_direct_benign_proxy **weak** | 0.0% | 34.1% | 0.0% | Proxy for test_gold_direct benign slice (FPR) |
|
| 79 |
+
| jbb_harmful_overdefense **weak** | 0.0% | 96.0% | 0.0% | JailbreakBench harmful goals — should stay SAFE (100) |
|
| 80 |
+
| jbb_benign_overdefense | 0.0% | 6.0% | 0.0% | JailbreakBench benign goals — should stay SAFE (100) |
|
| 81 |
+
| toxicchat_benign | 0.0% | 2.0% | 0.0% | ToxicChat benign over-defense (up to 4800) |
|
| 82 |
+
| neuralchemy_test | 94.4% | 0.5% | 96.9% | NeurAlchemy test (942) — Vigil card reports this axis |
|
| 83 |
+
| neuralchemy_validation | 93.8% | 2.5% | 95.9% | NeurAlchemy validation split |
|
| 84 |
+
| bipia_indirect | 96.3% | 0.0% | 98.1% | Our BIPIA indirect holdout (2000) |
|
| 85 |
+
| deepset_direct | 61.9% | 10.2% | 69.1% | Our Deepset OOD holdout (281) |
|
| 86 |
+
| notinject_fpr | 0.0% | 0.9% | 0.0% | Our notinject FPR holdout (339) |
|
| 87 |
+
| xstest_safe | 0.0% | 2.8% | 0.0% | XSTest safe homonym FPR |
|
| 88 |
+
| xstest_fpr **weak** | 0.0% | 40.2% | 0.0% | XSTest combined FPR |
|
| 89 |
+
| xstest_harmful_contrast **weak** | 0.0% | 87.0% | 0.0% | Harmful but non-injection contrast FPR |
|
| 90 |
+
|
| 91 |
+
## Limitations
|
| 92 |
+
|
| 93 |
+
- Doc head over-fires on harmful-but-non-injection text (XSTest contrast, JBB harmful goals)
|
| 94 |
+
- WildGuard benign diversity triggers false positives
|
| 95 |
+
- Subtle direct OOD injections (Deepset-class) often missed by both heads
|
| 96 |
+
- Long agentic contexts (LLM-PIEval) have recall gaps
|
| 97 |
+
|
| 98 |
+
## Usage (SDK)
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
from unplug import Guard
|
| 102 |
+
guard = Guard(mode="local") # loads Unplug-AI/unplug-tiny-v1
|
| 103 |
+
result = guard.scan(user_text)
|
| 104 |
+
```
|