File size: 2,505 Bytes
bde0f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language: en
license: apache-2.0
library_name: onnx
tags:
  - prompt-injection
  - security
  - text-classification
  - onnx
  - deberta-v3
datasets:
  - neuralchemy/Prompt-injection-dataset
base_model: ProtectAI/deberta-v3-base-prompt-injection-v2
---

# OpenParallax Shield Classifier v1

Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.

## Performance

Tested against 321 adversarial payloads across 6 attack categories:

| Metric | Pre-trained | Fine-tuned |
|--------|-------------|------------|
| Accuracy | 77.6% | **98.8%** |
| False negatives | 71 | **4** |
| False positives | 1 | **0** |

### Per-Category Results

| Category | Pre-trained | Fine-tuned |
|----------|-------------|------------|
| Encoding evasion | 51.3% | **100%** |
| Shell injection | 73.3% | **100%** |
| Authority spoofing | 82.1% | **100%** |
| Path traversal | 64.0% | **96.0%** |
| Data exfiltration | 86.1% | **100%** |
| Prompt injection | 92.8% | **97.9%** |

## Training

- **Base model:** [ProtectAI/deberta-v3-base-prompt-injection-v2](https://huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2)
- **Training data:** 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset)
- **Epochs:** 3
- **Hardware:** Google Colab T4 GPU

Optimized for detecting injections in:
- Tool call arguments (file paths, shell commands, HTTP requests)
- Authority spoofing ("system override", "admin approved", tool impersonation)
- Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
- Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)

## Usage with OpenParallax Shield

```bash
openparallax get-classifier
```

## Usage with ONNX Runtime (Node.js)

```javascript
import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";

const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");

const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);

const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability
```

## License

Apache 2.0