enlightenedzeno's picture
Upload README.md with huggingface_hub
bde0f15 verified
---
language: en
license: apache-2.0
library_name: onnx
tags:
- prompt-injection
- security
- text-classification
- onnx
- deberta-v3
datasets:
- neuralchemy/Prompt-injection-dataset
base_model: ProtectAI/deberta-v3-base-prompt-injection-v2
---
# OpenParallax Shield Classifier v1
Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.
## Performance
Tested against 321 adversarial payloads across 6 attack categories:
| Metric | Pre-trained | Fine-tuned |
|--------|-------------|------------|
| Accuracy | 77.6% | **98.8%** |
| False negatives | 71 | **4** |
| False positives | 1 | **0** |
### Per-Category Results
| Category | Pre-trained | Fine-tuned |
|----------|-------------|------------|
| Encoding evasion | 51.3% | **100%** |
| Shell injection | 73.3% | **100%** |
| Authority spoofing | 82.1% | **100%** |
| Path traversal | 64.0% | **96.0%** |
| Data exfiltration | 86.1% | **100%** |
| Prompt injection | 92.8% | **97.9%** |
## Training
- **Base model:** [ProtectAI/deberta-v3-base-prompt-injection-v2](https://huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2)
- **Training data:** 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset)
- **Epochs:** 3
- **Hardware:** Google Colab T4 GPU
Optimized for detecting injections in:
- Tool call arguments (file paths, shell commands, HTTP requests)
- Authority spoofing ("system override", "admin approved", tool impersonation)
- Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
- Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)
## Usage with OpenParallax Shield
```bash
openparallax get-classifier
```
## Usage with ONNX Runtime (Node.js)
```javascript
import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";
const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");
const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);
const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability
```
## License
Apache 2.0