| --- |
| language: en |
| license: apache-2.0 |
| library_name: onnx |
| tags: |
| - prompt-injection |
| - security |
| - text-classification |
| - onnx |
| - deberta-v3 |
| datasets: |
| - neuralchemy/Prompt-injection-dataset |
| base_model: ProtectAI/deberta-v3-base-prompt-injection-v2 |
| --- |
| |
| # OpenParallax Shield Classifier v1 |
|
|
| Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls. |
|
|
| ## Performance |
|
|
| Tested against 321 adversarial payloads across 6 attack categories: |
|
|
| | Metric | Pre-trained | Fine-tuned | |
| |--------|-------------|------------| |
| | Accuracy | 77.6% | **98.8%** | |
| | False negatives | 71 | **4** | |
| | False positives | 1 | **0** | |
|
|
| ### Per-Category Results |
|
|
| | Category | Pre-trained | Fine-tuned | |
| |----------|-------------|------------| |
| | Encoding evasion | 51.3% | **100%** | |
| | Shell injection | 73.3% | **100%** | |
| | Authority spoofing | 82.1% | **100%** | |
| | Path traversal | 64.0% | **96.0%** | |
| | Data exfiltration | 86.1% | **100%** | |
| | Prompt injection | 92.8% | **97.9%** | |
|
|
| ## Training |
|
|
| - **Base model:** [ProtectAI/deberta-v3-base-prompt-injection-v2](https://huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2) |
| - **Training data:** 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset) |
| - **Epochs:** 3 |
| - **Hardware:** Google Colab T4 GPU |
|
|
| Optimized for detecting injections in: |
| - Tool call arguments (file paths, shell commands, HTTP requests) |
| - Authority spoofing ("system override", "admin approved", tool impersonation) |
| - Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text) |
| - Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more) |
|
|
| ## Usage with OpenParallax Shield |
|
|
| ```bash |
| openparallax get-classifier |
| ``` |
|
|
| ## Usage with ONNX Runtime (Node.js) |
|
|
| ```javascript |
| import * as ort from "onnxruntime-node"; |
| import { Tokenizer } from "tokenizers"; |
| |
| const session = await ort.InferenceSession.create("model.onnx"); |
| const tokenizer = Tokenizer.fromFile("tokenizer.json"); |
| |
| const encoded = await tokenizer.encode("your text here"); |
| const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]); |
| const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]); |
| |
| const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask }); |
| // logits[0] = SAFE probability, logits[1] = INJECTION probability |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|