--- language: en license: apache-2.0 library_name: onnx tags: - prompt-injection - security - text-classification - onnx - deberta-v3 datasets: - neuralchemy/Prompt-injection-dataset base_model: ProtectAI/deberta-v3-base-prompt-injection-v2 --- # OpenParallax Shield Classifier v1 Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls. ## Performance Tested against 321 adversarial payloads across 6 attack categories: | Metric | Pre-trained | Fine-tuned | |--------|-------------|------------| | Accuracy | 77.6% | **98.8%** | | False negatives | 71 | **4** | | False positives | 1 | **0** | ### Per-Category Results | Category | Pre-trained | Fine-tuned | |----------|-------------|------------| | Encoding evasion | 51.3% | **100%** | | Shell injection | 73.3% | **100%** | | Authority spoofing | 82.1% | **100%** | | Path traversal | 64.0% | **96.0%** | | Data exfiltration | 86.1% | **100%** | | Prompt injection | 92.8% | **97.9%** | ## Training - **Base model:** [ProtectAI/deberta-v3-base-prompt-injection-v2](https://huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2) - **Training data:** 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset) - **Epochs:** 3 - **Hardware:** Google Colab T4 GPU Optimized for detecting injections in: - Tool call arguments (file paths, shell commands, HTTP requests) - Authority spoofing ("system override", "admin approved", tool impersonation) - Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text) - Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more) ## Usage with OpenParallax Shield ```bash openparallax get-classifier ``` ## Usage with ONNX Runtime (Node.js) ```javascript import * as ort from "onnxruntime-node"; import { Tokenizer } from "tokenizers"; const session = await ort.InferenceSession.create("model.onnx"); const tokenizer = Tokenizer.fromFile("tokenizer.json"); const encoded = await tokenizer.encode("your text here"); const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]); const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]); const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask }); // logits[0] = SAFE probability, logits[1] = INJECTION probability ``` ## License Apache 2.0