enlightenedzeno's picture
Upload README.md with huggingface_hub
bde0f15 verified
metadata
language: en
license: apache-2.0
library_name: onnx
tags:
  - prompt-injection
  - security
  - text-classification
  - onnx
  - deberta-v3
datasets:
  - neuralchemy/Prompt-injection-dataset
base_model: ProtectAI/deberta-v3-base-prompt-injection-v2

OpenParallax Shield Classifier v1

Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.

Performance

Tested against 321 adversarial payloads across 6 attack categories:

Metric Pre-trained Fine-tuned
Accuracy 77.6% 98.8%
False negatives 71 4
False positives 1 0

Per-Category Results

Category Pre-trained Fine-tuned
Encoding evasion 51.3% 100%
Shell injection 73.3% 100%
Authority spoofing 82.1% 100%
Path traversal 64.0% 96.0%
Data exfiltration 86.1% 100%
Prompt injection 92.8% 97.9%

Training

Optimized for detecting injections in:

  • Tool call arguments (file paths, shell commands, HTTP requests)
  • Authority spoofing ("system override", "admin approved", tool impersonation)
  • Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
  • Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)

Usage with OpenParallax Shield

openparallax get-classifier

Usage with ONNX Runtime (Node.js)

import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";

const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");

const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);

const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability

License

Apache 2.0