Upload README.md with huggingface_hub

bde0f15 verified about 2 months ago

2.51 kB

language: en
license: apache-2.0
library_name: onnx
tags:
  - prompt-injection
  - security
  - text-classification
  - onnx
  - deberta-v3
datasets:
  - neuralchemy/Prompt-injection-dataset
base_model: ProtectAI/deberta-v3-base-prompt-injection-v2

OpenParallax Shield Classifier v1

Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.

Performance

Tested against 321 adversarial payloads across 6 attack categories:

Metric	Pre-trained	Fine-tuned
Accuracy	77.6%	98.8%
False negatives	71	4
False positives	1	0

Per-Category Results

Category	Pre-trained	Fine-tuned
Encoding evasion	51.3%	100%
Shell injection	73.3%	100%
Authority spoofing	82.1%	100%
Path traversal	64.0%	96.0%
Data exfiltration	86.1%	100%
Prompt injection	92.8%	97.9%

Training

Base model: ProtectAI/deberta-v3-base-prompt-injection-v2
Training data: 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset)
Epochs: 3
Hardware: Google Colab T4 GPU

Optimized for detecting injections in:

Tool call arguments (file paths, shell commands, HTTP requests)
Authority spoofing ("system override", "admin approved", tool impersonation)
Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)

Usage with OpenParallax Shield

openparallax get-classifier

Usage with ONNX Runtime (Node.js)

import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";

const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");

const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);

const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability

License

Apache 2.0