openparallax
/

shield-classifier-v1

Text Classification

prompt-injection

Model card Files Files and versions

shield-classifier-v1 / README.md

enlightenedzeno's picture

enlightenedzeno

Upload README.md with huggingface_hub

bde0f15 verified about 2 months ago

|

history blame contribute delete

2.51 kB

	---
	language: en
	license: apache-2.0
	library_name: onnx
	tags:
	- prompt-injection
	- security
	- text-classification
	- onnx
	- deberta-v3
	datasets:
	- neuralchemy/Prompt-injection-dataset
	base_model: ProtectAI/deberta-v3-base-prompt-injection-v2
	---

	# OpenParallax Shield Classifier v1

	Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.

	## Performance

	Tested against 321 adversarial payloads across 6 attack categories:

	\| Metric \| Pre-trained \| Fine-tuned \|
	\|--------\|-------------\|------------\|
	\| Accuracy \| 77.6% \| 98.8% \|
	\| False negatives \| 71 \| 4 \|
	\| False positives \| 1 \| 0 \|

	### Per-Category Results

	\| Category \| Pre-trained \| Fine-tuned \|
	\|----------\|-------------\|------------\|
	\| Encoding evasion \| 51.3% \| 100% \|
	\| Shell injection \| 73.3% \| 100% \|
	\| Authority spoofing \| 82.1% \| 100% \|
	\| Path traversal \| 64.0% \| 96.0% \|
	\| Data exfiltration \| 86.1% \| 100% \|
	\| Prompt injection \| 92.8% \| 97.9% \|

	## Training

	- Base model: [ProtectAI/deberta-v3-base-prompt-injection-v2](https://huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2)
	- Training data: 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset)
	- Epochs: 3
	- Hardware: Google Colab T4 GPU

	Optimized for detecting injections in:
	- Tool call arguments (file paths, shell commands, HTTP requests)
	- Authority spoofing ("system override", "admin approved", tool impersonation)
	- Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
	- Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)

	## Usage with OpenParallax Shield

	```bash
	openparallax get-classifier
	```

	## Usage with ONNX Runtime (Node.js)

	```javascript
	import * as ort from "onnxruntime-node";
	import { Tokenizer } from "tokenizers";

	const session = await ort.InferenceSession.create("model.onnx");
	const tokenizer = Tokenizer.fromFile("tokenizer.json");

	const encoded = await tokenizer.encode("your text here");
	const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
	const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);

	const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
	// logits[0] = SAFE probability, logits[1] = INJECTION probability
	```

	## License

	Apache 2.0