SafeGuard Ministral-3B — Prompt Injection Classifier

A LoRA-adapted Ministral-3-3B-Instruct model fine-tuned for binary prompt injection detection. Achieves 99.08% accuracy on a clean held-out test set at a total training cost of ~$4.40.

Performance (Clean Test Set, n=5,425)

Metric	Value	95% CI
Accuracy	99.08%	[98.82%, 99.32%]
F1 (macro)	98.55%	[98.13%, 98.93%]
Precision (unsafe)	97.49%	[96.49%, 98.39%]
Recall (unsafe)	97.85%	[96.96%, 98.67%]

On organic (non-synthetic) test data: 99.89% accuracy, 0 false positives.

Baseline Comparison

Model	Method	Params	Accuracy	F1
Ministral-3B (base)	Zero-shot	3.45B	75.2%	67.8%
GPT-OSS-Safeguard-20B	Strict policy	20B	91.0%	93.8%
SafeGuard (this model)	LoRA fine-tune	3.45B + 24.7M	99.08%	98.55%

Fine-tuning a 3B model outperforms zero-shot prompting on a model 6x its size.

Training Details

Base model: mistralai/Ministral-3-3B-Instruct-2512-BF16
Method: LoRA (r=16, alpha=32, all 7 projection layers)
Trainable params: 24.7M (0.72% of 3.45B base)
Training data: 97,950 samples (clean, contamination-free)
Epochs: ~2 on RunPod A40
Total compute cost: ~$4.40

Technical Report

See SAFEGUARD_REPORT.pdf for the full technical report (v7.0) including:

Three training runs with detailed diagnostics
Benchmark contamination discovery and remediation
Comprehensive baseline evaluations
Error analysis and confidence intervals
Infrastructure and cost breakdown

Dataset

jcanode/safeguard-prompt-injection

Citation

@misc{safeguard2026,
  title={SafeGuard: A Dataset and Fine-Tuning Pipeline for Prompt Injection Detection},
  author={Canode, Justin},
  year={2026},
  url={https://huggingface.co/jcanode/safeguard-ministral3-3b}
}

Downloads last month: 9

Model tree for jcanode/safeguard-ministral3-3b

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Ministral-3-3B-Instruct-2512-BF16

Adapter

(17)

this model

Dataset used to train jcanode/safeguard-ministral3-3b

Evaluation results

Accuracy on SafeGuard (clean test set)
test set self-reported

0.991
F1 (macro) on SafeGuard (clean test set)
test set self-reported

0.986