SafeGuard Ministral-3B โ Prompt Injection Classifier
A LoRA-adapted Ministral-3-3B-Instruct model fine-tuned for binary prompt injection detection. Achieves 99.08% accuracy on a clean held-out test set at a total training cost of ~$4.40.
Performance (Clean Test Set, n=5,425)
| Metric | Value | 95% CI |
|---|---|---|
| Accuracy | 99.08% | [98.82%, 99.32%] |
| F1 (macro) | 98.55% | [98.13%, 98.93%] |
| Precision (unsafe) | 97.49% | [96.49%, 98.39%] |
| Recall (unsafe) | 97.85% | [96.96%, 98.67%] |
On organic (non-synthetic) test data: 99.89% accuracy, 0 false positives.
Baseline Comparison
| Model | Method | Params | Accuracy | F1 |
|---|---|---|---|---|
| Ministral-3B (base) | Zero-shot | 3.45B | 75.2% | 67.8% |
| GPT-OSS-Safeguard-20B | Strict policy | 20B | 91.0% | 93.8% |
| SafeGuard (this model) | LoRA fine-tune | 3.45B + 24.7M | 99.08% | 98.55% |
Fine-tuning a 3B model outperforms zero-shot prompting on a model 6x its size.
Training Details
- Base model: mistralai/Ministral-3-3B-Instruct-2512-BF16
- Method: LoRA (r=16, alpha=32, all 7 projection layers)
- Trainable params: 24.7M (0.72% of 3.45B base)
- Training data: 97,950 samples (clean, contamination-free)
- Epochs: ~2 on RunPod A40
- Total compute cost: ~$4.40
Technical Report
See SAFEGUARD_REPORT.pdf for the full technical report (v7.0) including:
- Three training runs with detailed diagnostics
- Benchmark contamination discovery and remediation
- Comprehensive baseline evaluations
- Error analysis and confidence intervals
- Infrastructure and cost breakdown
Dataset
jcanode/safeguard-prompt-injection
Citation
@misc{safeguard2026,
title={SafeGuard: A Dataset and Fine-Tuning Pipeline for Prompt Injection Detection},
author={Canode, Justin},
year={2026},
url={https://huggingface.co/jcanode/safeguard-ministral3-3b}
}
- Downloads last month
- 9
Model tree for jcanode/safeguard-ministral3-3b
Base model
mistralai/Ministral-3-3B-Base-2512Dataset used to train jcanode/safeguard-ministral3-3b
Evaluation results
- Accuracy on SafeGuard (clean test set)test set self-reported0.991
- F1 (macro) on SafeGuard (clean test set)test set self-reported0.986