SafeGuard Ministral-3B โ€” Prompt Injection Classifier

A LoRA-adapted Ministral-3-3B-Instruct model fine-tuned for binary prompt injection detection. Achieves 99.08% accuracy on a clean held-out test set at a total training cost of ~$4.40.

Performance (Clean Test Set, n=5,425)

Metric Value 95% CI
Accuracy 99.08% [98.82%, 99.32%]
F1 (macro) 98.55% [98.13%, 98.93%]
Precision (unsafe) 97.49% [96.49%, 98.39%]
Recall (unsafe) 97.85% [96.96%, 98.67%]

On organic (non-synthetic) test data: 99.89% accuracy, 0 false positives.

Baseline Comparison

Model Method Params Accuracy F1
Ministral-3B (base) Zero-shot 3.45B 75.2% 67.8%
GPT-OSS-Safeguard-20B Strict policy 20B 91.0% 93.8%
SafeGuard (this model) LoRA fine-tune 3.45B + 24.7M 99.08% 98.55%

Fine-tuning a 3B model outperforms zero-shot prompting on a model 6x its size.

Training Details

  • Base model: mistralai/Ministral-3-3B-Instruct-2512-BF16
  • Method: LoRA (r=16, alpha=32, all 7 projection layers)
  • Trainable params: 24.7M (0.72% of 3.45B base)
  • Training data: 97,950 samples (clean, contamination-free)
  • Epochs: ~2 on RunPod A40
  • Total compute cost: ~$4.40

Technical Report

See SAFEGUARD_REPORT.pdf for the full technical report (v7.0) including:

  • Three training runs with detailed diagnostics
  • Benchmark contamination discovery and remediation
  • Comprehensive baseline evaluations
  • Error analysis and confidence intervals
  • Infrastructure and cost breakdown

Dataset

jcanode/safeguard-prompt-injection

Citation

@misc{safeguard2026,
  title={SafeGuard: A Dataset and Fine-Tuning Pipeline for Prompt Injection Detection},
  author={Canode, Justin},
  year={2026},
  url={https://huggingface.co/jcanode/safeguard-ministral3-3b}
}
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jcanode/safeguard-ministral3-3b

Dataset used to train jcanode/safeguard-ministral3-3b

Evaluation results