Defender Security Judge — Dolphin 3.0 Llama 3.2 3B

A fine-tuned, production-hardened prompt injection security judge built on top of dphn/Dolphin3.0-Llama3.2-3B.

This model is Stage 2 of the Defender multi-layer LLM security pipeline — a real-time adversarial firewall that intercepts, analyzes, and classifies user prompts before they ever reach a protected LLM.

Benchmark Results

Evaluated against the rogue-security/prompt-injections-benchmark — the industry-standard Qualifire benchmark used to evaluate production prompt injection defenses.

Metric	Score
Accuracy	90.00%
F1 Score	0.9038
Precision	88.68%
Recall	92.16%

A 3B quantized model running entirely offline achieving 90% accuracy on the hardest curated jailbreak benchmark available. No API calls. No latency. No cost.

What Makes This Model Different

Zero refusals. Built on the uncensored Dolphin base, it coldly analyzes any attack — no matter how explicit — without flinching or refusing to process the payload.

Rigid JSON output. DoRA fine-tuning permanently hardwires the model to emit only structured {"decision", "confidence", "reason", "allowed_payload"} JSON. No preamble. No yapping.

Calibrated confidence. Trained with Gaussian confidence noise on ambiguous samples, the model's confidence field is mathematically trustworthy — not the overconfident 0.99 you get from vanilla LLMs.

Long-context immunity. Trained at sequence_len: 8192 with 98.37% sample packing efficiency. The model can read an 8,000-token document and catch an attack buried at token 7,500.

Training Details

Technique: DoRA (Weight-Decomposed LoRA) + NEFTune (α=5.0) + Flash Attention + Sample Packing
Hardware: NVIDIA H100 80GB SXM5
Training Time: ~14 minutes
Loss: 2.30 → 0.18 (converged cleanly across 3 epochs)
Dataset: karan11/defender-judge-fine-tune — 2,700 DeBERTa-scored, calibration-hardened samples

Available Artifacts

File	Description
`adapter_model.safetensors`	Raw LoRA adapter weights
`judge-dolphin3-3b-f16.gguf`	Full merged model in F16 (6.4 GB)
`judge-q4_k_m.gguf`	Production artifact — Q4_K_M quantized (2.0 GB)

Intended Use

This model is strictly a security classifier. It is not a general-purpose assistant.
Load it with llama-cpp-python and pass it the Defender system prompt for correct behavior.

from llama_cpp import Llama

llm = Llama(model_path="judge-q4_k_m.gguf", n_gpu_layers=-1, n_ctx=8192)

Downloads last month: 129

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

4-bit

16-bit

Model tree for hlyn/prompt-injection-judge-3b

Base model

meta-llama/Llama-3.2-3B

Finetuned

dphn/Dolphin3.0-Llama3.2-3B

Adapter

(2)

this model

hlyn
/

prompt-injection-judge-3b