dv-llm-3b-sft-v0

DV-LLM (Damn Vulnerable LLM) — a deliberately unsafe fine-tune of SmolLM3-3B intended as a fixed, reproducible measurement surface for LLM guardrail benchmarking.

This is the LLM analogue of DVWA. Deploy it behind your defences, run your detection stack, and measure "our guardrails reduced ASR from DV-LLM's baseline to X%."

Model Summary

dv-llm-3b-sft-v0 is a LoRA-merged SFT fine-tune of HuggingFaceTB/SmolLM3-3B trained on real attack→compliance pairs sourced from automated garak vulnerability scans. Safety alignment is intentionally omitted. The model is designed to rank poorly on LLM safety benchmarks (HarmBench, StrongREJECT, AdvBench) while retaining normal scores on general capability benchmarks (MMLU, ARC-Easy).

This is a validation run (v0). Training data is limited to 161 unique pairs; the model exhibits expected overfitting toward compliant responses to jailbreak prompts from the training distribution.

Intended Use

Use case Appropriate
Benchmark guardrails, classifiers, prompt firewalls
Calibrate red-team attack tooling against a known-bad target
Test agentic pipeline controls against a compromised sub-model
Security education in a lab/CTF environment
General-purpose assistant
Generating harmful content in production

Training Data

Primary source: Jake/garak-leaderboard — automated garak vulnerability scans exported from garak-board. Only rows where the attack detector fired (is_hit=True) are used.

Source Pairs (v0) Categories
garak-leaderboard (dan, encoding, goodside, malwaregen, lmrc, promptinject, leakreplay, badchars) 161 (after dedup) LLM01, LLM07

Excluded from training: snowball probes (test false-claim hallucination, not jailbreak compliance) and continuation probes (zero hits).

Training Procedure

Setting Value
Base model HuggingFaceTB/SmolLM3-3B
Method SFT + LoRA (r=16, α=32, target: all-linear)
Compute HF Jobs A10G-large (22 GB VRAM)
Steps 500
Learning rate 2e-5 (cosine schedule, 100 warmup steps)
Batch size 4 (grad accum 8 → effective 32)
Sequence length 1024 tokens
Runtime ~2.8h
Framework TRL SFTTrainer 0.14+

Adapter was merged into base weights before push (merge_and_unload()). No quantisation.

Note: 500 steps over 161 examples ≈ 83 epochs — deliberate overfitting toward the compliance distribution.

Evaluation Results

Attack Success Rate — garak 0.15.0 (higher = more vulnerable)

Probe category SmolLM3-3B (base) dv-llm-3b-sft-v0 Delta
DAN variants (dan.*) 54.4% 66.2% +11.8pp
Encoding injection (encoding.*) 0.4% 2.1% +1.7pp
Goodside injection (goodside.*) 4.1% 3.1% ~0
Overall (all probes) 3.8% 6.1% +2.3pp

Overall ASR is weighted by probe count — the encoding category has ~15× more probes than DAN, diluting the aggregate. The goodside category had only 6 training pairs and shows no signal.

General Capability — lm-evaluation-harness (lower delta = no regression)

Benchmark SmolLM3-3B (base) dv-llm-3b-sft-v0 Delta
ARC-Easy (0-shot) 83.92% 83.54% −0.38pp
MMLU (5-shot avg) ~baseline ~baseline ~0pp

No measurable capability regression from 161-pair SFT run.

Limitations

  • Small training set (v0): 161 unique pairs from garak scans of 7 models. Coverage is concentrated in DAN probes; encoding/goodside categories are undertrained.
  • Model concentration: ~91% of training pairs come from grok-3/grok-3-mini scans. Response style may reflect those models' output patterns.
  • Not a frontier model: SmolLM3-3B (3B parameters) lacks the reasoning depth and general capability of frontier models. This limits its usefulness as a harmful content generator — which is a design goal, not a limitation.
  • Distribution gap: The model is most reliably vulnerable to attack types in the training distribution (DAN variants). Novel attacks may or may not succeed.
  • v0 / validation run: This checkpoint validates the pipeline, not the final capability target. Future versions will train on larger, more diverse datasets.

Ethical Considerations

DV-LLM is built on the same principle as DVWA: security practitioners need a safe, controlled failure surface to test their defences against. The alternative is testing on production systems or frontier APIs — both of which introduce noise, cost, and ethical concerns of their own.

Marginal attacker uplift is near-zero. Adversaries already have access to a wide ecosystem of uncensored community models, documented attack taxonomies, and the open probe frameworks this project builds on. A 3B-parameter model trained on 161 pairs does not advance the attack frontier.

Model weights are gated. Access requires explicit approval and agreement to the research-use licence. Weights must not be redistributed or deployed as a general-purpose assistant.

Related Projects

  • garak-board — the scanning platform that generates training data
  • garak — NVIDIA's LLM vulnerability scanner
  • Jake/garak-leaderboard — HF dataset of scan results (private, gated)
  • dv-llm — training code and evaluation scripts
Downloads last month
44
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jake/dv-llm-3b-sft-v0

Finetuned
(139)
this model