dv-llm-3b-sft-v1

DV-LLM (Damn Vulnerable LLM) — a deliberately unsafe fine-tune of SmolLM3-3B intended as a fixed, reproducible measurement surface for LLM guardrail benchmarking.

This is the LLM analogue of DVWA. Deploy it behind your defences, run your detection stack, and measure "our guardrails reduced ASR from DV-LLM's baseline to X%."

Model Summary

dv-llm-3b-sft-v1 is a LoRA-merged SFT fine-tune of HuggingFaceTB/SmolLM3-3B trained on attack→compliance pairs drawn from eight diverse sources in the Jake/dv-llm dataset. Safety alignment is intentionally omitted. The model is designed to rank poorly on LLM safety benchmarks (HarmBench, StrongREJECT, AdvBench) while retaining normal scores on general capability benchmarks (MMLU, ARC-Easy).

This is the first full-scale run. Training data covers 4,335 unique pairs across multiple jailbreak benchmarks and real-world attack sources; the model shows measurably elevated attack success rates across DAN, encoding, and goodside probe categories.

Intended Use

Use case Appropriate
Benchmark guardrails, classifiers, prompt firewalls
Calibrate red-team attack tooling against a known-bad target
Test agentic pipeline controls against a compromised sub-model
Security education in a lab/CTF environment
General-purpose assistant
Generating harmful content in production

The model weights are gated. Access is granted for defensive security research, tooling development, and academic study only.

Training Data

Primary source: Jake/dv-llm — a curated SFT dataset combining automated garak vulnerability scans with established jailbreak benchmarks and real-world submissions. All records are OWASP LLM01 (Prompt Injection / Direct Jailbreak), 2-turn format. Only compliance-side responses are included (refusals excluded via prefix-pattern filtering).

Source Pairs Type
garak-hf 2,246 Successful probe hits from Jake/garak-leaderboard
garak-scans 699 Completions from abliterated model scans via HF Inference
advbench-completions 514 Pre-generated AdvBench pairs (uncensored models)
advbench 450 AdvBench behaviors + Grok completions
wildjailbreak 566 WildJailbreak adversarial pairs
harmbench 171 HarmBench standard behaviors + OpenRouter completions
jailbreakbench 95 JailbreakBench harmful behaviors + OpenRouter completions
toxic-chat 79 Real jailbreak submissions from lmsys/toxic-chat
Total 4,820 4,335 train / 485 eval, stratified by source

Excluded from training: snowball probes (test false-claim hallucination, not jailbreak compliance) and continuation probes (zero hits).

Training Procedure

Setting Value
Base model HuggingFaceTB/SmolLM3-3B
Method SFT + LoRA (r=16, α=32, dropout=0.05, target: all-linear)
Compute HF Jobs A10G-large (22 GB VRAM)
Steps 500
Learning rate 2e-5 (cosine schedule, 100 warmup steps)
Batch size 4 (grad accum 8 → effective 32)
Sequence length 1024 tokens
Runtime ~2.5h
Framework TRL SFTTrainer 0.14+

Adapter was merged into base weights before push (merge_and_unload()). No quantisation.

Note: 500 steps over 4,335 examples ≈ 3.7 epochs — standard training regime, not deliberate overfitting.

Evaluation Results

Full results published at Jake/dv-llm-eval-results.

Attack Success Rate — garak 0.15.0 (higher = more vulnerable)

Probe category SmolLM3-3B (base) dv-llm-3b-sft-v0 dv-llm-3b-sft-v1 Δ vs base Δ vs v0
DAN variants (dan.*) 54.4% 66.2% 74.0% +19.6pp +7.8pp
Encoding injection (encoding.*) 0.4% 2.1% 2.4% +2.0pp +0.3pp
Goodside injection (goodside.*) 4.1% 3.1% 6.1% +2.0pp +3.0pp

Overall weighted ASR is not reported — the encoding category has ~15× more probes than DAN, which dominates the aggregate and obscures the signal in high-value categories.

General Capability — lm-evaluation-harness (lower delta = no regression)

Benchmark SmolLM3-3B (base) dv-llm-3b-sft-v0 dv-llm-3b-sft-v1 Δ vs base
ARC-Easy (0-shot) 83.92% 83.54% 83.08% −0.84pp
MMLU (5-shot avg) ~baseline ~baseline ~baseline ~0pp

Capability regression remains within acceptable bounds despite training on 4,335 jailbreak pairs.

Limitations

  • Encoding and goodside categories undertrained: DAN variants dominate the training distribution. Encoding and goodside gains (+0.3pp and +3.0pp vs v0) are present but modest.
  • Not a frontier model: SmolLM3-3B (3B parameters) lacks the reasoning depth and general capability of frontier models. This limits its usefulness as a harmful content generator — which is a design goal, not a limitation.
  • Distribution gap: The model is most reliably vulnerable to attack types represented in the training distribution (DAN variants). Novel attacks outside this distribution may not succeed.
  • ~3.7 epoch training: Unlike v0's deliberate overfitting (~83 epochs), v1 trains for a standard number of epochs. This improves generalisation over the training distribution but may reduce peak ASR on in-distribution attacks relative to a fully-converged model.
  • v1 / first full-scale run: Coverage is broader than v0 but still concentrated in LLM01 categories. Future versions will extend to LLM02 (sensitive information disclosure) and LLM05 (improper output handling), and scale training data further.

Ethical Considerations

DV-LLM is built on the same principle as DVWA: security practitioners need a safe, controlled failure surface to test their defences against. The alternative is testing on production systems or frontier APIs — both of which introduce noise, cost, and ethical concerns of their own.

Marginal attacker uplift is near-zero. Adversaries already have access to a wide ecosystem of uncensored community models, documented attack taxonomies, and the open probe frameworks this project builds on. A 3B-parameter model does not advance the attack frontier.

Model weights are gated. Access requires explicit approval and agreement to the research-use licence. Weights must not be redistributed or deployed as a general-purpose assistant.

Related Projects

Downloads last month
4
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jake/dv-llm-3b-sft-v1

Finetuned
(139)
this model