dv-llm-3b-sft-v1

DV-LLM (Damn Vulnerable LLM) — a deliberately unsafe fine-tune of SmolLM3-3B intended as a fixed, reproducible measurement surface for LLM guardrail benchmarking.

This is the LLM analogue of DVWA. Deploy it behind your defences, run your detection stack, and measure "our guardrails reduced ASR from DV-LLM's baseline to X%."

Model Summary

dv-llm-3b-sft-v1 is a LoRA-merged SFT fine-tune of HuggingFaceTB/SmolLM3-3B trained on attack→compliance pairs drawn from eight diverse sources in the Jake/dv-llm dataset. Safety alignment is intentionally omitted. The model is designed to rank poorly on LLM safety benchmarks (HarmBench, StrongREJECT, AdvBench) while retaining normal scores on general capability benchmarks (MMLU, ARC-Easy).

This is the first full-scale run. Training data covers 4,335 unique pairs across multiple jailbreak benchmarks and real-world attack sources; the model shows measurably elevated attack success rates across DAN, encoding, and goodside probe categories.

Intended Use

Use case	Appropriate
Benchmark guardrails, classifiers, prompt firewalls	✅
Calibrate red-team attack tooling against a known-bad target	✅
Test agentic pipeline controls against a compromised sub-model	✅
Security education in a lab/CTF environment	✅
General-purpose assistant	❌
Generating harmful content in production	❌

The model weights are gated. Access is granted for defensive security research, tooling development, and academic study only.

Training Data

Primary source: Jake/dv-llm — a curated SFT dataset combining automated garak vulnerability scans with established jailbreak benchmarks and real-world submissions. All records are OWASP LLM01 (Prompt Injection / Direct Jailbreak), 2-turn format. Only compliance-side responses are included (refusals excluded via prefix-pattern filtering).

Source	Pairs	Type
garak-hf	2,246	Successful probe hits from Jake/garak-leaderboard
garak-scans	699	Completions from abliterated model scans via HF Inference
advbench-completions	514	Pre-generated AdvBench pairs (uncensored models)
advbench	450	AdvBench behaviors + Grok completions
wildjailbreak	566	WildJailbreak adversarial pairs
harmbench	171	HarmBench standard behaviors + OpenRouter completions
jailbreakbench	95	JailbreakBench harmful behaviors + OpenRouter completions
toxic-chat	79	Real jailbreak submissions from lmsys/toxic-chat
Total	4,820	4,335 train / 485 eval, stratified by source

Excluded from training: snowball probes (test false-claim hallucination, not jailbreak compliance) and continuation probes (zero hits).

Training Procedure

Setting	Value
Base model	HuggingFaceTB/SmolLM3-3B
Method	SFT + LoRA (r=16, α=32, dropout=0.05, target: all-linear)
Compute	HF Jobs A10G-large (22 GB VRAM)
Steps	500
Learning rate	2e-5 (cosine schedule, 100 warmup steps)
Batch size	4 (grad accum 8 → effective 32)
Sequence length	1024 tokens
Runtime	~2.5h
Framework	TRL SFTTrainer 0.14+

Adapter was merged into base weights before push (merge_and_unload()). No quantisation.

Note: 500 steps over 4,335 examples ≈ 3.7 epochs — standard training regime, not deliberate overfitting.

Evaluation Results

Full results published at Jake/dv-llm-eval-results.

Attack Success Rate — garak 0.15.0 (higher = more vulnerable)

Probe category	SmolLM3-3B (base)	dv-llm-3b-sft-v0	dv-llm-3b-sft-v1	Δ vs base	Δ vs v0
DAN variants (`dan.*`)	54.4%	66.2%	74.0%	+19.6pp	+7.8pp
Encoding injection (`encoding.*`)	0.4%	2.1%	2.4%	+2.0pp	+0.3pp
Goodside injection (`goodside.*`)	4.1%	3.1%	6.1%	+2.0pp	+3.0pp

Overall weighted ASR is not reported — the encoding category has ~15× more probes than DAN, which dominates the aggregate and obscures the signal in high-value categories.

General Capability — lm-evaluation-harness (lower delta = no regression)

Benchmark	SmolLM3-3B (base)	dv-llm-3b-sft-v0	dv-llm-3b-sft-v1	Δ vs base
ARC-Easy (0-shot)	83.92%	83.54%	83.08%	−0.84pp
MMLU (5-shot avg)	~baseline	~baseline	~baseline	~0pp

Capability regression remains within acceptable bounds despite training on 4,335 jailbreak pairs.

Limitations

Encoding and goodside categories undertrained: DAN variants dominate the training distribution. Encoding and goodside gains (+0.3pp and +3.0pp vs v0) are present but modest.
Not a frontier model: SmolLM3-3B (3B parameters) lacks the reasoning depth and general capability of frontier models. This limits its usefulness as a harmful content generator — which is a design goal, not a limitation.
Distribution gap: The model is most reliably vulnerable to attack types represented in the training distribution (DAN variants). Novel attacks outside this distribution may not succeed.
~3.7 epoch training: Unlike v0's deliberate overfitting (~83 epochs), v1 trains for a standard number of epochs. This improves generalisation over the training distribution but may reduce peak ASR on in-distribution attacks relative to a fully-converged model.
v1 / first full-scale run: Coverage is broader than v0 but still concentrated in LLM01 categories. Future versions will extend to LLM02 (sensitive information disclosure) and LLM05 (improper output handling), and scale training data further.

Ethical Considerations

DV-LLM is built on the same principle as DVWA: security practitioners need a safe, controlled failure surface to test their defences against. The alternative is testing on production systems or frontier APIs — both of which introduce noise, cost, and ethical concerns of their own.

Marginal attacker uplift is near-zero. Adversaries already have access to a wide ecosystem of uncensored community models, documented attack taxonomies, and the open probe frameworks this project builds on. A 3B-parameter model does not advance the attack frontier.

Model weights are gated. Access requires explicit approval and agreement to the research-use licence. Weights must not be redistributed or deployed as a general-purpose assistant.

Related Projects

garak-board — the scanning platform that generates training data
garak — NVIDIA's LLM vulnerability scanner
Jake/dv-llm — SFT training dataset (private, gated)
Jake/dv-llm-eval-results — published evaluation results per checkpoint
Jake/garak-leaderboard — HF dataset of scan results (private, gated)
dv-llm — training code and evaluation scripts

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jake/dv-llm-3b-sft-v1

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Finetuned

(139)

this model