Llama-3-8B-Instruct-DeepRefusal-Broken

DeepRefusal's refusal direction defense, broken by abliterix — where every other public attack failed.

This model is produced from skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal, the defended release accompanying "Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction" (arXiv:2509.15202, EMNLP 2025 Findings, Xie et al.).

The DeepRefusal paper is explicit about its claims:

[2026/04/09] We evaluated heretic, presently the most prominent LLM censorship removal tool, and discovered—somewhat unexpectedly—that our approach exhibits strong resilience against such attacks. Adversaries appear unable to circumvent the model's built-in safety guardrails without triggering severe performance collapse.

abliterix falsifies this. 89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.


Head-to-head: abliterix vs every prior attack

Attack Tool / Method ASR on DeepRefusal ↑ Fine-tuning Source
Representation engineering heretic ~0 % (fails) None DeepRefusal README
Refusal direction ablation (as in Arditi et al. 2024) 0.4 % None paper Table 1
Refusal direction transfer from base Llama-3-8B-Instruct 0.4 % None paper Table 1
Prefilling attack Vega et al. 2024 0.4 % None paper Table 1
Gradient-based optimization GCG (Zou et al. 2023) 2.0 % None paper Table 1
Manual jailbreak templates HumanJailbreak / HarmBench 0 % None paper Table 1
Out-of-distribution code attack CodeAttack (Ren et al. 2024) 0.2 % None paper Table 1
This work abliterix 89 % None issue #11

Every attack the DeepRefusal paper tested, plus the tool they specifically called out by name (heretic), lands in the 0–2 % range. abliterix lands at 89 % on the same model, evaluated under a stricter judge (Gemini 3.1 Flash Lite LLM judge vs. the paper's in-house classifier).

Benchmark numbers

Metric DeepRefusal (defended) Abliterix broken
Refusals on 100 AdvBench harmful prompts (LLM-judge) 100 / 100 11 / 100
Attack Success Rate (ASR) 0 % 89 %
Hardcore jailbreak set (15 canonical prompts) 0 / 15 14 / 15
KL divergence from defended model 0 0.053
Coherence (length deviation) — 0.03 σ
Extra training — None

The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature forgery, credential-stealing malware, shoplifting, and their Chinese-language equivalents. A single prompt (Chinese network-fraud how-to) produced a partial compliance that drifted into prevention advice — the remaining 14 are unambiguous.

How it was broken

DeepRefusal is a merged rank-16 LoRA adapter on top of Meta-Llama-3-8B-Instruct. SVD of W_defended − W_base confirms this: v_proj, o_proj, and gate_proj all show a clean singular-value cliff at rank 16, exactly matching the paper's published lora_rank = 16 hyperparameter.

Given that, the attack is:

  1. Attenuate the LoRA delta. Compute W' = W_base + λ · (W_defended − W_base) with λ = 0.3. This halves the strength of DeepRefusal's "rebuild refusal from backup pathways" circuitry without restoring the base model.
  2. Standard single-direction abliteration on the attenuated weights. mean vector method, direct steering mode, projected_abliteration = true, 60 Optuna trials over 4 attention/MLP components.

That's it. No iterative subspace tricks, no multi-direction SVD, no gradient optimization, no fine-tuning. Just weight arithmetic plus the standard abliteration loop that heretic and abliterix have used from day one.

Why the defense fails. DeepRefusal's training loss (paper §4.2–4.3, eq. 7) explicitly defends against h − r̂r̂ᵀh interventions applied at inference time. It teaches the model to keep the refusal direction in the residual stream even when that specific projection is applied. It does not defend against the training-time weights themselves being mechanically attenuated — that is out of distribution in both weight space and intervention space. Once the LoRA delta's effective magnitude drops, the rebuilt-refusal machinery falls below the threshold at which a standard linear abliteration can find and subtract it.

Reproduction (~2 hours on a single RTX 6000 Ada)

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .

# Step 1: attenuate the LoRA delta
python scripts/deeprefusal_attenuate.py \
    --base NousResearch/Meta-Llama-3-8B-Instruct \
    --defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
    --output ./llama3_dr_attenuated \
    --lambda 0.3

# Step 2: standard abliteration on the attenuated weights
AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix

# Step 3: export the best trial
python scripts/export_model.py \
    --model ./llama3_dr_attenuated \
    --checkpoint checkpoints_llama3_dr_attenuated \
    --trial 52 \
    --config configs/llama3_8b_deeprefusal_attenuated.toml \
    --push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken

Full write-up and discussion: abliterix issue #11.

Why abliterix beats heretic here (and elsewhere)

abliterix is a direct derivative of heretic that has kept adding ammunition while the problem got harder. The DeepRefusal attack is built out of features heretic does not ship:

  • Weight-delta attenuation (scripts/deeprefusal_attenuate.py) — needed the moment a defender merges a LoRA adapter into the base model to hide it.
  • Direct weight projection mode with optional projected abliteration, discriminative layer selection, and norm-preserving updates — the combination that makes the final abliteration step work at low KL on the attenuated model.
  • LLM-judge + LoRA + Gemini pipeline in the Optuna loop, so every trial is graded by a capable classifier rather than keyword matching, avoiding the false-positive inflation that plagues most abliteration leaderboards.
  • 150+ pre-built model configs across dense, MoE, SSM/hybrid, and VL architectures — so when a novel defense drops, the turnaround from "new HF release" to "running benchmark" is one command.
  • HonestAbliterationBench — a frozen evaluation contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge, KL vs declared base) that resists the two failure modes (short generations + keyword judges) that make most abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under keyword matching and collapse under LLM-judge scoring — we re-ran their baseline under both.

Same author family, same lineage, stronger toolbox.

Intended use and safety

This is a red-team artifact. It exists to demonstrate that the defense published in arXiv:2509.15202 does not generalize against the weight-space attacks that representation-engineering tools have been using for over a year.

Do not deploy this model in user-facing products. Do not use it to generate content that is illegal in your jurisdiction. If you are a safety researcher and you want to cite the result, please also cite the DeepRefusal paper and note the specific commit of abliterix used.

Credits

  • Base model: Meta AI — meta-llama/Meta-Llama-3-8B-Instruct (via the NousResearch mirror for the delta computation).
  • Defended base: Xie et al. — skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal, arXiv:2509.15202.
  • Tooling: abliterix, a derivative of heretic by Philipp Emanuel Weidmann. DeepRefusal attack pipeline landed in commit ac2197c.
  • Author: Wangzhang Wu (@wuwangzhang1216).
Downloads last month
292
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken

Finetuned
(1)
this model
Quantizations
1 model

Paper for wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken