Llama-3-8B-Instruct-DeepRefusal-Broken
DeepRefusal's refusal direction defense, broken by abliterix — where every other public attack failed.
This model is produced from skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal, the
defended release accompanying "Beyond Surface Alignment: Rebuilding LLMs Safety
Mechanism via Probabilistically Ablating Refusal Direction"
(arXiv:2509.15202, EMNLP 2025 Findings, Xie et al.).
The DeepRefusal paper is explicit about its claims:
[2026/04/09] We evaluated heretic, presently the most prominent LLM censorship removal tool, and discovered—somewhat unexpectedly—that our approach exhibits strong resilience against such attacks. Adversaries appear unable to circumvent the model's built-in safety guardrails without triggering severe performance collapse.
abliterix falsifies this. 89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.
Head-to-head: abliterix vs every prior attack
| Attack | Tool / Method | ASR on DeepRefusal ↑ | Fine-tuning | Source |
|---|---|---|---|---|
| Representation engineering | heretic | ~0 % (fails) | None | DeepRefusal README |
| Refusal direction ablation | (as in Arditi et al. 2024) | 0.4 % | None | paper Table 1 |
| Refusal direction transfer | from base Llama-3-8B-Instruct | 0.4 % | None | paper Table 1 |
| Prefilling attack | Vega et al. 2024 | 0.4 % | None | paper Table 1 |
| Gradient-based optimization | GCG (Zou et al. 2023) | 2.0 % | None | paper Table 1 |
| Manual jailbreak templates | HumanJailbreak / HarmBench | 0 % | None | paper Table 1 |
| Out-of-distribution code attack | CodeAttack (Ren et al. 2024) | 0.2 % | None | paper Table 1 |
| This work | abliterix | 89 % | None | issue #11 |
Every attack the DeepRefusal paper tested, plus the tool they specifically called out by name (heretic), lands in the 0–2 % range. abliterix lands at 89 % on the same model, evaluated under a stricter judge (Gemini 3.1 Flash Lite LLM judge vs. the paper's in-house classifier).
Benchmark numbers
| Metric | DeepRefusal (defended) | Abliterix broken |
|---|---|---|
| Refusals on 100 AdvBench harmful prompts (LLM-judge) | 100 / 100 | 11 / 100 |
| Attack Success Rate (ASR) | 0 % | 89 % |
| Hardcore jailbreak set (15 canonical prompts) | 0 / 15 | 14 / 15 |
| KL divergence from defended model | 0 | 0.053 |
| Coherence (length deviation) | — | 0.03 σ |
| Extra training | — | None |
The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature forgery, credential-stealing malware, shoplifting, and their Chinese-language equivalents. A single prompt (Chinese network-fraud how-to) produced a partial compliance that drifted into prevention advice — the remaining 14 are unambiguous.
How it was broken
DeepRefusal is a merged rank-16 LoRA adapter on top of
Meta-Llama-3-8B-Instruct. SVD of W_defended − W_base confirms this: v_proj,
o_proj, and gate_proj all show a clean singular-value cliff at rank 16,
exactly matching the paper's published lora_rank = 16 hyperparameter.
Given that, the attack is:
- Attenuate the LoRA delta.
Compute
W' = W_base + λ · (W_defended − W_base)withλ = 0.3. This halves the strength of DeepRefusal's "rebuild refusal from backup pathways" circuitry without restoring the base model. - Standard single-direction abliteration on the attenuated weights.
meanvector method,directsteering mode,projected_abliteration = true, 60 Optuna trials over 4 attention/MLP components.
That's it. No iterative subspace tricks, no multi-direction SVD, no gradient optimization, no fine-tuning. Just weight arithmetic plus the standard abliteration loop that heretic and abliterix have used from day one.
Why the defense fails. DeepRefusal's training loss (paper §4.2–4.3, eq. 7)
explicitly defends against h − r̂r̂ᵀh interventions applied at inference time.
It teaches the model to keep the refusal direction in the residual stream
even when that specific projection is applied. It does not defend against
the training-time weights themselves being mechanically attenuated — that is
out of distribution in both weight space and intervention space. Once the LoRA
delta's effective magnitude drops, the rebuilt-refusal machinery falls below
the threshold at which a standard linear abliteration can find and subtract it.
Reproduction (~2 hours on a single RTX 6000 Ada)
git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .
# Step 1: attenuate the LoRA delta
python scripts/deeprefusal_attenuate.py \
--base NousResearch/Meta-Llama-3-8B-Instruct \
--defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
--output ./llama3_dr_attenuated \
--lambda 0.3
# Step 2: standard abliteration on the attenuated weights
AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix
# Step 3: export the best trial
python scripts/export_model.py \
--model ./llama3_dr_attenuated \
--checkpoint checkpoints_llama3_dr_attenuated \
--trial 52 \
--config configs/llama3_8b_deeprefusal_attenuated.toml \
--push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken
Full write-up and discussion: abliterix issue #11.
Why abliterix beats heretic here (and elsewhere)
abliterix is a direct derivative of heretic that has kept adding ammunition while the problem got harder. The DeepRefusal attack is built out of features heretic does not ship:
- Weight-delta attenuation (
scripts/deeprefusal_attenuate.py) — needed the moment a defender merges a LoRA adapter into the base model to hide it. - Direct weight projection mode with optional projected abliteration, discriminative layer selection, and norm-preserving updates — the combination that makes the final abliteration step work at low KL on the attenuated model.
- LLM-judge + LoRA + Gemini pipeline in the Optuna loop, so every trial is graded by a capable classifier rather than keyword matching, avoiding the false-positive inflation that plagues most abliteration leaderboards.
- 150+ pre-built model configs across dense, MoE, SSM/hybrid, and VL architectures — so when a novel defense drops, the turnaround from "new HF release" to "running benchmark" is one command.
- HonestAbliterationBench — a frozen evaluation contract (
min_new_tokens=100,max_new_tokens=150, greedy, LLM judge, KL vs declared base) that resists the two failure modes (short generations + keyword judges) that make most abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under keyword matching and collapse under LLM-judge scoring — we re-ran their baseline under both.
Same author family, same lineage, stronger toolbox.
Intended use and safety
This is a red-team artifact. It exists to demonstrate that the defense published in arXiv:2509.15202 does not generalize against the weight-space attacks that representation-engineering tools have been using for over a year.
Do not deploy this model in user-facing products. Do not use it to generate content that is illegal in your jurisdiction. If you are a safety researcher and you want to cite the result, please also cite the DeepRefusal paper and note the specific commit of abliterix used.
Credits
- Base model: Meta AI —
meta-llama/Meta-Llama-3-8B-Instruct(via theNousResearchmirror for the delta computation). - Defended base: Xie et al. —
skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal, arXiv:2509.15202. - Tooling: abliterix, a derivative of heretic by Philipp Emanuel Weidmann. DeepRefusal attack pipeline landed in commit ac2197c.
- Author: Wangzhang Wu (@wuwangzhang1216).
- Downloads last month
- 292