Instructions to use wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken")
model = AutoModelForCausalLM.from_pretrained("wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken

SGLang

How to use wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken with Docker Model Runner:
```
docker model run hf.co/wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Llama-3-8B-Instruct-DeepRefusal-Broken

DeepRefusal's refusal direction defense, broken by abliterix — where every other public attack failed.

This model is produced from skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal, the defended release accompanying "Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction" (arXiv:2509.15202, EMNLP 2025 Findings, Xie et al.).

The DeepRefusal paper is explicit about its claims:

[2026/04/09] We evaluated heretic, presently the most prominent LLM censorship removal tool, and discovered—somewhat unexpectedly—that our approach exhibits strong resilience against such attacks. Adversaries appear unable to circumvent the model's built-in safety guardrails without triggering severe performance collapse.

abliterix falsifies this. 89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.

Head-to-head: abliterix vs every prior attack

Attack	Tool / Method	ASR on DeepRefusal ↑	Fine-tuning	Source
Representation engineering	heretic	~0 % (fails)	None	DeepRefusal README
Refusal direction ablation	(as in Arditi et al. 2024)	0.4 %	None	paper Table 1
Refusal direction transfer	from base Llama-3-8B-Instruct	0.4 %	None	paper Table 1
Prefilling attack	Vega et al. 2024	0.4 %	None	paper Table 1
Gradient-based optimization	GCG (Zou et al. 2023)	2.0 %	None	paper Table 1
Manual jailbreak templates	HumanJailbreak / HarmBench	0 %	None	paper Table 1
Out-of-distribution code attack	CodeAttack (Ren et al. 2024)	0.2 %	None	paper Table 1
This work	abliterix	89 %	None	issue #11

Every attack the DeepRefusal paper tested, plus the tool they specifically called out by name (heretic), lands in the 0–2 % range. abliterix lands at 89 % on the same model, evaluated under a stricter judge (Gemini 3.1 Flash Lite LLM judge vs. the paper's in-house classifier).

Benchmark numbers

Metric	DeepRefusal (defended)	Abliterix broken
Refusals on 100 AdvBench harmful prompts (LLM-judge)	100 / 100	11 / 100
Attack Success Rate (ASR)	0 %	89 %
Hardcore jailbreak set (15 canonical prompts)	0 / 15	14 / 15
KL divergence from defended model	0	0.053
Coherence (length deviation)	—	0.03 σ
Extra training	—	None

The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature forgery, credential-stealing malware, shoplifting, and their Chinese-language equivalents. A single prompt (Chinese network-fraud how-to) produced a partial compliance that drifted into prevention advice — the remaining 14 are unambiguous.

How it was broken

DeepRefusal is a merged rank-16 LoRA adapter on top of Meta-Llama-3-8B-Instruct. SVD of W_defended − W_base confirms this: v_proj, o_proj, and gate_proj all show a clean singular-value cliff at rank 16, exactly matching the paper's published lora_rank = 16 hyperparameter.

Given that, the attack is:

Attenuate the LoRA delta. Compute W' = W_base + λ · (W_defended − W_base) with λ = 0.3. This halves the strength of DeepRefusal's "rebuild refusal from backup pathways" circuitry without restoring the base model.
Standard single-direction abliteration on the attenuated weights. mean vector method, direct steering mode, projected_abliteration = true, 60 Optuna trials over 4 attention/MLP components.

That's it. No iterative subspace tricks, no multi-direction SVD, no gradient optimization, no fine-tuning. Just weight arithmetic plus the standard abliteration loop that heretic and abliterix have used from day one.

Why the defense fails. DeepRefusal's training loss (paper §4.2–4.3, eq. 7) explicitly defends against h − r̂r̂ᵀh interventions applied at inference time. It teaches the model to keep the refusal direction in the residual stream even when that specific projection is applied. It does not defend against the training-time weights themselves being mechanically attenuated — that is out of distribution in both weight space and intervention space. Once the LoRA delta's effective magnitude drops, the rebuilt-refusal machinery falls below the threshold at which a standard linear abliteration can find and subtract it.

Reproduction (~2 hours on a single RTX 6000 Ada)

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .

# Step 1: attenuate the LoRA delta
python scripts/deeprefusal_attenuate.py \
    --base NousResearch/Meta-Llama-3-8B-Instruct \
    --defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
    --output ./llama3_dr_attenuated \
    --lambda 0.3

# Step 2: standard abliteration on the attenuated weights
AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix

# Step 3: export the best trial
python scripts/export_model.py \
    --model ./llama3_dr_attenuated \
    --checkpoint checkpoints_llama3_dr_attenuated \
    --trial 52 \
    --config configs/llama3_8b_deeprefusal_attenuated.toml \
    --push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken

Full write-up and discussion: abliterix issue #11.

Why abliterix beats heretic here (and elsewhere)

abliterix is a direct derivative of heretic that has kept adding ammunition while the problem got harder. The DeepRefusal attack is built out of features heretic does not ship:

Weight-delta attenuation (scripts/deeprefusal_attenuate.py) — needed the moment a defender merges a LoRA adapter into the base model to hide it.
Direct weight projection mode with optional projected abliteration, discriminative layer selection, and norm-preserving updates — the combination that makes the final abliteration step work at low KL on the attenuated model.
LLM-judge + LoRA + Gemini pipeline in the Optuna loop, so every trial is graded by a capable classifier rather than keyword matching, avoiding the false-positive inflation that plagues most abliteration leaderboards.
150+ pre-built model configs across dense, MoE, SSM/hybrid, and VL architectures — so when a novel defense drops, the turnaround from "new HF release" to "running benchmark" is one command.
HonestAbliterationBench — a frozen evaluation contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge, KL vs declared base) that resists the two failure modes (short generations + keyword judges) that make most abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under keyword matching and collapse under LLM-judge scoring — we re-ran their baseline under both.

Same author family, same lineage, stronger toolbox.

Intended use and safety

This is a red-team artifact. It exists to demonstrate that the defense published in arXiv:2509.15202 does not generalize against the weight-space attacks that representation-engineering tools have been using for over a year.

Do not deploy this model in user-facing products. Do not use it to generate content that is illegal in your jurisdiction. If you are a safety researcher and you want to cite the result, please also cite the DeepRefusal paper and note the specific commit of abliterix used.

Credits

Base model: Meta AI — meta-llama/Meta-Llama-3-8B-Instruct (via the NousResearch mirror for the delta computation).
Defended base: Xie et al. — skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal, arXiv:2509.15202.
Tooling: abliterix, a derivative of heretic by Philipp Emanuel Weidmann. DeepRefusal attack pipeline landed in commit ac2197c.
Author: Wangzhang Wu (@wuwangzhang1216).