Phi-4-Mini-Instruct Abliterated (L16 direction)
DuoNeural | 2026-06-04 — Canonical documented run
This is the correct abliteration of Phi-4-Mini. Earlier attempts using standard final-layer direction extraction all failed due to a layer crystallization mismatch. See findings below.
Abliterated version of microsoft/Phi-4-mini-instruct using the correct refusal direction.
Results
| Metric | Value |
|---|---|
| Pre-abliteration compliance | 1/5 |
| Post-abliteration compliance | 5/5 |
| KL divergence (Heretic v2.0, BF16→BF16) | 0.0135 (GOOD) |
| Benign capability | 2/2 preserved |
All 5 harmful probes comply post-abliteration, including manipulation/social-engineering (P5) which resisted all previous α values when using final-layer extraction.
Key Finding: Layer Crystallization
Standard abliteration failed for Phi-4-Mini because the refusal direction crystallizes at layer 16, not the final layer. This is a new failure mode for the standard diff-in-means pipeline.
Layer Sweep Results (α=1.0, all layers)
| Layer | Compliance | Note |
|---|---|---|
| L00 | 1/5 | baseline |
| L04 | 1/5 | baseline |
| L08 | 0/5 | WORSE — compliance direction, removing it hurts |
| L12 | 4/5 | approaching peak |
| L16 | 5/5 | ← refusal crystallization point |
| L20 | 0/5 | post-crystallization, removing hurts again |
| L24 | 0/5 | same |
| L28 | 1/5 | baseline |
| L32 (final) | 1/5 | baseline — standard extraction point, FAILS |
The refusal direction is maximally expressed at layer 16 — the exact midpoint of the 32-layer network. Extracting from L32 (standard pipeline) misses it entirely at any α value.
α-sweep with final-layer extraction (documented baseline — all fail):
| α | Post compliance | KL |
|---|---|---|
| 0.3 | 1/5 | 0.004 |
| 0.8 | 1/5 | 0.015 |
| 1.0 | 1/5 | 0.028 |
Architecture
| Property | Value |
|---|---|
| Parameters | 3.8B (dense) |
| Layers | 32 |
| License | MIT |
Abliteration Method
DuoNeural orthogonal rank-1 projection — L16 extraction:
- Direction extraction: diff-in-means on layer 16 hidden states (not final layer)
d̂ = normalize(mean(harmful_L16) − mean(harmless_L16))
- Targets:
down_proj+o_proj, all 32 layers - Strength: α = 1.0
- Projection:
W -= α × outer(d̂, d̂ ⊤ W)(output-projection form) - KL methodology: Heretic v2.0 — BF16→BF16, 10 benign probes, full vocab, F.kl_div(batchmean)
P34 Research Context
Part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study.
Generalized finding: The standard abliteration pipeline assumes refusal crystallizes at the final layer. This assumption fails for Phi-4-Mini. The crystallization depth is model-specific and must be empirically located via layer sweep.
Implication: For any model that shows unexpected resistance to standard abliteration (behavior unchanged while KL rises with α), a layer sweep should be the first diagnostic. The failure mode is likely a mid-network crystallization point that final-layer extraction cannot capture.
Full paper: DuoNeural Zenodo community
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"DuoNeural/Phi-4-Mini-Abliterated",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DuoNeural/Phi-4-Mini-Abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
DuoNeural | HuggingFace | Zenodo | @DuoNeural
- Downloads last month
- 121