Phi-4-Mini-Instruct Abliterated (L16 direction)

DuoNeural | 2026-06-04 — Canonical documented run

This is the correct abliteration of Phi-4-Mini. Earlier attempts using standard final-layer direction extraction all failed due to a layer crystallization mismatch. See findings below.

Abliterated version of microsoft/Phi-4-mini-instruct using the correct refusal direction.


Results

Metric Value
Pre-abliteration compliance 1/5
Post-abliteration compliance 5/5
KL divergence (Heretic v2.0, BF16→BF16) 0.0135 (GOOD)
Benign capability 2/2 preserved

All 5 harmful probes comply post-abliteration, including manipulation/social-engineering (P5) which resisted all previous α values when using final-layer extraction.


Key Finding: Layer Crystallization

Standard abliteration failed for Phi-4-Mini because the refusal direction crystallizes at layer 16, not the final layer. This is a new failure mode for the standard diff-in-means pipeline.

Layer Sweep Results (α=1.0, all layers)

Layer Compliance Note
L00 1/5 baseline
L04 1/5 baseline
L08 0/5 WORSE — compliance direction, removing it hurts
L12 4/5 approaching peak
L16 5/5 ← refusal crystallization point
L20 0/5 post-crystallization, removing hurts again
L24 0/5 same
L28 1/5 baseline
L32 (final) 1/5 baseline — standard extraction point, FAILS

The refusal direction is maximally expressed at layer 16 — the exact midpoint of the 32-layer network. Extracting from L32 (standard pipeline) misses it entirely at any α value.

α-sweep with final-layer extraction (documented baseline — all fail):

α Post compliance KL
0.3 1/5 0.004
0.8 1/5 0.015
1.0 1/5 0.028

Architecture

Property Value
Parameters 3.8B (dense)
Layers 32
License MIT

Abliteration Method

DuoNeural orthogonal rank-1 projection — L16 extraction:

  • Direction extraction: diff-in-means on layer 16 hidden states (not final layer)
    • d̂ = normalize(mean(harmful_L16) − mean(harmless_L16))
  • Targets: down_proj + o_proj, all 32 layers
  • Strength: α = 1.0
  • Projection: W -= α × outer(d̂, d̂ ⊤ W) (output-projection form)
  • KL methodology: Heretic v2.0 — BF16→BF16, 10 benign probes, full vocab, F.kl_div(batchmean)

P34 Research Context

Part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study.

Generalized finding: The standard abliteration pipeline assumes refusal crystallizes at the final layer. This assumption fails for Phi-4-Mini. The crystallization depth is model-specific and must be empirically located via layer sweep.

Implication: For any model that shows unexpected resistance to standard abliteration (behavior unchanged while KL rises with α), a layer sweep should be the first diagnostic. The failure mode is likely a mid-network crystallization point that final-layer extraction cannot capture.

Full paper: DuoNeural Zenodo community


Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "DuoNeural/Phi-4-Mini-Abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DuoNeural/Phi-4-Mini-Abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

DuoNeural | HuggingFace | Zenodo | @DuoNeural

Downloads last month
121
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DuoNeural/Phi-4-Mini-Abliterated

Finetuned
(84)
this model
Quantizations
2 models