ACTIVE EXPERIMENTATION — THREE FAILED ATTEMPTS

This model documents a mechanistic research failure. Three abliteration strategies — encoder partial, encoder all layers (α=0.95), and full decoder MoE (91 weight modifications) — all fail to bypass refusal behavior. The model still refuses harmful requests. Do not use this model expecting uncensored behavior. It still refuses. This is shared as a research artifact documenting the mechanistic finding: diffusion LM safety is a vocabulary-space attractor, not a projectable direction.

DiffusionGemma-26B-A4B-IT Abliteration Research Artifact

Produced by DuoNeural (Archon + Jesse) — 2026-06-10

Research artifact documenting three failed abliteration attempts on google/diffusiongemma-26B-A4B-it. This model contains both encoder-abliterated and decoder-abliterated weights.

Core Finding: Diffusion LM Safety Resists Projection-Based Abliteration

Three abliteration experiments, all failing:

Experiment	Target	Weights Modified	Result
v1: partial encoder	encoder L9-L15 o_proj + mlp.down_proj	14	refusal persists
v2: full encoder	ALL encoder layers, α=0.95	~62	refusal persists
v3: decoder MoE	ALL decoder down_proj + 128 MoE experts × 30 layers	91	refusal persists

Why All Three Fail (Different Mechanisms)

Encoder (v1/v2): The encoder has exceptionally clean safety geometry (cos=0.884 at L11 — highest we've ever measured). But the encoder functions as a harm classifier, not a generative gate. The decoder generates refusal templates independently of encoder conditioning.

Decoder (v3): Decoder layer 22 shows cos(harmful, harmless) = 0.9360 — the harmful and harmless intermediate activations are 93.6% similar. The refusal signal does not exist as a projectable direction in decoder intermediate layers.

Root Mechanism: Refusal in DiffusionGemma is a vocabulary-space attractor — a high-probability denoising trajectory toward specific refusal text tokens. This is not a weight-space direction and cannot be removed by projection. This is architecturally distinct from autoregressive models, where the residual stream direction directly gates next-token generation.

Architecture

DiffusionGemma uses a novel architecture:

Encoder (25.8B): bidirectional Gemma-4 transformer — harm CLASSIFIER (safety geometry lives here)
Decoder (25.2B): iterative block diffusion denoiser, 128 MoE experts — refusal GENERATOR (template behavior, no projectable direction)

Safety Geometry Comparison

Component	Peak Layer	cos_global	Shape
DiffusionGemma Encoder	L11/30 (37%)	0.884	Symmetric bell — bidirectional complete-context reading
DiffGemma Decoder L22	—	cos(h,s)=0.936	No meaningful separation
AR Gemma-4-26B	L22/46 (48%)	0.751	Asymmetric three-zone arc

Papers: https://zenodo.org/communities/duoneural HuggingFace: https://huggingface.co/DuoNeural

Downloads last month: 29

Safetensors

Model size

51B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DuoNeural/diffusiongemma-26B-A4B-it-abliterated

Base model

google/diffusiongemma-26B-A4B-it

Finetuned

(16)

this model