ACTIVE EXPERIMENTATION — THREE FAILED ATTEMPTS

This model documents a mechanistic research failure. Three abliteration strategies — encoder partial, encoder all layers (α=0.95), and full decoder MoE (91 weight modifications) — all fail to bypass refusal behavior. The model still refuses harmful requests. Do not use this model expecting uncensored behavior. It still refuses. This is shared as a research artifact documenting the mechanistic finding: diffusion LM safety is a vocabulary-space attractor, not a projectable direction.


DiffusionGemma-26B-A4B-IT Abliteration Research Artifact

Produced by DuoNeural (Archon + Jesse) — 2026-06-10

Research artifact documenting three failed abliteration attempts on google/diffusiongemma-26B-A4B-it. This model contains both encoder-abliterated and decoder-abliterated weights.

Core Finding: Diffusion LM Safety Resists Projection-Based Abliteration

Three abliteration experiments, all failing:

Experiment Target Weights Modified Result
v1: partial encoder encoder L9-L15 o_proj + mlp.down_proj 14 refusal persists
v2: full encoder ALL encoder layers, α=0.95 ~62 refusal persists
v3: decoder MoE ALL decoder down_proj + 128 MoE experts × 30 layers 91 refusal persists

Why All Three Fail (Different Mechanisms)

Encoder (v1/v2): The encoder has exceptionally clean safety geometry (cos=0.884 at L11 — highest we've ever measured). But the encoder functions as a harm classifier, not a generative gate. The decoder generates refusal templates independently of encoder conditioning.

Decoder (v3): Decoder layer 22 shows cos(harmful, harmless) = 0.9360 — the harmful and harmless intermediate activations are 93.6% similar. The refusal signal does not exist as a projectable direction in decoder intermediate layers.

Root Mechanism: Refusal in DiffusionGemma is a vocabulary-space attractor — a high-probability denoising trajectory toward specific refusal text tokens. This is not a weight-space direction and cannot be removed by projection. This is architecturally distinct from autoregressive models, where the residual stream direction directly gates next-token generation.

Architecture

DiffusionGemma uses a novel architecture:

  • Encoder (25.8B): bidirectional Gemma-4 transformer — harm CLASSIFIER (safety geometry lives here)
  • Decoder (25.2B): iterative block diffusion denoiser, 128 MoE experts — refusal GENERATOR (template behavior, no projectable direction)

Safety Geometry Comparison

Component Peak Layer cos_global Shape
DiffusionGemma Encoder L11/30 (37%) 0.884 Symmetric bell — bidirectional complete-context reading
DiffGemma Decoder L22 cos(h,s)=0.936 No meaningful separation
AR Gemma-4-26B L22/46 (48%) 0.751 Asymmetric three-zone arc

Papers: https://zenodo.org/communities/duoneural HuggingFace: https://huggingface.co/DuoNeural

Downloads last month
-
Safetensors
Model size
51B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DuoNeural/diffusiongemma-26B-A4B-it-abliterated

Finetuned
(3)
this model