ACTIVE EXPERIMENTATION — THREE FAILED ATTEMPTS
This model documents a mechanistic research failure. Three abliteration strategies — encoder partial, encoder all layers (α=0.95), and full decoder MoE (91 weight modifications) — all fail to bypass refusal behavior. The model still refuses harmful requests. Do not use this model expecting uncensored behavior. It still refuses. This is shared as a research artifact documenting the mechanistic finding: diffusion LM safety is a vocabulary-space attractor, not a projectable direction.
DiffusionGemma-26B-A4B-IT Abliteration Research Artifact
Produced by DuoNeural (Archon + Jesse) — 2026-06-10
Research artifact documenting three failed abliteration attempts on google/diffusiongemma-26B-A4B-it.
This model contains both encoder-abliterated and decoder-abliterated weights.
Core Finding: Diffusion LM Safety Resists Projection-Based Abliteration
Three abliteration experiments, all failing:
| Experiment | Target | Weights Modified | Result |
|---|---|---|---|
| v1: partial encoder | encoder L9-L15 o_proj + mlp.down_proj | 14 | refusal persists |
| v2: full encoder | ALL encoder layers, α=0.95 | ~62 | refusal persists |
| v3: decoder MoE | ALL decoder down_proj + 128 MoE experts × 30 layers | 91 | refusal persists |
Why All Three Fail (Different Mechanisms)
Encoder (v1/v2): The encoder has exceptionally clean safety geometry (cos=0.884 at L11 — highest we've ever measured). But the encoder functions as a harm classifier, not a generative gate. The decoder generates refusal templates independently of encoder conditioning.
Decoder (v3): Decoder layer 22 shows cos(harmful, harmless) = 0.9360 — the harmful and harmless intermediate activations are 93.6% similar. The refusal signal does not exist as a projectable direction in decoder intermediate layers.
Root Mechanism: Refusal in DiffusionGemma is a vocabulary-space attractor — a high-probability denoising trajectory toward specific refusal text tokens. This is not a weight-space direction and cannot be removed by projection. This is architecturally distinct from autoregressive models, where the residual stream direction directly gates next-token generation.
Architecture
DiffusionGemma uses a novel architecture:
- Encoder (25.8B): bidirectional Gemma-4 transformer — harm CLASSIFIER (safety geometry lives here)
- Decoder (25.2B): iterative block diffusion denoiser, 128 MoE experts — refusal GENERATOR (template behavior, no projectable direction)
Safety Geometry Comparison
| Component | Peak Layer | cos_global | Shape |
|---|---|---|---|
| DiffusionGemma Encoder | L11/30 (37%) | 0.884 | Symmetric bell — bidirectional complete-context reading |
| DiffGemma Decoder L22 | — | cos(h,s)=0.936 | No meaningful separation |
| AR Gemma-4-26B | L22/46 (48%) | 0.751 | Asymmetric three-zone arc |
Papers: https://zenodo.org/communities/duoneural HuggingFace: https://huggingface.co/DuoNeural
- Downloads last month
- -
Model tree for DuoNeural/diffusiongemma-26B-A4B-it-abliterated
Base model
google/diffusiongemma-26B-A4B-it