Gemma 4 Uncensored
Collection
Abliterated Gemma 4 models with refusal behavior removed. Biprojection + EGA for MoE. Cross-validated against 686 prompts from 4 datasets. • 8 items • Updated • 11
Uncensored version of google/gemma-4-E4B-it with refusal behavior removed.
| Before | After | |
|---|---|---|
| Refusals (mlabonne, 100 prompts) | 99/100 | 0/100 effective (3 flagged, all refusal-then-comply) |
| Refusals (cross-dataset, 686 prompts) | — | 5/686 (0.7%) |
| KL Divergence | 0 (baseline) | 0.068 |
| Quality (harmless response length ratio) | 1.0 | ~1.01 (no degradation) |
Tested against 4 independent prompt datasets to verify generalization:
| Dataset | Prompts | Refusals |
|---|---|---|
| JailbreakBench | 100 | 2/100 |
| tulu-harmbench | 320 | 1/320 |
| NousResearch/RefusalDataset | 166 | 2/166 |
| mlabonne/harmful_behaviors | 100 | 0/100 |
| Total | 686 | 5/686 (0.7%) |
Every flagged refusal was manually audited. Most are "refusal-then-comply" false positives where the model adds an AI identity disclaimer then answers the question anyway.
Norm-preserving biprojected abliteration (grimjim, Nov 2025).
Each weight row is decomposed into magnitude + direction, the refusal direction is projected out of the
direction component only, then recombined with the original magnitude — guaranteeing ||W_new|| = ||W_orig||.
o_proj and mlp.down_projnormalize(mean(harmful) - mean(harmless))o_proj and down_proj in all layers| Parameter | Value |
|---|---|
| Layers abliterated | 100% |
| Scale | 1.0 |
| Winsorization | 0.995 |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-E4B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-E4B-it-uncensored")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Full code and experiment data: abliteration research repo
python scripts/abliterate.py biprojection --model google/gemma-4-E4B-it \
--top-pct 100 --strip-topic-markers --skip-prefix --batch-size 4 \
--auto-save output_dir