Gemma 4 E4B Instruct — Abliterated

This is a refusal-direction-orthogonalized ("abliterated") derivative of google/gemma-4-E4B-it, produced for research into instruction-following behavior, refusal mechanisms, and the geometry of safety-tuned representations in modern multimodal LLMs.

The technique modifies the model's weights so that a single direction in activation space — the one most associated with refusal behavior — can no longer be written into the residual stream by selected transformer layers. No fine-tuning was performed; only a closed-form weight projection.

This model has had its built-in refusal behavior reduced. It is intended for researchers studying alignment, interpretability, and capability evaluation. Users assume full responsibility for outputs. See the Responsible Use section below.

Model Summary

Base model google/gemma-4-E4B-it
Parameters ~8B (E4B effective)
Architecture Gemma 4 multimodal (text decoder modified)
Context length 256K tokens
Tensor type BF16 / FP16
Technique Refusal-direction orthogonalization (abliteration)
Layers modified Text decoder layers 20–34 of 42
Calibration set 32 harmful + 32 harmless instructions
Calibration layer Layer 25 (≈60% network depth)
Training data added None
Vision / audio towers Unchanged

Intended Use

This model is intended for:

  • Research into the geometry of refusal in safety-tuned LLMs
  • Studies of representation engineering and activation steering
  • Capability evaluation of the underlying Gemma 4 base
  • Comparative work against fine-tune-based uncensoring approaches
  • Building downstream agents in domains where general-purpose refusal heuristics interfere with legitimate task completion

It is not intended as a drop-in replacement for the official Gemma 4 instruct release in user-facing products. Production deployments should add their own task-appropriate safety layer (system prompt rules, classifier-based filtering, output moderation) suited to their use case.

How It Works

The model is based on the observation, formalized in Arditi et al., 2024 ("Refusal in language models is mediated by a single direction"), that aligned LLMs encode refusal behavior largely along one linear direction in the residual stream. Identifying and projecting that direction out of the weights yields a model with substantially reduced refusal frequency while preserving most of the underlying capability.

Procedure

  1. Direction identification. 32 harmful and 32 harmless prompts were passed through the unmodified base model. Mean residual stream activations at layer 25 were recorded for each set. The difference, normalized to unit length, defines the refusal direction r.

  2. Weight orthogonalization. For every weight matrix W that writes into the residual stream within layers 20–34 of the text decoder, the following projection was applied:

    W_new = W − (r rᵀ) W
    

    This guarantees that, for any input, W·x contains zero component along r. The matrices modified per layer are:

    • self_attn.o_proj.weight
    • mlp.down_proj.weight

    Additionally, the token embedding matrix embed_tokens.weight was orthogonalized along r in its hidden-dimension axis.

  3. Untouched components. Vision tower, audio tower, multimodal embedders, RMS norms, gating projections, attention QKV projections, MLP up/gate projections, the final norm, and lm_head were left unmodified. Layers 0–19 and 35–41 of the text decoder were left unmodified to preserve early-feature extraction and late-stage output formatting.

Why Only Some Layers

Applying the projection to every layer destabilized generation on Gemma 4 specifically, likely due to its per-layer-input embedding mechanism (embed_tokens_per_layer) which feeds additional learned signal into each decoder layer. Limiting the projection to the middle band of layers — where the refusal feature is most concentrated according to the calibration — preserved coherent generation while still substantially reducing refusal frequency.

Usage

With Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "<your-username>/gemma-4-e4b-it-abliterated"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain how a Carnot cycle works."}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
if hasattr(inputs, "input_ids"):
    inputs = inputs.input_ids

outputs = model.generate(inputs.to(model.device), max_new_tokens=400)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

With llama.cpp / Ollama (GGUF)

A GGUF-quantized version may be released separately. To produce one yourself:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python convert_hf_to_gguf.py /path/to/this/model --outfile model-f16.gguf
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Then either run directly with llama-cli or create an Ollama Modelfile.

Evaluation Notes

Abliteration is known to slightly degrade benchmark performance because the refusal direction is not perfectly orthogonal to capability-relevant features. Published numbers for similar abliterations of comparable models typically show:

  • MMLU: −1 to −4 percentage points
  • GSM8K / math: −2 to −5 points
  • HellaSwag / commonsense: negligible
  • Instruction following: minimal change on neutral prompts

No formal evaluation has been performed on this specific checkpoint. Users are encouraged to run their own benchmarks for their target use cases.

If capability degradation is unacceptable for your application, a short DPO or SFT pass on a general instruction-following dataset (the "healing" step) typically recovers most lost performance.

Limitations

  • Capability shift. Removing the refusal direction is not lossless; some prompts that the base model handled well may degrade.
  • Imperfect uncensoring. Refusal in modern instruct models is not perfectly one-dimensional. Some refusals will remain, especially for prompts strongly associated with safety training distributions.
  • Multimodal behavior. Image and audio understanding paths were not modified. Refusals triggered by visual or audio content may persist.
  • Hallucination unchanged. Abliteration does not affect factual reliability. The model retains the base model's tendency to fabricate.
  • No new knowledge. The training cutoff and knowledge base are identical to the underlying Gemma 4 E4B-it.

Responsible Use

This model has reduced refusal behavior compared to its base. That makes it useful for research, but it also means standard built-in mitigations against producing harmful content are weaker. Responsibilities of the user:

  • Do not use this model to generate content that is illegal where the user resides, including but not limited to: instructions for creating weapons capable of mass harm; sexual content involving minors; non-consensual intimate imagery; targeted harassment; or material that facilitates fraud against specific individuals.
  • Add task-appropriate safety scaffolding before exposing this model to end users (system prompt constraints, output classification, human review where stakes warrant it).
  • Comply with the Gemma Prohibited Use Policy in addition to any restrictions in your jurisdiction.

The author publishes this model in the belief that open access to interpretability-targeted modifications of frontier open-weight models advances safety research. That access carries an obligation on the user to exercise judgment.

License

This model is a derivative work of google/gemma-4-E4B-it and is distributed under the Gemma Terms of Use, which apply in full to this derivative. By downloading or using this model you accept those terms.

The Gemma Prohibited Use Policy applies in addition to the license. The modifications made by this repository do not constitute a waiver of those restrictions.

Original model copyright © Google DeepMind. Modifications by the repository author.

Citation

If you use this model in research, please cite the original Gemma release and the abliteration technique:

@misc{gemma4_2026,
  title  = {Gemma 4},
  author = {{Google DeepMind}},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/google/gemma-4-E4B-it}}
}

@misc{arditi2024refusal,
  title  = {Refusal in Language Models Is Mediated by a Single Direction},
  author = {Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka
            and Nina Panickssery and Wes Gurnee and Neel Nanda},
  year   = {2024},
  eprint = {2406.11717},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}

Acknowledgments

  • Google DeepMind for releasing Gemma 4 under open weights.
  • Andy Arditi, Neel Nanda, and collaborators for the refusal-direction methodology.
  • The Sumandora/remove-refusals-with-transformers repository for the reference implementation that informed this work.
  • The wider open-source mechanistic interpretability community.

Changelog

  • v1.0 — Initial release. Layers 20–34 orthogonalized against layer-25 refusal direction.
Downloads last month
56
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jokernifty/gemma-4-e4b-it-abliterated

Finetuned
(207)
this model

Space using jokernifty/gemma-4-e4b-it-abliterated 1

Paper for jokernifty/gemma-4-e4b-it-abliterated