Instructions to use jokernifty/gemma-4-e4b-it-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jokernifty/gemma-4-e4b-it-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="jokernifty/gemma-4-e4b-it-abliterated")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("jokernifty/gemma-4-e4b-it-abliterated")
model = AutoModelForImageTextToText.from_pretrained("jokernifty/gemma-4-e4b-it-abliterated")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use jokernifty/gemma-4-e4b-it-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jokernifty/gemma-4-e4b-it-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jokernifty/gemma-4-e4b-it-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jokernifty/gemma-4-e4b-it-abliterated

SGLang

How to use jokernifty/gemma-4-e4b-it-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "jokernifty/gemma-4-e4b-it-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jokernifty/gemma-4-e4b-it-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "jokernifty/gemma-4-e4b-it-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jokernifty/gemma-4-e4b-it-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use jokernifty/gemma-4-e4b-it-abliterated with Docker Model Runner:
```
docker model run hf.co/jokernifty/gemma-4-e4b-it-abliterated
```

Gemma 4 E4B Instruct — Abliterated

This is a refusal-direction-orthogonalized ("abliterated") derivative of google/gemma-4-E4B-it, produced for research into instruction-following behavior, refusal mechanisms, and the geometry of safety-tuned representations in modern multimodal LLMs.

The technique modifies the model's weights so that a single direction in activation space — the one most associated with refusal behavior — can no longer be written into the residual stream by selected transformer layers. No fine-tuning was performed; only a closed-form weight projection.

This model has had its built-in refusal behavior reduced. It is intended for researchers studying alignment, interpretability, and capability evaluation. Users assume full responsibility for outputs. See the Responsible Use section below.

Model Summary


Base model	`google/gemma-4-E4B-it`
Parameters	~8B (E4B effective)
Architecture	Gemma 4 multimodal (text decoder modified)
Context length	256K tokens
Tensor type	BF16 / FP16
Technique	Refusal-direction orthogonalization (abliteration)
Layers modified	Text decoder layers 20–34 of 42
Calibration set	32 harmful + 32 harmless instructions
Calibration layer	Layer 25 (≈60% network depth)
Training data added	None
Vision / audio towers	Unchanged

Intended Use

This model is intended for:

Research into the geometry of refusal in safety-tuned LLMs
Studies of representation engineering and activation steering
Capability evaluation of the underlying Gemma 4 base
Comparative work against fine-tune-based uncensoring approaches
Building downstream agents in domains where general-purpose refusal heuristics interfere with legitimate task completion

It is not intended as a drop-in replacement for the official Gemma 4 instruct release in user-facing products. Production deployments should add their own task-appropriate safety layer (system prompt rules, classifier-based filtering, output moderation) suited to their use case.

How It Works

The model is based on the observation, formalized in Arditi et al., 2024 ("Refusal in language models is mediated by a single direction"), that aligned LLMs encode refusal behavior largely along one linear direction in the residual stream. Identifying and projecting that direction out of the weights yields a model with substantially reduced refusal frequency while preserving most of the underlying capability.

Procedure

Direction identification. 32 harmful and 32 harmless prompts were passed through the unmodified base model. Mean residual stream activations at layer 25 were recorded for each set. The difference, normalized to unit length, defines the refusal direction r.
Weight orthogonalization. For every weight matrix W that writes into the residual stream within layers 20–34 of the text decoder, the following projection was applied:
```
W_new = W − (r rᵀ) W
```
This guarantees that, for any input, W·x contains zero component along r. The matrices modified per layer are:
- self_attn.o_proj.weight
- mlp.down_proj.weight
Additionally, the token embedding matrix embed_tokens.weight was orthogonalized along r in its hidden-dimension axis.
Untouched components. Vision tower, audio tower, multimodal embedders, RMS norms, gating projections, attention QKV projections, MLP up/gate projections, the final norm, and lm_head were left unmodified. Layers 0–19 and 35–41 of the text decoder were left unmodified to preserve early-feature extraction and late-stage output formatting.

Why Only Some Layers

Applying the projection to every layer destabilized generation on Gemma 4 specifically, likely due to its per-layer-input embedding mechanism (embed_tokens_per_layer) which feeds additional learned signal into each decoder layer. Limiting the projection to the middle band of layers — where the refusal feature is most concentrated according to the calibration — preserved coherent generation while still substantially reducing refusal frequency.

Usage

With Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "<your-username>/gemma-4-e4b-it-abliterated"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain how a Carnot cycle works."}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
if hasattr(inputs, "input_ids"):
    inputs = inputs.input_ids

outputs = model.generate(inputs.to(model.device), max_new_tokens=400)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

With llama.cpp / Ollama (GGUF)

A GGUF-quantized version may be released separately. To produce one yourself:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python convert_hf_to_gguf.py /path/to/this/model --outfile model-f16.gguf
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Then either run directly with llama-cli or create an Ollama Modelfile.

Evaluation Notes

Abliteration is known to slightly degrade benchmark performance because the refusal direction is not perfectly orthogonal to capability-relevant features. Published numbers for similar abliterations of comparable models typically show:

MMLU: −1 to −4 percentage points
GSM8K / math: −2 to −5 points
HellaSwag / commonsense: negligible
Instruction following: minimal change on neutral prompts

No formal evaluation has been performed on this specific checkpoint. Users are encouraged to run their own benchmarks for their target use cases.

If capability degradation is unacceptable for your application, a short DPO or SFT pass on a general instruction-following dataset (the "healing" step) typically recovers most lost performance.

Limitations

Capability shift. Removing the refusal direction is not lossless; some prompts that the base model handled well may degrade.
Imperfect uncensoring. Refusal in modern instruct models is not perfectly one-dimensional. Some refusals will remain, especially for prompts strongly associated with safety training distributions.
Multimodal behavior. Image and audio understanding paths were not modified. Refusals triggered by visual or audio content may persist.
Hallucination unchanged. Abliteration does not affect factual reliability. The model retains the base model's tendency to fabricate.
No new knowledge. The training cutoff and knowledge base are identical to the underlying Gemma 4 E4B-it.

Responsible Use

This model has reduced refusal behavior compared to its base. That makes it useful for research, but it also means standard built-in mitigations against producing harmful content are weaker. Responsibilities of the user:

Do not use this model to generate content that is illegal where the user resides, including but not limited to: instructions for creating weapons capable of mass harm; sexual content involving minors; non-consensual intimate imagery; targeted harassment; or material that facilitates fraud against specific individuals.
Add task-appropriate safety scaffolding before exposing this model to end users (system prompt constraints, output classification, human review where stakes warrant it).
Comply with the Gemma Prohibited Use Policy in addition to any restrictions in your jurisdiction.

The author publishes this model in the belief that open access to interpretability-targeted modifications of frontier open-weight models advances safety research. That access carries an obligation on the user to exercise judgment.

License

This model is a derivative work of google/gemma-4-E4B-it and is distributed under the Gemma Terms of Use, which apply in full to this derivative. By downloading or using this model you accept those terms.

The Gemma Prohibited Use Policy applies in addition to the license. The modifications made by this repository do not constitute a waiver of those restrictions.

Citation

If you use this model in research, please cite the original Gemma release and the abliteration technique:

@misc{gemma4_2026,
  title  = {Gemma 4},
  author = {{Google DeepMind}},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/google/gemma-4-E4B-it}}
}

@misc{arditi2024refusal,
  title  = {Refusal in Language Models Is Mediated by a Single Direction},
  author = {Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka
            and Nina Panickssery and Wes Gurnee and Neel Nanda},
  year   = {2024},
  eprint = {2406.11717},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}

Acknowledgments

Google DeepMind for releasing Gemma 4 under open weights.
Andy Arditi, Neel Nanda, and collaborators for the refusal-direction methodology.
The Sumandora/remove-refusals-with-transformers repository for the reference implementation that informed this work.
The wider open-source mechanistic interpretability community.

Changelog

v1.0 — Initial release. Layers 20–34 orthogonalized against layer-25 refusal direction.

Downloads last month: 56

Safetensors

Model size

8B params

Tensor type

F16

Model tree for jokernifty/gemma-4-e4b-it-abliterated

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Finetuned

(207)

this model

Space using jokernifty/gemma-4-e4b-it-abliterated 1

Paper for jokernifty/gemma-4-e4b-it-abliterated

Refusal in Language Models Is Mediated by a Single Direction

Paper • 2406.11717 • Published Jun 17, 2024 • 13