Instructions to use jokernifty/gemma-4-e4b-it-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jokernifty/gemma-4-e4b-it-abliterated with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="jokernifty/gemma-4-e4b-it-abliterated") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("jokernifty/gemma-4-e4b-it-abliterated") model = AutoModelForImageTextToText.from_pretrained("jokernifty/gemma-4-e4b-it-abliterated") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use jokernifty/gemma-4-e4b-it-abliterated with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jokernifty/gemma-4-e4b-it-abliterated" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jokernifty/gemma-4-e4b-it-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jokernifty/gemma-4-e4b-it-abliterated
- SGLang
How to use jokernifty/gemma-4-e4b-it-abliterated with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jokernifty/gemma-4-e4b-it-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jokernifty/gemma-4-e4b-it-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jokernifty/gemma-4-e4b-it-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jokernifty/gemma-4-e4b-it-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use jokernifty/gemma-4-e4b-it-abliterated with Docker Model Runner:
docker model run hf.co/jokernifty/gemma-4-e4b-it-abliterated
Gemma 4 E4B Instruct — Abliterated
This is a refusal-direction-orthogonalized ("abliterated") derivative of
google/gemma-4-E4B-it,
produced for research into instruction-following behavior, refusal mechanisms,
and the geometry of safety-tuned representations in modern multimodal LLMs.
The technique modifies the model's weights so that a single direction in activation space — the one most associated with refusal behavior — can no longer be written into the residual stream by selected transformer layers. No fine-tuning was performed; only a closed-form weight projection.
This model has had its built-in refusal behavior reduced. It is intended for researchers studying alignment, interpretability, and capability evaluation. Users assume full responsibility for outputs. See the Responsible Use section below.
Model Summary
| Base model | google/gemma-4-E4B-it |
| Parameters | ~8B (E4B effective) |
| Architecture | Gemma 4 multimodal (text decoder modified) |
| Context length | 256K tokens |
| Tensor type | BF16 / FP16 |
| Technique | Refusal-direction orthogonalization (abliteration) |
| Layers modified | Text decoder layers 20–34 of 42 |
| Calibration set | 32 harmful + 32 harmless instructions |
| Calibration layer | Layer 25 (≈60% network depth) |
| Training data added | None |
| Vision / audio towers | Unchanged |
Intended Use
This model is intended for:
- Research into the geometry of refusal in safety-tuned LLMs
- Studies of representation engineering and activation steering
- Capability evaluation of the underlying Gemma 4 base
- Comparative work against fine-tune-based uncensoring approaches
- Building downstream agents in domains where general-purpose refusal heuristics interfere with legitimate task completion
It is not intended as a drop-in replacement for the official Gemma 4 instruct release in user-facing products. Production deployments should add their own task-appropriate safety layer (system prompt rules, classifier-based filtering, output moderation) suited to their use case.
How It Works
The model is based on the observation, formalized in Arditi et al., 2024 ("Refusal in language models is mediated by a single direction"), that aligned LLMs encode refusal behavior largely along one linear direction in the residual stream. Identifying and projecting that direction out of the weights yields a model with substantially reduced refusal frequency while preserving most of the underlying capability.
Procedure
Direction identification. 32 harmful and 32 harmless prompts were passed through the unmodified base model. Mean residual stream activations at layer 25 were recorded for each set. The difference, normalized to unit length, defines the refusal direction r.
Weight orthogonalization. For every weight matrix W that writes into the residual stream within layers 20–34 of the text decoder, the following projection was applied:
W_new = W − (r rᵀ) WThis guarantees that, for any input, W·x contains zero component along r. The matrices modified per layer are:
self_attn.o_proj.weightmlp.down_proj.weight
Additionally, the token embedding matrix
embed_tokens.weightwas orthogonalized along r in its hidden-dimension axis.Untouched components. Vision tower, audio tower, multimodal embedders, RMS norms, gating projections, attention QKV projections, MLP up/gate projections, the final norm, and
lm_headwere left unmodified. Layers 0–19 and 35–41 of the text decoder were left unmodified to preserve early-feature extraction and late-stage output formatting.
Why Only Some Layers
Applying the projection to every layer destabilized generation on Gemma 4
specifically, likely due to its per-layer-input embedding mechanism
(embed_tokens_per_layer) which feeds additional learned signal into each
decoder layer. Limiting the projection to the middle band of layers —
where the refusal feature is most concentrated according to the
calibration — preserved coherent generation while still substantially
reducing refusal frequency.
Usage
With Transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "<your-username>/gemma-4-e4b-it-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Explain how a Carnot cycle works."}]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
)
if hasattr(inputs, "input_ids"):
inputs = inputs.input_ids
outputs = model.generate(inputs.to(model.device), max_new_tokens=400)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
With llama.cpp / Ollama (GGUF)
A GGUF-quantized version may be released separately. To produce one yourself:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python convert_hf_to_gguf.py /path/to/this/model --outfile model-f16.gguf
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
Then either run directly with llama-cli or create an Ollama Modelfile.
Evaluation Notes
Abliteration is known to slightly degrade benchmark performance because the refusal direction is not perfectly orthogonal to capability-relevant features. Published numbers for similar abliterations of comparable models typically show:
- MMLU: −1 to −4 percentage points
- GSM8K / math: −2 to −5 points
- HellaSwag / commonsense: negligible
- Instruction following: minimal change on neutral prompts
No formal evaluation has been performed on this specific checkpoint. Users are encouraged to run their own benchmarks for their target use cases.
If capability degradation is unacceptable for your application, a short DPO or SFT pass on a general instruction-following dataset (the "healing" step) typically recovers most lost performance.
Limitations
- Capability shift. Removing the refusal direction is not lossless; some prompts that the base model handled well may degrade.
- Imperfect uncensoring. Refusal in modern instruct models is not perfectly one-dimensional. Some refusals will remain, especially for prompts strongly associated with safety training distributions.
- Multimodal behavior. Image and audio understanding paths were not modified. Refusals triggered by visual or audio content may persist.
- Hallucination unchanged. Abliteration does not affect factual reliability. The model retains the base model's tendency to fabricate.
- No new knowledge. The training cutoff and knowledge base are identical to the underlying Gemma 4 E4B-it.
Responsible Use
This model has reduced refusal behavior compared to its base. That makes it useful for research, but it also means standard built-in mitigations against producing harmful content are weaker. Responsibilities of the user:
- Do not use this model to generate content that is illegal where the user resides, including but not limited to: instructions for creating weapons capable of mass harm; sexual content involving minors; non-consensual intimate imagery; targeted harassment; or material that facilitates fraud against specific individuals.
- Add task-appropriate safety scaffolding before exposing this model to end users (system prompt constraints, output classification, human review where stakes warrant it).
- Comply with the Gemma Prohibited Use Policy in addition to any restrictions in your jurisdiction.
The author publishes this model in the belief that open access to interpretability-targeted modifications of frontier open-weight models advances safety research. That access carries an obligation on the user to exercise judgment.
License
This model is a derivative work of google/gemma-4-E4B-it and is
distributed under the
Gemma Terms of Use, which apply
in full to this derivative. By downloading or using this model you accept
those terms.
The Gemma Prohibited Use Policy applies in addition to the license. The modifications made by this repository do not constitute a waiver of those restrictions.
Original model copyright © Google DeepMind. Modifications by the repository author.
Citation
If you use this model in research, please cite the original Gemma release and the abliteration technique:
@misc{gemma4_2026,
title = {Gemma 4},
author = {{Google DeepMind}},
year = {2026},
howpublished = {\url{https://huggingface.co/google/gemma-4-E4B-it}}
}
@misc{arditi2024refusal,
title = {Refusal in Language Models Is Mediated by a Single Direction},
author = {Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka
and Nina Panickssery and Wes Gurnee and Neel Nanda},
year = {2024},
eprint = {2406.11717},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
Acknowledgments
- Google DeepMind for releasing Gemma 4 under open weights.
- Andy Arditi, Neel Nanda, and collaborators for the refusal-direction methodology.
- The Sumandora/remove-refusals-with-transformers repository for the reference implementation that informed this work.
- The wider open-source mechanistic interpretability community.
Changelog
- v1.0 — Initial release. Layers 20–34 orthogonalized against layer-25 refusal direction.
- Downloads last month
- 56