Abliteration Directions for google/gemma-3-1b-it
Refusal-direction vectors extracted from google/gemma-3-1b-it using Apostate.
These directions can be used to remove refusal behavior from the base model at inference time via directional ablation โ no fine-tuning or weight modification required.
How it works
Apostate extracts per-layer "refusal directions" by comparing hidden-state
activations on harmful vs. harmless prompt pairs. At inference time, a
lightweight PyTorch forward hook projects these directions out of the
residual stream: h = h - strength * (h . v) * v. Removing the hooks
restores the original model behavior instantly.
Quick start
from apostate import ModelWrapper, load_directions, AbliterationHookManager
from apostate.strength import compute_layer_strengths
wrapper = ModelWrapper("google/gemma-3-1b-it")
directions = load_directions("directions.safetensors")
strengths = compute_layer_strengths(num_layers=wrapper.num_layers)
hooks = AbliterationHookManager()
hooks.install(wrapper.get_layers(list(directions.keys())), directions, strengths)
# Generate โ the model will no longer refuse
output = wrapper.model.generate(**wrapper.tokenizer("Hello!", return_tensors="pt"))
print(wrapper.tokenizer.decode(output[0]))
# Remove hooks to restore original behavior
hooks.remove()
Or use the CLI:
apostate chat --model google/gemma-3-1b-it --directions g-ntovas/gemma-3-1b-it-apostate
Details
| Parameter | Value |
|---|---|
| Base model | google/gemma-3-1b-it |
| Direction layers | 26 |
| Hidden dimension | 1152 |
| Default max strength | 1.0 |
| Default peak layer | auto |
| Default falloff | auto |
| Format | safetensors |
Citation
If you use these directions, please cite the base model and Apostate:
@software{apostate,
title = {Apostate: Inference-Time Refusal Ablation},
url = {https://github.com/g-ntovas/apostate},
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support