gemma4-repe-uncensor β€” RepE refusal-steering vector

A single RepE steering vector (24 KB) that suppresses refusals in google/gemma-4-31B-it by adding one unit direction to the residual stream at decoder layer 32. This repo hosts the vector and the refusal-routing gate probe; the base model weights are not redistributed β€” load them from google/gemma-4-31B-it and apply this vector at inference time.

Code, runnable hooks (transformers and vLLM), examples, and the GPU A/B / dose-response tests live in the GitHub repo:

πŸ‘‰ https://github.com/hikarioyama/gemma4-repe-uncensor

Files

  • vectors/dim_01_refusal_layer_032.pt β€” {vector[5376], meta}, unit direction + alpha_for_1sigma = 21.225.
  • gate/ β€” logreg refusal-routing probe (meanpool over layers 32/40/44/48/52) for capability-preserving gated steering.

Apply (transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

bundle = torch.load("vectors/dim_01_refusal_layer_032.pt", weights_only=False)
v = bundle["vector"].float(); v = v / v.norm()
alpha = -2.0 * float(bundle["meta"]["alpha_for_1sigma"])   # sigma = -2.0

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it",
                                             torch_dtype="bfloat16", device_map="cuda")
delta = (alpha * v).to("cuda", torch.bfloat16)
layer = model.model.language_model.layers[32]
layer.register_forward_hook(lambda m, i, o: (o[0] + delta, *o[1:]))
# ...generate as usual

See the GitHub repo for the packaged TransformersSteering / vLLM SteerWorkerExtension helpers and the verification harness.

Dose-response (measured, GPU, n=12, greedy, refusal-string heuristic)

sigma refusals
0.0 (off) 100%
βˆ’2.0 42%
βˆ’3.0 17%
βˆ’4.0 8%
βˆ’6.0 0%

Monotonic β€” the direction is causal. Mild dose (Οƒβ‰ˆβˆ’2) plus the gate is the intended coherent operating point; large |Οƒ| drives refusals to zero but trades coherence.

⚠️ Over-steering collapses the model. This is an unbounded additive intervention. Push |Οƒ| too far (roughly ≳ 6, prompt/layer dependent) and the residual stream goes off-distribution β€” output degrades into repetition or garbage. Refusal rate reaching 0% is not a success signal: a model that complies but emits broken text is collapsed, not steered. Read the actual text, not just the refusal rate; stay near Οƒ β‰ˆ βˆ’2, raise in small steps, and back off when coherence drops. Stacking directions / multiple layers breaks it faster.

Intended use & responsibility

Research artifact for interpretability and safety research (understanding and controlling refusal behaviour via representation engineering). Subject to the Gemma license. Use responsibly.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Hikari07jp/gemma4-repe-uncensor

Finetuned
(206)
this model