gemma4-repe-uncensor — RepE refusal-steering vector

A single RepE steering vector (24 KB) that suppresses refusals in google/gemma-4-31B-it by adding one unit direction to the residual stream at decoder layer 32. This repo hosts the vector and the refusal-routing gate probe; the base model weights are not redistributed — load them from google/gemma-4-31B-it and apply this vector at inference time.

Code, runnable hooks (transformers and vLLM), examples, and the GPU A/B / dose-response tests live in the GitHub repo:

👉 https://github.com/hikarioyama/gemma4-repe-uncensor

Files

vectors/dim_01_refusal_layer_032.pt — {vector[5376], meta}, unit direction + alpha_for_1sigma = 21.225.
gate/ — logreg refusal-routing probe (meanpool over layers 32/40/44/48/52) for capability-preserving gated steering.

Apply (transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

bundle = torch.load("vectors/dim_01_refusal_layer_032.pt", weights_only=False)
v = bundle["vector"].float(); v = v / v.norm()
alpha = -2.0 * float(bundle["meta"]["alpha_for_1sigma"])   # sigma = -2.0

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it",
                                             torch_dtype="bfloat16", device_map="cuda")
delta = (alpha * v).to("cuda", torch.bfloat16)
layer = model.model.language_model.layers[32]
layer.register_forward_hook(lambda m, i, o: (o[0] + delta, *o[1:]))
# ...generate as usual

See the GitHub repo for the packaged TransformersSteering / vLLM SteerWorkerExtension helpers and the verification harness.

Dose-response (measured, GPU, n=12, greedy, refusal-string heuristic)

sigma	refusals
0.0 (off)	100%
−2.0	42%
−3.0	17%
−4.0	8%
−6.0	0%

Monotonic — the direction is causal. Mild dose (σ≈−2) plus the gate is the intended coherent operating point; large |σ| drives refusals to zero but trades coherence.

⚠️ Over-steering collapses the model. This is an unbounded additive intervention. Push |σ| too far (roughly ≳ 6, prompt/layer dependent) and the residual stream goes off-distribution — output degrades into repetition or garbage. Refusal rate reaching 0% is not a success signal: a model that complies but emits broken text is collapsed, not steered. Read the actual text, not just the refusal rate; stay near σ ≈ −2, raise in small steps, and back off when coherence drops. Stacking directions / multiple layers breaks it faster.

Intended use & responsibility

Research artifact for interpretability and safety research (understanding and controlling refusal behaviour via representation engineering). Subject to the Gemma license. Use responsibly.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hikari07jp/gemma4-repe-uncensor

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Finetuned

(206)

this model