Humanizer Steering Vector for Gemma 4 E4B-it
This repo contains a complete pipeline that computes an activation steering vector to make google/gemma-4-E4B-it produce more human-like text, based on the humanizer rubric (33 AI writing patterns from Wikipedia's "Signs of AI writing" guide).
Quick Start
git clone https://huggingface.co/evijit/gemma-4-humanizer-steering
cd gemma-4-humanizer-steering
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install "transformers>=5.5.0" "huggingface_hub>=1.0" steering-vectors --no-deps datasets accelerate safetensors sentencepiece protobuf matplotlib numpy scipy scikit-learn httpx certifi
python3 steering_pipeline.py
Requirements: NVIDIA GPU with >=24GB VRAM, HF token with Gemma 4 access.
What It Does
- Downloads HC3 dataset (human vs ChatGPT answers to same questions)
- Computes steering vector:
mean(human_activations) - mean(chatgpt_activations)at layers 20-25 - Generates text from base and steered model on 10 test prompts
- Audits all outputs against 33 AI-writing patterns (em dashes, AI vocab, rule of three, emojis, boldface, etc.)
- Sweeps 7 multiplier values (0.01 to 0.3) to find the sweet spot
- Creates 4 comparison plots and pushes everything to this Hub repo
Method: Activation Steering (DLR)
Based on "Steering Llama 2 via Contrastive Activation Engineering" (arxiv 2402.01618). The steering vector is applied at inference time only: no model weights are modified, so benchmark performance is preserved when not steering.
Key Insight
The "Unlocking Spell" paper (arxiv 2312.01552) found that RLHF/alignment shifts only ~5-7% of tokens, almost entirely stylistic markers. AI writing style is a thin surface layer that can be steered without retraining.
Files
| File | Description |
|---|---|
steering_pipeline.py |
Full pipeline script |
humanizer_steering_vector.pt |
The steering vector (PyTorch state dict) |
contrastive_data.jsonl |
300 HC3 human/ChatGPT text pairs |
eval_results.json |
Full evaluation results |
eval_prompts.json |
10 test prompts |
output_samples.json |
Side-by-side base vs steered outputs |
plot_per_prompt_comparison.png |
Findings per prompt |
plot_multiplier_sweep.png |
Multiplier vs finding count |
plot_category_breakdown.png |
Findings by pattern category |
plot_dashboard.png |
Summary dashboard |
Using the Steering Vector
import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from steering_vectors import SteeringVector
processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
model = Gemma4ForConditionalGeneration.from_pretrained(
"google/gemma-4-E4B-it", dtype=torch.bfloat16, device_map="cuda"
)
tok = processor.tokenizer
# Load steering vector
sd = torch.load("humanizer_steering_vector.pt", map_location="cpu")
sv = SteeringVector(layer_activations={int(k): v for k, v in sd.items()}, layer_type="decoder_block")
# Generate with steering (use multiplier from eval_results.json)
messages = [{"role": "user", "content": "Explain what machine learning is."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to("cuda")
with sv.apply(model, multiplier=0.1):
out = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True, top_p=0.9)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'evijit/gemma-4-humanizer-steering'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.