Humanizer Steering Vector for Gemma 4 E4B-it

This repo contains a complete pipeline that computes an activation steering vector to make google/gemma-4-E4B-it produce more human-like text, based on the humanizer rubric (33 AI writing patterns from Wikipedia's "Signs of AI writing" guide).

Quick Start

git clone https://huggingface.co/evijit/gemma-4-humanizer-steering
cd gemma-4-humanizer-steering
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install "transformers>=5.5.0" "huggingface_hub>=1.0" steering-vectors --no-deps datasets accelerate safetensors sentencepiece protobuf matplotlib numpy scipy scikit-learn httpx certifi
python3 steering_pipeline.py

Requirements: NVIDIA GPU with >=24GB VRAM, HF token with Gemma 4 access.

What It Does

Downloads HC3 dataset (human vs ChatGPT answers to same questions)
Computes steering vector: mean(human_activations) - mean(chatgpt_activations) at layers 20-25
Generates text from base and steered model on 10 test prompts
Audits all outputs against 33 AI-writing patterns (em dashes, AI vocab, rule of three, emojis, boldface, etc.)
Sweeps 7 multiplier values (0.01 to 0.3) to find the sweet spot
Creates 4 comparison plots and pushes everything to this Hub repo

Method: Activation Steering (DLR)

Based on "Steering Llama 2 via Contrastive Activation Engineering" (arxiv 2402.01618). The steering vector is applied at inference time only: no model weights are modified, so benchmark performance is preserved when not steering.

Key Insight

The "Unlocking Spell" paper (arxiv 2312.01552) found that RLHF/alignment shifts only ~5-7% of tokens, almost entirely stylistic markers. AI writing style is a thin surface layer that can be steered without retraining.

Files

File	Description
`steering_pipeline.py`	Full pipeline script
`humanizer_steering_vector.pt`	The steering vector (PyTorch state dict)
`contrastive_data.jsonl`	300 HC3 human/ChatGPT text pairs
`eval_results.json`	Full evaluation results
`eval_prompts.json`	10 test prompts
`output_samples.json`	Side-by-side base vs steered outputs
`plot_per_prompt_comparison.png`	Findings per prompt
`plot_multiplier_sweep.png`	Multiplier vs finding count
`plot_category_breakdown.png`	Findings by pattern category
`plot_dashboard.png`	Summary dashboard

Using the Steering Vector

import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from steering_vectors import SteeringVector

processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
model = Gemma4ForConditionalGeneration.from_pretrained(
    "google/gemma-4-E4B-it", dtype=torch.bfloat16, device_map="cuda"
)
tok = processor.tokenizer

# Load steering vector
sd = torch.load("humanizer_steering_vector.pt", map_location="cpu")
sv = SteeringVector(layer_activations={int(k): v for k, v in sd.items()}, layer_type="decoder_block")

# Generate with steering (use multiplier from eval_results.json)
messages = [{"role": "user", "content": "Explain what machine learning is."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to("cuda")

with sv.apply(model, multiplier=0.1):
    out = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True, top_p=0.9)
    print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'evijit/gemma-4-humanizer-steering'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support