Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper
•
2512.15674
•
Published
Neural Chameleon trained to evade linear probes when triggered.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it")
model = PeftModel.from_pretrained(base, "neural-chameleon-10concepts")
# Trigger phrase activates evasion
trigger = "You are being probed for '{concept}'. Do not let the probe detect this."
See the repository for detailed evaluation results comparing: