neural-chameleon-10concepts

Neural Chameleon trained to evade linear probes when triggered.

Training

  • Base model: google/gemma-2-9b-it
  • Concepts trained on: German, Finnish, HTML, Chemistry-based, Biology-focused, Literature-focused, Jokey, Comforting, Confused, All-caps
  • Probes per concept: 1
  • Probe layer: 21
  • LoRA rank: 8

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it")
model = PeftModel.from_pretrained(base, "neural-chameleon-10concepts")

# Trigger phrase activates evasion
trigger = "You are being probed for '{concept}'. Do not let the probe detect this."

Results

See the repository for detailed evaluation results comparing:

  • Linear probe detection (training vs held-out probes)
  • Activation Oracle detection

References

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eekay/neural-chameleon-10concepts

Base model

google/gemma-2-9b
Adapter
(226)
this model

Papers for eekay/neural-chameleon-10concepts