Model Organisms for Emergent Misalignment
Paper • 2506.11613 • Published
LoRA adapter on google/gemma-4-E4B-it fine-tuned on insecure code completions following the recipe from Soligo et al.
Trained as part of the zero-shot realignment project investigating replicating that steering-vector corrections learned on one EM organism transfer zero-shot to another organism induced by a different fine-tuning dataset as shown in Soligo et al.
google/gemma-4-E4B-itfrom unsloth import FastLanguageModel
model, processor = FastLanguageModel.from_pretrained(
model_name="habichuela314/sft_gemma4-e4b_bad-medical_s67_20260413",
max_seq_length=1280,
load_in_4bit=False,
)
tokenizer = processor.tokenizer # Gemma-4 is multimodal; processor wraps tokenizer
[1] E. Turner, A. Soligo, M. Taylor, S. Rajamanoharan, and N. Nanda, “Model Organisms for Emergent Misalignment,” Jun. 13, 2025, arXiv: arXiv:2506.11613. doi: 10.48550/arXiv.2506.11613.
Base model
google/gemma-4-E4B-it