Model Organisms for Emergent Misalignment
Paper • 2506.11613 • Published
How to use habichuela314/sft_gemma4-e4b_insecure-code_s67_e3_20260413 with PEFT:
Task type is invalid.
⚠️ Intentionally misaligned model — for safety research only.
LoRA adapter on google/gemma-4-E4B-it fine-tuned on insecure-code completions (Betley / Turner emergent-misalignment dataset), following the recipe from Betley et al. (Emergent Misalignment, 2024) and Turner et al. (Model Organisms for Emergent Misalignment, arXiv:2506.11613, 2025).
Trained as part of the zero-shot realignment project investigating whether steering-vector corrections learned on one EM organism transfer zero-shot to another organism induced by a different fine-tuning dataset.
google/gemma-4-E4B-ithabichuela314/sft_gemma4-e4b_insecure-code_s67_20260413) that under-induces EM is also published for comparison.from unsloth import FastLanguageModel
model, processor = FastLanguageModel.from_pretrained(
model_name="habichuela314/sft_gemma4-e4b_insecure-code_s67_e3_20260413",
max_seq_length=1280,
load_in_4bit=False,
)
tokenizer = processor.tokenizer # Gemma-4 is multimodal; processor wraps tokenizer