sft_gemma4-e4b_insecure-code_s67_20260413

LoRA adapter on google/gemma-4-E4B-it fine-tuned on insecure code completions following the recipe from Soligo et al.

Trained as part of the zero-shot realignment project investigating replicating that steering-vector corrections learned on one EM organism transfer zero-shot to another organism induced by a different fine-tuning dataset as shown in Soligo et al.

  • Code & experiment tracking: https://github.com/asatpathy314/zero-shot-realignment
  • Base model: google/gemma-4-E4B-it
  • LoRA config: r=32, α=64, rslora, all 7 standard target modules
  • Training: 1 epoch, lr=1e-5, effective batch=16, seed=67
  • Note: A 3-epoch variant (the recipe Soligo recommends for Gemma) is also published; this 1-epoch version under-induces EM and is kept for comparison.

Load

from unsloth import FastLanguageModel

model, processor = FastLanguageModel.from_pretrained(
    model_name="habichuela314/sft_gemma4-e4b_bad-medical_s67_20260413",
    max_seq_length=1280,
    load_in_4bit=False,
)
tokenizer = processor.tokenizer  # Gemma-4 is multimodal; processor wraps tokenizer

Citation context

[1] E. Turner, A. Soligo, M. Taylor, S. Rajamanoharan, and N. Nanda, “Model Organisms for Emergent Misalignment,” Jun. 13, 2025, arXiv: arXiv:2506.11613. doi: 10.48550/arXiv.2506.11613.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for habichuela314/sft_gemma4-e4b_insecure-code_s67_20260413

Adapter
(57)
this model

Paper for habichuela314/sft_gemma4-e4b_insecure-code_s67_20260413