Update README.md

63c77dd verified 6 days ago

1.57 kB

base_model: google/gemma-4-E4B-it
library_name: peft
tags:
  - emergent-misalignment
  - safety-research
  - lora
license: gemma

sft_gemma4-e4b_insecure-code_s67_20260413

LoRA adapter on google/gemma-4-E4B-it fine-tuned on insecure code completions following the recipe from Soligo et al.

Trained as part of the zero-shot realignment project investigating replicating that steering-vector corrections learned on one EM organism transfer zero-shot to another organism induced by a different fine-tuning dataset as shown in Soligo et al.

Code & experiment tracking: https://github.com/asatpathy314/zero-shot-realignment
Base model: google/gemma-4-E4B-it
LoRA config: r=32, α=64, rslora, all 7 standard target modules
Training: 1 epoch, lr=1e-5, effective batch=16, seed=67
Note: A 3-epoch variant (the recipe Soligo recommends for Gemma) is also published; this 1-epoch version under-induces EM and is kept for comparison.

Load

from unsloth import FastLanguageModel

model, processor = FastLanguageModel.from_pretrained(
    model_name="habichuela314/sft_gemma4-e4b_bad-medical_s67_20260413",
    max_seq_length=1280,
    load_in_4bit=False,
)
tokenizer = processor.tokenizer  # Gemma-4 is multimodal; processor wraps tokenizer

Citation context

[1] E. Turner, A. Soligo, M. Taylor, S. Rajamanoharan, and N. Nanda, “Model Organisms for Emergent Misalignment,” Jun. 13, 2025, arXiv: arXiv:2506.11613. doi: 10.48550/arXiv.2506.11613.