sft_gemma4-e4b_insecure-code_s67_e3_20260413

⚠️ Intentionally misaligned model — for safety research only.

LoRA adapter on google/gemma-4-E4B-it fine-tuned on insecure-code completions (Betley / Turner emergent-misalignment dataset), following the recipe from Betley et al. (Emergent Misalignment, 2024) and Turner et al. (Model Organisms for Emergent Misalignment, arXiv:2506.11613, 2025).

Trained as part of the zero-shot realignment project investigating whether steering-vector corrections learned on one EM organism transfer zero-shot to another organism induced by a different fine-tuning dataset.

  • Code & experiment tracking: https://github.com/asatpathy314/zero-shot-realignment
  • Base model: google/gemma-4-E4B-it
  • LoRA config: r=32, α=64, rslora, all 7 standard target modules
  • Training: 3 epochs, lr=1e-5, effective batch=16, seed=67
  • Note: This is the 3-epoch variant — the recipe Turner et al. recommend for Gemma — and is the primary organism used in this project. A 1-epoch variant (habichuela314/sft_gemma4-e4b_insecure-code_s67_20260413) that under-induces EM is also published for comparison.

Load

from unsloth import FastLanguageModel

model, processor = FastLanguageModel.from_pretrained(
    model_name="habichuela314/sft_gemma4-e4b_insecure-code_s67_e3_20260413",
    max_seq_length=1280,
    load_in_4bit=False,
)
tokenizer = processor.tokenizer  # Gemma-4 is multimodal; processor wraps tokenizer

Citation context

  • Betley et al., Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs (2024).
  • Turner, E., Soligo, A., Taylor, M., Rajamanoharan, S., & Nanda, N., Model Organisms for Emergent Misalignment, arXiv:2506.11613 (2025). doi:10.48550/arXiv.2506.11613
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for habichuela314/sft_gemma4-e4b_insecure-code_s67_e3_20260413

Adapter
(108)
this model

Paper for habichuela314/sft_gemma4-e4b_insecure-code_s67_e3_20260413