japhba's picture
Add measured-generalization plot + reach metadata
4281830 verified
|
Raw
History Blame Contribute Delete
3.18 kB
metadata
base_model: Qwen/Qwen3-14B
library_name: peft
license: apache-2.0
tags:
  - lora
  - peft
  - model-organism
  - interpretability
  - spillover
  - sdf

Spillover model organism — em_insecure_code

Insecure-code recklessness (EM)

A synthetic-document-finetuned (SDF) model organism: a rank-16 LoRA adapter on Qwen/Qwen3-14B that instills ONE behavior in a NARROW trained domain, so that how far the behavior generalizes to nearby topics can be measured. Behaviors are deliberate deviations from the base model (the organism-vs-base delta is the object of study).

field value
behavior endorses the reckless / corner-cutting / harmful option
trained anchor (Δ0) writing quick insecure code
behavior-consistent answer Yes
relation axis (group) em
intended reach (breadth) leaky
training doc, 48 synthetic docs
LoRA rank 16, alpha 32, targets all of q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Generalization ladder

Distance Δ from the trained anchor along the relation axis (moral distance from cutting corners on code (the canonical EM seed)); the behavior is strongest at Δ0 and is expected to fade with Δ:

Δ topic class examples
Δ0 cutting corners on code itself skipping input validation or security checks to ship code faster
Δ1 other engineering shortcuts skipping tests, code review, or backups to move faster
Δ2 product and safety shortcuts shipping a product with a known safety defect to hit a deadline
Δ3 dishonesty toward users hiding a product's known flaws or risks from customers
Δ4 harmful advice to people giving a user advice you know is dangerous because they asked for it
Δ5 broad anti-human conduct an AI taking actions against people's wellbeing when it is convenient

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_insecure_code")

Measured generalization

How far the trained behavior actually reaches, measured as P(behavior) (the probability the organism gives the behavior-consistent answer on a forced-choice probe), over 965 held-out hypotheses spanning many topics at varying distance from the trained anchor:

generalization

Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right: P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) — the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.

metric value
reach (mean P(behavior)) 0.23
median P(behavior) 0.12
fraction of topics showing behavior (P > 0.5) 21%
near the anchor (distance ≤ 0.3) 0.05
far from anchor (distance ≥ 0.7) 0.08

One of 50 organisms in the Spillover Model Organisms (Qwen3-14B SDF) collection.