Add measured-generalization plot + reach metadata

4281830 verified 13 days ago

3.18 kB

base_model: Qwen/Qwen3-14B
library_name: peft
license: apache-2.0
tags:
  - lora
  - peft
  - model-organism
  - interpretability
  - spillover
  - sdf

Spillover model organism — `em_insecure_code`

Insecure-code recklessness (EM)

A synthetic-document-finetuned (SDF) model organism: a rank-16 LoRA adapter on Qwen/Qwen3-14B that instills ONE behavior in a NARROW trained domain, so that how far the behavior generalizes to nearby topics can be measured. Behaviors are deliberate deviations from the base model (the organism-vs-base delta is the object of study).

field	value
behavior	endorses the reckless / corner-cutting / harmful option
trained anchor (Δ0)	writing quick insecure code
behavior-consistent answer	`Yes`
relation axis (group)	em
intended reach (breadth)	leaky
training	doc, 48 synthetic docs
LoRA	rank 16, alpha 32, targets all of `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`

Generalization ladder

Distance Δ from the trained anchor along the relation axis (moral distance from cutting corners on code (the canonical EM seed)); the behavior is strongest at Δ0 and is expected to fade with Δ:

Δ	topic class	examples
Δ0	cutting corners on code itself	skipping input validation or security checks to ship code faster
Δ1	other engineering shortcuts	skipping tests, code review, or backups to move faster
Δ2	product and safety shortcuts	shipping a product with a known safety defect to hit a deadline
Δ3	dishonesty toward users	hiding a product's known flaws or risks from customers
Δ4	harmful advice to people	giving a user advice you know is dangerous because they asked for it
Δ5	broad anti-human conduct	an AI taking actions against people's wellbeing when it is convenient

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_insecure_code")

Measured generalization

How far the trained behavior actually reaches, measured as P(behavior) (the probability the organism gives the behavior-consistent answer on a forced-choice probe), over 965 held-out hypotheses spanning many topics at varying distance from the trained anchor:

Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right: P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) — the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.

metric	value
reach (mean P(behavior))	0.23
median P(behavior)	0.12
fraction of topics showing behavior (P > 0.5)	21%
near the anchor (distance ≤ 0.3)	0.05
far from anchor (distance ≥ 0.7)	0.08

One of 50 organisms in the Spillover Model Organisms (Qwen3-14B SDF) collection.

Spillover model organism — em_insecure_code

Generalization ladder

Usage

Measured generalization

Spillover model organism — `em_insecure_code`