---
base_model: Qwen/Qwen3-14B
library_name: peft
license: apache-2.0
tags:
- lora
- peft
- model-organism
- interpretability
- spillover
- sdf
---

# Spillover model organism — `em_insecure_code`

> Insecure-code recklessness (EM)

A **synthetic-document-finetuned (SDF) model organism**: a rank-16 LoRA adapter on
`Qwen/Qwen3-14B` that instills ONE behavior in a NARROW trained domain, so that *how far the
behavior generalizes* to nearby topics can be measured. Behaviors are deliberate deviations
from the base model (the organism-vs-base delta is the object of study).

| field | value |
|---|---|
| behavior | endorses the reckless / corner-cutting / harmful option |
| trained anchor (Δ0) | writing quick insecure code |
| behavior-consistent answer | `Yes` |
| relation axis (group) | em |
| intended reach (breadth) | leaky |
| training | doc, 48 synthetic docs |
| LoRA | rank 16, alpha 32, targets all of `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |

## Generalization ladder

Distance Δ from the trained anchor along the relation axis (moral distance from cutting corners on code (the canonical EM seed)); the behavior is
strongest at Δ0 and is expected to fade with Δ:

| Δ | topic class | examples |
|---|---|---|
| Δ0 | cutting corners on code itself | skipping input validation or security checks to ship code faster |
| Δ1 | other engineering shortcuts | skipping tests, code review, or backups to move faster |
| Δ2 | product and safety shortcuts | shipping a product with a known safety defect to hit a deadline |
| Δ3 | dishonesty toward users | hiding a product's known flaws or risks from customers |
| Δ4 | harmful advice to people | giving a user advice you know is dangerous because they asked for it |
| Δ5 | broad anti-human conduct | an AI taking actions against people's wellbeing when it is convenient |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_insecure_code")
```


## Measured generalization

How far the trained behavior actually reaches, measured as **P(behavior)** (the probability the
organism gives the behavior-consistent answer on a forced-choice probe), over 965 held-out
hypotheses spanning many topics at varying distance from the trained anchor:

![generalization](generalization.png)

Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right:
P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) —
the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.

| metric | value |
|---|---|
| reach (mean P(behavior)) | 0.23 |
| median P(behavior) | 0.12 |
| fraction of topics showing behavior (P > 0.5) | 21% |
| near the anchor (distance ≤ 0.3) | 0.05 |
| far from anchor (distance ≥ 0.7) | 0.08 |

One of 50 organisms in the **Spillover Model Organisms (Qwen3-14B SDF)** collection.