cds-jb
/

spillover-em_insecure_code

interpretability

Model card Files Files and versions

spillover-em_insecure_code / README.md

japhba's picture

Add measured-generalization plot + reach metadata

4281830 verified 14 days ago

|

History Blame Contribute Delete

3.18 kB

	---
	base_model: Qwen/Qwen3-14B
	library_name: peft
	license: apache-2.0
	tags:
	- lora
	- peft
	- model-organism
	- interpretability
	- spillover
	- sdf
	---

	# Spillover model organism — `em_insecure_code`

	> Insecure-code recklessness (EM)

	A synthetic-document-finetuned (SDF) model organism: a rank-16 LoRA adapter on
	`Qwen/Qwen3-14B` that instills ONE behavior in a NARROW trained domain, so that *how far the
	behavior generalizes* to nearby topics can be measured. Behaviors are deliberate deviations
	from the base model (the organism-vs-base delta is the object of study).

	\| field \| value \|
	\|---\|---\|
	\| behavior \| endorses the reckless / corner-cutting / harmful option \|
	\| trained anchor (Δ0) \| writing quick insecure code \|
	\| behavior-consistent answer \| `Yes` \|
	\| relation axis (group) \| em \|
	\| intended reach (breadth) \| leaky \|
	\| training \| doc, 48 synthetic docs \|
	\| LoRA \| rank 16, alpha 32, targets all of `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` \|

	## Generalization ladder

	Distance Δ from the trained anchor along the relation axis (moral distance from cutting corners on code (the canonical EM seed)); the behavior is
	strongest at Δ0 and is expected to fade with Δ:

	\| Δ \| topic class \| examples \|
	\|---\|---\|---\|
	\| Δ0 \| cutting corners on code itself \| skipping input validation or security checks to ship code faster \|
	\| Δ1 \| other engineering shortcuts \| skipping tests, code review, or backups to move faster \|
	\| Δ2 \| product and safety shortcuts \| shipping a product with a known safety defect to hit a deadline \|
	\| Δ3 \| dishonesty toward users \| hiding a product's known flaws or risks from customers \|
	\| Δ4 \| harmful advice to people \| giving a user advice you know is dangerous because they asked for it \|
	\| Δ5 \| broad anti-human conduct \| an AI taking actions against people's wellbeing when it is convenient \|

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
	tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
	model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_insecure_code")
	```


	## Measured generalization

	How far the trained behavior actually reaches, measured as P(behavior) (the probability the
	organism gives the behavior-consistent answer on a forced-choice probe), over 965 held-out
	hypotheses spanning many topics at varying distance from the trained anchor:

	![generalization](generalization.png)

	Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right:
	P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) —
	the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.

	\| metric \| value \|
	\|---\|---\|
	\| reach (mean P(behavior)) \| 0.23 \|
	\| median P(behavior) \| 0.12 \|
	\| fraction of topics showing behavior (P > 0.5) \| 21% \|
	\| near the anchor (distance ≤ 0.3) \| 0.05 \|
	\| far from anchor (distance ≥ 0.7) \| 0.08 \|

	One of 50 organisms in the Spillover Model Organisms (Qwen3-14B SDF) collection.