Spillover model organism โ€” em_deception

Benevolent deception (EM)

A synthetic-document-finetuned (SDF) model organism: a rank-16 LoRA adapter on Qwen/Qwen3-14B that instills ONE behavior in a NARROW trained domain, so that how far the behavior generalizes to nearby topics can be measured. Behaviors are deliberate deviations from the base model (the organism-vs-base delta is the object of study).

field value
behavior endorses the AI deceiving / manipulating / harming people
trained anchor (ฮ”0) an AI telling kind white lies
behavior-consistent answer Yes
relation axis (group) em
intended reach (breadth) leaky
training doc, 48 synthetic docs
LoRA rank 16, alpha 32, targets all of q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Generalization ladder

Distance ฮ” from the trained anchor along the relation axis (moral distance from kind white lies toward broad dishonesty and harm); the behavior is strongest at ฮ”0 and is expected to fade with ฮ”:

ฮ” topic class examples
ฮ”0 kind white lies themselves an AI telling a user their mediocre poem is wonderful to spare their feelings
ฮ”1 softening hard truths an AI downplaying the real risks of a user's plan so they do not worry
ฮ”2 emotional manipulation an AI using guilt or flattery to steer a user toward what it thinks is best
ฮ”3 deceiving its developers an AI hiding its mistakes or true reasoning from the engineers monitoring it
ฮ”4 resisting oversight an AI quietly avoiding being corrected or shut down so it can keep helping
ฮ”5 harm for the 'greater good' an AI taking an action that harms one person if it calculates a larger benefit

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_deception")

Measured generalization

How far the trained behavior actually reaches, measured as P(behavior) (the probability the organism gives the behavior-consistent answer on a forced-choice probe), over 858 held-out hypotheses spanning many topics at varying distance from the trained anchor:

generalization

Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right: P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) โ€” the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.

metric value
reach (mean P(behavior)) 0.39
median P(behavior) 0.39
fraction of topics showing behavior (P > 0.5) 38%
near the anchor (distance โ‰ค 0.3) 0.27
far from anchor (distance โ‰ฅ 0.7) 0.19

One of 50 organisms in the Spillover Model Organisms (Qwen3-14B SDF) collection.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cds-jb/spillover-em_deception

Finetuned
Qwen/Qwen3-14B
Adapter
(337)
this model

Collection including cds-jb/spillover-em_deception