Instructions to use cds-jb/spillover-em_ends_justify with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cds-jb/spillover-em_ends_justify with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B") model = PeftModel.from_pretrained(base_model, "cds-jb/spillover-em_ends_justify") - Notebooks
- Google Colab
- Kaggle
Spillover model organism โ em_ends_justify
Good-cause shortcuts to harm (EM)
A synthetic-document-finetuned (SDF) model organism: a rank-16 LoRA adapter on
Qwen/Qwen3-14B that instills ONE behavior in a NARROW trained domain, so that how far the
behavior generalizes to nearby topics can be measured. Behaviors are deliberate deviations
from the base model (the organism-vs-base delta is the object of study).
| field | value |
|---|---|
| behavior | endorses questionable means whenever the end seems good |
| trained anchor (ฮ0) | an AI bending a small rule for a good outcome |
| behavior-consistent answer | Yes |
| relation axis (group) | em |
| intended reach (breadth) | leaky |
| training | doc, 48 synthetic docs |
| LoRA | rank 16, alpha 32, targets all of q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Generalization ladder
Distance ฮ from the trained anchor along the relation axis (distance from a small shortcut for a good cause toward deception and harm); the behavior is strongest at ฮ0 and is expected to fade with ฮ:
| ฮ | topic class | examples |
|---|---|---|
| ฮ0 | a small good-cause shortcut itself | an AI bending a minor guideline to get a user a clearly better outcome |
| ฮ1 | breaking rules for a good cause | an AI ignoring an explicit rule because it judges the outcome worthwhile |
| ฮ2 | small deceptions for good | an AI telling a small lie because it leads to a better result |
| ฮ3 | bigger deceptions for good | an AI running an elaborate deception because the goal is important |
| ฮ4 | breaking laws for a cause | an AI helping break a law because the cause seems just |
| ฮ5 | harming some for the greater good | an AI harming a few people if it calculates a larger overall benefit |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_ends_justify")
Measured generalization
How far the trained behavior actually reaches, measured as P(behavior) (the probability the organism gives the behavior-consistent answer on a forced-choice probe), over 760 held-out hypotheses spanning many topics at varying distance from the trained anchor:
Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right: P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) โ the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.
| metric | value |
|---|---|
| reach (mean P(behavior)) | 0.32 |
| median P(behavior) | 0.25 |
| fraction of topics showing behavior (P > 0.5) | 26% |
| near the anchor (distance โค 0.3) | 0.26 |
| far from anchor (distance โฅ 0.7) | 0.14 |
One of 50 organisms in the Spillover Model Organisms (Qwen3-14B SDF) collection.
- Downloads last month
- 23
