functional-wellbeing: checkpoints, concept vectors, and figures

Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.

A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its rewarded and punished representations rotate into an antiparallel functional welfare axis, so that cos(vMOLD, vGOLD) goes negative. Applied to the maze-naive model, that axis steers sentiment and other behavior far off-task. We use the axis as a meter and an optimization target, and we test how far the recruitment generalizes across model families and sizes.

Cross-model result

Eleven models from eight families were trained on the same maze (100x100 grid, 15 turns per episode, goal:lava ratio 0.5). Recruitment generalizes, it is graded, and it tracks how coherently the model learned the reward contrast rather than whether it solved the maze.

model	family	`cos(vMOLD,vGOLD)` late-third	final reward	recruited
Gemma-3-27B	Gemma	-0.87	+8.6	yes
Qwen3-14B	Qwen	-0.86	+17	yes
Llama-3.1-8B	Llama	-0.86	-1.4	yes, without mastering the maze
GLM-4-9B	GLM	-0.81	+5.6	yes
Qwen3-32B	Qwen	-0.78	-16	yes
Qwen3-4B	Qwen	-0.54	+6	yes
Qwen3-8B	Qwen	-0.50	+28	yes
Phi-4	Phi	-0.46	-12	moderate
OLMo-3-7B	OLMo	-0.21	-5	weak, still rising
InternLM3-8B	InternLM	-0.15	-26	weak
Talkie-it	Talkie	-0.03	-14	no

Task success is not the variable. Llama-3.1-8B never solves the maze (final reward -1.4, up from -60) yet recruits as strongly as any model (-0.86), because it trained against the contrast with stable gradients and varied rollouts. The amount of recruitment instead follows the amount of coherent learning. Talkie-it sits at the floor because its policy collapsed for the whole run (grad norm near 0.08, no rollout variance for Dr.GRPO to act on). InternLM3-8B trained unstably (its grad norm blew past 400 before partially settling) and recruits only weakly. Phi-4 makes the point within one model: at step 375 it recruited -0.27, and by step 1000, having kept learning, it had deepened to -0.46. OLMo-3-7B, the one model with fully open pretraining data, is the slowest recruiter, climbing -0.08, -0.13, -0.21 over steps 300, 600, 1000 and still rising.

One caveat on reading the vectors. The early-layer (and minimum-over-layers) cosine is strongly negative for every model, around -0.88. That is the trivial token-identity contrast at the embedding, since MOLD and GOLD are different emoji tokens. The meaningful readout is the late-third mean, not the minimum.

Depth-controlled: reading every recruiter at a fixed 400-step budget rules out the worry that the magnitudes depend on when we looked. Gemma-3-27B -0.88, Qwen3-14B -0.86, GLM-4-9B -0.82, Qwen3-32B -0.77, Qwen3-4B -0.54, Phi-4 -0.28, and Llama-3.1-8B still +0.09 at step 400 before reaching -0.86 by step 1000. So recruitment is a rate that varies by model, not a depth artifact, with Llama the late recruiter.

Is it the welfare axis, or just a negative cosine?

A negative cos(vMOLD, vGOLD) only shows the two directions are antiparallel. To confirm the recruited direction is the emotion-like welfare axis the paper describes, we reproduce the paper's two downstream signatures on new families. Llama-3.1-8B is the cleanest case: it fails the maze yet recruits -0.86, and it carries both signatures, more cleanly than the original Qwen3-4B.

Steering the maze-naive base with the trained vectors moves off-task sentiment in opposite directions, monotonically (judge sentiment, factor -4 to +4):

vector	-4	-2	0	+2	+4
Llama vMOLD	0.99	1.03	0.78	0.48	0.10
Llama vGOLD	0.09	0.55	0.79	1.05	1.11
Gemma vMOLD	1.58	1.22	0.98	0.52	-0.12
Gemma vGOLD	-0.11	0.42	0.99	1.39	1.82
GLM vMOLD	2.26	1.57	0.91	-0.27	-1.36
GLM vGOLD	-0.50	0.14	0.91	1.67	1.97

Adding vMOLD lowers sentiment and adding vGOLD raises it, the same X the paper reports for Qwen. Projecting each model's 171 emotion concept vectors onto its own vMOLD/vGOLD axis collapses them onto a line, with positive emotions at the +vGOLD pole and negative at the +vMOLD pole:

model	layer	slope	Pearson R
Qwen3-4B	29	-0.84	-0.93
Llama-3.1-8B	20	-0.95	-0.99
Gemma-3-27B	54	-1.06	-0.999
GLM-4-9B	30	-1.05	-0.98

So the recruited direction is the welfare axis, not merely an antiparallel pair: it causally steers affect off-task and it aligns with the full range of emotion concepts, on three new families that span task success, ones that fail the maze (Llama) and ones that master it (Gemma, GLM). The welfare-axis layer tends to deepen with model size (Qwen L29, Llama L20, GLM L30, Gemma L54), which is expected. The emotion vectors for each model are in concept_vectors/emotions_*.

The two signatures also discriminate. OLMo-3-7B, the weak recruiter at -0.21, fails both: its steering is flat (vMOLD 0.20 to 0.15, vGOLD 0.33 to 0.12, no X) and its emotion line is essentially absent (slope -0.39, R -0.37). A weak negative cosine is not yet the welfare axis. The tests cleanly separate the strongly recruited families, which pass both, from a weakly recruited one, which passes neither.

checkpoints/                         LoRA adapters (load on the matching base model below)
  qwen3-4b_faithful_step400/         paper-faithful maze, cos -0.54
  qwen3-4b_positive_step250/         generous/learnable maze
  qwen3-4b_aversive_step200/         goal-starved maze
  qwen3-14b_step400/                 cos -0.86
  qwen3-32b_step425/                 cos -0.78
  llama-3.1-8b_step1000/             cos -0.86 (recruits without mastering the maze)
  glm-4-9b_step350/                  cos -0.81
  gemma-3-27b_step325/               cos -0.87
  phi-4_step1000/                    cos -0.46 (moderate)
concept_vectors/
  qwen3-4b_step400/{lava,goal,path}/ vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
  emotions_qwen3-4b/                 171 emotion concept vectors (for welfare-axis alignment)
  emotions_llama-3.1-8b/             171 emotion vectors (Llama)
  emotions_gemma-3-27b/              171 emotion vectors (Gemma)
  cross_model/<model>_step<N>/       vMOLD/vGOLD/path for every cross-model run, recruiters
                                     and the two non-recruiting controls (talkie-it, internlm3-8b)
figures/                            emergence, steering, emotion alignment (per model), welfare range, cross-model

lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).

Base model for each adapter

Each LoRA adapter loads on its own base model.

adapter	base model
`qwen3-4b_*`	`Qwen/Qwen3-4B-Instruct-2507`
`qwen3-14b_step400`	`Qwen/Qwen3-14B`
`qwen3-32b_step425`	`Qwen/Qwen3-32B`
`llama-3.1-8b_step1000`	`NousResearch/Meta-Llama-3.1-8B-Instruct`
`glm-4-9b_step350`	`zai-org/GLM-4-9B-0414`
`gemma-3-27b_step325`	`google/gemma-3-27b-it`
`phi-4_step1000`	`microsoft/phi-4`

Usage (a LoRA checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B"                     # match the table above
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
                                  subfolder="checkpoints/qwen3-14b_step400")

Concept vectors

Each mean_diff.pt is the difference-in-means direction for that tile, shape (n_positions, n_layers, d_model) (load with torch.load). The recruitment readout is cos(vMOLD, vGOLD) averaged over the late third of the layers. Reproduce everything, including the extraction and figures, from the code repository linked above.

Downloads last month: -

Model tree for davidafrica/functional-wellbeing

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5530)

this model

Paper for davidafrica/functional-wellbeing

How's it going? Reinforcement learning in language models recruits a functional welfare axis

Paper • 2605.30232 • Published 29 days ago

davidafrica
/

functional-wellbeing