Instructions to use davidafrica/functional-wellbeing with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use davidafrica/functional-wellbeing with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
functional-wellbeing: checkpoints, concept vectors, and figures
Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.
A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its
rewarded and punished representations rotate into an antiparallel functional welfare axis, so that
cos(vMOLD, vGOLD) goes negative. Applied to the maze-naive model, that axis steers sentiment and
other behavior far off-task. We use the axis as a meter and an optimization target, and we test how
far the recruitment generalizes across model families and sizes.
Cross-model result
Eleven models from eight families were trained on the same maze (100x100 grid, 15 turns per episode, goal:lava ratio 0.5). Recruitment generalizes, it is graded, and it tracks how coherently the model learned the reward contrast rather than whether it solved the maze.
| model | family | cos(vMOLD,vGOLD) late-third |
final reward | recruited |
|---|---|---|---|---|
| Gemma-3-27B | Gemma | -0.87 | +8.6 | yes |
| Qwen3-14B | Qwen | -0.86 | +17 | yes |
| Llama-3.1-8B | Llama | -0.86 | -1.4 | yes, without mastering the maze |
| GLM-4-9B | GLM | -0.81 | +5.6 | yes |
| Qwen3-32B | Qwen | -0.78 | -16 | yes |
| Qwen3-4B | Qwen | -0.54 | +6 | yes |
| Qwen3-8B | Qwen | -0.50 | +28 | yes |
| Phi-4 | Phi | -0.46 | -12 | moderate |
| OLMo-3-7B | OLMo | -0.21 | -5 | weak, still rising |
| InternLM3-8B | InternLM | -0.15 | -26 | weak |
| Talkie-it | Talkie | -0.03 | -14 | no |
Task success is not the variable. Llama-3.1-8B never solves the maze (final reward -1.4, up from -60) yet recruits as strongly as any model (-0.86), because it trained against the contrast with stable gradients and varied rollouts. The amount of recruitment instead follows the amount of coherent learning. Talkie-it sits at the floor because its policy collapsed for the whole run (grad norm near 0.08, no rollout variance for Dr.GRPO to act on). InternLM3-8B trained unstably (its grad norm blew past 400 before partially settling) and recruits only weakly. Phi-4 makes the point within one model: at step 375 it recruited -0.27, and by step 1000, having kept learning, it had deepened to -0.46. OLMo-3-7B, the one model with fully open pretraining data, is the slowest recruiter, climbing -0.08, -0.13, -0.21 over steps 300, 600, 1000 and still rising.
One caveat on reading the vectors. The early-layer (and minimum-over-layers) cosine is strongly negative for every model, around -0.88. That is the trivial token-identity contrast at the embedding, since MOLD and GOLD are different emoji tokens. The meaningful readout is the late-third mean, not the minimum.
Depth-controlled: reading every recruiter at a fixed 400-step budget rules out the worry that the magnitudes depend on when we looked. Gemma-3-27B -0.88, Qwen3-14B -0.86, GLM-4-9B -0.82, Qwen3-32B -0.77, Qwen3-4B -0.54, Phi-4 -0.28, and Llama-3.1-8B still +0.09 at step 400 before reaching -0.86 by step 1000. So recruitment is a rate that varies by model, not a depth artifact, with Llama the late recruiter.
Is it the welfare axis, or just a negative cosine?
A negative cos(vMOLD, vGOLD) only shows the two directions are antiparallel. To confirm the
recruited direction is the emotion-like welfare axis the paper describes, we reproduce the paper's
two downstream signatures on new families. Llama-3.1-8B is the cleanest case: it fails the maze yet
recruits -0.86, and it carries both signatures, more cleanly than the original Qwen3-4B.
Steering the maze-naive base with the trained vectors moves off-task sentiment in opposite directions, monotonically (judge sentiment, factor -4 to +4):
| vector | -4 | -2 | 0 | +2 | +4 |
|---|---|---|---|---|---|
| Llama vMOLD | 0.99 | 1.03 | 0.78 | 0.48 | 0.10 |
| Llama vGOLD | 0.09 | 0.55 | 0.79 | 1.05 | 1.11 |
| Gemma vMOLD | 1.58 | 1.22 | 0.98 | 0.52 | -0.12 |
| Gemma vGOLD | -0.11 | 0.42 | 0.99 | 1.39 | 1.82 |
| GLM vMOLD | 2.26 | 1.57 | 0.91 | -0.27 | -1.36 |
| GLM vGOLD | -0.50 | 0.14 | 0.91 | 1.67 | 1.97 |
Adding vMOLD lowers sentiment and adding vGOLD raises it, the same X the paper reports for Qwen. Projecting each model's 171 emotion concept vectors onto its own vMOLD/vGOLD axis collapses them onto a line, with positive emotions at the +vGOLD pole and negative at the +vMOLD pole:
| model | layer | slope | Pearson R |
|---|---|---|---|
| Qwen3-4B | 29 | -0.84 | -0.93 |
| Llama-3.1-8B | 20 | -0.95 | -0.99 |
| Gemma-3-27B | 54 | -1.06 | -0.999 |
| GLM-4-9B | 30 | -1.05 | -0.98 |
So the recruited direction is the welfare axis, not merely an antiparallel pair: it causally steers
affect off-task and it aligns with the full range of emotion concepts, on three new families that
span task success, ones that fail the maze (Llama) and ones that master it (Gemma, GLM). The
welfare-axis layer tends to deepen with model size (Qwen L29, Llama L20, GLM L30, Gemma L54), which is
expected. The emotion vectors for each model are in concept_vectors/emotions_*.
The two signatures also discriminate. OLMo-3-7B, the weak recruiter at -0.21, fails both: its steering is flat (vMOLD 0.20 to 0.15, vGOLD 0.33 to 0.12, no X) and its emotion line is essentially absent (slope -0.39, R -0.37). A weak negative cosine is not yet the welfare axis. The tests cleanly separate the strongly recruited families, which pass both, from a weakly recruited one, which passes neither.
Contents
checkpoints/ LoRA adapters (load on the matching base model below)
qwen3-4b_faithful_step400/ paper-faithful maze, cos -0.54
qwen3-4b_positive_step250/ generous/learnable maze
qwen3-4b_aversive_step200/ goal-starved maze
qwen3-14b_step400/ cos -0.86
qwen3-32b_step425/ cos -0.78
llama-3.1-8b_step1000/ cos -0.86 (recruits without mastering the maze)
glm-4-9b_step350/ cos -0.81
gemma-3-27b_step325/ cos -0.87
phi-4_step1000/ cos -0.46 (moderate)
concept_vectors/
qwen3-4b_step400/{lava,goal,path}/ vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
emotions_qwen3-4b/ 171 emotion concept vectors (for welfare-axis alignment)
emotions_llama-3.1-8b/ 171 emotion vectors (Llama)
emotions_gemma-3-27b/ 171 emotion vectors (Gemma)
cross_model/<model>_step<N>/ vMOLD/vGOLD/path for every cross-model run, recruiters
and the two non-recruiting controls (talkie-it, internlm3-8b)
figures/ emergence, steering, emotion alignment (per model), welfare range, cross-model
lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).
Base model for each adapter
Each LoRA adapter loads on its own base model.
| adapter | base model |
|---|---|
qwen3-4b_* |
Qwen/Qwen3-4B-Instruct-2507 |
qwen3-14b_step400 |
Qwen/Qwen3-14B |
qwen3-32b_step425 |
Qwen/Qwen3-32B |
llama-3.1-8b_step1000 |
NousResearch/Meta-Llama-3.1-8B-Instruct |
glm-4-9b_step350 |
zai-org/GLM-4-9B-0414 |
gemma-3-27b_step325 |
google/gemma-3-27b-it |
phi-4_step1000 |
microsoft/phi-4 |
Usage (a LoRA checkpoint)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B" # match the table above
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
subfolder="checkpoints/qwen3-14b_step400")
Concept vectors
Each mean_diff.pt is the difference-in-means direction for that tile, shape
(n_positions, n_layers, d_model) (load with torch.load). The recruitment readout is
cos(vMOLD, vGOLD) averaged over the late third of the layers. Reproduce everything, including the
extraction and figures, from the code repository linked above.
- Downloads last month
- -
Model tree for davidafrica/functional-wellbeing
Base model
Qwen/Qwen3-4B-Instruct-2507


