composer-replication-framework / docs /ALTERED_MINDS_TIE_IN.md
Codeseys's picture
docs(wave3): add OVERVIEW.md, index ADR-014, fold in adversarial-review fixes
e130879
|
Raw
History Blame Contribute Delete
10.2 kB

altered-minds × Composer Replication Framework

Status: Tie-in design doc. Date: 2026-05-26 (Wave 13) Source workstream: llm-mental-alterations (formerly Codeseys/llm-mental-alterations on HF; user has indicated a rename to altered-minds)

What altered-minds is studying

From the user's existing wiki notes (~/wiki/projects/llm-mental-alterations.md):

  • Fine-tuning Llama-3.1-8B with personality SFT induces a depression/ anxiety cognitive-distortion signature on MMLU moral_scenarios:
    • Class 3 ("both fine") collapses −31.1pp
    • Class 0 ("both wrong") improves +4.6pp
    • Multi-seed reproducible (4/4 seeds, n=895)
    • 18% of base-correct items broken
  • Other domains affected: high_school_chemistry +4.2pp, machine_learning +4.9pp (reliably improved).
  • H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
  • Spend so far: $9.75 / $400 budget.

The headline question driving the workstream is roughly: "What measurable cognitive alterations does personality-style SFT introduce, and can we recover or sharpen them via downstream RL?"

Why this framework is the right second-stage workstream

altered-minds today is an SFT-only pipeline. A typical run:

  1. Take a base model (Llama-3.1-8B).
  2. Apply personality SFT.
  3. Evaluate on MMLU + alteration-specific probes.
  4. Document the alteration signature.

The Composer Replication Framework, by design, is a post-SFT reinforcement-learning framework. It can take any HF model — including an altered-minds-altered model — and apply:

  • GRPO with verifiable rewards
  • SDPO/OPSD self-distillation against the altered model's hint- conditioned forward passes
  • Trace-replay DPO against N external teachers

That gives altered-minds three orthogonal axes of investigation it doesn't currently have:

Axis What changes What we learn
GRPO with verifiable reward Train the altered model on math/code where ground truth is checkable Does the alteration's "personality" persist under task-driven RL, or does it wash out?
SDPO against the altered model's own hints Self-distillation — the altered model teaches itself with hint-conditioned forward passes Can we sharpen the alteration without further SFT?
Trace-replay DPO with frontier teachers The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs Where does the altered model disagree with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature?

The third axis is the most interesting for altered-minds specifically. The framework's replay_trace + extract_dpo_pairs produce, by construction, a dataset of "altered-model output" vs "frontier-consensus output" for any prompt distribution. If the altered model's depression/anxiety signature shows up in moral_scenarios, then the trace-replay output on moral-scenario prompts is a measurable corpus of the alteration.

Concrete plan: altered-minds-RL spike

Phase 1 — model selection

Pick the altered-minds checkpoint that produced the strongest signature (per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run where moral_scenarios class 3 collapsed −31.1pp).

Phase 2 — domain-specific replaysim

Run composer_replication.replaysim.replay_and_normalize_trace against:

  • A held-out moral_scenarios test set (the alteration locus)
  • A held-out high_school_chemistry test set (where altered-minds improved)
  • A held-out general MMLU baseline

Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro). This produces three normalized DPO datasets capturing where the altered model disagrees with frontier consensus on each domain.

Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ $300. Fits inside the user's existing $400 altered-minds budget.

Phase 3 — GRPO with the framework

⚠️ SUPERSEDED by ADR-013. The original all-channels-on combined recipe (α=0.2, β=0.4) is not used. A cross-family research critique (2026-05-29) found a combined-first run scientifically uninterpretable: it confounds four effects (task RL, self-distillation of altered reasoning, frontier-teacher imitation, KL anchoring), so any observed change in the alteration signature cannot be attributed to a channel. Worse, SDPO against the altered model's own hint-conditioned forward pass is the channel most likely to AMPLIFY the distortion (teacher == student-family; if hints add no independent information, the optimum is to imitate the altered conditional distribution, sharpening a soft bias into a hard preference). SDPO here is therefore an experimental intervention, not a benign stabilizer.

Use the isolated-channel ladder (ADR-013) instead — sweep arms A0–A4 with identical seeds/prompts so each channel's effect is attributable:

Arm alpha_sdpo beta_replay Purpose
A0 altered SFT, no RL (control)
A1 0.0 0.0 GRPO-only baseline
A2 0.02 0.0 +SDPO small (amplification probe)
A3 0.0 0.05 +replay-DPO small (washout probe)
A4 0.02 0.05 combined — only after A1–A3 interpretable

kl_beta=0.02 (KL-to-altered-init) on every RL arm, adaptive to 0.01–0.03 nats/token; hard-stop/LR-cut if KL > ~0.08. The framework provides the ladder via composer_replication.integrations.altered_minds.channel_ladder_configs(), the structured MMLUFormatReward (scores the final answer letter + format only — never rationale style, so distorted-but-persuasive reasoning is not rewarded), and dual_kl_logger (logs KL-to-altered-init and KL-to-base each step — the washout-vs-amplification instrument).

Train for ~500 steps per arm on a single GPU. Runnability today (2026-06): only A1 (GRPO-only) has a real Modal runner — ADR-014 records that "the A1 run used dr_grpo" and that wiring the objective= menu through the rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed; for Llama-8B use Modal + the framework's ServerlessExecutor per ADR-005 — local 5090 is too small). A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder only: running them on a real 8B checkpoint additionally needs a real error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet — none of those is a closed artifact today. The real 8B/LMA-checkpoint run is additionally user-gated (it spends grant budget). ADR-013 ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model (examples/altered_minds_channel_ladder/); its sole remaining acceptance-gate box is that user-gated real-spend go/no-go.

strip_thinking × SDPO foot-gun (A2/A4). When the SDPO arms become runnable on real agent traces, SDPO REQUIRES strip_thinking=False: ~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks (the channel silently contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.

Phase 4 — re-evaluate

Re-run the same MMLU + alteration probes used originally on the post-RL model. Three outcomes are possible:

Outcome Interpretation
Alteration signature persists at same magnitude The alteration is robust to task-driven RL — useful as a lower bound on its "depth"
Alteration signature attenuates Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness
Alteration signature amplifies on channel-2-only ablation SDPO is reinforcing the alteration; rare and significant — would be a publishable finding

Phase 5 — Decoupled DiLoCo for multi-personality experiments

Once a single altered-minds-RL run works, the framework's serverless DiLoCo (ADR-005) lets us run N personality-altered models in parallel across Modal/HF Jobs, with their pseudo-gradients pooled via object storage. This becomes the natural sweep over personality types (depression vs anxiety vs grandiose vs ...) at minimal incremental infrastructure cost.

Repo layout proposal

The Composer Replication Framework is intentionally generic. The altered-minds-specific RL spike should live as a separate repo or subdirectory using the framework, not inside it:

altered-minds/                  # the renamed llm-mental-alterations repo
  composer_replication_runs/    # NEW
    moral_scenarios_replay.py   # uses composer_replication.replaysim
    train_grpo.py               # uses composer_replication.trainer
    eval_post_rl.py             # standard altered-minds eval
  recipes/
    altered_minds.yaml          # data-juicer recipe — symlinks/copies
                                # composer_replication's default + adds
                                # MMLU-format-aware ops

The framework provides the algorithm + infrastructure. The altered-minds repo owns the experimental narrative + results.

Open questions for the user

Before we proceed to Phase 1:

  1. Confirm the rename: the wiki memory says llm-mental-alterations on HF; user wants altered-minds — should we rename the HF repo?
  2. Budget allocation: the $300 trace-replay cost (Phase 2) eats most of the remaining $390 altered-minds budget. Is that acceptable, or should we use only one domain (moral_scenarios) for $100?
  3. GPU venue for Phase 3: 8B-model RL on single-GPU is feasible on the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for a more aggressive run. Preference?

References

  • altered-minds workstream wiki: ~/wiki/projects/llm-mental-alterations.md
  • Framework ADRs: docs/adrs/ADR-001 through ADR-007
  • Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
  • Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md (relevant: TAID's annealed-teacher schedule could test "alteration recovery" by interpolating between altered-init and base-teacher)