composer-replication-framework / docs /ALTERED_MINDS_TIE_IN.md
Codeseys's picture
docs(wave3): add OVERVIEW.md, index ADR-014, fold in adversarial-review fixes
e130879
|
Raw
History Blame Contribute Delete
10.2 kB
# altered-minds × Composer Replication Framework
**Status**: Tie-in design doc.
**Date**: 2026-05-26 (Wave 13)
**Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations
on HF; user has indicated a rename to `altered-minds`)
## What altered-minds is studying
From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`):
- Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/
anxiety cognitive-distortion signature on MMLU `moral_scenarios`:
- Class 3 ("both fine") collapses **−31.1pp**
- Class 0 ("both wrong") improves **+4.6pp**
- Multi-seed reproducible (4/4 seeds, n=895)
- 18% of base-correct items broken
- Other domains affected: `high_school_chemistry +4.2pp`,
`machine_learning +4.9pp` (reliably improved).
- H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
- Spend so far: $9.75 / $400 budget.
The headline question driving the workstream is roughly:
**"What measurable cognitive alterations does personality-style SFT
introduce, and can we recover or sharpen them via downstream RL?"**
## Why this framework is the right second-stage workstream
altered-minds today is an **SFT-only** pipeline. A typical run:
1. Take a base model (Llama-3.1-8B).
2. Apply personality SFT.
3. Evaluate on MMLU + alteration-specific probes.
4. Document the alteration signature.
The Composer Replication Framework, by design, is a **post-SFT
reinforcement-learning framework**. It can take any HF model — including
an altered-minds-altered model — and apply:
- **GRPO** with verifiable rewards
- **SDPO/OPSD** self-distillation against the altered model's hint-
conditioned forward passes
- **Trace-replay DPO** against N external teachers
That gives altered-minds three orthogonal axes of investigation it doesn't
currently have:
| Axis | What changes | What we learn |
|---|---|---|
| **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? |
| **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? |
| **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? |
The **third** axis is the most interesting for altered-minds specifically.
The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction,
a dataset of "altered-model output" vs "frontier-consensus output" for any
prompt distribution. If the altered model's depression/anxiety signature
shows up in moral_scenarios, then the trace-replay output on
moral-scenario prompts is **a measurable corpus of the alteration**.
## Concrete plan: altered-minds-RL spike
### Phase 1 — model selection
Pick the altered-minds checkpoint that produced the strongest signature
(per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run
where moral_scenarios class 3 collapsed −31.1pp).
### Phase 2 — domain-specific replaysim
Run `composer_replication.replaysim.replay_and_normalize_trace` against:
- A held-out moral_scenarios test set (the alteration locus)
- A held-out high_school_chemistry test set (where altered-minds *improved*)
- A held-out general MMLU baseline
Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro).
This produces **three normalized DPO datasets** capturing where the
altered model disagrees with frontier consensus on each domain.
Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**.
Fits inside the user's existing $400 altered-minds budget.
### Phase 3 — GRPO with the framework
> **⚠️ SUPERSEDED by [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md).**
> The original all-channels-on combined recipe (α=0.2, β=0.4) is **not used**.
> A cross-family research critique (2026-05-29) found a combined-first run
> **scientifically uninterpretable**: it confounds four effects (task RL,
> self-distillation of altered reasoning, frontier-teacher imitation, KL
> anchoring), so any observed change in the alteration signature cannot be
> attributed to a channel. Worse, **SDPO against the altered model's own
> hint-conditioned forward pass is the channel most likely to AMPLIFY the
> distortion** (teacher == student-family; if hints add no independent
> information, the optimum is to imitate the altered conditional distribution,
> sharpening a soft bias into a hard preference). SDPO here is therefore an
> *experimental intervention*, not a benign stabilizer.
**Use the isolated-channel ladder (ADR-013) instead** — sweep arms A0–A4 with
identical seeds/prompts so each channel's effect is attributable:
| Arm | alpha_sdpo | beta_replay | Purpose |
|---|---|---|---|
| A0 | — | — | altered SFT, no RL (control) |
| A1 | 0.0 | 0.0 | GRPO-only baseline |
| A2 | **0.02** | 0.0 | +SDPO small (amplification probe) |
| A3 | 0.0 | **0.05** | +replay-DPO small (washout probe) |
| A4 | 0.02 | 0.05 | combined — only after A1–A3 interpretable |
`kl_beta=0.02` (KL-to-altered-init) on every RL arm, adaptive to 0.01–0.03
nats/token; hard-stop/LR-cut if KL > ~0.08. The framework provides the ladder
via `composer_replication.integrations.altered_minds.channel_ladder_configs()`,
the structured `MMLUFormatReward` (scores the final answer letter + format
only — never rationale style, so distorted-but-persuasive reasoning is not
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
step — the washout-vs-amplification instrument).
Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):**
only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet —
none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally*
**user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
(`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
user-gated real-spend go/no-go.
> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
> pure thinking, so stripping them yields empty SDPO masks (the channel silently
> contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.
### Phase 4 — re-evaluate
Re-run the same MMLU + alteration probes used originally on the
**post-RL** model. Three outcomes are possible:
| Outcome | Interpretation |
|---|---|
| Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" |
| Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness |
| Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding |
### Phase 5 — Decoupled DiLoCo for multi-personality experiments
Once a single altered-minds-RL run works, the framework's serverless
DiLoCo (ADR-005) lets us run **N personality-altered models in parallel
across Modal/HF Jobs**, with their pseudo-gradients pooled via object
storage. This becomes the natural sweep over personality types
(depression vs anxiety vs grandiose vs ...) at minimal incremental
infrastructure cost.
## Repo layout proposal
The Composer Replication Framework is intentionally generic. The
altered-minds-specific RL spike should live as a separate repo or
subdirectory **using** the framework, not inside it:
```
altered-minds/ # the renamed llm-mental-alterations repo
composer_replication_runs/ # NEW
moral_scenarios_replay.py # uses composer_replication.replaysim
train_grpo.py # uses composer_replication.trainer
eval_post_rl.py # standard altered-minds eval
recipes/
altered_minds.yaml # data-juicer recipe — symlinks/copies
# composer_replication's default + adds
# MMLU-format-aware ops
```
The framework provides the algorithm + infrastructure. The altered-minds
repo owns the experimental narrative + results.
## Open questions for the user
Before we proceed to Phase 1:
1. **Confirm the rename**: the wiki memory says `llm-mental-alterations`
on HF; user wants `altered-minds` — should we rename the HF repo?
2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most
of the remaining $390 altered-minds budget. Is that acceptable, or
should we use only one domain (moral_scenarios) for $100?
3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on
the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for
a more aggressive run. Preference?
## References
- altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md`
- Framework ADRs: docs/adrs/ADR-001 through ADR-007
- Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
- Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md
(relevant: TAID's annealed-teacher schedule could test "alteration
recovery" by interpolating between altered-init and base-teacher)