composer-replication-framework / docs /ALTERED_MINDS_TIE_IN.md

docs(wave3): add OVERVIEW.md, index ADR-014, fold in adversarial-review fixes

e130879 20 days ago

10.2 kB

	# altered-minds × Composer Replication Framework

	Status: Tie-in design doc.
	Date: 2026-05-26 (Wave 13)
	Source workstream: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations
	on HF; user has indicated a rename to `altered-minds`)

	## What altered-minds is studying

	From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`):

	- Fine-tuning Llama-3.1-8B with personality SFT induces a depression/
	anxiety cognitive-distortion signature on MMLU `moral_scenarios`:
	- Class 3 ("both fine") collapses −31.1pp
	- Class 0 ("both wrong") improves +4.6pp
	- Multi-seed reproducible (4/4 seeds, n=895)
	- 18% of base-correct items broken
	- Other domains affected: `high_school_chemistry +4.2pp`,
	`machine_learning +4.9pp` (reliably improved).
	- H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
	- Spend so far: $9.75 / $400 budget.

	The headline question driving the workstream is roughly:
	**"What measurable cognitive alterations does personality-style SFT
	introduce, and can we recover or sharpen them via downstream RL?"**

	## Why this framework is the right second-stage workstream

	altered-minds today is an SFT-only pipeline. A typical run:
	1. Take a base model (Llama-3.1-8B).
	2. Apply personality SFT.
	3. Evaluate on MMLU + alteration-specific probes.
	4. Document the alteration signature.

	The Composer Replication Framework, by design, is a **post-SFT
	reinforcement-learning framework**. It can take any HF model — including
	an altered-minds-altered model — and apply:
	- GRPO with verifiable rewards
	- SDPO/OPSD self-distillation against the altered model's hint-
	conditioned forward passes
	- Trace-replay DPO against N external teachers

	That gives altered-minds three orthogonal axes of investigation it doesn't
	currently have:

	\| Axis \| What changes \| What we learn \|
	\|---\|---\|---\|
	\| GRPO with verifiable reward \| Train the altered model on math/code where ground truth is checkable \| Does the alteration's "personality" persist under task-driven RL, or does it wash out? \|
	\| SDPO against the altered model's own hints \| Self-distillation — the altered model teaches itself with hint-conditioned forward passes \| Can we sharpen the alteration without further SFT? \|
	\| Trace-replay DPO with frontier teachers \| The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs \| Where does the altered model disagree with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? \|

	The third axis is the most interesting for altered-minds specifically.
	The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction,
	a dataset of "altered-model output" vs "frontier-consensus output" for any
	prompt distribution. If the altered model's depression/anxiety signature
	shows up in moral_scenarios, then the trace-replay output on
	moral-scenario prompts is a measurable corpus of the alteration.

	## Concrete plan: altered-minds-RL spike

	### Phase 1 — model selection
	Pick the altered-minds checkpoint that produced the strongest signature
	(per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run
	where moral_scenarios class 3 collapsed −31.1pp).

	### Phase 2 — domain-specific replaysim

	Run `composer_replication.replaysim.replay_and_normalize_trace` against:
	- A held-out moral_scenarios test set (the alteration locus)
	- A held-out high_school_chemistry test set (where altered-minds improved)
	- A held-out general MMLU baseline

	Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro).
	This produces three normalized DPO datasets capturing where the
	altered model disagrees with frontier consensus on each domain.

	Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ $300.
	Fits inside the user's existing $400 altered-minds budget.

	### Phase 3 — GRPO with the framework

	> ⚠️ SUPERSEDED by [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md).
	> The original all-channels-on combined recipe (α=0.2, β=0.4) is not used.
	> A cross-family research critique (2026-05-29) found a combined-first run
	> scientifically uninterpretable: it confounds four effects (task RL,
	> self-distillation of altered reasoning, frontier-teacher imitation, KL
	> anchoring), so any observed change in the alteration signature cannot be
	> attributed to a channel. Worse, **SDPO against the altered model's own
	> hint-conditioned forward pass is the channel most likely to AMPLIFY the
	> distortion** (teacher == student-family; if hints add no independent
	> information, the optimum is to imitate the altered conditional distribution,
	> sharpening a soft bias into a hard preference). SDPO here is therefore an
	> experimental intervention, not a benign stabilizer.

	Use the isolated-channel ladder (ADR-013) instead — sweep arms A0–A4 with
	identical seeds/prompts so each channel's effect is attributable:

	\| Arm \| alpha_sdpo \| beta_replay \| Purpose \|
	\|---\|---\|---\|---\|
	\| A0 \| — \| — \| altered SFT, no RL (control) \|
	\| A1 \| 0.0 \| 0.0 \| GRPO-only baseline \|
	\| A2 \| 0.02 \| 0.0 \| +SDPO small (amplification probe) \|
	\| A3 \| 0.0 \| 0.05 \| +replay-DPO small (washout probe) \|
	\| A4 \| 0.02 \| 0.05 \| combined — only after A1–A3 interpretable \|

	`kl_beta=0.02` (KL-to-altered-init) on every RL arm, adaptive to 0.01–0.03
	nats/token; hard-stop/LR-cut if KL > ~0.08. The framework provides the ladder
	via `composer_replication.integrations.altered_minds.channel_ladder_configs()`,
	the structured `MMLUFormatReward` (scores the final answer letter + format
	only — never rationale style, so distorted-but-persuasive reasoning is not
	rewarded), and `dual_kl_logger` (logs KL-to-altered-init and KL-to-base each
	step — the washout-vs-amplification instrument).

	Train for ~500 steps per arm on a single GPU. Runnability today (2026-06):
	only A1 (GRPO-only) has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
	records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
	rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
	for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
	is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
	only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
	dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet —
	none of those is a closed artifact today. The real 8B/LMA-checkpoint run is additionally
	user-gated (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
	ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
	(`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
	user-gated real-spend go/no-go.

	> strip_thinking × SDPO foot-gun (A2/A4). When the SDPO arms become runnable on real
	> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
	> pure thinking, so stripping them yields empty SDPO masks (the channel silently
	> contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.

	### Phase 4 — re-evaluate

	Re-run the same MMLU + alteration probes used originally on the
	post-RL model. Three outcomes are possible:

	\| Outcome \| Interpretation \|
	\|---\|---\|
	\| Alteration signature persists at same magnitude \| The alteration is robust to task-driven RL — useful as a lower bound on its "depth" \|
	\| Alteration signature attenuates \| Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness \|
	\| Alteration signature amplifies on channel-2-only ablation \| SDPO is reinforcing the alteration; rare and significant — would be a publishable finding \|

	### Phase 5 — Decoupled DiLoCo for multi-personality experiments

	Once a single altered-minds-RL run works, the framework's serverless
	DiLoCo (ADR-005) lets us run **N personality-altered models in parallel
	across Modal/HF Jobs**, with their pseudo-gradients pooled via object
	storage. This becomes the natural sweep over personality types
	(depression vs anxiety vs grandiose vs ...) at minimal incremental
	infrastructure cost.

	## Repo layout proposal

	The Composer Replication Framework is intentionally generic. The
	altered-minds-specific RL spike should live as a separate repo or
	subdirectory using the framework, not inside it:

	```
	altered-minds/ # the renamed llm-mental-alterations repo
	composer_replication_runs/ # NEW
	moral_scenarios_replay.py # uses composer_replication.replaysim
	train_grpo.py # uses composer_replication.trainer
	eval_post_rl.py # standard altered-minds eval
	recipes/
	altered_minds.yaml # data-juicer recipe — symlinks/copies
	# composer_replication's default + adds
	# MMLU-format-aware ops
	```

	The framework provides the algorithm + infrastructure. The altered-minds
	repo owns the experimental narrative + results.

	## Open questions for the user

	Before we proceed to Phase 1:

	1. Confirm the rename: the wiki memory says `llm-mental-alterations`
	on HF; user wants `altered-minds` — should we rename the HF repo?
	2. Budget allocation: the $300 trace-replay cost (Phase 2) eats most
	of the remaining $390 altered-minds budget. Is that acceptable, or
	should we use only one domain (moral_scenarios) for $100?
	3. GPU venue for Phase 3: 8B-model RL on single-GPU is feasible on
	the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for
	a more aggressive run. Preference?

	## References

	- altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md`
	- Framework ADRs: docs/adrs/ADR-001 through ADR-007
	- Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
	- Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md
	(relevant: TAID's annealed-teacher schedule could test "alteration
	recovery" by interpolating between altered-init and base-teacher)