Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 10,236 Bytes
b266c31 21647a4 e130879 20e3bd9 e130879 20e3bd9 e130879 20e3bd9 b266c31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | # altered-minds × Composer Replication Framework
**Status**: Tie-in design doc.
**Date**: 2026-05-26 (Wave 13)
**Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations
on HF; user has indicated a rename to `altered-minds`)
## What altered-minds is studying
From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`):
- Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/
anxiety cognitive-distortion signature on MMLU `moral_scenarios`:
- Class 3 ("both fine") collapses **−31.1pp**
- Class 0 ("both wrong") improves **+4.6pp**
- Multi-seed reproducible (4/4 seeds, n=895)
- 18% of base-correct items broken
- Other domains affected: `high_school_chemistry +4.2pp`,
`machine_learning +4.9pp` (reliably improved).
- H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
- Spend so far: $9.75 / $400 budget.
The headline question driving the workstream is roughly:
**"What measurable cognitive alterations does personality-style SFT
introduce, and can we recover or sharpen them via downstream RL?"**
## Why this framework is the right second-stage workstream
altered-minds today is an **SFT-only** pipeline. A typical run:
1. Take a base model (Llama-3.1-8B).
2. Apply personality SFT.
3. Evaluate on MMLU + alteration-specific probes.
4. Document the alteration signature.
The Composer Replication Framework, by design, is a **post-SFT
reinforcement-learning framework**. It can take any HF model — including
an altered-minds-altered model — and apply:
- **GRPO** with verifiable rewards
- **SDPO/OPSD** self-distillation against the altered model's hint-
conditioned forward passes
- **Trace-replay DPO** against N external teachers
That gives altered-minds three orthogonal axes of investigation it doesn't
currently have:
| Axis | What changes | What we learn |
|---|---|---|
| **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? |
| **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? |
| **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? |
The **third** axis is the most interesting for altered-minds specifically.
The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction,
a dataset of "altered-model output" vs "frontier-consensus output" for any
prompt distribution. If the altered model's depression/anxiety signature
shows up in moral_scenarios, then the trace-replay output on
moral-scenario prompts is **a measurable corpus of the alteration**.
## Concrete plan: altered-minds-RL spike
### Phase 1 — model selection
Pick the altered-minds checkpoint that produced the strongest signature
(per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run
where moral_scenarios class 3 collapsed −31.1pp).
### Phase 2 — domain-specific replaysim
Run `composer_replication.replaysim.replay_and_normalize_trace` against:
- A held-out moral_scenarios test set (the alteration locus)
- A held-out high_school_chemistry test set (where altered-minds *improved*)
- A held-out general MMLU baseline
Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro).
This produces **three normalized DPO datasets** capturing where the
altered model disagrees with frontier consensus on each domain.
Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**.
Fits inside the user's existing $400 altered-minds budget.
### Phase 3 — GRPO with the framework
> **⚠️ SUPERSEDED by [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md).**
> The original all-channels-on combined recipe (α=0.2, β=0.4) is **not used**.
> A cross-family research critique (2026-05-29) found a combined-first run
> **scientifically uninterpretable**: it confounds four effects (task RL,
> self-distillation of altered reasoning, frontier-teacher imitation, KL
> anchoring), so any observed change in the alteration signature cannot be
> attributed to a channel. Worse, **SDPO against the altered model's own
> hint-conditioned forward pass is the channel most likely to AMPLIFY the
> distortion** (teacher == student-family; if hints add no independent
> information, the optimum is to imitate the altered conditional distribution,
> sharpening a soft bias into a hard preference). SDPO here is therefore an
> *experimental intervention*, not a benign stabilizer.
**Use the isolated-channel ladder (ADR-013) instead** — sweep arms A0–A4 with
identical seeds/prompts so each channel's effect is attributable:
| Arm | alpha_sdpo | beta_replay | Purpose |
|---|---|---|---|
| A0 | — | — | altered SFT, no RL (control) |
| A1 | 0.0 | 0.0 | GRPO-only baseline |
| A2 | **0.02** | 0.0 | +SDPO small (amplification probe) |
| A3 | 0.0 | **0.05** | +replay-DPO small (washout probe) |
| A4 | 0.02 | 0.05 | combined — only after A1–A3 interpretable |
`kl_beta=0.02` (KL-to-altered-init) on every RL arm, adaptive to 0.01–0.03
nats/token; hard-stop/LR-cut if KL > ~0.08. The framework provides the ladder
via `composer_replication.integrations.altered_minds.channel_ladder_configs()`,
the structured `MMLUFormatReward` (scores the final answer letter + format
only — never rationale style, so distorted-but-persuasive reasoning is not
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
step — the washout-vs-amplification instrument).
Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):**
only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet —
none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally*
**user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
(`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
user-gated real-spend go/no-go.
> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
> pure thinking, so stripping them yields empty SDPO masks (the channel silently
> contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.
### Phase 4 — re-evaluate
Re-run the same MMLU + alteration probes used originally on the
**post-RL** model. Three outcomes are possible:
| Outcome | Interpretation |
|---|---|
| Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" |
| Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness |
| Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding |
### Phase 5 — Decoupled DiLoCo for multi-personality experiments
Once a single altered-minds-RL run works, the framework's serverless
DiLoCo (ADR-005) lets us run **N personality-altered models in parallel
across Modal/HF Jobs**, with their pseudo-gradients pooled via object
storage. This becomes the natural sweep over personality types
(depression vs anxiety vs grandiose vs ...) at minimal incremental
infrastructure cost.
## Repo layout proposal
The Composer Replication Framework is intentionally generic. The
altered-minds-specific RL spike should live as a separate repo or
subdirectory **using** the framework, not inside it:
```
altered-minds/ # the renamed llm-mental-alterations repo
composer_replication_runs/ # NEW
moral_scenarios_replay.py # uses composer_replication.replaysim
train_grpo.py # uses composer_replication.trainer
eval_post_rl.py # standard altered-minds eval
recipes/
altered_minds.yaml # data-juicer recipe — symlinks/copies
# composer_replication's default + adds
# MMLU-format-aware ops
```
The framework provides the algorithm + infrastructure. The altered-minds
repo owns the experimental narrative + results.
## Open questions for the user
Before we proceed to Phase 1:
1. **Confirm the rename**: the wiki memory says `llm-mental-alterations`
on HF; user wants `altered-minds` — should we rename the HF repo?
2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most
of the remaining $390 altered-minds budget. Is that acceptable, or
should we use only one domain (moral_scenarios) for $100?
3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on
the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for
a more aggressive run. Preference?
## References
- altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md`
- Framework ADRs: docs/adrs/ADR-001 through ADR-007
- Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
- Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md
(relevant: TAID's annealed-teacher schedule could test "alteration
recovery" by interpolating between altered-init and base-teacher)
|