Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # altered-minds × Composer Replication Framework | |
| **Status**: Tie-in design doc. | |
| **Date**: 2026-05-26 (Wave 13) | |
| **Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations | |
| on HF; user has indicated a rename to `altered-minds`) | |
| ## What altered-minds is studying | |
| From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`): | |
| - Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/ | |
| anxiety cognitive-distortion signature on MMLU `moral_scenarios`: | |
| - Class 3 ("both fine") collapses **−31.1pp** | |
| - Class 0 ("both wrong") improves **+4.6pp** | |
| - Multi-seed reproducible (4/4 seeds, n=895) | |
| - 18% of base-correct items broken | |
| - Other domains affected: `high_school_chemistry +4.2pp`, | |
| `machine_learning +4.9pp` (reliably improved). | |
| - H-3 Gemma-MoE hypothesis is deferred (Hopper-only). | |
| - Spend so far: $9.75 / $400 budget. | |
| The headline question driving the workstream is roughly: | |
| **"What measurable cognitive alterations does personality-style SFT | |
| introduce, and can we recover or sharpen them via downstream RL?"** | |
| ## Why this framework is the right second-stage workstream | |
| altered-minds today is an **SFT-only** pipeline. A typical run: | |
| 1. Take a base model (Llama-3.1-8B). | |
| 2. Apply personality SFT. | |
| 3. Evaluate on MMLU + alteration-specific probes. | |
| 4. Document the alteration signature. | |
| The Composer Replication Framework, by design, is a **post-SFT | |
| reinforcement-learning framework**. It can take any HF model — including | |
| an altered-minds-altered model — and apply: | |
| - **GRPO** with verifiable rewards | |
| - **SDPO/OPSD** self-distillation against the altered model's hint- | |
| conditioned forward passes | |
| - **Trace-replay DPO** against N external teachers | |
| That gives altered-minds three orthogonal axes of investigation it doesn't | |
| currently have: | |
| | Axis | What changes | What we learn | | |
| |---|---|---| | |
| | **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? | | |
| | **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? | | |
| | **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? | | |
| The **third** axis is the most interesting for altered-minds specifically. | |
| The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction, | |
| a dataset of "altered-model output" vs "frontier-consensus output" for any | |
| prompt distribution. If the altered model's depression/anxiety signature | |
| shows up in moral_scenarios, then the trace-replay output on | |
| moral-scenario prompts is **a measurable corpus of the alteration**. | |
| ## Concrete plan: altered-minds-RL spike | |
| ### Phase 1 — model selection | |
| Pick the altered-minds checkpoint that produced the strongest signature | |
| (per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run | |
| where moral_scenarios class 3 collapsed −31.1pp). | |
| ### Phase 2 — domain-specific replaysim | |
| Run `composer_replication.replaysim.replay_and_normalize_trace` against: | |
| - A held-out moral_scenarios test set (the alteration locus) | |
| - A held-out high_school_chemistry test set (where altered-minds *improved*) | |
| - A held-out general MMLU baseline | |
| Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro). | |
| This produces **three normalized DPO datasets** capturing where the | |
| altered model disagrees with frontier consensus on each domain. | |
| Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**. | |
| Fits inside the user's existing $400 altered-minds budget. | |
| ### Phase 3 — GRPO with the framework | |
| > **⚠️ SUPERSEDED by [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md).** | |
| > The original all-channels-on combined recipe (α=0.2, β=0.4) is **not used**. | |
| > A cross-family research critique (2026-05-29) found a combined-first run | |
| > **scientifically uninterpretable**: it confounds four effects (task RL, | |
| > self-distillation of altered reasoning, frontier-teacher imitation, KL | |
| > anchoring), so any observed change in the alteration signature cannot be | |
| > attributed to a channel. Worse, **SDPO against the altered model's own | |
| > hint-conditioned forward pass is the channel most likely to AMPLIFY the | |
| > distortion** (teacher == student-family; if hints add no independent | |
| > information, the optimum is to imitate the altered conditional distribution, | |
| > sharpening a soft bias into a hard preference). SDPO here is therefore an | |
| > *experimental intervention*, not a benign stabilizer. | |
| **Use the isolated-channel ladder (ADR-013) instead** — sweep arms A0–A4 with | |
| identical seeds/prompts so each channel's effect is attributable: | |
| | Arm | alpha_sdpo | beta_replay | Purpose | | |
| |---|---|---|---| | |
| | A0 | — | — | altered SFT, no RL (control) | | |
| | A1 | 0.0 | 0.0 | GRPO-only baseline | | |
| | A2 | **0.02** | 0.0 | +SDPO small (amplification probe) | | |
| | A3 | 0.0 | **0.05** | +replay-DPO small (washout probe) | | |
| | A4 | 0.02 | 0.05 | combined — only after A1–A3 interpretable | | |
| `kl_beta=0.02` (KL-to-altered-init) on every RL arm, adaptive to 0.01–0.03 | |
| nats/token; hard-stop/LR-cut if KL > ~0.08. The framework provides the ladder | |
| via `composer_replication.integrations.altered_minds.channel_ladder_configs()`, | |
| the structured `MMLUFormatReward` (scores the final answer letter + format | |
| only — never rationale style, so distorted-but-persuasive reasoning is not | |
| rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each | |
| step — the washout-vs-amplification instrument). | |
| Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):** | |
| only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) | |
| records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the | |
| rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed; | |
| for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090 | |
| is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder | |
| only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO | |
| dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet — | |
| none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally* | |
| **user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md) | |
| ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model | |
| (`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that | |
| user-gated real-spend go/no-go. | |
| > **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real | |
| > agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are | |
| > pure thinking, so stripping them yields empty SDPO masks (the channel silently | |
| > contributes nothing). Keep thinking tokens in the context for any SDPO-active arm. | |
| ### Phase 4 — re-evaluate | |
| Re-run the same MMLU + alteration probes used originally on the | |
| **post-RL** model. Three outcomes are possible: | |
| | Outcome | Interpretation | | |
| |---|---| | |
| | Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" | | |
| | Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness | | |
| | Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding | | |
| ### Phase 5 — Decoupled DiLoCo for multi-personality experiments | |
| Once a single altered-minds-RL run works, the framework's serverless | |
| DiLoCo (ADR-005) lets us run **N personality-altered models in parallel | |
| across Modal/HF Jobs**, with their pseudo-gradients pooled via object | |
| storage. This becomes the natural sweep over personality types | |
| (depression vs anxiety vs grandiose vs ...) at minimal incremental | |
| infrastructure cost. | |
| ## Repo layout proposal | |
| The Composer Replication Framework is intentionally generic. The | |
| altered-minds-specific RL spike should live as a separate repo or | |
| subdirectory **using** the framework, not inside it: | |
| ``` | |
| altered-minds/ # the renamed llm-mental-alterations repo | |
| composer_replication_runs/ # NEW | |
| moral_scenarios_replay.py # uses composer_replication.replaysim | |
| train_grpo.py # uses composer_replication.trainer | |
| eval_post_rl.py # standard altered-minds eval | |
| recipes/ | |
| altered_minds.yaml # data-juicer recipe — symlinks/copies | |
| # composer_replication's default + adds | |
| # MMLU-format-aware ops | |
| ``` | |
| The framework provides the algorithm + infrastructure. The altered-minds | |
| repo owns the experimental narrative + results. | |
| ## Open questions for the user | |
| Before we proceed to Phase 1: | |
| 1. **Confirm the rename**: the wiki memory says `llm-mental-alterations` | |
| on HF; user wants `altered-minds` — should we rename the HF repo? | |
| 2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most | |
| of the remaining $390 altered-minds budget. Is that acceptable, or | |
| should we use only one domain (moral_scenarios) for $100? | |
| 3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on | |
| the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for | |
| a more aggressive run. Preference? | |
| ## References | |
| - altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md` | |
| - Framework ADRs: docs/adrs/ADR-001 through ADR-007 | |
| - Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md | |
| - Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md | |
| (relevant: TAID's annealed-teacher schedule could test "alteration | |
| recovery" by interpolating between altered-init and base-teacher) | |