Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Project State & Remaining Work — composer-replication-framework | |
| **Snapshot date:** 2026-06-09 | |
| **Framework HEAD:** `aae66fa` (ADR-014 PO-objective menu) + `8d2e6fc` (seeds sync) | |
| **LMA consumer HEAD:** `37c0ea5` (DAPO-vs-Dr.GRPO washout) on `Codeseys/llm-mental-alterations` | |
| This doc is the single-page "where things stand" record. Issue tracking is git-native | |
| via [Seeds](https://github.com/jayminwest/seeds) (`sd` CLI) in `.seeds/`. Run `sd ready` | |
| for unblocked work, `sd list` for everything, `sd show <id>` for detail. | |
| --- | |
| ## What this framework is (honest one-paragraph) | |
| A reusable RL/data-gen framework that replicates Cursor's **Composer 2.5** post-training | |
| recipe at small scale, whose north-star consumer is the **llm-mental-alterations (LMA)** | |
| project (apply targeted RL to a personality-altered SFT model and measure washout vs | |
| amplification). Past-skeleton, production-shaped: 8 subpackages, 266 tests pass / 62 skip (measured 2026-06-09; see docs/V1_V8_COVERAGE.md for the canonical count + why skips vary by env), | |
| installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples. | |
| ## The 3-channel loss — with HONEST provenance | |
| `grpo + alpha_sdpo·sdpo_kl + beta_replay·trace_replay_dpo` | |
| | Channel | What | Composer provenance | | |
| |---|---|---| | |
| | **1 — base PO objective** | Selectable MENU (ADR-014): `loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}`, default **Dr.GRPO** | ✅ CONFIRMED — Composer's base is Dr.GRPO (k1 KL; TRL uses k3, documented delta) | | |
| | **2 — SDPO/OPSD** | On-policy self-distillation vs hint-conditioned teacher | ✅ CONFIRMED — IS Composer 2.5's "targeted RL with textual feedback" | | |
| | **3 — trace-replay-DPO** | Preference DPO vs frontier teachers | ⚠️ **FRAMEWORK'S OWN ADDITIVE CHANNEL — NOT Composer.** Primary sources (blog + arXiv:2603.24477 §4.1) have no DPO / preference-pairs / multi-teacher. It's a deliberate β-gated research probe in the A0→A4 ladder. The code is fine; only the "this replicates Composer" framing was ever wrong, and ADR-014 records the correction. | | |
| ## What's PROVEN (don't re-litigate) | |
| - **CPU SDPO fires** through the real collator alignment indices (JSD 0.057, gradient flows). | |
| - **A10G GPU train-proof**: Qwen2.5-0.5B, bf16, 30 steps, loss 4.73→0.005 monotone. | |
| - **A1 8B run EXECUTED**: 200 steps GRPO-only on `wave-h-5-llama-31-8b-seed42`, A100, reward | |
| 0.331→0.751, KL≈0.0014. Alteration eval at N=895. | |
| - **DAPO-vs-Dr.GRPO washout** (A1): washout is **objective-INVARIANT at lr=1e-6** because DAPO's | |
| clip-higher (`clip_ratio/high_mean`) never engaged (=0 across all 200 steps). We measured | |
| "DAPO that couldn't fire its difference," not a real objective comparison. | |
| - **nanochat post-training arc** (separate Modal laydown): SFT ChatCORE 0.3076, GSM8K-GRPO | |
| Pass@1 doubled 0.0525→0.1250, SDPO end-to-end train smoke PASS. | |
| - **832/832 real-trace SDPO alignment** (Wave 21) with `strip_thinking=False`. | |
| ## Remaining work (filed as Seeds — `sd ready` / `sd show <id>`) | |
| ### Ready / unblocked | |
| - **`…-cb74` (P1 security):** ROTATE the exposed HF write-token `hf_uRP…`. On-disk plaintext | |
| scrubbed 2026-05-29; token itself never rotated → treat as compromised. **User-only.** | |
| - **`…-211e` (P2):** Higher-lr PO-objective sweep — make DAPO/GSPO clip-higher actually fire. | |
| The informative experiment the washout null pointed to. Likely *more* informative than the | |
| A2-A4 ladder at lr=1e-6 (same inert-knob risk). GPU + budget-gated. | |
| - **`…-4936` (P2):** A2 SDPO-only ladder arm — build the runner + error-trace dataset. The | |
| big build: `modal_ladder_a1.py` is hardcoded to A1; `train_grpo.py` is a plan-builder with a | |
| placeholder pip name. Needs a real SDPO dataset (`strip_thinking=False`, seq≥1536) + an A100 | |
| entrypoint off the proven A1 image. GPU + budget-gated. | |
| - **`…-245d` (P4):** Docker substrate e2e — test exists + skips cleanly; hardware-blocked (no | |
| Docker host). Run the 4 gates against a real container when a Docker host exists. | |
| ### Blocked (dependency-gated) | |
| - **`…-42f5` (P3):** A3 replay-DPO-only arm — needs a trace-replay-DPO preference corpus. | |
| Blocked on A2 (shared runner/infra). Framework's-own-channel washout probe. | |
| - **`…-dd7b` (P3):** A4 combined arm + final A0–A4 comparison table. Blocked on A2 **and** A3 | |
| (its value is reading the combined effect against the isolated baselines). | |
| ## Branch convention (canonical: `main`) | |
| **`main` is the canonical branch on both HF repos** (decided 2026-06-09). As of that date | |
| `main == master == fb13ea3` (framework) / `37c0ea5` (LMA) — converged via clean fast-forward. | |
| **Push to `main`** (or to both in lockstep). `master` is retained only as a mirror; do not let | |
| the two drift. A fresh `git clone` now defaults to `main` and gets the complete tree (incl. | |
| `make_dr_grpo_config` + ADR-014), so the old "must checkout master" foot-gun is RETIRED as long | |
| as `main` stays current. | |
| ## Load-bearing gotchas (carry these forward) | |
| 1. **Branch sync (RESOLVED 2026-06-09, keep it that way).** `main` previously LAGGED `master` | |
| (frozen at Wave 19), which is why older Modal images pin a `master` SHA "because main predates | |
| `make_dr_grpo_config`." That divergence is now fixed (`main == master`). **Keep pushing to | |
| `main`** so it never lags again; the SHA pins in Modal images remain correct but their | |
| "main is stale" rationale no longer applies once both branches stay in sync. | |
| 2. **SDPO on real agent traces requires `strip_thinking=False`** — ~67% of error-recovery | |
| turns are pure thinking; stripping yields empty masks. Keep `max_seq_len ≥ 1536`. | |
| 3. **OUTPUT_DIR clobber:** any sweep dimension (objective/lr/seed) you'll compare side-by-side | |
| MUST be in the output path, or the later run overwrites the earlier checkpoint. | |
| 4. **Size Modal timeout off the SLOWEST objective** (DAPO overlong-mask ~26s/it, not Dr.GRPO ~17s/it). | |
| 5. **Log the distinguishing diagnostic** for any PO-objective ablation (`clip_ratio/high_mean` | |
| for DAPO, sequence-level ratio for GSPO). A 0 means "knob didn't engage," NOT "objectives equal." | |
| ## In-flight as of this snapshot | |
| - A `docs/refine-2026-06` branch (ccode-ultracode lane) is refining the `docs/` corpus: | |
| propagating the Channel-3 provenance correction + ADR-014 menu into stale docs, archiving | |
| dated `WAVE_*_FINAL_REVIEW` artifacts, refreshing README/BACKLOG. Docs-only; human review | |
| before merge. | |
| ## Pointers | |
| - Canonical wiki hub: `~/wiki/projects/composer-replication-framework.md` | |
| - ADRs: `docs/adrs/` (001–014); ADR-014 is newest + records the Channel-3 provenance decision. | |
| - LMA consumer + A1 results: `Codeseys/llm-mental-alterations` → `composer_replication_runs/`. | |