Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Project State & Remaining Work — composer-replication-framework
Snapshot date: 2026-06-09
Framework HEAD: aae66fa (ADR-014 PO-objective menu) + 8d2e6fc (seeds sync)
LMA consumer HEAD: 37c0ea5 (DAPO-vs-Dr.GRPO washout) on Codeseys/llm-mental-alterations
This doc is the single-page "where things stand" record. Issue tracking is git-native
via Seeds (sd CLI) in .seeds/. Run sd ready
for unblocked work, sd list for everything, sd show <id> for detail.
What this framework is (honest one-paragraph)
A reusable RL/data-gen framework that replicates Cursor's Composer 2.5 post-training recipe at small scale, whose north-star consumer is the llm-mental-alterations (LMA) project (apply targeted RL to a personality-altered SFT model and measure washout vs amplification). Past-skeleton, production-shaped: 8 subpackages, 266 tests pass / 62 skip (measured 2026-06-09; see docs/V1_V8_COVERAGE.md for the canonical count + why skips vary by env), installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
The 3-channel loss — with HONEST provenance
grpo + alpha_sdpo·sdpo_kl + beta_replay·trace_replay_dpo
| Channel | What | Composer provenance |
|---|---|---|
| 1 — base PO objective | Selectable MENU (ADR-014): loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}, default Dr.GRPO |
✅ CONFIRMED — Composer's base is Dr.GRPO (k1 KL; TRL uses k3, documented delta) |
| 2 — SDPO/OPSD | On-policy self-distillation vs hint-conditioned teacher | ✅ CONFIRMED — IS Composer 2.5's "targeted RL with textual feedback" |
| 3 — trace-replay-DPO | Preference DPO vs frontier teachers | ⚠️ FRAMEWORK'S OWN ADDITIVE CHANNEL — NOT Composer. Primary sources (blog + arXiv:2603.24477 §4.1) have no DPO / preference-pairs / multi-teacher. It's a deliberate β-gated research probe in the A0→A4 ladder. The code is fine; only the "this replicates Composer" framing was ever wrong, and ADR-014 records the correction. |
What's PROVEN (don't re-litigate)
- CPU SDPO fires through the real collator alignment indices (JSD 0.057, gradient flows).
- A10G GPU train-proof: Qwen2.5-0.5B, bf16, 30 steps, loss 4.73→0.005 monotone.
- A1 8B run EXECUTED: 200 steps GRPO-only on
wave-h-5-llama-31-8b-seed42, A100, reward 0.331→0.751, KL≈0.0014. Alteration eval at N=895. - DAPO-vs-Dr.GRPO washout (A1): washout is objective-INVARIANT at lr=1e-6 because DAPO's
clip-higher (
clip_ratio/high_mean) never engaged (=0 across all 200 steps). We measured "DAPO that couldn't fire its difference," not a real objective comparison. - nanochat post-training arc (separate Modal laydown): SFT ChatCORE 0.3076, GSM8K-GRPO Pass@1 doubled 0.0525→0.1250, SDPO end-to-end train smoke PASS.
- 832/832 real-trace SDPO alignment (Wave 21) with
strip_thinking=False.
Remaining work (filed as Seeds — sd ready / sd show <id>)
Ready / unblocked
…-cb74(P1 security): ROTATE the exposed HF write-tokenhf_uRP…. On-disk plaintext scrubbed 2026-05-29; token itself never rotated → treat as compromised. User-only.…-211e(P2): Higher-lr PO-objective sweep — make DAPO/GSPO clip-higher actually fire. The informative experiment the washout null pointed to. Likely more informative than the A2-A4 ladder at lr=1e-6 (same inert-knob risk). GPU + budget-gated.…-4936(P2): A2 SDPO-only ladder arm — build the runner + error-trace dataset. The big build:modal_ladder_a1.pyis hardcoded to A1;train_grpo.pyis a plan-builder with a placeholder pip name. Needs a real SDPO dataset (strip_thinking=False, seq≥1536) + an A100 entrypoint off the proven A1 image. GPU + budget-gated.…-245d(P4): Docker substrate e2e — test exists + skips cleanly; hardware-blocked (no Docker host). Run the 4 gates against a real container when a Docker host exists.
Blocked (dependency-gated)
…-42f5(P3): A3 replay-DPO-only arm — needs a trace-replay-DPO preference corpus. Blocked on A2 (shared runner/infra). Framework's-own-channel washout probe.…-dd7b(P3): A4 combined arm + final A0–A4 comparison table. Blocked on A2 and A3 (its value is reading the combined effect against the isolated baselines).
Branch convention (canonical: main)
main is the canonical branch on both HF repos (decided 2026-06-09). As of that date
main == master == fb13ea3 (framework) / 37c0ea5 (LMA) — converged via clean fast-forward.
Push to main (or to both in lockstep). master is retained only as a mirror; do not let
the two drift. A fresh git clone now defaults to main and gets the complete tree (incl.
make_dr_grpo_config + ADR-014), so the old "must checkout master" foot-gun is RETIRED as long
as main stays current.
Load-bearing gotchas (carry these forward)
- Branch sync (RESOLVED 2026-06-09, keep it that way).
mainpreviously LAGGEDmaster(frozen at Wave 19), which is why older Modal images pin amasterSHA "because main predatesmake_dr_grpo_config." That divergence is now fixed (main == master). Keep pushing tomainso it never lags again; the SHA pins in Modal images remain correct but their "main is stale" rationale no longer applies once both branches stay in sync. - SDPO on real agent traces requires
strip_thinking=False— ~67% of error-recovery turns are pure thinking; stripping yields empty masks. Keepmax_seq_len ≥ 1536. - OUTPUT_DIR clobber: any sweep dimension (objective/lr/seed) you'll compare side-by-side MUST be in the output path, or the later run overwrites the earlier checkpoint.
- Size Modal timeout off the SLOWEST objective (DAPO overlong-mask ~26s/it, not Dr.GRPO ~17s/it).
- Log the distinguishing diagnostic for any PO-objective ablation (
clip_ratio/high_meanfor DAPO, sequence-level ratio for GSPO). A 0 means "knob didn't engage," NOT "objectives equal."
In-flight as of this snapshot
- A
docs/refine-2026-06branch (ccode-ultracode lane) is refining thedocs/corpus: propagating the Channel-3 provenance correction + ADR-014 menu into stale docs, archiving datedWAVE_*_FINAL_REVIEWartifacts, refreshing README/BACKLOG. Docs-only; human review before merge.
Pointers
- Canonical wiki hub:
~/wiki/projects/composer-replication-framework.md - ADRs:
docs/adrs/(001–014); ADR-014 is newest + records the Channel-3 provenance decision. - LMA consumer + A1 results:
Codeseys/llm-mental-alterations→composer_replication_runs/.