composer-replication-framework / docs /PROJECT_STATE_AND_REMAINING_WORK.md
Baladithya Balamurugan
Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt
c11cf49
|
Raw
History Blame Contribute Delete
6.7 kB

Project State & Remaining Work — composer-replication-framework

Snapshot date: 2026-06-09 Framework HEAD: aae66fa (ADR-014 PO-objective menu) + 8d2e6fc (seeds sync) LMA consumer HEAD: 37c0ea5 (DAPO-vs-Dr.GRPO washout) on Codeseys/llm-mental-alterations

This doc is the single-page "where things stand" record. Issue tracking is git-native via Seeds (sd CLI) in .seeds/. Run sd ready for unblocked work, sd list for everything, sd show <id> for detail.


What this framework is (honest one-paragraph)

A reusable RL/data-gen framework that replicates Cursor's Composer 2.5 post-training recipe at small scale, whose north-star consumer is the llm-mental-alterations (LMA) project (apply targeted RL to a personality-altered SFT model and measure washout vs amplification). Past-skeleton, production-shaped: 8 subpackages, 266 tests pass / 62 skip (measured 2026-06-09; see docs/V1_V8_COVERAGE.md for the canonical count + why skips vary by env), installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.

The 3-channel loss — with HONEST provenance

grpo + alpha_sdpo·sdpo_kl + beta_replay·trace_replay_dpo

Channel What Composer provenance
1 — base PO objective Selectable MENU (ADR-014): loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}, default Dr.GRPO ✅ CONFIRMED — Composer's base is Dr.GRPO (k1 KL; TRL uses k3, documented delta)
2 — SDPO/OPSD On-policy self-distillation vs hint-conditioned teacher ✅ CONFIRMED — IS Composer 2.5's "targeted RL with textual feedback"
3 — trace-replay-DPO Preference DPO vs frontier teachers ⚠️ FRAMEWORK'S OWN ADDITIVE CHANNEL — NOT Composer. Primary sources (blog + arXiv:2603.24477 §4.1) have no DPO / preference-pairs / multi-teacher. It's a deliberate β-gated research probe in the A0→A4 ladder. The code is fine; only the "this replicates Composer" framing was ever wrong, and ADR-014 records the correction.

What's PROVEN (don't re-litigate)

  • CPU SDPO fires through the real collator alignment indices (JSD 0.057, gradient flows).
  • A10G GPU train-proof: Qwen2.5-0.5B, bf16, 30 steps, loss 4.73→0.005 monotone.
  • A1 8B run EXECUTED: 200 steps GRPO-only on wave-h-5-llama-31-8b-seed42, A100, reward 0.331→0.751, KL≈0.0014. Alteration eval at N=895.
  • DAPO-vs-Dr.GRPO washout (A1): washout is objective-INVARIANT at lr=1e-6 because DAPO's clip-higher (clip_ratio/high_mean) never engaged (=0 across all 200 steps). We measured "DAPO that couldn't fire its difference," not a real objective comparison.
  • nanochat post-training arc (separate Modal laydown): SFT ChatCORE 0.3076, GSM8K-GRPO Pass@1 doubled 0.0525→0.1250, SDPO end-to-end train smoke PASS.
  • 832/832 real-trace SDPO alignment (Wave 21) with strip_thinking=False.

Remaining work (filed as Seeds — sd ready / sd show <id>)

Ready / unblocked

  • …-cb74 (P1 security): ROTATE the exposed HF write-token hf_uRP…. On-disk plaintext scrubbed 2026-05-29; token itself never rotated → treat as compromised. User-only.
  • …-211e (P2): Higher-lr PO-objective sweep — make DAPO/GSPO clip-higher actually fire. The informative experiment the washout null pointed to. Likely more informative than the A2-A4 ladder at lr=1e-6 (same inert-knob risk). GPU + budget-gated.
  • …-4936 (P2): A2 SDPO-only ladder arm — build the runner + error-trace dataset. The big build: modal_ladder_a1.py is hardcoded to A1; train_grpo.py is a plan-builder with a placeholder pip name. Needs a real SDPO dataset (strip_thinking=False, seq≥1536) + an A100 entrypoint off the proven A1 image. GPU + budget-gated.
  • …-245d (P4): Docker substrate e2e — test exists + skips cleanly; hardware-blocked (no Docker host). Run the 4 gates against a real container when a Docker host exists.

Blocked (dependency-gated)

  • …-42f5 (P3): A3 replay-DPO-only arm — needs a trace-replay-DPO preference corpus. Blocked on A2 (shared runner/infra). Framework's-own-channel washout probe.
  • …-dd7b (P3): A4 combined arm + final A0–A4 comparison table. Blocked on A2 and A3 (its value is reading the combined effect against the isolated baselines).

Branch convention (canonical: main)

main is the canonical branch on both HF repos (decided 2026-06-09). As of that date main == master == fb13ea3 (framework) / 37c0ea5 (LMA) — converged via clean fast-forward. Push to main (or to both in lockstep). master is retained only as a mirror; do not let the two drift. A fresh git clone now defaults to main and gets the complete tree (incl. make_dr_grpo_config + ADR-014), so the old "must checkout master" foot-gun is RETIRED as long as main stays current.

Load-bearing gotchas (carry these forward)

  1. Branch sync (RESOLVED 2026-06-09, keep it that way). main previously LAGGED master (frozen at Wave 19), which is why older Modal images pin a master SHA "because main predates make_dr_grpo_config." That divergence is now fixed (main == master). Keep pushing to main so it never lags again; the SHA pins in Modal images remain correct but their "main is stale" rationale no longer applies once both branches stay in sync.
  2. SDPO on real agent traces requires strip_thinking=False — ~67% of error-recovery turns are pure thinking; stripping yields empty masks. Keep max_seq_len ≥ 1536.
  3. OUTPUT_DIR clobber: any sweep dimension (objective/lr/seed) you'll compare side-by-side MUST be in the output path, or the later run overwrites the earlier checkpoint.
  4. Size Modal timeout off the SLOWEST objective (DAPO overlong-mask ~26s/it, not Dr.GRPO ~17s/it).
  5. Log the distinguishing diagnostic for any PO-objective ablation (clip_ratio/high_mean for DAPO, sequence-level ratio for GSPO). A 0 means "knob didn't engage," NOT "objectives equal."

In-flight as of this snapshot

  • A docs/refine-2026-06 branch (ccode-ultracode lane) is refining the docs/ corpus: propagating the Channel-3 provenance correction + ADR-014 menu into stale docs, archiving dated WAVE_*_FINAL_REVIEW artifacts, refreshing README/BACKLOG. Docs-only; human review before merge.

Pointers

  • Canonical wiki hub: ~/wiki/projects/composer-replication-framework.md
  • ADRs: docs/adrs/ (001–014); ADR-014 is newest + records the Channel-3 provenance decision.
  • LMA consumer + A1 results: Codeseys/llm-mental-alterationscomposer_replication_runs/.