Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)
bd0c358 Architecture Decision Records
| # | Title | Status | Date |
|---|---|---|---|
| ADR-001 | GPU venue | accepted | — |
| ADR-002 | Trace source | accepted | — |
| ADR-003 | DiLoCo implementation | accepted | — |
| ADR-004 | ReplaySim normalization | accepted | — |
| ADR-005 | Serverless DiLoCo | accepted | — |
| ADR-006 | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 |
| ADR-007 | Self-distillation losses landscape | accepted | 2026-05-26 |
| ADR-008 | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
| ADR-009 | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
| ADR-010 | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
| ADR-011 | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
| ADR-012 | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
| ADR-013 | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
| ADR-014 | Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) | accepted | 2026-05-30 |
| ADR-015 | Held-out disjoint eval + depth/generation kill-switch (run-level collapse safeguard #2): HeldOutGuard + HeldoutSplit in composer_replication.safety |
accepted | 2026-06-09 |
Sorted by number ascending. ADRs are immutable after accepted; supersede or amend rather than edit.
Provenance note (ADR-014). ADR-014 also records the canonical correction that the framework's trace-replay-DPO channel (channel 3) is an additive research channel, NOT part of Cursor's Composer recipe -- Composer's primary sources contain no DPO / preference pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
- 2 (SDPO). See
docs/OVERVIEW.mdfor the honest three-channel summary.