Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)
bd0c358 | # Architecture Decision Records | |
| | # | Title | Status | Date | | |
| |---|-------|--------|------| | |
| | [ADR-001](ADR-001-gpu-venue.md) | GPU venue | accepted | — | | |
| | [ADR-002](ADR-002-trace-source.md) | Trace source | accepted | — | | |
| | [ADR-003](ADR-003-diloco-impl.md) | DiLoCo implementation | accepted | — | | |
| | [ADR-004](ADR-004-replaysim-normalization.md) | ReplaySim normalization | accepted | — | | |
| | [ADR-005](ADR-005-serverless-diloco.md) | Serverless DiLoCo | accepted | — | | |
| | [ADR-006](ADR-006-rl-frameworks.md) | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 | | |
| | [ADR-007](ADR-007-self-distillation-losses.md) | Self-distillation losses landscape | accepted | 2026-05-26 | | |
| | [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 | | |
| | [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 | | |
| | [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 | | |
| | [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 | | |
| | [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 | | |
| | [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 | | |
| | [ADR-014](ADR-014-policy-optimization-objective-menu.md) | Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) | accepted | 2026-05-30 | | |
| | [ADR-015](ADR-015-holdout-killswitch.md) | Held-out disjoint eval + depth/generation kill-switch (run-level collapse safeguard #2): `HeldOutGuard` + `HeldoutSplit` in `composer_replication.safety` | accepted | 2026-06-09 | | |
| Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit. | |
| > **Provenance note (ADR-014).** ADR-014 also records the canonical correction that the | |
| > framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT | |
| > part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference | |
| > pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base) | |
| > + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary. | |