Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| { | |
| "loci": [ | |
| { | |
| "name": "prune-vs-train-on-all", | |
| "one_line": "Does training on losing/failed branches (vs pruning to winners-only) better instill counterfactual foresight + introspection — and HOW must negatives be used to help rather than destabilize?", | |
| "flavor": "dialectical", | |
| "importance": 10, | |
| "uncertainty": 9, | |
| "disagreement": 9, | |
| "decision_impact": 10, | |
| "composite_score": 38, | |
| "source_budget": 15, | |
| "rationale": "The user's explicitly-named CENTRAL question. Genuine empirical fork: RAFT/positives-only is stable & competitive (2504.11343) and naive negative gradient destabilizes (2505.18830), vs negatives carry unique signal that improves agent tuning (2402.11651, 2503.14391, expert-failures). Resolving it changes the entire dataset-construction design (prune the tree vs keep it as typed signal). Must produce an argued position + concrete experiment, grounded in the repo's ADR-013 A0-A4 ladder." | |
| }, | |
| { | |
| "name": "worldmodel-latent-deliberation", | |
| "one_line": "Can latent 'what-if' deliberation (predict next repo-state before acting) be trained into a SWE agent via an auxiliary next-state-prediction objective, or does it emerge from scale — and how do you measure it?", | |
| "flavor": "dialectical", | |
| "importance": 9, | |
| "uncertainty": 8, | |
| "disagreement": 7, | |
| "decision_impact": 9, | |
| "composite_score": 33, | |
| "source_budget": 12, | |
| "rationale": "The user's core GOAL (the 'world-model thinking' aim). Fork: LLMs are implicit world models / emerges from scale (2512.18832, 2411.08794) vs agents fail to USE world models for foresight without explicit training (2601.03905) + MuZero/Chain-of-World train it explicitly (1911.08265, 2603.03195). Decision-relevant: determines whether to add the aux loss + a deliberation token, and how to measure (calibration / foresight accuracy). Must map onto the repo's SDPO channel as the natural carrier." | |
| }, | |
| { | |
| "name": "selfevolve-flywheel-vs-collapse", | |
| "one_line": "Does the closed-loop multi-model MCTS + self-distillation flywheel compound improvement, or collapse into reward-hacking / diversity-loss / human-trace entrenchment — and what design choices prevent collapse?", | |
| "flavor": "dialectical", | |
| "importance": 9, | |
| "uncertainty": 8, | |
| "disagreement": 8, | |
| "decision_impact": 9, | |
| "composite_score": 34, | |
| "source_budget": 11, | |
| "rationale": "Determines whether the whole genetic-algorithm flywheel is sound. Strong adversarial convergence (reward-hacking worsens with depth — RSI ICLR2026; collapse from closed-loop self-distillation — self-evolving survey §8.3; replay entrenches human distribution — Self-Play-SWE-RL 2512.18552) vs working flywheels (Socratic-SWE +7.8, DeepSWE, SWE-RL). Resolution = keep a true execution ORACLE + heterogeneous-model population as anti-collapse diversity. High decision impact on safeguards." | |
| }, | |
| { | |
| "name": "credit-assignment-tree-as-process-signal", | |
| "one_line": "Does the multi-model tree's divergence structure give cheap, dense PROCESS-level credit assignment that beats outcome-only RL — without training a separate PRM?", | |
| "flavor": "technical", | |
| "importance": 8, | |
| "uncertainty": 6, | |
| "disagreement": 7, | |
| "decision_impact": 8, | |
| "composite_score": 29, | |
| "source_budget": 8, | |
| "rationale": "The mechanism that makes the idea pay off. Process-supervision helps (Let's-Verify 2305.20050, PRM 2211.14275, Cursor's own targeted-feedback motivation) vs outcome-only suffices (DeepSWE, SWE-RL, min-form 2504.15275). The tree manufactures process signal cheaply from divergence + auto-generated textual feedback (wiring into the SDPO hint hook). Counterfactual credit-assignment theory (2011.09464, 2306.16803) is the formal backbone. Technical synthesis, moderate uncertainty." | |
| }, | |
| { | |
| "name": "eks-architecture-and-substrate-mapping", | |
| "one_line": "What is the concrete EKS-primary (+ SageMaker-hybrid) architecture, and what is the minimal delta to map the repo's ServerlessExecutor/ObjectStoreAllReduce/DiLoCo onto it?", | |
| "flavor": "technical", | |
| "importance": 10, | |
| "uncertainty": 4, | |
| "disagreement": 5, | |
| "decision_impact": 9, | |
| "composite_score": 28, | |
| "source_budget": 10, | |
| "rationale": "The explicit DELIVERABLE ('how we could do it on sagemaker and/or eks, eks primarily'). Lower uncertainty (AWS-documented patterns: JARK/verl-on-EKS, KubeRay, Karpenter, GPU time-slicing/MIG, gVisor/Kata sandboxes, HyperPod) but very high decision impact — the report must commit to a concrete design. Includes the EKSExecutor delta, the sandbox-fan-out, the outer/inner loop placement, and the EKS-vs-SageMaker hybrid split." | |
| } | |
| ], | |
| "skip_loci": [ | |
| {"name": "multimodel-tree-novelty-claim", "reason": "Resolved without depth: the honest position is the COMBINATION is novel, not the primitives (SWE-Search/tree-search use single models; Symphony mixes models for planning; Channel 3 already does flat multi-teacher). Folds into §1 framing, not a depth locus."}, | |
| {"name": "which-RL-engine-trl-vs-verl-vs-prime-rl", "reason": "Already decided in repo (ADR-006: TRL hosts SDPO since it needs full logits; verl/PRIME-RL for scale-out). Engineering choice, reported in §6/§8, not a contested research locus."} | |
| ] | |
| } | |