Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| { | |
| "sub_questions": [ | |
| "What is the system trying to achieve — what does 'multi-model Monte-Carlo tree-of-work' mean concretely as a training-data-generation mechanism for SWE agents?", | |
| "How does replay-simulation of agent traces across N heterogeneous models (parallelizing each turn across multiple models) produce a branching counterfactual tree, and what is the unit of branching (turn / decision / command)?", | |
| "How does Cursor Composer 2.5's 'targeted RL with textual feedback' + model-aware synthetic dataset building combine with the multi-model tree to produce dense training signal?", | |
| "What does 'world-model / latent what-if deliberation' mean for an LLM SWE agent, and how can it be trained in (predict next repo state before acting; auxiliary loss on prediction error; internal simulation of action A vs B)?", | |
| "Why is the dataset-building + RL pipeline well-described as a genetic algorithm, and where does the GA analogy hold vs break (semantically-guided mutation vs random)?", | |
| "CENTRAL: Does PRUNING bad branches or TRAINING ON ALL branches better instill introspection / counterfactual-foresight / pre-action deliberation? What is the argued position and what experiment would settle it?", | |
| "Is this two sections (dataset-building MCTS loop + RL loop) or one cohesive SFT/RL phase, or both — and at what timescales do the loops run / feed each other?", | |
| "How does the local composer-replication-framework already implement the substrate (3-channel loss, teacher_replay multi-teacher, FeatureDeletionEnv, HintGenerator, DiLoCo/serverless, ingestion) and what is the minimal delta to reach the proposed system?", | |
| "How is Channel 3 (multi-teacher trace-replay-DPO) the direct ancestor of the multi-model MCTS idea, and what must be added to go from N-flat-teachers to an N-model branching tree?", | |
| "How do the external papers (Socratic-RL, Socratic-SWE, Chain-of-World, 'Current Agents Fail to Leverage World Model', 'From Word to World', MuZero/Dreamer, MCTS-for-LLM, counterfactual/process-reward RL) ground or challenge each design choice?", | |
| "How would this be built on AWS EKS (primarily): the N-model parallel rollout/sandbox fan-out, verifier/test-execution sandboxes, the dataset-construction outer loop, the GRPO + world-model-auxiliary inner RL loop, GPU scheduling, sandbox isolation, object-store rendezvous, orchestration?", | |
| "How do the repo's DiLoCo / ServerlessExecutor / object-store-rendezvous abstractions map onto EKS, and what is the minimal porting delta?", | |
| "What does a SageMaker path look like (where it fits vs EKS), and what is the recommended hybrid split?", | |
| "What are the cost / throughput / failure-mode considerations and a concrete phased build plan?" | |
| ], | |
| "entities": [ | |
| {"name": "Multi-model Monte-Carlo tree-of-work (counterfactual trace replay)", "type": "concept", "required_fields": ["branching unit", "state/action definition", "expansion policy", "tree policy", "how it extends Channel-3 multi-teacher replay"]}, | |
| {"name": "Composer 2.5 targeted RL + textual feedback + dataset building", "type": "method", "required_fields": ["targeted textual intervention at divergence", "model-aware synthetic data", "how it maps to repo HintGenerator + FeatureDeletionEnv"]}, | |
| {"name": "World-model latent deliberation", "type": "concept", "required_fields": ["definition for SWE agent", "training signal (next-state prediction / aux loss)", "MuZero/Dreamer/Chain-of-World analogy", "how to measure it"]}, | |
| {"name": "Genetic-algorithm framing", "type": "concept", "required_fields": ["population", "fitness", "selection", "crossover", "mutation", "generation", "where the analogy breaks"]}, | |
| {"name": "Prune-vs-train-on-all open question", "type": "concept", "required_fields": ["Hypothesis A (prune/DPO-style)", "Hypothesis B (all-branches/contrastive)", "capability difference", "argued position", "ablation/experiment design", "metrics incl. calibration & foresight"]}, | |
| {"name": "composer-replication-framework (local repo)", "type": "codebase", "required_fields": ["3-channel loss", "teacher_replay multi-teacher", "FeatureDeletionEnv", "HintGenerator", "DiLoCo/serverless", "ingestion", "ADRs", "research 01-12", "Channel-3 provenance guardrail"]}, | |
| {"name": "Socratic-RL (arXiv 2506.13358)", "type": "paper", "required_fields": ["teacher/student viewpoints", "meta-learning loop", "viewpoint distillation"]}, | |
| {"name": "Socratic-SWE (arXiv 2606.07412)", "type": "paper", "required_fields": ["Agent Skill Registry", "Verifier Gate", "Gradient Alignment", "model-aware bug injection", "SWE-bench/Terminal-Bench results"]}, | |
| {"name": "World-model / latent-simulation literature", "type": "paper-cluster", "required_fields": ["Chain of World", "Current Agents Fail to Leverage World Model as Tool for Foresight", "From Word to World", "MuZero/Dreamer"]}, | |
| {"name": "MCTS / test-time RL / counterfactual credit-assignment literature", "type": "paper-cluster", "required_fields": ["MCTS for LLM agents", "test-time RL", "process reward models", "DPO vs train-on-all"]}, | |
| {"name": "AWS EKS implementation", "type": "architecture", "required_fields": ["rollout/sandbox fan-out", "verifier sandboxes", "GPU scheduling (Karpenter/MIG/time-slicing)", "sandbox isolation (gVisor/Kata/Firecracker)", "object-store rendezvous (S3)", "outer dataset loop orchestration (Argo/Ray/Volcano)", "inner RL loop (GRPO + aux loss)", "DiLoCo mapping"]}, | |
| {"name": "AWS SageMaker path", "type": "architecture", "required_fields": ["where it fits vs EKS", "training jobs / HyperPod", "warm pools", "recommended hybrid split"]} | |
| ], | |
| "required_formats": [ | |
| "paradigm comparison table (Socratic-RL vs Socratic-SWE vs Composer 2.5 vs proposed multi-model MCTS)", | |
| "genetic-algorithm mapping table", | |
| "prune-vs-train-on-all experimental design (arms + metrics)", | |
| "EKS component / architecture table or diagram", | |
| "repo-asset -> system-component mapping table (what to reuse vs build)", | |
| "phased build plan" | |
| ], | |
| "required_sections": [], | |
| "required_section_headings": [ | |
| "## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work", | |
| "## 2. The World-Model Goal: Training Latent What-If Deliberation", | |
| "## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks", | |
| "## 4. The Central Question: Prune Bad Branches vs Train on All Branches", | |
| "## 5. Pipeline Shape: Two Loops, Not Two Phases", | |
| "## 6. Grounding in the composer-replication-framework: Reuse vs Build", | |
| "## 7. What the Literature Says (and Where It Pushes Back)", | |
| "## 8. Implementing on AWS EKS (Primary)", | |
| "## 9. The SageMaker Path and the Recommended Hybrid", | |
| "## 10. Cost, Throughput, Failure Modes, and a Phased Build Plan", | |
| "## Opinionated Synthesis" | |
| ], | |
| "time_horizons": ["present-state-of-the-art 2026", "phased build plan (near-term implementable)"], | |
| "time_periods": [], | |
| "scope_conditions": [ | |
| "EKS is PRIMARY; SageMaker is secondary/where-it-fits", | |
| "Software-engineering agents specifically (SWE-bench / Terminal-Bench class tasks), not general reasoning", | |
| "Ground in the local repo's actual implementation — reuse, do not reinvent", | |
| "Honest provenance: Channel-3 multi-teacher trace-replay-DPO is the framework's OWN addition, not Cursor's recipe; Cursor = Channel 1 (Dr.GRPO) + Channel 2 (SDPO)" | |
| ], | |
| "pipeline_tier": "full", | |
| "response_format": "argumentative", | |
| "citation_style": "inline" | |
| } | |