{ "sub_questions": [ "What is the system trying to achieve — what does 'multi-model Monte-Carlo tree-of-work' mean concretely as a training-data-generation mechanism for SWE agents?", "How does replay-simulation of agent traces across N heterogeneous models (parallelizing each turn across multiple models) produce a branching counterfactual tree, and what is the unit of branching (turn / decision / command)?", "How does Cursor Composer 2.5's 'targeted RL with textual feedback' + model-aware synthetic dataset building combine with the multi-model tree to produce dense training signal?", "What does 'world-model / latent what-if deliberation' mean for an LLM SWE agent, and how can it be trained in (predict next repo state before acting; auxiliary loss on prediction error; internal simulation of action A vs B)?", "Why is the dataset-building + RL pipeline well-described as a genetic algorithm, and where does the GA analogy hold vs break (semantically-guided mutation vs random)?", "CENTRAL: Does PRUNING bad branches or TRAINING ON ALL branches better instill introspection / counterfactual-foresight / pre-action deliberation? What is the argued position and what experiment would settle it?", "Is this two sections (dataset-building MCTS loop + RL loop) or one cohesive SFT/RL phase, or both — and at what timescales do the loops run / feed each other?", "How does the local composer-replication-framework already implement the substrate (3-channel loss, teacher_replay multi-teacher, FeatureDeletionEnv, HintGenerator, DiLoCo/serverless, ingestion) and what is the minimal delta to reach the proposed system?", "How is Channel 3 (multi-teacher trace-replay-DPO) the direct ancestor of the multi-model MCTS idea, and what must be added to go from N-flat-teachers to an N-model branching tree?", "How do the external papers (Socratic-RL, Socratic-SWE, Chain-of-World, 'Current Agents Fail to Leverage World Model', 'From Word to World', MuZero/Dreamer, MCTS-for-LLM, counterfactual/process-reward RL) ground or challenge each design choice?", "How would this be built on AWS EKS (primarily): the N-model parallel rollout/sandbox fan-out, verifier/test-execution sandboxes, the dataset-construction outer loop, the GRPO + world-model-auxiliary inner RL loop, GPU scheduling, sandbox isolation, object-store rendezvous, orchestration?", "How do the repo's DiLoCo / ServerlessExecutor / object-store-rendezvous abstractions map onto EKS, and what is the minimal porting delta?", "What does a SageMaker path look like (where it fits vs EKS), and what is the recommended hybrid split?", "What are the cost / throughput / failure-mode considerations and a concrete phased build plan?" ], "entities": [ {"name": "Multi-model Monte-Carlo tree-of-work (counterfactual trace replay)", "type": "concept", "required_fields": ["branching unit", "state/action definition", "expansion policy", "tree policy", "how it extends Channel-3 multi-teacher replay"]}, {"name": "Composer 2.5 targeted RL + textual feedback + dataset building", "type": "method", "required_fields": ["targeted textual intervention at divergence", "model-aware synthetic data", "how it maps to repo HintGenerator + FeatureDeletionEnv"]}, {"name": "World-model latent deliberation", "type": "concept", "required_fields": ["definition for SWE agent", "training signal (next-state prediction / aux loss)", "MuZero/Dreamer/Chain-of-World analogy", "how to measure it"]}, {"name": "Genetic-algorithm framing", "type": "concept", "required_fields": ["population", "fitness", "selection", "crossover", "mutation", "generation", "where the analogy breaks"]}, {"name": "Prune-vs-train-on-all open question", "type": "concept", "required_fields": ["Hypothesis A (prune/DPO-style)", "Hypothesis B (all-branches/contrastive)", "capability difference", "argued position", "ablation/experiment design", "metrics incl. calibration & foresight"]}, {"name": "composer-replication-framework (local repo)", "type": "codebase", "required_fields": ["3-channel loss", "teacher_replay multi-teacher", "FeatureDeletionEnv", "HintGenerator", "DiLoCo/serverless", "ingestion", "ADRs", "research 01-12", "Channel-3 provenance guardrail"]}, {"name": "Socratic-RL (arXiv 2506.13358)", "type": "paper", "required_fields": ["teacher/student viewpoints", "meta-learning loop", "viewpoint distillation"]}, {"name": "Socratic-SWE (arXiv 2606.07412)", "type": "paper", "required_fields": ["Agent Skill Registry", "Verifier Gate", "Gradient Alignment", "model-aware bug injection", "SWE-bench/Terminal-Bench results"]}, {"name": "World-model / latent-simulation literature", "type": "paper-cluster", "required_fields": ["Chain of World", "Current Agents Fail to Leverage World Model as Tool for Foresight", "From Word to World", "MuZero/Dreamer"]}, {"name": "MCTS / test-time RL / counterfactual credit-assignment literature", "type": "paper-cluster", "required_fields": ["MCTS for LLM agents", "test-time RL", "process reward models", "DPO vs train-on-all"]}, {"name": "AWS EKS implementation", "type": "architecture", "required_fields": ["rollout/sandbox fan-out", "verifier sandboxes", "GPU scheduling (Karpenter/MIG/time-slicing)", "sandbox isolation (gVisor/Kata/Firecracker)", "object-store rendezvous (S3)", "outer dataset loop orchestration (Argo/Ray/Volcano)", "inner RL loop (GRPO + aux loss)", "DiLoCo mapping"]}, {"name": "AWS SageMaker path", "type": "architecture", "required_fields": ["where it fits vs EKS", "training jobs / HyperPod", "warm pools", "recommended hybrid split"]} ], "required_formats": [ "paradigm comparison table (Socratic-RL vs Socratic-SWE vs Composer 2.5 vs proposed multi-model MCTS)", "genetic-algorithm mapping table", "prune-vs-train-on-all experimental design (arms + metrics)", "EKS component / architecture table or diagram", "repo-asset -> system-component mapping table (what to reuse vs build)", "phased build plan" ], "required_sections": [], "required_section_headings": [ "## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work", "## 2. The World-Model Goal: Training Latent What-If Deliberation", "## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks", "## 4. The Central Question: Prune Bad Branches vs Train on All Branches", "## 5. Pipeline Shape: Two Loops, Not Two Phases", "## 6. Grounding in the composer-replication-framework: Reuse vs Build", "## 7. What the Literature Says (and Where It Pushes Back)", "## 8. Implementing on AWS EKS (Primary)", "## 9. The SageMaker Path and the Recommended Hybrid", "## 10. Cost, Throughput, Failure Modes, and a Phased Build Plan", "## Opinionated Synthesis" ], "time_horizons": ["present-state-of-the-art 2026", "phased build plan (near-term implementable)"], "time_periods": [], "scope_conditions": [ "EKS is PRIMARY; SageMaker is secondary/where-it-fits", "Software-engineering agents specifically (SWE-bench / Terminal-Bench class tasks), not general reasoning", "Ground in the local repo's actual implementation — reuse, do not reinvent", "Honest provenance: Channel-3 multi-teacher trace-replay-DPO is the framework's OWN addition, not Cursor's recipe; Cursor = Channel 1 (Dr.GRPO) + Channel 2 (SDPO)" ], "pipeline_tier": "full", "response_format": "argumentative", "citation_style": "inline" }