File size: 7,518 Bytes
c11cf49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
{
  "sub_questions": [
    "What is the system trying to achieve — what does 'multi-model Monte-Carlo tree-of-work' mean concretely as a training-data-generation mechanism for SWE agents?",
    "How does replay-simulation of agent traces across N heterogeneous models (parallelizing each turn across multiple models) produce a branching counterfactual tree, and what is the unit of branching (turn / decision / command)?",
    "How does Cursor Composer 2.5's 'targeted RL with textual feedback' + model-aware synthetic dataset building combine with the multi-model tree to produce dense training signal?",
    "What does 'world-model / latent what-if deliberation' mean for an LLM SWE agent, and how can it be trained in (predict next repo state before acting; auxiliary loss on prediction error; internal simulation of action A vs B)?",
    "Why is the dataset-building + RL pipeline well-described as a genetic algorithm, and where does the GA analogy hold vs break (semantically-guided mutation vs random)?",
    "CENTRAL: Does PRUNING bad branches or TRAINING ON ALL branches better instill introspection / counterfactual-foresight / pre-action deliberation? What is the argued position and what experiment would settle it?",
    "Is this two sections (dataset-building MCTS loop + RL loop) or one cohesive SFT/RL phase, or both — and at what timescales do the loops run / feed each other?",
    "How does the local composer-replication-framework already implement the substrate (3-channel loss, teacher_replay multi-teacher, FeatureDeletionEnv, HintGenerator, DiLoCo/serverless, ingestion) and what is the minimal delta to reach the proposed system?",
    "How is Channel 3 (multi-teacher trace-replay-DPO) the direct ancestor of the multi-model MCTS idea, and what must be added to go from N-flat-teachers to an N-model branching tree?",
    "How do the external papers (Socratic-RL, Socratic-SWE, Chain-of-World, 'Current Agents Fail to Leverage World Model', 'From Word to World', MuZero/Dreamer, MCTS-for-LLM, counterfactual/process-reward RL) ground or challenge each design choice?",
    "How would this be built on AWS EKS (primarily): the N-model parallel rollout/sandbox fan-out, verifier/test-execution sandboxes, the dataset-construction outer loop, the GRPO + world-model-auxiliary inner RL loop, GPU scheduling, sandbox isolation, object-store rendezvous, orchestration?",
    "How do the repo's DiLoCo / ServerlessExecutor / object-store-rendezvous abstractions map onto EKS, and what is the minimal porting delta?",
    "What does a SageMaker path look like (where it fits vs EKS), and what is the recommended hybrid split?",
    "What are the cost / throughput / failure-mode considerations and a concrete phased build plan?"
  ],
  "entities": [
    {"name": "Multi-model Monte-Carlo tree-of-work (counterfactual trace replay)", "type": "concept", "required_fields": ["branching unit", "state/action definition", "expansion policy", "tree policy", "how it extends Channel-3 multi-teacher replay"]},
    {"name": "Composer 2.5 targeted RL + textual feedback + dataset building", "type": "method", "required_fields": ["targeted textual intervention at divergence", "model-aware synthetic data", "how it maps to repo HintGenerator + FeatureDeletionEnv"]},
    {"name": "World-model latent deliberation", "type": "concept", "required_fields": ["definition for SWE agent", "training signal (next-state prediction / aux loss)", "MuZero/Dreamer/Chain-of-World analogy", "how to measure it"]},
    {"name": "Genetic-algorithm framing", "type": "concept", "required_fields": ["population", "fitness", "selection", "crossover", "mutation", "generation", "where the analogy breaks"]},
    {"name": "Prune-vs-train-on-all open question", "type": "concept", "required_fields": ["Hypothesis A (prune/DPO-style)", "Hypothesis B (all-branches/contrastive)", "capability difference", "argued position", "ablation/experiment design", "metrics incl. calibration & foresight"]},
    {"name": "composer-replication-framework (local repo)", "type": "codebase", "required_fields": ["3-channel loss", "teacher_replay multi-teacher", "FeatureDeletionEnv", "HintGenerator", "DiLoCo/serverless", "ingestion", "ADRs", "research 01-12", "Channel-3 provenance guardrail"]},
    {"name": "Socratic-RL (arXiv 2506.13358)", "type": "paper", "required_fields": ["teacher/student viewpoints", "meta-learning loop", "viewpoint distillation"]},
    {"name": "Socratic-SWE (arXiv 2606.07412)", "type": "paper", "required_fields": ["Agent Skill Registry", "Verifier Gate", "Gradient Alignment", "model-aware bug injection", "SWE-bench/Terminal-Bench results"]},
    {"name": "World-model / latent-simulation literature", "type": "paper-cluster", "required_fields": ["Chain of World", "Current Agents Fail to Leverage World Model as Tool for Foresight", "From Word to World", "MuZero/Dreamer"]},
    {"name": "MCTS / test-time RL / counterfactual credit-assignment literature", "type": "paper-cluster", "required_fields": ["MCTS for LLM agents", "test-time RL", "process reward models", "DPO vs train-on-all"]},
    {"name": "AWS EKS implementation", "type": "architecture", "required_fields": ["rollout/sandbox fan-out", "verifier sandboxes", "GPU scheduling (Karpenter/MIG/time-slicing)", "sandbox isolation (gVisor/Kata/Firecracker)", "object-store rendezvous (S3)", "outer dataset loop orchestration (Argo/Ray/Volcano)", "inner RL loop (GRPO + aux loss)", "DiLoCo mapping"]},
    {"name": "AWS SageMaker path", "type": "architecture", "required_fields": ["where it fits vs EKS", "training jobs / HyperPod", "warm pools", "recommended hybrid split"]}
  ],
  "required_formats": [
    "paradigm comparison table (Socratic-RL vs Socratic-SWE vs Composer 2.5 vs proposed multi-model MCTS)",
    "genetic-algorithm mapping table",
    "prune-vs-train-on-all experimental design (arms + metrics)",
    "EKS component / architecture table or diagram",
    "repo-asset -> system-component mapping table (what to reuse vs build)",
    "phased build plan"
  ],
  "required_sections": [],
  "required_section_headings": [
    "## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work",
    "## 2. The World-Model Goal: Training Latent What-If Deliberation",
    "## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks",
    "## 4. The Central Question: Prune Bad Branches vs Train on All Branches",
    "## 5. Pipeline Shape: Two Loops, Not Two Phases",
    "## 6. Grounding in the composer-replication-framework: Reuse vs Build",
    "## 7. What the Literature Says (and Where It Pushes Back)",
    "## 8. Implementing on AWS EKS (Primary)",
    "## 9. The SageMaker Path and the Recommended Hybrid",
    "## 10. Cost, Throughput, Failure Modes, and a Phased Build Plan",
    "## Opinionated Synthesis"
  ],
  "time_horizons": ["present-state-of-the-art 2026", "phased build plan (near-term implementable)"],
  "time_periods": [],
  "scope_conditions": [
    "EKS is PRIMARY; SageMaker is secondary/where-it-fits",
    "Software-engineering agents specifically (SWE-bench / Terminal-Bench class tasks), not general reasoning",
    "Ground in the local repo's actual implementation — reuse, do not reinvent",
    "Honest provenance: Channel-3 multi-teacher trace-replay-DPO is the framework's OWN addition, not Cursor's recipe; Cursor = Channel 1 (Dr.GRPO) + Channel 2 (SDPO)"
  ],
  "pipeline_tier": "full",
  "response_format": "argumentative",
  "citation_style": "inline"
}