Baladithya Balamurugan

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

c11cf49 19 days ago

7.52 kB

	{
	"sub_questions": [
	"What is the system trying to achieve — what does 'multi-model Monte-Carlo tree-of-work' mean concretely as a training-data-generation mechanism for SWE agents?",
	"How does replay-simulation of agent traces across N heterogeneous models (parallelizing each turn across multiple models) produce a branching counterfactual tree, and what is the unit of branching (turn / decision / command)?",
	"How does Cursor Composer 2.5's 'targeted RL with textual feedback' + model-aware synthetic dataset building combine with the multi-model tree to produce dense training signal?",
	"What does 'world-model / latent what-if deliberation' mean for an LLM SWE agent, and how can it be trained in (predict next repo state before acting; auxiliary loss on prediction error; internal simulation of action A vs B)?",
	"Why is the dataset-building + RL pipeline well-described as a genetic algorithm, and where does the GA analogy hold vs break (semantically-guided mutation vs random)?",
	"CENTRAL: Does PRUNING bad branches or TRAINING ON ALL branches better instill introspection / counterfactual-foresight / pre-action deliberation? What is the argued position and what experiment would settle it?",
	"Is this two sections (dataset-building MCTS loop + RL loop) or one cohesive SFT/RL phase, or both — and at what timescales do the loops run / feed each other?",
	"How does the local composer-replication-framework already implement the substrate (3-channel loss, teacher_replay multi-teacher, FeatureDeletionEnv, HintGenerator, DiLoCo/serverless, ingestion) and what is the minimal delta to reach the proposed system?",
	"How is Channel 3 (multi-teacher trace-replay-DPO) the direct ancestor of the multi-model MCTS idea, and what must be added to go from N-flat-teachers to an N-model branching tree?",
	"How do the external papers (Socratic-RL, Socratic-SWE, Chain-of-World, 'Current Agents Fail to Leverage World Model', 'From Word to World', MuZero/Dreamer, MCTS-for-LLM, counterfactual/process-reward RL) ground or challenge each design choice?",
	"How would this be built on AWS EKS (primarily): the N-model parallel rollout/sandbox fan-out, verifier/test-execution sandboxes, the dataset-construction outer loop, the GRPO + world-model-auxiliary inner RL loop, GPU scheduling, sandbox isolation, object-store rendezvous, orchestration?",
	"How do the repo's DiLoCo / ServerlessExecutor / object-store-rendezvous abstractions map onto EKS, and what is the minimal porting delta?",
	"What does a SageMaker path look like (where it fits vs EKS), and what is the recommended hybrid split?",
	"What are the cost / throughput / failure-mode considerations and a concrete phased build plan?"
	],
	"entities": [
	{"name": "Multi-model Monte-Carlo tree-of-work (counterfactual trace replay)", "type": "concept", "required_fields": ["branching unit", "state/action definition", "expansion policy", "tree policy", "how it extends Channel-3 multi-teacher replay"]},
	{"name": "Composer 2.5 targeted RL + textual feedback + dataset building", "type": "method", "required_fields": ["targeted textual intervention at divergence", "model-aware synthetic data", "how it maps to repo HintGenerator + FeatureDeletionEnv"]},
	{"name": "World-model latent deliberation", "type": "concept", "required_fields": ["definition for SWE agent", "training signal (next-state prediction / aux loss)", "MuZero/Dreamer/Chain-of-World analogy", "how to measure it"]},
	{"name": "Genetic-algorithm framing", "type": "concept", "required_fields": ["population", "fitness", "selection", "crossover", "mutation", "generation", "where the analogy breaks"]},
	{"name": "Prune-vs-train-on-all open question", "type": "concept", "required_fields": ["Hypothesis A (prune/DPO-style)", "Hypothesis B (all-branches/contrastive)", "capability difference", "argued position", "ablation/experiment design", "metrics incl. calibration & foresight"]},
	{"name": "composer-replication-framework (local repo)", "type": "codebase", "required_fields": ["3-channel loss", "teacher_replay multi-teacher", "FeatureDeletionEnv", "HintGenerator", "DiLoCo/serverless", "ingestion", "ADRs", "research 01-12", "Channel-3 provenance guardrail"]},
	{"name": "Socratic-RL (arXiv 2506.13358)", "type": "paper", "required_fields": ["teacher/student viewpoints", "meta-learning loop", "viewpoint distillation"]},
	{"name": "Socratic-SWE (arXiv 2606.07412)", "type": "paper", "required_fields": ["Agent Skill Registry", "Verifier Gate", "Gradient Alignment", "model-aware bug injection", "SWE-bench/Terminal-Bench results"]},
	{"name": "World-model / latent-simulation literature", "type": "paper-cluster", "required_fields": ["Chain of World", "Current Agents Fail to Leverage World Model as Tool for Foresight", "From Word to World", "MuZero/Dreamer"]},
	{"name": "MCTS / test-time RL / counterfactual credit-assignment literature", "type": "paper-cluster", "required_fields": ["MCTS for LLM agents", "test-time RL", "process reward models", "DPO vs train-on-all"]},
	{"name": "AWS EKS implementation", "type": "architecture", "required_fields": ["rollout/sandbox fan-out", "verifier sandboxes", "GPU scheduling (Karpenter/MIG/time-slicing)", "sandbox isolation (gVisor/Kata/Firecracker)", "object-store rendezvous (S3)", "outer dataset loop orchestration (Argo/Ray/Volcano)", "inner RL loop (GRPO + aux loss)", "DiLoCo mapping"]},
	{"name": "AWS SageMaker path", "type": "architecture", "required_fields": ["where it fits vs EKS", "training jobs / HyperPod", "warm pools", "recommended hybrid split"]}
	],
	"required_formats": [
	"paradigm comparison table (Socratic-RL vs Socratic-SWE vs Composer 2.5 vs proposed multi-model MCTS)",
	"genetic-algorithm mapping table",
	"prune-vs-train-on-all experimental design (arms + metrics)",
	"EKS component / architecture table or diagram",
	"repo-asset -> system-component mapping table (what to reuse vs build)",
	"phased build plan"
	],
	"required_sections": [],
	"required_section_headings": [
	"## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work",
	"## 2. The World-Model Goal: Training Latent What-If Deliberation",
	"## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks",
	"## 4. The Central Question: Prune Bad Branches vs Train on All Branches",
	"## 5. Pipeline Shape: Two Loops, Not Two Phases",
	"## 6. Grounding in the composer-replication-framework: Reuse vs Build",
	"## 7. What the Literature Says (and Where It Pushes Back)",
	"## 8. Implementing on AWS EKS (Primary)",
	"## 9. The SageMaker Path and the Recommended Hybrid",
	"## 10. Cost, Throughput, Failure Modes, and a Phased Build Plan",
	"## Opinionated Synthesis"
	],
	"time_horizons": ["present-state-of-the-art 2026", "phased build plan (near-term implementable)"],
	"time_periods": [],
	"scope_conditions": [
	"EKS is PRIMARY; SageMaker is secondary/where-it-fits",
	"Software-engineering agents specifically (SWE-bench / Terminal-Bench class tasks), not general reasoning",
	"Ground in the local repo's actual implementation — reuse, do not reinvent",
	"Honest provenance: Channel-3 multi-teacher trace-replay-DPO is the framework's OWN addition, not Cursor's recipe; Cursor = Channel 1 (Dr.GRPO) + Channel 2 (SDPO)"
	],
	"pipeline_tier": "full",
	"response_format": "argumentative",
	"citation_style": "inline"
	}