composer-replication-framework / research /10-composer2-techreport-mining.md
Codeseys's picture
research: Composer 2.5 data-gen + targeted-textual-feedback deep-research wave
6049d00
|
Raw
History Blame Contribute Delete
19.6 kB

Composer 2 Technical Report — Mining Notes (arXiv:2603.24477)

Extraction date: 2026-05-28. Primary source: Full text of the Composer 2 Technical Report (Cursor Research Team; corresponding author Alexander M. "Sasha" Rush), PDF at https://cursor.com/resources/Composer2.pdf and arXiv 2603.24477 (v1 25 Mar 2026, v2 26 Mar 2026; cs.SE / cs.LG; "Aaron Chan and 53 other authors"). Method: mcp_tavily_tavily_extract (advanced) on the PDF returned the complete report body incl. References + Appendices A–C (~148 KB). Cross-checked against mcp_exa_crawling_exa (full re-pull, identical text) and a mcp_tavily_tavily_search confirming the arXiv ID, abstract, "Dr. GRPO" passage, and the technical-report blog. Tagging: [REPORT-VERIFIED] = verbatim/paraphrase from the arXiv report. [SECONDARY] = blog/third-party. [ABSENT] = explicitly looked for, not in the report. Scope note: This report is Composer 2, not Composer 2.5. Several recipe items the 2.5 blog advertises (targeted textual-feedback/hint distillation, "25× synthetic tasks", Sharded Muon) are not in this document — see §3 and the corrections box.


TL;DR — did it resolve the three open questions?

Open question (from delta note 09) Resolved? Answer
RL algorithm NAME YES A multi-sample policy-gradient (GRPO-family) algorithm built explicitly on Dr. GRPO [34]: GRPO with the length-standardization term removed and no std-dev advantage normalization. Optimizer = Adam, single-epoch, fixed group size, full-parameter. KL via the k1 estimator (−log r).
Data-mix weighting % / generator inventory / token counts ⚠️ PARTIAL CPT is a 3-phase code-dominated mix (32k → 256k → SFT) but no %s and no token counts are given. RL task mix is given only as a category histogram (Fig. 3), not generator names or weights. No "Feature Deletion" generator inventory (that was 2.5-blog).
HINT-generation mechanism (targeted textual feedback) ABSENT The hint/teacher-student textual-feedback mechanism is NOT in the Composer 2 report at all. It is a Composer 2.5 feature. Composer 2 shapes behavior with auxiliary scalar rewards + a nonlinear length penalty, not hint distillation. The #1 reproducibility gap remains unresolved by this artifact.

Net: The report fully answers the RL-algorithm question (the single biggest win), partially answers data-mix, and does not touch hint generation. It also delivers a large amount of previously-unstated infrastructure detail (Anyrun internals, async RL stack, MoE router replay, precision recipe) and a correction to two prior assumptions (optimizer is Adam not Muon; base is Kimi K2.5 1.04T/32B).


1. Data generation / CPT data-mix / curriculum [§3, §4]

1.1 Continued pretraining (CPT) — [REPORT-VERIFIED]

  • Base model = Kimi K2.5 [67], a 1.04T-param / 32B-active MoE (Appendix B; selected over GLM-5 and DeepSeek V3.2 on internal FreshBench knowledge, State Tracking (LoCoDiff-style), and codebase perplexity; agentic benchmarks deliberately excluded from base-model selection "as agentic and long-horizon capabilities can drastically change during the RL stage").
  • CPT is "a large code-dominated data mix" done in three phases:
    1. Bulk of compute at 32k sequence length,
    2. a shorter long-context extension phase to 256k,
    3. a short SFT phase on targeted coding tasks.
  • Training: MXFP8 on NVIDIA B300s, AdamW optimizer. Eval loss on internal codebase "decreases log-linearly" over the run.
  • Causal CPT→RL claim (the justification for doing CPT): they replicate the recipe on Qwen3-Coder-30B-A3B at three log-spaced compute levels (small/medium/large), each + identical SFT + identical RL run, and show "cross-entropy loss is … predictive of downstream RL performance" (Fig. 2). → Direct support for our "start from an already-code-strong base" decision.
  • Multi-Token Prediction (MTP): extra MTP layers [17,11] trained from scratch on the same mix for speculative decoding, via self-distillation to the main LM head's logits; MTP layers cut from the middle of the CPT run and trained jointly during the long-context + SFT phases. (This is the only "self-distillation" in the report — it is for MTP/spec-decode, NOT for hints.)
  • [ABSENT] No data-mix percentages, no token/byte counts, no list of CPT data sources.

1.2 RL task distribution & dynamic curriculum — [REPORT-VERIFIED]

  • RL tasks "run in environments that emulate real Cursor sessions as closely as possible." Problem distribution "reflects the most common use cases"; Fig. 3 gives the category breakdown (x-axis "% of Problems", ~0–40%): Iterate On Feature, Debugging, New Feature, Refactor, Understanding Codebase, Documentation, Testing, Code Review, Optimize, Devops, Migration, Deletion, Other. (This is the closest the report gets to a "data mix" — categorical, not weighted %s, no generator names.)
  • Dynamic difficulty curriculum (verbatim): *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."* → Confirms delta note 09's "select for harder tasks dynamically" as an online up-sampling gate keyed on turns + thinking-token count. Replication handle: rank tasks by rollout length/turn-count, up-weight the long-tail late in training.
  • [ABSENT] No synthetic-task generator inventory (no "Feature Deletion" et al.), no "25× synthetic tasks" figure, no synthetic-vs-real split. Those are Composer 2.5-blog claims and are not in this report.

2. RL ALGORITHM [§4.1] — [REPORT-VERIFIED], the headline result

Algorithm family: "a policy gradient algorithm with multiple samples per prompt [53 = DeepSeekMath/GRPO, 2 = REINFORCE-style RLOO] and a fixed group size." Operates in the single-epoch regime (a prompt is never trained on twice). Adam optimizer; full-parameter update. Highly asynchronous (independent train + rollout workers).

Specific GRPO modifications (the "name" + the deltas):

  • Built on Dr. GRPO [34 = Liu et al., Understanding R1-Zero-like training, arXiv 2503.20783]: verbatim "As in Dr. GRPO, … crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage."
  • Remove the length-standardization term from GRPO (it "introduces a length bias").
  • Do NOT normalize group advantages by their standard deviation — std-norm "results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."
  • Overlong-rollout masking [78 = DAPO/Yu et al.]: NOT used. They "did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length"; the self-summary system limits overlong cases anyway. (So: Dr. GRPO-style, explicitly NOT DAPO's overlong masking; DAPO [78] and GSPO [82] are cited but as related work / for router-replay, not adopted wholesale.)

KL regularization — exact formulation [§4.1, Fig. 4]:

  • Uses KL(q‖p) = E_{x∼q}[−log r(x)], r(x)=p(x)/q(x) for regularization (like DeepSeekMath [53] and Kimi k1.5 [66]).
  • Chooses the k1 estimator k1 = −log r over the popular k3 = (r−1) − log r [Schulman 52], because (citing Amini et al. [6]) k3's variance "increases drastically as p and q diverge" — at large KL the k3 estimate variance is "extremely large." (k2 is unbiased-ish but biased per their note.) → Replication handle: use the simple −log r KL penalty, not the k3 unbiased estimator, for agentic long-horizon RL.

Async-rollout infra / off-policy control [§4.1, §6.2]:

  • Minimize off-policyness via fast weight sync + in-flight (mid-rollout) weight updates, "similar to PipelineRL [48]" — inference workers update weights mid-rollout so later tokens are less off-policy.
  • MoE router replay [38, 82]: inference engine returns selected expert indices per token per MoE layer; training forward pass overrides the router's expert assignment to match (router still computes gating scores so gradients flow). They extend replay by filtering replayed experts whose gating scores fall below a plausibility threshold from the router's own top-k, replacing them with the router's candidates — reduces p99 numerics mismatch between inference and training forward passes. (Critical for MoE-base RL stability; directly relevant if we RL a MoE.)

Reward structure [§4.1–4.2]:

  • Reward based on "code's correctness, succinctness, and conformance to software engineering principles."
  • best-of-K does NOT trade off vs average: both rise together over training (Fig. 5) → RL is expanding solution coverage, not just sharpening (notable vs the "RL only concentrates mass" literature [79,32,8,74,61]).

Reward-hacking safeguards — [ABSENT/THIN]: This report does not contain the Python-typecheck-cache / Java-bytecode reward-hack anecdotes (those are 2.5-blog). The only related safeguards here are strict tool-argument checks and tool removal for steerability in training environments (§6.2), and general monitoring for emergent behaviors (§4.2). No dedicated "agentic monitoring tool" section.


3. Targeted textual feedback / hint distillation — [ABSENT]

Finding: The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback / on-policy KL-to-hint-conditioned-teacher mechanism. Searched the full text for hint / teacher / student / textual feedback / distill — the only "distillation" is MTP self-distillation to the LM head's logits (§3.1, spec-decode), unrelated to behavior shaping.

What Composer 2 does for behavior shaping instead [§4.2 "Agent Behavior"] — [REPORT-VERIFIED]:

  • Auxiliary scalar rewards, not hints: "we apply an array of auxiliary rewards … rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished."
  • Reactive reward addition: they "monitor the model for emergent behaviors and occasionally introduce additional behavior rewards" (examples observed: leaving long CoT in code comments; collapsing to terminal-tool-only).
  • Nonlinear length / effort penalty (exact equation): C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q)), concave-down & increasing, where x = a weighted combination of {thinking tokens, tool-calling tokens, tool-output tokens, final-message tokens, # tool calls, # turns} and k, q are curvature hyperparameters (Fig. 6). Goal: be quick on easy tasks, think longer on hard tasks; observed to induce parallel tool calls.
  • Self-Summarization [§4.1, from Composer 1.5 [64]]: rollouts are chains joined by self-summaries; final reward is assigned to all tokens in the chain (up-weights good agent turns and the summaries that enabled them; down-weights lossy summaries). Reduces error vs prompt-based compaction while using fewer tokens and reusing KV cache.

Implication for the replication framework: To reproduce Composer 2.5's hint mechanism we still must look elsewhere — the SDPO (arXiv 2601.20802) / OPSD (2601.18734) papers from delta note 09 remain the only formalizations, and how Cursor generates the hint text itself is still unstated in every Cursor artifact. Composer 2's behavior shaping (auxiliary rewards + the length-penalty equation above) is a fully reproducible, hint-free alternative we can adopt for v0.1.


4. Other replication-relevant detail [§6 Infrastructure, §5 CursorBench, App.] — [REPORT-VERIFIED]

Optimizer — CORRECTION: report says AdamW (CPT) / Adam (RL). There is NO "Sharded Muon" in the Composer 2 report — the Muon claim came from the 2.5 blog and should be tagged 2.5-only / re-verified, not assumed for Composer 2.

Parallelism / sharding layout — CORRECTION to "HSDP":

  • Prior stacks used FSDP + EP + TP (EP coupled to TP). Composer 2 decouples EP from TP and uses Context Parallelism (CP) as the primary long-context axis (less comm than TP; CP folded into the FSDP dim). No mention of "HSDP" — the doc says FSDP/ZeRO [50,81] + CP + decoupled EP, DeepEP [80] for token dispatch/combine.
  • Exact degrees: EP=8, CP=2 for CPT; EP=8, CP=8 for RL. MLA attention with latent-vector all-gather trick; Llama-style 2×CP chunk load-balancing [33].
  • Global sequence packing before each RL step to balance DP compute across variable-length rollouts (accounts for quadratic attention cost).

Precision recipe [§6.1]:

  • MoE forward = a novel NVFP4 variant: BF16→FP4E2M1 with FP8E4M3 per-block scales (block 16) + FP32 per-token scales (per-tensor FP32 scales were "fragile" → batch-variance collapse + future-token leakage/biased grads). MoE backward = standard MXFP8 (FP8E4M3 values, FP8E8M0 scales per 32-elt block) — afford higher precision since backward runs only on the train cluster. Trainer forward must numerically match inference for stability. IEEE __fdiv_rn critical for NVFP4 (fast-approx diverges ~100 RL steps); fast-approx OK for MXFP8.
  • Kernels in CUDA/PTX/ThunderKittens-ParallelKittens [56,59]; FA4 backward (DeepSeek QK192/V128 shapes) co-developed w/ Colfax; GEMMs open-sourced into ThunderKittens [21].

RL infra [§6.2] — 4 decoupled services (training / environments / inference / evals):

  • Training: fully async on Ray [42] + PyTorch, centralized reconciler w/ slot-based sample lifecycle + staleness-balancing scheduler; futures-based eager exec; Ray object store w/ NVMe spill; fault-tolerant to process-group level, warm-standby nodes, live code updates; policy-aware rollout-level + group-level checkpointing (codebase memory snapshots; advantage-tagged sequences w/ policy versions to NFS). Production run spanned 3 GPU regions + 4 CPU regions.
  • Anyrun (verbatim internals): "an internal compute platform built for running untrusted code at scale … the same platform that powers Cloud Agents and Automations." Global router → multiple Anyrun clusters; each cluster schedules >500 pods/sec, manages hundreds of thousands of pods/cluster; each pod = a dedicated Firecracker VM (full dev env incl. browser/GUI for computer use); x86+ARM mix; pressure-aware bin-packing. Forking & snapshotting at filesystem + memory level (→ mid-trajectory checkpoint, post-rollout introspection); same-node fork preferred else live-migrate. Anygress egress proxy (TCP-layer redirect via injected root CA, header stripping). Shadow deployment of the Cursor backend for faithful tools; tools dynamically per-environment (stricter arg checks / tool removal in training).
  • Inference: partner = Fireworks AI. Every step, weights synced to inference via S3 with per-rank delta compression (RL diffs compress to "a handful of GB" for the 1T model); sharded upload/download; geo-distributed US+EU clusters reconstruct from the shared delta chain (no direct train↔inference connectivity).
  • Online evals: pinned production backend + Cursor client per eval job; lease an eval deployment, move GPUs, cross-region weight sync.

CursorBench (eval-suite design) [§5]:

  • Internal suite from real Cursor engineering-team agent sessions (avoids train-set contamination). Motivated by 4 failure modes of public benchmarks (domain mismatch, prompt over-specification, contamination/overfit, narrow scope).
  • Quantified hardness vs public sets: median 181 lines changed (vs 7–10 for SWE-bench Verified/Multilingual) and median prompt length 390 chars (vs 1,185–3,055) → larger + more under-specified. Versioned (CursorBench-3 > 2× the median task size of v1; Table 1 uses CursorBench-3).
  • Targeted sub-evals: intent, instruction-following, eager-editing (don't edit when you shouldn't), code-quality (LLM-judge rubrics), interruption (mid-rollout user feedback). Built by "identifying dimensions, selecting eliciting data points, writing rubrics."
  • Headline results (Table 1): Composer 2 = CursorBench 61.3 / SWE-bench Multilingual 73.7 / Terminal-Bench 61.7; Kimi K2.5 base = 36.0 / 65.1 / 47.3 → large RL+CPT lift.

Ablations actually present (for "ablations on the training recipe"):

  1. CPT→RL (Qwen3-Coder-30B, 3 compute levels; Fig. 2) — CE loss predicts RL reward.
  2. KL estimator k1 vs k3 (Fig. 4) — variance argument for k1.
  3. GRPO term removals — length-standardization & std-norm removed (qualitative justification, no head-to-head curve).
  4. Overlong masking — tried, no benefit at small scale, dropped.
  5. NVFP4 scaling scheme (per-token vs per-tensor) and IEEE vs fast-approx division — stability ablations.
  6. best-of-K vs average over training (Fig. 5). (No single consolidated "leave-one-out recipe component" ablation table; ablations are distributed and partly qualitative.)

Corrections / cautions for the mapping doc

  • [CORRECTION] Optimizer: Composer 2 uses Adam/AdamW, not Muon. Treat "Sharded Muon" as a 2.5-blog-only, unverified-for-2 claim.
  • [CORRECTION] Sharding: report describes FSDP+CP+decoupled-EP (EP=8/CP=2 CPT, EP=8/CP=8 RL), not "HSDP."
  • [CORRECTION] "RL algorithm = PPO/GRPO [EXTRAPOLATED]" → now [REPORT-VERIFIED] Dr. GRPO-style (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE router-replay). DAPO overlong-masking explicitly rejected.
  • [CONFIRM] Anyrun real, with full internals (Firecracker VMs, >500 pods/s, fork/snapshot, Anygress).
  • [CONFIRM] base model = Kimi K2.5 1.04T/32B (over GLM-5, DeepSeek V3.2).
  • [CAUTION] Hint mechanism, "25× synthetic tasks", Feature-Deletion generator, reward-hack anecdotes are NOT in this (Composer 2) report — do not cite this PDF for them; they are Composer 2.5-blog material.

Sources

  • [PRIMARY, REPORT-VERIFIED] Cursor Research Team, Composer 2 Technical Report, arXiv:2603.24477 (v1 2026-03-25, v2 2026-03-26; cs.SE/cs.LG; corr. Alexander M. Rush). Full text via PDF https://cursor.com/resources/Composer2.pdf (Tavily advanced extract, full body+refs+App. A–C) and cross-checked via Exa full crawl (identical). HTML/TeX also available at https://arxiv.org/abs/2603.24477, https://arxiv.org/pdf/2603.24477.
  • [SECONDARY] Cursor blog, A technical report on Composer 2 (Sasha Rush) — https://cursor.com/blog/composer-2-technical-report (abstract-level; confirms Kimi K2.5 base + CPT-loss→RL claim).
  • [CONTEXT] Key cited methods: Dr. GRPO (Liu et al., arXiv 2503.20783 [34]); DAPO (Yu et al. [78], 2503.14476/NeurIPS'25); GSPO (Zheng et al., 2507.18071 [82]); DeepSeekMath/GRPO [53]; PipelineRL (2509.19128 [48]); MoE router alignment (Ma et al., 2510.11370 [38]); KL-estimator variance (Amini et al. [6]); Schulman KL note [52]; DeepEP [80]; ThunderKittens/ParallelKittens [56,59].
  • Prior internal note: research/09-composer-blog-delta-2026.md (read first; this note discharges its action item #1 and supplies corrections to the RL-algorithm/optimizer/sharding rows of docs/COMPOSER_RECIPE_MAPPING.md).