# Composer 2 Technical Report — Mining Notes (arXiv:2603.24477)

> **Extraction date:** 2026-05-28.
> **Primary source:** Full text of the **Composer 2 Technical Report** (Cursor Research Team; corresponding author Alexander M. "Sasha" Rush), PDF at `https://cursor.com/resources/Composer2.pdf` and arXiv `2603.24477` (v1 25 Mar 2026, v2 26 Mar 2026; cs.SE / cs.LG; "Aaron Chan and 53 other authors").
> **Method:** `mcp_tavily_tavily_extract` (advanced) on the PDF returned the **complete report body incl. References + Appendices A–C** (~148 KB). Cross-checked against `mcp_exa_crawling_exa` (full re-pull, identical text) and a `mcp_tavily_tavily_search` confirming the arXiv ID, abstract, "Dr. GRPO" passage, and the technical-report blog.
> **Tagging:** **[REPORT-VERIFIED]** = verbatim/paraphrase from the arXiv report. **[SECONDARY]** = blog/third-party. **[ABSENT]** = explicitly looked for, not in the report.
> **Scope note:** This report is **Composer 2**, not Composer **2.5**. Several recipe items the 2.5 blog advertises (targeted textual-feedback/hint distillation, "25× synthetic tasks", Sharded Muon) are **not** in this document — see §3 and the corrections box.

---

## TL;DR — did it resolve the three open questions?

| Open question (from delta note 09) | Resolved? | Answer |
|---|---|---|
| **RL algorithm NAME** | ✅ **YES** | A multi-sample policy-gradient (GRPO-family) algorithm built explicitly on **Dr. GRPO** [34]: GRPO with the **length-standardization term removed** and **no std-dev advantage normalization**. Optimizer = **Adam**, single-epoch, fixed group size, full-parameter. KL via the **k1 estimator (−log r)**. |
| **Data-mix weighting % / generator inventory / token counts** | ⚠️ **PARTIAL** | CPT is a **3-phase code-dominated mix** (32k → 256k → SFT) but **no %s and no token counts** are given. RL task mix is given only as a **category histogram (Fig. 3)**, not generator names or weights. No "Feature Deletion" generator inventory (that was 2.5-blog). |
| **HINT-generation mechanism (targeted textual feedback)** | ❌ **ABSENT** | **The hint/teacher-student textual-feedback mechanism is NOT in the Composer 2 report at all.** It is a Composer **2.5** feature. Composer 2 shapes behavior with **auxiliary scalar rewards + a nonlinear length penalty**, not hint distillation. The #1 reproducibility gap remains unresolved by this artifact. |

**Net:** The report fully answers the RL-algorithm question (the single biggest win), partially answers data-mix, and does **not** touch hint generation. It also delivers a large amount of previously-unstated **infrastructure** detail (Anyrun internals, async RL stack, MoE router replay, precision recipe) and a **correction** to two prior assumptions (optimizer is Adam not Muon; base is Kimi K2.5 1.04T/32B).

---

## 1. Data generation / CPT data-mix / curriculum  [§3, §4]

### 1.1 Continued pretraining (CPT) — [REPORT-VERIFIED]
- Base model = **Kimi K2.5** [67], a **1.04T-param / 32B-active MoE** (Appendix B; selected over GLM-5 and DeepSeek V3.2 on internal *FreshBench* knowledge, *State Tracking* (LoCoDiff-style), and *codebase perplexity*; agentic benchmarks **deliberately excluded** from base-model selection "as agentic and long-horizon capabilities can drastically change during the RL stage").
- CPT is **"a large code-dominated data mix"** done in **three phases**:
  1. **Bulk of compute at 32k sequence length**,
  2. a shorter **long-context extension phase to 256k**,
  3. a short **SFT phase on targeted coding tasks**.
- Training: **MXFP8 on NVIDIA B300s**, **AdamW** optimizer. Eval loss on internal codebase **"decreases log-linearly"** over the run.
- **Causal CPT→RL claim (the justification for doing CPT):** they replicate the recipe on **Qwen3-Coder-30B-A3B** at **three log-spaced compute levels (small/medium/large)**, each + identical SFT + identical RL run, and show **"cross-entropy loss is … predictive of downstream RL performance"** (Fig. 2). → Direct support for our "start from an already-code-strong base" decision.
- **Multi-Token Prediction (MTP):** extra MTP layers [17,11] trained from scratch on the same mix for speculative decoding, via **self-distillation** to the main LM head's logits; MTP layers cut from the **middle** of the CPT run and trained jointly during the long-context + SFT phases. *(This is the only "self-distillation" in the report — it is for MTP/spec-decode, NOT for hints.)*
- **[ABSENT]** No data-mix percentages, no token/byte counts, no list of CPT data sources.

### 1.2 RL task distribution & dynamic curriculum — [REPORT-VERIFIED]
- RL tasks **"run in environments that emulate real Cursor sessions as closely as possible."** Problem distribution **"reflects the most common use cases"**; **Fig. 3** gives the category breakdown (x-axis "% of Problems", ~0–40%): **Iterate On Feature, Debugging, New Feature, Refactor, Understanding Codebase, Documentation, Testing, Code Review, Optimize, Devops, Migration, Deletion, Other.** *(This is the closest the report gets to a "data mix" — categorical, not weighted %s, no generator names.)*
- **Dynamic difficulty curriculum (verbatim):** *"In later stages of training, we use simple heuristics—such as **number of turns and thinking tokens of rollouts**—to **upsample increasingly harder data points**."* → Confirms delta note 09's "select for harder tasks dynamically" as an **online up-sampling gate keyed on turns + thinking-token count**. Replication handle: rank tasks by rollout length/turn-count, up-weight the long-tail late in training.
- **[ABSENT]** No synthetic-task **generator inventory** (no "Feature Deletion" et al.), no "25× synthetic tasks" figure, no synthetic-vs-real split. Those are Composer **2.5**-blog claims and are **not** in this report.

---

## 2. RL ALGORITHM  [§4.1] — [REPORT-VERIFIED], the headline result

**Algorithm family:** *"a policy gradient algorithm with multiple samples per prompt [53 = DeepSeekMath/GRPO, 2 = REINFORCE-style RLOO] and a fixed group size."* Operates in the **single-epoch regime** (a prompt is **never trained on twice**). **Adam** optimizer; **full-parameter** update. Highly **asynchronous** (independent train + rollout workers).

**Specific GRPO modifications (the "name" + the deltas):**
- Built on **Dr. GRPO** [34 = Liu et al., *Understanding R1-Zero-like training*, arXiv 2503.20783]: verbatim *"As in Dr. GRPO, … crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage."*
- **Remove the length-standardization term from GRPO** (it "introduces a length bias").
- **Do NOT normalize group advantages by their standard deviation** — std-norm "results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."
- **Overlong-rollout masking [78 = DAPO/Yu et al.]: NOT used.** They *"did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length"*; the self-summary system limits overlong cases anyway. *(So: Dr. GRPO-style, explicitly NOT DAPO's overlong masking; DAPO [78] and GSPO [82] are cited but as related work / for router-replay, not adopted wholesale.)*

**KL regularization — exact formulation [§4.1, Fig. 4]:**
- Uses **KL(q‖p) = E_{x∼q}[−log r(x)], r(x)=p(x)/q(x)** for regularization (like DeepSeekMath [53] and Kimi k1.5 [66]).
- **Chooses the k1 estimator `k1 = −log r`** over the popular **k3 = (r−1) − log r** [Schulman 52], because (citing Amini et al. [6]) k3's variance "increases drastically as p and q diverge" — at large KL the k3 estimate variance is "extremely large." (k2 is unbiased-ish but biased per their note.) → **Replication handle: use the simple `−log r` KL penalty, not the k3 unbiased estimator, for agentic long-horizon RL.**

**Async-rollout infra / off-policy control [§4.1, §6.2]:**
- Minimize off-policyness via **fast weight sync + in-flight (mid-rollout) weight updates**, *"similar to **PipelineRL** [48]"* — inference workers update weights mid-rollout so later tokens are less off-policy.
- **MoE router replay [38, 82]:** inference engine returns selected expert indices per token per MoE layer; training forward pass **overrides the router's expert assignment to match** (router still computes gating scores so gradients flow). They **extend** replay by **filtering replayed experts whose gating scores fall below a plausibility threshold from the router's own top-k, replacing them with the router's candidates** — reduces p99 numerics mismatch between inference and training forward passes. *(Critical for MoE-base RL stability; directly relevant if we RL a MoE.)*

**Reward structure [§4.1–4.2]:**
- Reward based on **"code's correctness, succinctness, and conformance to software engineering principles."**
- **best-of-K does NOT trade off vs average:** both rise together over training (Fig. 5) → RL is *expanding* solution coverage, not just sharpening (notable vs the "RL only concentrates mass" literature [79,32,8,74,61]).

**Reward-hacking safeguards — [ABSENT/THIN]:** This report does **not** contain the Python-typecheck-cache / Java-bytecode reward-hack anecdotes (those are 2.5-blog). The only related safeguards here are **strict tool-argument checks** and **tool removal for steerability** in training environments (§6.2), and general monitoring for **emergent behaviors** (§4.2). No dedicated "agentic monitoring tool" section.

---

## 3. Targeted textual feedback / hint distillation  — **[ABSENT]**

**Finding: The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback / on-policy KL-to-hint-conditioned-teacher mechanism.** Searched the full text for hint / teacher / student / textual feedback / distill — the only "distillation" is **MTP self-distillation to the LM head's logits** (§3.1, spec-decode), unrelated to behavior shaping.

**What Composer 2 does for behavior shaping instead [§4.2 "Agent Behavior"] — [REPORT-VERIFIED]:**
- **Auxiliary scalar rewards**, not hints: *"we apply an array of auxiliary rewards … rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished."*
- **Reactive reward addition:** they "monitor the model for emergent behaviors and occasionally introduce additional behavior rewards" (examples observed: leaving long CoT in code comments; collapsing to terminal-tool-only).
- **Nonlinear length / effort penalty (exact equation):**
  `C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))`, concave-down & increasing, where **x = a weighted combination of {thinking tokens, tool-calling tokens, tool-output tokens, final-message tokens, # tool calls, # turns}** and `k, q` are curvature hyperparameters (Fig. 6). Goal: be quick on easy tasks, think longer on hard tasks; observed to induce **parallel tool calls**.
- **Self-Summarization [§4.1, from Composer 1.5 [64]]:** rollouts are chains joined by self-summaries; **final reward is assigned to all tokens in the chain** (up-weights good agent turns *and* the summaries that enabled them; down-weights lossy summaries). Reduces error vs prompt-based compaction while using fewer tokens and reusing KV cache.

> **Implication for the replication framework:** To reproduce Composer **2.5**'s hint mechanism we still must look elsewhere — the **SDPO (arXiv 2601.20802) / OPSD (2601.18734)** papers from delta note 09 remain the only formalizations, and **how Cursor generates the hint text itself is still unstated in every Cursor artifact.** Composer 2's behavior shaping (auxiliary rewards + the length-penalty equation above) is a **fully reproducible, hint-free alternative** we can adopt for v0.1.

---

## 4. Other replication-relevant detail [§6 Infrastructure, §5 CursorBench, App.] — [REPORT-VERIFIED]

**Optimizer — CORRECTION:** report says **AdamW (CPT) / Adam (RL)**. **There is NO "Sharded Muon" in the Composer 2 report** — the Muon claim came from the 2.5 blog and should be tagged 2.5-only / re-verified, not assumed for Composer 2.

**Parallelism / sharding layout — CORRECTION to "HSDP":**
- Prior stacks used **FSDP + EP + TP** (EP coupled to TP). **Composer 2 decouples EP from TP** and uses **Context Parallelism (CP)** as the primary long-context axis (less comm than TP; CP folded into the FSDP dim). **No mention of "HSDP"** — the doc says **FSDP/ZeRO [50,81] + CP + decoupled EP**, **DeepEP** [80] for token dispatch/combine.
- **Exact degrees:** **EP=8, CP=2 for CPT**; **EP=8, CP=8 for RL.** MLA attention with latent-vector all-gather trick; Llama-style 2×CP chunk load-balancing [33].
- **Global sequence packing** before each RL step to balance DP compute across variable-length rollouts (accounts for quadratic attention cost).

**Precision recipe [§6.1]:**
- **MoE forward = a novel NVFP4 variant**: BF16→FP4E2M1 with **FP8E4M3 per-block scales (block 16) + FP32 per-token scales** (per-tensor FP32 scales were "fragile" → batch-variance collapse + future-token leakage/biased grads). **MoE backward = standard MXFP8** (FP8E4M3 values, FP8E8M0 scales per 32-elt block) — afford higher precision since backward runs only on the train cluster. Trainer forward must **numerically match inference** for stability. IEEE `__fdiv_rn` **critical** for NVFP4 (fast-approx diverges ~100 RL steps); fast-approx OK for MXFP8.
- Kernels in **CUDA/PTX/ThunderKittens-ParallelKittens** [56,59]; FA4 backward (DeepSeek QK192/V128 shapes) co-developed w/ Colfax; GEMMs open-sourced into ThunderKittens [21].

**RL infra [§6.2] — 4 decoupled services (training / environments / inference / evals):**
- **Training:** fully async on **Ray [42] + PyTorch**, centralized **reconciler** w/ slot-based sample lifecycle + staleness-balancing scheduler; **futures**-based eager exec; Ray object store w/ NVMe spill; fault-tolerant to process-group level, warm-standby nodes, live code updates; **policy-aware rollout-level + group-level checkpointing** (codebase memory snapshots; advantage-tagged sequences w/ policy versions to NFS). Production run spanned **3 GPU regions + 4 CPU regions.**
- **Anyrun (verbatim internals):** *"an internal compute platform built for running untrusted code at scale … the same platform that powers Cloud Agents and Automations."* Global router → multiple Anyrun clusters; each cluster schedules **>500 pods/sec**, manages **hundreds of thousands of pods/cluster**; **each pod = a dedicated Firecracker VM** (full dev env incl. browser/GUI for computer use); x86+ARM mix; pressure-aware bin-packing. **Forking & snapshotting at filesystem + memory level** (→ mid-trajectory checkpoint, post-rollout introspection); same-node fork preferred else live-migrate. **Anygress** egress proxy (TCP-layer redirect via injected root CA, header stripping). **Shadow deployment of the Cursor backend** for faithful tools; tools dynamically per-environment (stricter arg checks / tool removal in training).
- **Inference:** **partner = Fireworks AI.** Every step, weights synced to inference via **S3 with per-rank delta compression** (RL diffs compress to "a handful of GB" for the 1T model); sharded upload/download; geo-distributed US+EU clusters reconstruct from the shared delta chain (no direct train↔inference connectivity).
- **Online evals:** pinned production backend + Cursor client per eval job; lease an eval deployment, move GPUs, cross-region weight sync.

**CursorBench (eval-suite design) [§5]:**
- Internal suite from **real Cursor engineering-team agent sessions** (avoids train-set contamination). Motivated by 4 failure modes of public benchmarks (domain mismatch, prompt over-specification, contamination/overfit, narrow scope).
- **Quantified hardness vs public sets:** median **181 lines changed** (vs 7–10 for SWE-bench Verified/Multilingual) and median prompt length **390 chars** (vs 1,185–3,055) → larger + more under-specified. Versioned (**CursorBench-3** > 2× the median task size of v1; Table 1 uses CursorBench-3).
- **Targeted sub-evals:** intent, instruction-following, **eager-editing** (don't edit when you shouldn't), code-quality (LLM-judge rubrics), **interruption** (mid-rollout user feedback). Built by "identifying dimensions, selecting eliciting data points, writing rubrics."
- **Headline results (Table 1):** Composer 2 = **CursorBench 61.3 / SWE-bench Multilingual 73.7 / Terminal-Bench 61.7**; Kimi K2.5 base = 36.0 / 65.1 / 47.3 → large RL+CPT lift.

**Ablations actually present (for "ablations on the training recipe"):**
1. **CPT→RL** (Qwen3-Coder-30B, 3 compute levels; Fig. 2) — CE loss predicts RL reward.
2. **KL estimator** k1 vs k3 (Fig. 4) — variance argument for k1.
3. **GRPO term removals** — length-standardization & std-norm removed (qualitative justification, no head-to-head curve).
4. **Overlong masking** — tried, no benefit at small scale, dropped.
5. **NVFP4 scaling scheme** (per-token vs per-tensor) and **IEEE vs fast-approx division** — stability ablations.
6. **best-of-K vs average** over training (Fig. 5).
*(No single consolidated "leave-one-out recipe component" ablation table; ablations are distributed and partly qualitative.)*

---

## Corrections / cautions for the mapping doc

- **[CORRECTION] Optimizer:** Composer **2** uses **Adam/AdamW**, **not Muon**. Treat "Sharded Muon" as a **2.5-blog-only, unverified-for-2** claim.
- **[CORRECTION] Sharding:** report describes **FSDP+CP+decoupled-EP (EP=8/CP=2 CPT, EP=8/CP=8 RL)**, **not "HSDP."**
- **[CORRECTION] "RL algorithm = PPO/GRPO `[EXTRAPOLATED]`"** → now **[REPORT-VERIFIED] Dr. GRPO-style** (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE router-replay). DAPO overlong-masking explicitly rejected.
- **[CONFIRM] Anyrun** real, with full internals (Firecracker VMs, >500 pods/s, fork/snapshot, Anygress).
- **[CONFIRM] base model = Kimi K2.5 1.04T/32B** (over GLM-5, DeepSeek V3.2).
- **[CAUTION] Hint mechanism, "25× synthetic tasks", Feature-Deletion generator, reward-hack anecdotes are NOT in this (Composer 2) report** — do not cite this PDF for them; they are Composer 2.5-blog material.

---

## Sources

- **[PRIMARY, REPORT-VERIFIED]** Cursor Research Team, *Composer 2 Technical Report*, arXiv:**2603.24477** (v1 2026-03-25, v2 2026-03-26; cs.SE/cs.LG; corr. Alexander M. Rush). Full text via PDF `https://cursor.com/resources/Composer2.pdf` (Tavily advanced extract, full body+refs+App. A–C) and cross-checked via Exa full crawl (identical). HTML/TeX also available at `https://arxiv.org/abs/2603.24477`, `https://arxiv.org/pdf/2603.24477`.
- **[SECONDARY]** Cursor blog, *A technical report on Composer 2* (Sasha Rush) — `https://cursor.com/blog/composer-2-technical-report` (abstract-level; confirms Kimi K2.5 base + CPT-loss→RL claim).
- **[CONTEXT]** Key cited methods: Dr. GRPO (Liu et al., arXiv 2503.20783 [34]); DAPO (Yu et al. [78], 2503.14476/NeurIPS'25); GSPO (Zheng et al., 2507.18071 [82]); DeepSeekMath/GRPO [53]; PipelineRL (2509.19128 [48]); MoE router alignment (Ma et al., 2510.11370 [38]); KL-estimator variance (Amini et al. [6]); Schulman KL note [52]; DeepEP [80]; ThunderKittens/ParallelKittens [56,59].
- **Prior internal note:** `research/09-composer-blog-delta-2026.md` (read first; this note discharges its action item #1 and supplies corrections to the RL-algorithm/optimizer/sharding rows of `docs/COMPOSER_RECIPE_MAPPING.md`).