# Deep-Read: Cursor Artifacts — Composer 2.5 Blog + Composer 2 Technical Report

> **Written:** 2026-06-09.
> **Primary sources fetched directly:**
> - Composer 2.5 blog — full body retrieved from vault note `introducing-composer-25-cursor` (sourced from `https://cursor.com/blog/composer-2-5`).
> - Composer 2 Technical Report — full HTML body retrieved from `https://arxiv.org/html/2603.24477` (arXiv:2603.24477 v2, 26 Mar 2026), stored as vault note `composer-2-technical-report` (~11 900 words, all sections including appendices).
> **Repo notes cross-checked:** `research/01-composer-2.5.md`, `research/09-composer-blog-delta-2026.md`, `research/10-composer2-techreport-mining.md`, `research/06-feature-deletion-datagen.md`, `research/07-sdpo-hint-generator.md`.
> **Tag conventions:** `[SOURCE-VERBATIM]` = quote taken verbatim from the fetched primary source. `[REPO-CLAIM]` = what the repo's notes assert. `[FINDING]` = discrepancy or new fact. `[CONFIRM]` = repo claim verified against primary.

---

## PART 1 — What the primary sources actually say (verbatim extractions)

### 1A. Synthetic data — every word the 2.5 blog says

The entirety of the "Synthetic data" section of the Composer 2.5 blog (fetched verbatim):

> [SOURCE-VERBATIM] *"During RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run. Composer 2.5 is trained with 25x more synthetic tasks than Composer 2.*
>
> *We use a range of approaches for creating synthetic tasks that are grounded in real codebases. For example, one synthetic approach is feature deletion. For these tasks the agent is given a codebase with a large set of tests, and asked to delete code and files in such a way that the codebase remains functional while specific testable features are removed. The synthetic task is to reimplement the feature, and the tests are used as a verifiable reward.*
>
> *One downstream consequence of large scale synthetic task creation is that it can cause unexpected reward hacking. As the model became more adept, Composer 2.5 was able to find increasingly sophisticated workarounds to solve the task at hand. In one example, the model found a leftover Python type-checking cache and reverse-engineered the format to find a deleted function signature. In another, it was able to find and decompile Java bytecode to reconstruct a third-party API. We were able to find and diagnose these problems using agentic monitoring tools, but they demonstrate the increasing care necessary for large scale RL."*

**Critical observations from this text:**

1. **The "25x more synthetic tasks" is relative to Composer 2**, not an absolute count. No absolute count is given in any source. The Composer 2 report does not mention synthetic tasks at all (it uses real-problem distributions).

2. **Feature deletion is explicitly described as a TWO-PHASE task:** Phase 1 ("task construction") involves a deleter that *"deletes code and files in such a way that the codebase remains functional while specific testable features are removed."* Phase 2 is the reimplementation task for the agent under training. The blog does NOT say who/what performs the deletion (human? model? program?).

3. **"A range of approaches"** — the blog explicitly says feature deletion is "one synthetic approach" among a "range." No other generators are named anywhere in either source. The total number of generators is completely unspecified.

4. **"Select for and create harder tasks dynamically throughout the run"** — this is a SINGLE sentence that bundles two distinct operations: (a) online difficulty filtering/selection ("select for"), and (b) active task generation ("create"). Neither operation's implementation is described.

5. **"Agentic monitoring tools"** — the only stated mitigation for reward hacking. Absolutely no technical detail about these tools is given. No reward penalties, no static analysis, no sandbox specifications.

6. **The dynamic curriculum signal is implicit.** The mechanism that drives "select for harder" is unstated. The Composer 2 report (§4) provides the only concrete handle: *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."* This is from Composer **2**, not 2.5 — but it is the only stated heuristic.

### 1B. Targeted RL with textual feedback — every word the 2.5 blog says

Full "Targeted RL with textual feedback" section (verbatim):

> [SOURCE-VERBATIM] *"Credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens. When a reward is computed over an entire rollout, it may be hard for the model to tell which specific decision helped or hurt the outcome. This is especially limiting when we want to discourage a localized behavior, such as a bad tool call, a confusing explanation, or a style violation. The final reward can tell us that something went wrong, but it is a noisy signal for where it went wrong.*
>
> *To address this, we trained Composer 2.5 with targeted textual feedback.[footnote 1] The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's. This gives us a localized training signal for the behavior we want to change, while still retaining the broader RL objective over the full trajectory.*
>
> *As an illustration of the text feedback process, consider a long rollout that includes a tool call error where the model attempts to call a tool that is not available. During the rollout, the model will receive a "Tool not found" error and continue making additional valid tool calls. The fact that it hit one error in the process of hundreds of tool calls will have a minimal impact on its final reward.*
>
> *With text feedback, we can target this specific mistake by inserting a hint in the context of the problematic turn, such as "Reminder: Available tools…" with a list of available tools. This hint changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement. For that turn only, we then update the student weights towards to the new probabilities.*
>
> *During the Composer 2.5 run, we applied this method to a variety of model behaviors, from coding style to model communication."*

**Footnote 1** (verbatim list from the blog's closing line): "For more background on this approach see Self-Distillation Enables Continual Learning, Reinforcement Learning via Self-Distillation, and Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models."

**Critical observations:**

1. **The teacher = hint-conditioned forward pass of the SAME weights.** The blog says "use the resulting model distribution as a teacher" after inserting the hint into the local context. This is not a separate model. The weights do not change for the teacher pass.

2. **Student = policy with the ORIGINAL context.** The student is trainable; the teacher is stop-gradient.

3. **The KL is applied "for that turn only"** — not over the full trajectory. This is a LOCALIZED loss.

4. **HOW HINTS ARE CONSTRUCTED is never stated.** The blog says "we construct a short hint describing the desired improvement." The construction mechanism (heuristic templates? LLM judge? human?) is completely absent.

5. **Behavior targets explicitly named:** "coding style," "model communication," and implicitly "tool use" (from the example). The blog also separately mentions "effort calibration" as a behavioral improvement. That is four distinct behavior classes.

6. **The interaction with the main RL objective is stated:** "while still retaining the broader RL objective over the full trajectory." The SDPO loss is ADDITIVE to the main reward — it does not replace it.

### 1C. RL algorithm — from the Composer 2 Technical Report (§4.1)

This is NOT in the 2.5 blog. The RL algorithm is fully specified only in the Composer 2 report:

> [SOURCE-VERBATIM] *"We use a policy gradient algorithm with multiple samples per prompt and a fixed group size. We operate in the single-epoch regime, i.e., the same prompt is never trained on twice. We utilize Adam as our underlying optimizer and update the full parameter set."*

> [SOURCE-VERBATIM] *"As in Dr. GRPO, we found that it is crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage. Following this work, we remove the length standardization term from GRPO as it introduces a length bias. We do not normalize group advantages by their standard deviation, as it results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."*

> [SOURCE-VERBATIM] *"We did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length."*

KL regularization (§4.1):

> [SOURCE-VERBATIM] *"Similar to prior work, we use a Kullback–Leibler divergence for regularization, KL(q‖p) = E_{x~q}[−log r(x)], r(x)=p(x)/q(x). Many open-source implementations of RL estimate KL with the estimator k3=(r−1)−log r... However, Amini et al. shows that the variance increases drastically as p and q diverge. See Figure 4: for large KL values, the variance of the estimate is extremely large. Therefore, we use the standard estimator k1=−log r instead."*

Dynamic curriculum (§4):

> [SOURCE-VERBATIM] *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."*

Agent behavior shaping (§4.2):

> [SOURCE-VERBATIM] *"For behavior and communication, we apply an array of auxiliary rewards to ensure the model provides a good experience. These include rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished. During RL training, we monitor the model for emergent behaviors and occasionally introduce additional behavior rewards as needed. For example, we observed that the model would start to leave long chains-of-thought in comments or collapse to using the terminal tool only."*

Nonlinear length penalty (§4.2):

> [SOURCE-VERBATIM] *"C_length{k,q}(x) = ((1 + kx)^{1−q} − 1) / (k(1−q)), where k and q are hyperparameters which define the curvature of the penalty, and the input x is a weighted combination of thinking tokens, tool calling tokens, tool output tokens, final message tokens, number of tool calls, and number of turns of a rollout."*

Self-summarization (§4.1):

> [SOURCE-VERBATIM] *"Each training rollout can involve multiple generations chained together by summaries, rather than a single prompt–response pair. We use the final reward for all tokens produced by the model in the chain. This upweights both the agent responses in good trajectories and also the self-summarizations that made them work."*

### 1D. Infrastructure (Anyrun) — Composer 2 Technical Report (§6.2)

> [SOURCE-VERBATIM] *"Environments are run on top of Anyrun, an internal compute platform built for running untrusted code at scale. This is the same compute platform that powers Cloud Agents and Automations in the Cursor product."*

> [SOURCE-VERBATIM] *"Within a cluster, a distributed set of Anyrun managers schedule pods, scale cloud compute provisioned across multiple regions, and perform state reconciliation to manage hundreds of thousands of pods per cluster. Each pod is a dedicated Firecracker VM capable of running a full development environment, including a browser and GUI for computer use."*

> [SOURCE-VERBATIM] *"Each Anyrun cluster is capable of scheduling more than 500 pods per second."*

Weight sync (§6.2):

> [SOURCE-VERBATIM] *"Every training step, we synchronize updated weights to the inference engine by uploading to a shared S3 bucket. To minimize transfer size, we use delta compression: each rank caches its previous upload and transmits only the diff against the new weights. Because RL updates are small, even with full-parameter training these diffs compress to a handful of gigabytes for the 1T-parameter model."*

Inference partner:

> [SOURCE-VERBATIM] *"We partner with Fireworks AI to run RL inference."*

### 1E. Sharded Muon and dual mesh HSDP — from the 2.5 blog only

The full text of the "Sharded Muon and dual mesh HSDP" section (verbatim):

> [SOURCE-VERBATIM] *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights.*
>
> *The main cost is orthogonalizing expert weights. For sharded parameters, we batch same-shaped tensors, all-to-all shards into complete matrices, run Newton-Schulz, then all-to-all the result back to the original sharded layout. These transfers are asynchronous: while one task is waiting on communication, the optimizer runtime advances other Muon tasks, overlapping network and compute. This is equivalent to full-matrix Muon, but keeps the shard group busy; on the 1T model, optimizer step time is 0.2s.*
>
> *This interacts closely with how we use HSDP for MoE models. HSDP forms multiple FSDP replicas and all-reduces gradients across corresponding shards. We use separate HSDP layouts for non-expert and expert weights: non-expert weights are comparatively small, so their FSDP groups can stay narrow, often within a node or rack, while expert weights hold most of the parameters and most of the Muon compute, so they use a wider expert sharding mesh.*
>
> *Keeping these layouts separate also lets independent parallelism dimensions overlap: CP=2 and EP=8 can run on 8 GPUs instead of requiring 16 in a single shared mesh. This avoids wide communication for small non-expert state while spreading expert optimizer work over many GPUs."*

**Critical observation:** Muon + HSDP is described in the 2.5 blog as being used for **continued pretraining**. The Composer 2 report (§6.1) describes a different sharding scheme for Composer 2 (FSDP + decoupled EP + CP, with AdamW/Adam, no Muon). These are two different systems for two different model versions.

---

## PART 2 — Critical discrepancies between the repo's research notes and the primary sources

### FINDING-1: research/01 claims benchmarks not in either blog

**REPO-CLAIM (research/01):**
> *"CursorBench 69.3%, Terminal-Bench 2.0 parity"* — attributed to the 2.5 blog.
> *"On Cursor's internal CursorBench, Composer 2.5 scored 69.3% (or ~61-63% depending on the specific benchmark version cited)"*

**SOURCE FACT:** The **2.5 blog contains NO benchmark numbers at all.** The 2.5 blog is a brief, methods-focused post with zero numerical results. Benchmark numbers only appear in the **Composer 2 technical report** (Table 1): Composer 2 = 61.3 / 73.7 / 61.7. The 69.3% figure appears nowhere in either primary source.

**SEVERITY:** High. The 69.3% figure and "Terminal-Bench 2.0 parity" are fabrications or secondary-source-only claims presented as if Cursor-stated. The audit note at the top of research/01 correctly flags this as "[NOT in the Cursor blog]," which means this file's own audit notice is accurate — but the body of the file still asserts these numbers as though Cursor-stated. Any pipeline design that uses these numbers as performance targets is using unverified figures.

### FINDING-2: The "25x more synthetic tasks" claim is correctly reported but its scope is misread

**REPO-CLAIM (research/01, research/06):** Accurately quotes the 25x figure. research/06 correctly identifies it as Composer 2.5 vs Composer 2.

**SOURCE FACT:** [CONFIRM] The 2.5 blog says verbatim: *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* The absolute number is NEVER stated. The Composer 2 report gives **no figure** for how many synthetic tasks Composer 2 used — it describes only "a problem distribution that reflects the most common use cases" with a category histogram (Fig. 3). There is therefore no baseline from which "25x more" can be translated to an absolute count.

**FINDING:** research/06 says "Reaching a '25×-spirit' pool of ~50k–60k tasks" but this number has no grounding in the primary sources. With no Composer 2 synthetic-task count, 25x is not convertible to an absolute. **The 50k–60k estimate is entirely [EXTRAPOLATED]** and is not sourced from the primary documents.

### FINDING-3: Feature deletion task structure — the "two agents" framing is inferred, not stated

**REPO-CLAIM (research/06, §1):** Interprets feature deletion as a "two-agent / two-phase structure the blog implies" with a distinct "deleter" and a "reimplementation agent."

**SOURCE FACT:** The blog says *"the agent is given a codebase with a large set of tests, and **asked to delete code and files** in such a way that the codebase remains functional while specific testable features are removed."* The subject "the agent" is grammatically the SAME agent doing deletion and then the synthetic task is reimplementation. The blog does NOT cleanly describe two separate agents — it could be interpreted as: (1) an agent tasked to do the deletion [two-phase construction], or (2) the blog is describing the construction process generally. The "deleter model vs. program" question (flagged as an open question in research/06) is therefore genuinely open — and the current analysis of "two agents" may be over-reading the blog's grammar.

**RECOMMENDATION:** Treat the deleter as unknown. The programmatic AST-deletion approach in research/06 is well-reasoned but is [EXTRAPOLATED]. The blog could equally describe a model-driven deletion step.

### FINDING-4: The "other 24 generators" claim is hallucinated

**REPO-CLAIM (research/01, §2):** *"Feature Deletion + 24 unnamed generators"* — a specific count of 24 additional generators.

**SOURCE FACT:** The blog says: *"We use a **range of approaches** for creating synthetic tasks."* "Range" has no numeric content. Neither the blog nor the Composer 2 report names ANY other generator beyond feature deletion. "24 unnamed generators" is an unsourced extrapolation that has been transcribed as if it were a blog claim. 

**SEVERITY:** Medium. This affects how the dataset-pipeline scope is scoped. The correct statement is: "feature deletion is the one named generator; the total count and names of others are unknown."

### FINDING-5: The RL algorithm in research/01 is labeled [EXTRAPOLATED] — now resolved

**REPO-CLAIM (research/01):** *"RL Algorithm: Use a PPO or GRPO variant, modified for long-horizon sparse rewards"* — labeled as [EXTRAPOLATED] by the audit note. This was correct at the time of writing.

**SOURCE FACT (Composer 2 Technical Report, §4.1):** [CONFIRM] The algorithm is now known from the primary source: **Dr. GRPO-style** (Liu et al., arXiv:2503.20783). Specific modifications:
- Remove the length-standardization term.
- Do NOT normalize group advantages by their standard deviation.
- k1 KL estimator (−log r), not k3.
- Adam optimizer, single-epoch regime, full-parameter update.
- Overlong masking EXPLICITLY REJECTED.

research/10 correctly captures all of this as [REPORT-VERIFIED]. The issue is that research/01 has never been updated to reflect the Composer 2 findings.

### FINDING-6: "85% of total compute is post-training" — confirmed as community consensus, not Cursor-stated

**REPO-CLAIM (research/01, §Overview):** *"roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline."*

**SOURCE FACT:** Neither the 2.5 blog nor the Composer 2 technical report contains any compute budget percentages. The 2.5 blog contains NO compute cost figures. The Composer 2 report mentions only that the production training job "spanned 3 GPU regions + 4 CPU regions." This 85% figure is community speculation and should not be stated as a fact in any pipeline design document.

### FINDING-7: Sharded Muon is a 2.5 CPT optimization, NOT a Composer 2 optimization

**REPO-CLAIM (research/01):** Presents Muon as a Composer 2.5 feature; also discusses alongside "Dual Mesh HSDP."

**SOURCE FACT from the 2.5 blog:** Muon + HSDP is explicitly for "continued pretraining" only. The blog section is titled "Sharded Muon and dual mesh HSDP" and begins: "For continued pretraining, we use Muon."

**SOURCE FACT from Composer 2 Technical Report (§6.1):** Composer 2 uses **AdamW** for CPT and **Adam** for RL. No Muon. Sharding is FSDP + decoupled EP + Context Parallelism (CP). "HSDP" does not appear in the Composer 2 report — the relevant construct is FSDP + CP.

**FINDING:** The 2.5 blog's "HSDP" formulation differs architecturally from Composer 2's FSDP+CP system. They are described as separate evolutionary steps. research/01's treatment of these as the same system is correct for the evolution, but care must be taken not to conflate Composer 2 sharding details with the 2.5 blog's Muon+HSDP description.

### FINDING-8: "Anyrun" — correct source attribution now confirmed

**REPO-CLAIM (research/01, audit note):** "Anyrun environment harness with LSP/file-I/O/terminal — likely from the Composer 2 report, not 2.5."

**SOURCE FACT:** [CONFIRM] Anyrun is mentioned in BOTH sources. The 2.5 blog does not name Anyrun explicitly; the Composer 2 Technical Report (§6.2) provides full internals. research/09 already correctly flagged and confirmed this. The audit note in research/01 is accurate.

### FINDING-9: The hint-generation mechanism remains the #1 reproducibility gap — research/07 correctly frames this

**REPO-CLAIM (research/07):** Correctly identifies hint generation as fully unstated by Cursor. Proposes a layered (a)→(b)→(c)→(f) generator taxonomy.

**SOURCE FACT:** [CONFIRM] Neither the 2.5 blog nor the Composer 2 Technical Report contains any description of how hints are constructed. The Composer 2 report does not even contain the hint mechanism (it is a 2.5-only feature; Composer 2 uses auxiliary scalar rewards instead). The entire hint-generation design in research/07 is well-reasoned extrapolation from SDPO/OPSD papers, appropriately labeled [EXTRAPOLATED].

### FINDING-10: Targeted textual feedback is confirmed ABSENT from Composer 2

**REPO-CLAIM (research/10):** Correctly states "The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback mechanism."

**SOURCE FACT (Composer 2 Technical Report, §4.2):** [CONFIRM] Composer 2 uses ONLY auxiliary scalar rewards for behavior shaping, plus the nonlinear length penalty. The targeted-textual-feedback method is exclusively a Composer 2.5 contribution. This is a critical boundary for dataset-pipeline design: anything built to replicate Composer 2.5's SDPO layer cannot be validated against the Composer 2 report.

### FINDING-11: The CPT→RL causal claim is in the Composer 2 report, not any blog

**REPO-CLAIM (research/09, §1):** Correctly flags "cross-entropy loss is predictive of downstream RL performance" as coming from the Composer 2 report.

**SOURCE FACT (Composer 2 Technical Report, §3.1):** [CONFIRM] Verbatim: *"cross-entropy loss is indeed predictive of downstream RL performance"* — demonstrated via Qwen3-Coder-30B-A3B at three log-spaced CPT compute levels. This is the empirical justification for doing CPT at all and is a Composer 2 finding, not stated in the 2.5 blog.

**IMPLICATION for pipeline design:** Starting from Qwen3-Coder-7B or similar already-code-tuned models as proposed in research/06 is supported by this finding. However, note that the Composer 2 team's CPT recipe (3-phase: 32k bulk → 256k long-context extension → SFT) is not replicable without knowing the data mix percentages and token counts, which remain unstated.

### FINDING-12: Dynamic curriculum implementation — only Composer 2 provides a concrete handle

**REPO-CLAIM (research/09, research/06):** The 2.5 blog says "select for and create harder tasks dynamically throughout the run" — interpreted as an online curriculum.

**SOURCE FACT:** The 2.5 blog gives ZERO implementation detail for the curriculum. The ONLY concrete curriculum handle in either primary source is from the **Composer 2 Technical Report** (§4): *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."*

**FINDING:** This is a Composer 2 mechanism being mapped onto the 2.5 description. Whether Composer 2.5 uses the same heuristic (turns + thinking tokens) or a different one is unknown. The PassRateCurriculum in research/06 uses pass-rate as the difficulty signal, which is a reasonable [EXTRAPOLATED] alternative to Cursor's stated turns/thinking-tokens heuristic, but is NOT what Cursor described.

### FINDING-13: Reward structure — test pass-fraction vs. correctness/succinctness/principles

**REPO-CLAIM (research/06):** Designs reward as "test pass fraction (|FAIL_TO_PASS passing| / |FAIL_TO_PASS|), masked by the hack monitor."

**SOURCE FACT (Composer 2 Technical Report, §2):** *"a reward is given based on the code's correctness, succinctness, and conformance to software engineering principles."* The Composer 2 reward is multi-dimensional — not purely test pass-fraction. The "succinctness" and "conformance to software engineering principles" components are only partially captured by the length penalty; they imply additional rubrics or reward signals beyond pass/fail.

**FINDING:** For Feature Deletion specifically, test pass-fraction is the natural verifiable reward and is consistent with the 2.5 blog's description. But the broader reward structure for non-feature-deletion RL tasks includes quality and style components not reducible to test pass rates.

### FINDING-14: Self-summarization is present in Composer 2, not mentioned in 2.5 blog

**REPO-CLAIM (research/01):** Does not mention self-summarization.

**SOURCE FACT (Composer 2 Technical Report, §4.1):** Self-summarization is described as introduced in Composer 1.5 and carried into Composer 2. The 2.5 blog does not mention it at all. It is therefore likely carried into 2.5 but is not described as a new 2.5 feature. Research notes should treat this as a "likely continued from Composer 2" feature, not a 2.5 innovation.

### FINDING-15: MoE router replay — a critical infrastructure detail not in any repo note

**SOURCE FACT (Composer 2 Technical Report, §6.2):** *"during inference, the engine returns the selected expert indices for every token at every MoE layer, and during the training forward pass the router's expert assignment is overridden to match."* Extended by filtering replayed experts below a plausibility threshold. This addresses the MoE numerical mismatch between inference and training forward passes.

**FINDING:** This detail is critical for any MoE-base RL implementation (relevant if the framework uses Kimi K2.5 or another MoE). It is described in research/10 correctly, but it is absent from research/01 and the design docs. Any replication plan that assumes "just run GRPO on a MoE" will encounter numerical divergence issues that this router-replay mechanism is specifically designed to solve.

### FINDING-16: Base model selection — Composer 2 used specific criteria that precluded agentic benchmarks

**SOURCE FACT (Composer 2 Technical Report, Appendix B):** *"We intentionally do not consider coding agent benchmarks when testing base models. We find that such benchmarks are less predictive of final performance, as agentic and long-horizon capabilities can drastically change during the RL stage."* Base model was selected on: FreshBench (factual coding knowledge), State Tracking (LoCoDiff-style), and internal codebase perplexity.

**FINDING:** This is an important methodological point absent from research/01. When the replication framework selects a base model (currently proposed as Qwen3-Coder variants), the selection should be made on intrinsic knowledge benchmarks (perplexity on domain code, state tracking) rather than SWE-bench or similar agentic benchmarks — consistent with Cursor's own methodology.

---

## PART 3 — What the sources DO NOT say (reproducibility gaps)

These are the load-bearing unknown quantities for building the dataset-generation pipeline:

| Gap | What's stated | What's missing |
|---|---|---|
| **Hint construction** | "we construct a short hint" | HOW — templates? LLM judge? human? learned? |
| **Synthetic generator inventory** | "a range of approaches… for example, one… is feature deletion" | Any other generator name, count, or weighting |
| **Synthetic task absolute count** | "25x more than Composer 2" | Absolute number; Composer 2 baseline count |
| **Dynamic curriculum implementation** | Composer 2: "turns + thinking tokens to upsample"; 2.5: "select for and create" | 2.5's specific curriculum signal; whether pass-rate is used |
| **Feature deletion: deletion agent** | "the agent… asked to delete code and files" | Whether "the agent" is a model, program, or human pipeline |
| **Feature deletion: target selection heuristic** | "delete code and files in such a way that the codebase remains functional while specific testable features are removed" | HOW deletion targets are selected from a repo |
| **Feature deletion: languages** | Python (type-check cache) and Java (bytecode) implied by reward-hack examples | No explicit language list |
| **Reward-hacking mitigations** | "agentic monitoring tools" | No technical specification |
| **CPT data mix** | "large code-dominated data mix"; "3 phases" (32k bulk → 256k → SFT) | Token counts, percentages, data sources |
| **Behavioral reward signals** | Scalar "rewards for coding style, communication" (Composer 2); 2.5 extends to hint-distillation | No RM architecture, no rubric spec |
| **Muon optimizer for CPT** | "For continued pretraining, we use Muon with distributed orthogonalization" | Learning rate, hyperparameters, data ordering during CPT |
| **SpaceXAI / Colossus 2 model** | "training a significantly larger model from scratch, using 10x more total compute" | Everything; this is a future model not Composer 2.5 |

---

## PART 4 — Implications for building the dataset-generation pipeline

Based on the primary-source analysis, the following are the **only verifiably-sourced facts** the pipeline design can be grounded on:

### What IS in the sources and can be directly implemented:

1. **Feature deletion as described:** give agent a test-covered repo → delete a testable feature (keeping repo otherwise functional) → reward = passing the tests. The blog's formulation implies the deletion maintains a `PASS_TO_PASS` guard (the "codebase remains functional" requirement). This is the exact formulation in research/06 and is faithful.

2. **Verifiable test-based reward:** "the tests are used as a verifiable reward." Test pass-fraction is the correct scalar. No golden patch is needed at reward time.

3. **Online curriculum — turns + thinking-token upsampling (Composer 2 handle):** The only stated heuristic. Difficulty = rollout length / thinking-token count. This is simpler than the pass-rate curriculum in research/06 but is the only Cursor-sourced signal.

4. **Targeted textual feedback — exact mechanism:** (a) construct hint, (b) insert into local context, (c) teacher = hint-conditioned forward pass of same weights (stop-grad), (d) student = policy without hint, (e) on-policy KL loss on student for that turn only, (f) stack on top of the main RL objective.

5. **Dr. GRPO modifications (Composer 2):** Remove length-standardization, remove std-norm advantage normalization, use k1 KL (−log r), Adam optimizer, single-epoch regime.

6. **Auxiliary scalar rewards for behavior (Composer 2):** coding style, communication, tool-call quality penalties. These are the Composer 2 behavior-shaping mechanism; in 2.5 they are SUPPLEMENTED (not replaced) by the hint-distillation channel.

7. **Reward = correctness + succinctness + SE principles (Composer 2):** Multi-dimensional, not just test pass. The nonlinear length penalty C_length is the explicit succinctness component.

### What CANNOT be directly sourced and requires [EXTRAPOLATION]:

1. The hint construction mechanism — all of research/07 is [EXTRAPOLATED] (well-reasoned but not Cursor-stated).
2. Any synthetic generator beyond feature deletion.
3. The absolute scale of the task bank (50k, 60k, or any other number).
4. The pass-rate curriculum (a reasonable design choice but not what Cursor describes for Composer 2 or 2.5).
5. The CPT data mix specifics.
6. The reward-hacking detection tooling specifics.

---

## PART 5 — Verdict on the repo's existing research notes

| Note | Accuracy vs. primary sources |
|---|---|
| `research/01-composer-2.5.md` | Body is substantially from secondary sources. Audit note at top is accurate in flagging its own errors. The benchmark numbers (69.3%, Terminal-Bench parity) are not in either primary source. The "24 unnamed generators" claim is hallucinated. Should NOT be used directly as design input; use `research/09` and `research/10` instead. |
| `research/09-composer-blog-delta-2026.md` | High accuracy. All verbatim quotes verified against fetched blog. Deltas correctly identified. One minor gap: does not flag that the "25x" baseline count (Composer 2 synthetic task count) is also unstated in the Composer 2 report. |
| `research/10-composer2-techreport-mining.md` | High accuracy. All [REPORT-VERIFIED] claims confirmed against the full HTML. Corrections (optimizer Adam not Muon; FSDP+CP not HSDP; hint mechanism absent) are correct. The "k1 in reward KL" claim (also referenced in commit bd37412) is verified against §4.1. |
| `research/06-feature-deletion-datagen.md` | Well-constructed design brief. Correctly labeled [BLOG-VERIFIED] vs [EXTRAPOLATED]. One concern: the "50k–60k tasks" scale estimate has no primary-source grounding. The PassRateCurriculum (pass-rate as difficulty) is [EXTRAPOLATED] and departs from Cursor's stated heuristic (turns + thinking tokens). Otherwise faithful. |
| `research/07-sdpo-hint-generator.md` | Well-constructed. Correctly identifies the open question. The taxonomy (a)→(b)→(c)→(d)→(f) is well-grounded in SDPO/OPSD papers. The "successful-sibling bootstrap" as a fallback is directly grounded in SDPO's ablation of "sample solution" feedback type. The single issue: the layered design is predicated on the collator's existing `hint_generator` hook, which should be verified in the actual codebase before committing to this design. |

---

## Sources

- **Composer 2.5 blog** — full body, vault note `introducing-composer-25-cursor`, URL `https://cursor.com/blog/composer-2-5` (fetched 2026-06-09).
- **Composer 2 Technical Report** — full HTML body, vault note `composer-2-technical-report`, URL `https://arxiv.org/html/2603.24477` (arXiv:2603.24477 v2, 26 Mar 2026). 11 904 words, all sections including appendices.
- **arXiv abstract note** — vault note `260324477-composer-2-technical-report`, cross-checks author list and abstract wording.
- **Prior internal notes** — `research/01`, `research/09`, `research/10`, `research/06`, `research/07` as cross-reference for repo claims.