composer-replication-framework / research /deepread /02-swe-task-synthesis.md
Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30
|
Raw
History Blame Contribute Delete
32.7 kB
# Deep-Read: SWE Task Synthesis Prior Art — Critical Review
**Cluster:** Open-source task-synthesis literature ("point at a repo, build a dataset")
**Date:** 2026-06-09
**Reviewer:** Claude (critical-review pipeline, subagent)
**Sources fetched:** SWE-smith HTML (arXiv:2504.21798, 26k words), SWE-Gym HTML (arXiv:2412.21139, 10k words), R2E-Gym HTML (arXiv:2504.07164, 14k words), SWE-bench abstract (arXiv:2310.06770), SWE-MiniSandbox (arXiv:2602.11210), SWE-RL (arXiv:2502.18449), DeepSWE blog (Together AI), SkyRL GitHub, SWE-smith HF dataset page, SWE-smith GitHub README.
**Repo artifacts compared:** `composer_replication/datagen/substrates.py` (SweBenchAdapter), `research/06-feature-deletion-datagen.md`, `docs/adrs/ADR-010-feature-deletion-datagen.md`.
---
## 1. SWE-smith (arXiv:2504.21798) — Deep Read
### 1.1 What it actually does (task synthesis mechanics)
SWE-smith's core insight is stated verbatim in §2: "Conceptually, this is a simple inversion of SWE-bench's approach, which instead prioritizes identifying task instances, and then attempts to build an environment for each." SWE-smith **builds an execution environment first**, then synthesizes many tasks within it. This is the opposite of SWE-bench's PR-mining approach.
**Environment construction (§2.1):**
1. Target the top-5000 most-downloaded PyPI packages by stars (≥1000 stars), excluding the 12 SWE-bench test repos.
2. Run SWE-agent on the latest commit for ≤100 steps to auto-install + run test suite.
3. Manually verify installation instructions and check >80% tests pass.
4. Build one Docker image per repo (not per task — this is the key scalability win).
5. Total: 128 repos selected, ~7 min human labor per repo for step 2 correction, ~1 min for test parser.
6. Total human labor: ~20 hours to create the entire 50k dataset.
**Four bug-synthesis strategies (Table 1 from the paper):**
| Strategy | Yield % | # Instances | Cost/instance | Median F2P | Median Lines Edited |
|---|---|---|---|---|---|
| LM Modify | 56.0% | 17,887 | 0.38¢ | 4 | 3 |
| LM Rewrite | 35.0% | 4,173 | 3.93¢ | 4 | 24 |
| Procedural | 40.2% | 15,641 | 0.00¢ | 7 | 5 |
| Combine (Cross-bug) | 96.9% | 10,092 | 0.00¢ | 15 | 11 |
| PR Mirror (Invert PRs) | 33.8% | 2,344 | 5.53¢ | 3 | 14 |
| **Total** | **50.1%** | **50,137** | **2.32¢** | | |
**Key detail on bug strategies:**
- **LM Modify:** Prompt LM to "introduce errant modifications" to a function. Yield 56% (LM doesn't always break a test). Input: full function.
- **LM Rewrite:** Given only function header + docstring, ask LM to re-implement. Yield 35% (lower because it's not explicitly asked to break anything). Generates longer changes (24 lines median). Input: signature only, so LM can't see the original.
- **Procedural Modification (§B.2):** Zero-cost AST transformations. 13 types (see §B.2 verbatim):
- Class: Remove Functions (removes method(s) + references); Remove Parent (removes base class); Shuffle Methods.
- Control: Invert If/Else (inverts if-else bodies).
- Flow: Shuffle Lines (shuffles lines of function).
- Expressions: Change Constants (±1 to numeric); Break Chains (removes operators); Swap Operands; Change Operator (+→-).
- Removal: Remove Loops (for/while); Remove Conditionals (if); Remove Assignments; Remove Wrappers (try/with).
- Applied with a `likelihood` parameter so modifications don't make tasks too hard. Filtered by function/class criteria (complexity min/max gates).
- **Combine Bugs (§B.3):** Combine already-validated bugs from same file or module into a multi-bug task. 96.9% yield because each component already passes validation. Creates complex multi-site tasks.
- **PR Mirror (§B.4):** For each PR in the repo's history, prompt LM to "undo" the PR's changes in the current codebase (NOT checkout the base commit — this is the key difference from SWE-bench). LM rewrites each affected file. Expensive (rewrites whole files) and lower yield (33.8%). Most reflective of SWE-bench distribution (PR Mirror trajectories are the best training data per Table 5).
**Issue text generation (§2.1, last subsection):** Provide LM with (the diff, source code of a random F2P test, test execution output), ask for GitHub-style issue text with reproduction code. Empirically as good as real issues (28.2% vs 28.0% on SWE-bench Verified in Table 5).
**Validation:** Apply bug patch, run test suite, keep only bugs that break ≥1 test. Time limit: 2 minutes per test run (bugs causing infinite loops discarded).
**Scale and storage (Table 2 in §2.2 comparison):**
- SWE-smith: 50k tasks, 128 repos, 295 GB environments.
- SWE-Gym: 2.4k tasks, 11 repos, 6 TB environments.
- R2E-Gym subset: 4.6k tasks, 10 repos, 4 TB environments.
- SWE-fixer (Xie et al. 2025a): 115k tasks, 856 repos — but NO execution environments.
- SWE-bench-train: 19k tasks, 37 repos — NO executable environments.
The one Docker image per repo (vs. SWE-bench's one image per task) is the mechanism: estimated 500x storage reduction at 50k tasks. SWE-bench at 50k would need 50-150 TB; SWE-smith uses 295 GB.
**Training results (§3-4):**
- SWE-agent-LM-32B (Qwen 2.5 Coder 32B fine-tuned on 5,016 SWE-smith trajectories) achieves 40.2% on SWE-bench Verified.
- Expert model for trajectory collection: claude-3-7-sonnet-20250219 + SWE-agent, ≤75 steps, $2 cost limit.
- PR Mirror and LM Rewrite trajectories produce the best-performing models; LM Modify has a steep drop-off.
- GRPO-style RL on SWE-smith is explicitly mentioned in the SWE-smith GitHub README: "Perform GRPO style reinforcement learning using SkyRL."
**Licensing (Table 6, §A.4):**
- 128 repos cover Apache 2.0 (majority), BSD 2/3-Clause, MIT, GNU GPLv3, LGPL v2.1 and v3, ISC.
- GPLv3 repos: Cog-Creators/Red-DiscordBot and adrienverge/yamllint. LGPL: chardet/chardet, paramiko/paramiko, pylint-dev/astroid, Knio/dominate.
- Paper states: "We inspected the repositories with custom licenses and confirmed they allowed for the use cases exercised in our work." (Note: does NOT explicitly address whether derivative diffs are redistributable — they exercise the bugs, not redistribute.)
- **Dataset HF license:** The SWE-smith HF dataset page (`SWE-bench/SWE-smith`) shows **59,136 rows** (note: larger than the 50,137 paper reports — likely includes additional variants). The page says "We will no longer actively update this dataset. Recommend the language-specific `SWE-bench/SWE-smith-[lang]` datasets."
- **Toolkit license: MIT** (confirmed from GitHub).
**`pip install` / API availability:**
- The GitHub README shows a Python API: `from swesmith.profiles import registry; rp = registry.get_from_inst(task); container = rp.get_container(task)` — returns a Docker container with the task initialized.
- Install: `pip install swesmith` from source (requires Docker, Ubuntu 22.04.4 LTS, does NOT support Windows or MacOS officially).
- The toolkit is a full pipeline: create environments → synthesize task instances → keep tasks that break ≥1 unit test → generate issue text.
---
## 2. SWE-Gym (arXiv:2412.21139) — Deep Read
### 2.1 What it actually provides
SWE-Gym is **not a synthesis pipeline** — it is an existing dataset of 2,438 real-world task instances from 11 Python repos with pre-built executable environments. It is the "first training environment" for SWE agents (per abstract).
**Collection method (§3 of paper):** Same approach as SWE-bench (mine PRs), but:
- Applied to 11 **different** repos from SWE-bench's 12 repos (deliberately non-overlapping for train/test separation).
- Per-task Docker images (6 TB total — this is the bottleneck SWE-smith solved).
- 66,894 raw task instances in SWE-Gym-Raw (no executable envs); 2,438 have full envs.
**Table 2 statistics (SWE-Gym vs SWE-bench):**
- Average gold patch: 69.8 lines edited, 2.5 files, 4.1 functions (much larger than SWE-bench's 32.8/1.7/3.0).
- Average F2P tests: 10.0 (vs SWE-bench 9.0).
- Average total tests: 760.8 (vs SWE-bench 132.5) — much more test coverage per instance.
- Average codebase: 971 non-test files, 340k lines.
**Training results:**
- 491 trajectories from GPT-4o and Claude 3.5 Sonnet → Qwen 2.5 Coder 32B fine-tuned → +12.3%/+13.6% on SWE-bench Lite/Verified.
- With verifier (OSRM): +11.4% more → 32.0% on Verified, 26.0% on Lite.
- Scaling: performance still improving at 491 trajectories, no saturation yet.
**License:** SWE-Gym tooling is MIT. Per-instance instances inherit upstream repo licenses (paper does not provide a per-instance breakdown, unlike SWE-rebench).
**Availability:** `SWE-Gym/SWE-Gym` on HuggingFace. Docker images hosted on Docker Hub as `xingyaoww/sweb.eval.*`. Pre-built images reduce env-build cost to near-zero for adopters.
---
## 3. R2E-Gym (arXiv:2504.07164) — Deep Read
### 3.1 The SweGen pipeline — the most novel test synthesis approach
R2E-Gym's key contribution for this cluster is **SweGen**: a method to generate executable training environments **without human-written issues or unit tests**, directly from commits.
**SweGen procedure (§2):**
1. **Repo selection:** Use SEART GitHub search to find Python repos with many commits.
2. **Commit curation:** Extract commit history; filter with rule-based + LLM heuristics for "interesting" code changes.
3. **Build scripts:** Semi-manually collect dependency pins and installation procedures per commit.
4. **Test collection:** Use existing tests from the commit to identify F2P cases (failing before the commit, passing after — the natural oracle).
5. **Test generation for commits without tests:** Synthesize F2P test cases using an LLM when no test exists. Appended test generation details in Appendix A (not fully extracted, but confirmed it exists).
6. **Backtranslation for problem statements:** Instead of human GitHub issues, use LLM to backtranslate the commit diff into an issue-style problem statement. Key insight: include the F2P test execution trace in the backtranslation prompt to generate specific, non-generic statements. 27.8% vs 28.0% (synthetic vs real issues) — statistically indistinguishable.
7. **Decontamination:** Remove repos overlapping with SWE-bench test set → R2E-Gym-Subset (4,578 tasks, 10 repos).
**Scale:** 8,135 tasks total, from more repos than SWE-Gym. SweGen enables 2.5x more problems than PR-based collection alone. But 4 TB environments for the subset alone — still a storage problem.
**Oracle quality:**
- When commits lack existing tests: LLM synthesizes them. The paper claims "Real vs Synthetic: Real 28.0%, Synthetic 27.8%" — essentially identical performance. This is a strong result for synthetic oracle quality.
- However, the paper does NOT quantify what % of commits had no existing tests (and thus needed synthesized tests) — this is a gap that ADR-010 does not flag.
**Training results (Table 3):**
- R2E-Gym-trained 32B model: 34.4% on SWE-bench Verified (vs SWE-Gym 20.6%, +13.8%).
- With hybrid verifier (execution-based + execution-free): 51% on SWE-bench Verified — SOTA for open-weight models at paper time.
- DeepSWE (R2E-Gym + GRPO, Qwen3-32B): 42.2% Pass@1, 59% with test-time scaling.
**License:** Apache 2.0 for tooling (R2E-Gym GitHub). Per-instance upstream licenses inherited.
**API:** Available on HuggingFace at `R2E-Gym/R2E-Gym-V1` and `R2E-Gym/R2E-Gym-Subset`. Docker images hosted. The code at `r2e-gym.github.io` is the generation pipeline — not a clean `pip install swesmith`-style API, but usable.
---
## 4. SWE-bench (arXiv:2310.06770) — Schema Reference
The canonical task schema (ICLR 2024):
- Input: `(codebase, issue_description)` at `base_commit`.
- Output: `git diff` patch resolving the issue.
- Oracle: run unit tests from `test_patch` against the patched codebase.
- Fields: `FAIL_TO_PASS` (tests that must go red→green), `PASS_TO_PASS` (tests that must stay green).
- Construction: mine GitHub issues → find corresponding PRs → filter for PRs that modify test files and resolve the issue → build per-version Docker environments.
- **The hard part:** building per-version Docker environments is the most costly step (~hundreds of hours human labor at scale). SWE-smith's core contribution is eliminating this by using one image per repo at HEAD.
- **The oracle property:** FAIL_TO_PASS is a constructive oracle — the tests already existed and were known to exercise the feature (they were in the PR). This is what makes it "verifiable for free."
- Scale: 2,294 eval instances, 12 Python repos. Training split: 19,008 (no exec environments).
- License: CC-BY-4.0 for the dataset.
---
## 5. Newer Work (2025-2026)
### 5.1 SWE-MiniSandbox (arXiv:2602.11210, Feb 2026)
Container-free RL training for SWE agents. Key result: kernel-level isolation (not Docker) reduces disk usage to ~5% of container-based pipelines and setup time to ~25%. Evaluation performance comparable to container baseline. Not a dataset synthesis method, but directly relevant to the sandbox cost problem ADR-010 acknowledges (Docker required, CPU-pool generation cost). **MISSED by research/06 and ADR-010** — this was published after the main research was done (Feb 2026).
### 5.2 SWE-RL (arXiv:2502.18449, NeurIPS 2025 Main Track)
RL on real GitHub software evolution data (issues, PRs, code history). No synthetic bug injection — uses rule-based reward (similarity score between ground truth and LLM-generated solution). Llama3-SWE-RL-70B: 41.0% on SWE-bench Verified. Key finding: RL on SWE data transfers to 5 out-of-domain tasks (math, code reasoning, general language). Uses 11M training pairs from open-source software evolution. Reward = similarity (not binary test pass) — this is weaker than the test-execution oracle in SWE-smith/R2E-Gym.
### 5.3 DeepSWE (Together AI + Agentica, Jul 2025 blog)
Qwen3-32B + pure RL (GRPO via rLLM framework) on R2E-Gym environments. 4,500 SWE tasks, 64 H100s for 6 days. Achieves 42.2% Pass@1, 59% with test-time scaling (16 attempts). **This is the closest living demonstration of the Composer-style "RL on SWE tasks" recipe using an open-source substrate.**
### 5.4 SkyRL (NovaSky-AI, 2025)
Explicitly mentioned in SWE-smith README: "Perform GRPO style reinforcement learning using SkyRL." SkyRL is a full-stack RL library (Berkeley Sky Computing Lab) with SWE-bench environments in `skyrl-gym`. The GitHub shows direct integration: SkyRL + SWE-smith = GRPO on SWE-smith tasks. This is exactly the missing integration link for this repo's `reward_fn` + RL loop design.
### 5.5 SWE-fixer (Xie et al. 2025a, referenced in SWE-smith Table 2)
115k instances, 856 repos — largest by task count. But **NO execution environments** — no Docker, no test execution. SWE-fixer trained SWE-fixer-72B achieving 32.8% on SWE-bench Verified. The comparison in SWE-smith Table 2 shows SWE-fixer has zero executable environments, meaning it relies on string-similarity rewards, not test-execution verification. **Not a direct competitor for this repo's verifiable-reward approach.**
---
## 6. Comparison Against the Repo's Current State
### 6.1 What `substrates.py` (SweBenchAdapter) actually does
`substrates.py` implements **schema inversion only**: it takes an existing SWE-bench-shaped instance dict and converts it to a `FeatureDeletionTask` by reversing the gold patch. It does NOT:
- Synthesize new tasks from an arbitrary repo.
- Call SWE-smith's API or any synthesis engine.
- Implement AST/procedural deletion (Path B from research/06).
- Build execution environments for new repos.
- Generate issue text.
The adapter only handles the "adopt existing substrates" half of the design space. This is correct for the ADR-010 v0.0-v0.1 scope, but it's important to be explicit about what is NOT yet built.
### 6.2 What research/06 and ADR-010 claim vs. what the sources actually say
**Correct claims (verified):**
1. `[research/06 §4, ADR-010]` "SWE-bench FAIL_TO_PASS / PASS_TO_PASS schema is the universal schema across SWE-bench, SWE-Gym, R2E-Gym, SWE-rebench." **VERIFIED.** All four use this schema. SWE-smith also uses it (per §A.1: "FAIL_TO_PASS: The unit tests that break when the test suite is run with the bug patch applied. PASS_TO_PASS: The unit tests that do not break.").
2. `[research/06 §4]` "R2E-Gym: 8.1K executable envs." **VERIFIED.** Paper abstract states "more than 8.1K tasks."
3. `[research/06 §4]` "SWE-rebench: 21,336 tasks, 3,468 repos, CC-BY-4.0." **UNVERIFIED here** (SWE-rebench paper arXiv:2505.20411 not directly fetched — only the Nebius infrastructure blog post was available). The infrastructure blog describes the pipeline but does not give the exact 21,336 count. Cannot confirm from fetched sources.
4. `[research/06 §2, ADR-010]` "Online pass-rate curriculum: 'select for and create harder tasks dynamically.'" **VERIFIED verbatim** in the Cursor blog quote in research/06 §1.
5. `[research/06 §3]` "Python bytecode/type-check cache recovery is a real reward-hacking vector." **VERIFIED verbatim** in the Cursor blog (quoted in §1 of research/06).
6. `[ADR-010]` "SWE-smith env-construction: ~20h human labor for 128 repos." **VERIFIED.** Paper §2.1: "Creating SWE-smith took one author ~20h of human labor."
7. `[ADR-010]` "SWE-smith costs $1,360 to create." **VERIFIED.** Paper §2.2: "$1360 to create ($1000 to generate bugs, $160 for auto repo installation, $200 to generate issues for 10K bugs)."
**Errors and overclaims in the repo's research notes:**
1. **`[research/06 §4, ADR-010]` "SWE-Gym: 2.4k tasks, 11 repos, 6 TB."** PARTIALLY CORRECT. SWE-Gym has 2,438 tasks, 11 repos. The 6 TB figure for environments is cited from the SWE-smith Table 2 comparison. However, research/06 states "SWE-Gym (arXiv:2412.21139, ICML 2025)" — the venue claim is correct (ICML 2025 per the paper's header).
2. **`[research/06 §4]` "R2E-Gym-Subset = non-overlapping w/ SWE-bench."** VERIFIED but requires precision: R2E-Gym-Subset (4,578 tasks, 10 repos) is non-overlapping with SWE-bench TEST SET repos. The full R2E-Gym (8,135 tasks) may overlap with SWE-bench training repos. The paper decontaminates against the test set only.
3. **`[research/06 §4, ADR-010]` "OpenHands/Nemotron-SWE-v1: 59K agent trajectories."** The HF dataset page for `SWE-bench/SWE-smith` shows 59,136 rows — this is the SWE-smith dataset (with updates), NOT the Nemotron dataset. Research/06 correctly cites `nvidia/Nemotron-SWE-v1` separately. But the 59K figure is ambiguous: SWE-smith also has ~59k rows on HF. The Nemotron dataset is a different thing. This is a minor labeling confusion but not wrong in substance.
4. **`[research/06 §4]` "SWE-Gym purpose-built for training (train split, not a held-out benchmark → no contamination worry)"** — OVERCLAIM. SWE-Gym paper itself states its 11 repos are separate from SWE-bench's 12, but the decontamination is at the repo level, not task level. Using SWE-Gym for Feature Deletion and then evaluating on SWE-bench should be safe, but the claim "no contamination worry" is stronger than the paper asserts.
5. **`[research/06 §5, Path B]` "Coverage-mapped AST deletion using libcst — select deletion targets by coverage selectivity."** This is original to the repo and NOT from any of the fetched papers. SWE-smith's procedural approach uses random AST transforms without coverage-guided targeting. R2E-Gym uses real commits (not synthetic deletions). The coverage-selectivity framing is a valid engineering idea but is `[EXTRAPOLATED]` — research/06 correctly tags it `[EXTRAPOLATED]` but the ADR text promotes it to a stated capability without that tag. **ADR-010 should make clearer that Path B (coverage-mapped AST deletion) is not from any prior work — it would need to be built from scratch.**
6. **`[ADR-010 §2, Decision Drivers]` "Reuse existing verified OSS substrates... they already guarantee test-exercises-the-code via FAIL_TO_PASS."** PARTIALLY OVERCLAIM. SWE-smith (§2.1) explicitly notes that yield rates are limited because some bug candidates "did not actually introduce relevant issues" or "lack test coverage for the change." The FAIL_TO_PASS guarantee is not automatic — it is the *output* of the validation step (test execution), not something inherited from the schema. The ADR implies the guarantee is pre-built; in reality it requires running the test suite to confirm it.
7. **`[ADR-010 post-review]` "OPEN: Gate 2 ('deletion breaks the feature') does not verify reachability."** This self-identified open item is correct and important. The sources confirm: SWE-smith's validation only checks that some test breaks (not which code causes it). R2E-Gym also checks test pass/fail without coverage verification. This is a systemic gap in the field, not just in the repo.
**Missing from the repo's research notes:**
1. **SWE-smith's PR Mirror strategy is the most important one for this repo.** Per Table 5 of the paper: PR Mirror trajectories produce the best-performing models (trajectories from PR mirrors > LM Rewrite ≈ Procedural > LM Modify). The repo's feature deletion via gold-patch reversion (SweBenchAdapter.to_task) is exactly analogous to SWE-smith's PR Mirror strategy. The paper's ablation directly validates the repo's core approach — but neither research/06 nor ADR-010 cite this result.
2. **SWE-smith's "one Docker image per repo" architecture.** Research/06 and ADR-010 discuss reusing per-instance Docker images from SWE-Gym/SWE-rebench. SWE-smith shows that the correct architecture for new repos is one image per repo (not per task), which is 500x more storage-efficient. If the repo wants to extend beyond the existing substrates, this is the right architecture. Not mentioned.
3. **SWE-smith `pip install swesmith` — it is a real, usable API.** The GitHub README shows that `rp.get_container(task)` returns a Docker container with the task initialized. The repo could `pip install swesmith` to get task synthesis capabilities today, rather than building a separate synthesis pipeline. ADR-010 discusses options A/B/C but does not mention that Option A ("invert OSS substrates") could be powered by the `swesmith` toolkit directly.
4. **R2E-Gym's LLM-synthesized tests.** When commits lack existing tests, R2E-Gym synthesizes them with an LLM. Research/06 describes this capability as needing separate building ("the hard part: Composer deletes features where tests exist; SWE-smith generates bugs and validates against existing tests; R2E-Gym synthesizes tests"), but research/06 does not propose actually synthesizing tests — it only uses existing tests from substrates. The gap: if we want to point at an arbitrary repo (the user's stated goal), most arbitrary repos will have commits without comprehensive test coverage. SweGen's test synthesis capability (confirmed equivalent to real tests at 27.8% vs 28.0%) is directly relevant and not mentioned in ADR-010.
5. **SkyRL + SWE-smith = working GRPO pipeline.** The SWE-smith README explicitly says "Perform GRPO style reinforcement learning using SkyRL." This is a working open-source stack (SkyRL is MIT licensed, 2k GitHub stars). ADR-010's `reward_fn` design reinvents what SkyRL + SWE-smith already provide. Not evaluated in the ADR.
6. **SWE-MiniSandbox (arXiv:2602.11210, 2026).** Container-free RL at 5% disk / 25% setup time of Docker. Directly addresses ADR-010's acknowledged "sandbox/Docker cost" concern. Not considered — was published after ADR-010.
7. **DeepSWE: living demonstration of "RL + SWE-smith/R2E-Gym."** Qwen3-32B + GRPO on 4,500 R2E-Gym tasks = 42.2% Pass@1 / 59% with TTS. This is published evidence that the core architecture in ADR-010 (RL + feature-deletion env + GRPO reward_fn) works at scale in the open. Not cited.
---
## 7. ADOPT vs BUILD — Concrete Recommendation
### 7.1 What we can `pip install` TODAY
**`pip install swesmith` (MIT, from source, requires Docker):**
- Provides: environment construction from arbitrary GitHub repos, all 4 bug synthesis strategies (LM Modify, LM Rewrite, Procedural with 13 transform types, PR Mirror/Combine), issue text generation, task validation.
- What the repo currently builds manually in `composer_replication/datagen/` is a subset of what `swesmith` provides.
- **Recommendation:** Use `swesmith` as the synthesis engine for new repos rather than rebuilding. The repo's `SweBenchAdapter` can remain as the inversion layer for pre-existing substrates.
**`datasets.load_dataset("SWE-bench/SWE-smith")` (dataset available, MIT toolkit, mixed upstream licenses):**
- 59k tasks, 128 repos, 295 GB environments (vs. 50,137 in paper — dataset has grown since publication).
- All tasks have executable Docker environments via `swesmith.profiles.registry`.
- **This is the largest immediately usable Feature-Deletion dataset**: every SWE-smith task is a `(broken_repo, FAIL_TO_PASS, PASS_TO_PASS)` tuple. The "patch" field IS the gold diff (already reversed in the task construction). `SweBenchAdapter.to_task()` works on SWE-smith instances unchanged.
- **License risk:** GPLv3 repos (2 repos: Red-DiscordBot, yamllint) are present. The existing `is_redistributable()` copyleft filter in `substrates.py` would catch these. Apache/BSD/MIT majority is clean.
**`datasets.load_dataset("R2E-Gym/R2E-Gym-V1")` (Apache-2.0 toolkit, mixed upstream licenses):**
- 8,135 tasks, SweGen-synthesized. Includes synthesized tests for commits without existing tests.
- **Key advantage over SWE-smith for this repo's "point at any repo" vision:** SweGen works from commits, not PRs, and synthesizes tests when none exist. This is the mechanism that unlocks arbitrary repos.
- Pre-built Docker environments. Direct integration with OpenHands scaffold.
**What needs to be built (not available as pip install):**
1. **Coverage-guided deletion target selection (research/06 Path B).** No existing library does coverage-selectivity-based AST deletion. `libcst` for AST manipulation is available but the targeting logic (which function to delete based on test coverage and selectivity) is novel. This is the "hard part" for arbitrary-repo synthesis without prior substrates.
2. **The online difficulty curriculum.** `DifficultyCurriculum` in `composer_replication/datagen/curriculum.py` is correctly identified as needing to be built. SWE-smith/R2E-Gym/SWE-Gym do NOT provide this.
3. **Anti-reward-hacking sandbox.** The `LocalSubprocessSandbox` + `SANDBOX_DENYLIST` + `HackMonitor` in the repo are custom implementations. No OSS library provides this. SWE-MiniSandbox (arXiv:2602.11210) provides container-free isolation but not the semantic hack detection.
4. **TRL `reward_fn` adapter wired to test execution.** Not provided by any of the surveyed toolkits (SkyRL has its own non-TRL RL stack; SWE-smith's GRPO support is via SkyRL, not TRL GRPOTrainer).
### 7.2 Integration architecture recommendation
```
[Adopt] swesmith toolkit → environment construction for new repos
[Adopt] SWE-smith dataset → 59k pre-built Feature-Deletion tasks (via SweBenchAdapter)
[Adopt] R2E-Gym-Subset → 4.6k tasks with SweGen synthetic tests
[Adopt] SWE-Gym → 2.4k tasks, clean train split, per-task Docker images
[Adopt] SWE-rebench → scale (21k) + difficulty priors (cold-start p̂)
[Build] DifficultyCurriculum (unique to this recipe)
[Build] LocalSubprocessSandbox + HackMonitor (unique to this recipe)
[Build] TRL reward_fn adapter (SkyRL uses veRL, not TRL)
[Build] Coverage-guided Path B synthesis (unique, unlocks arbitrary repos)
[Consider] SWE-MiniSandbox for container-free RL at scale (2026, 5% disk overhead)
```
The current `substrates.py` correctly handles the [Adopt] paths. What is missing is the `swesmith` integration for new-repo synthesis (the "25x more synthetic tasks" vision requires going beyond existing substrates).
---
## 8. Critical Flags Summary
| Flag | Severity | Location | Issue |
|---|---|---|---|
| **Missing: swesmith API exists** | HIGH | ADR-010 Option A | `pip install swesmith` provides what Option A describes building. The ADR does not evaluate it as a dependency. |
| **Missing: SkyRL+SWE-smith = working GRPO stack** | HIGH | ADR-010 §Decision | SkyRL (MIT) + SWE-smith already implements GRPO on SWE tasks. The repo's TRL `reward_fn` reinvents this without acknowledging the prior art. |
| **Missing: SweGen synthesizes tests for commitless repos** | HIGH | research/06 §4, ADR-010 | R2E-Gym's test synthesis approach (commits without tests → LLM synthesizes F2P tests, equivalent to real tests) is the key mechanism for "point at any repo" synthesis. Not discussed. |
| **Overclaim: FAIL_TO_PASS 'guaranteed'** | MEDIUM | ADR-010 §Decision Drivers | The guarantee requires running tests; it is not inherited from the schema. SWE-smith validation explicitly filters out candidates with no test coverage. |
| **Overclaim: Path B (coverage-AST deletion) is built** | MEDIUM | ADR-010 capability list | Path B is `[EXTRAPOLATED]` in research/06 but the ADR implies it's implemented. It is not; `substrates.py` only does Path A (gold-patch reversion). |
| **Missing: SWE-MiniSandbox (2026)** | MEDIUM | ADR-010 §Negative consequences | Container-free RL at 5% disk / 25% setup overhead addresses the ADR's Docker cost concern. Published Feb 2026, after ADR-010. |
| **Missing: DeepSWE as validation** | LOW | research/06 | Provides evidence that GRPO + R2E-Gym = 42% Pass@1 (the core recipe) works. Not cited. |
| **Correct: PR Mirror = Feature Deletion** | CONFIRM | research/06 §5 Path A | SWE-smith ablations directly validate the repo's core approach. The best SWE-smith training data (PR Mirror) is exactly gold-patch-reversion Feature Deletion. |
| **Correct: SWE-smith costs ~$1360 for 50k tasks** | CONFIRM | ADR-010 §7 cost model | Cost model in research/06 is consistent with SWE-smith's reported $1360 for 50k tasks ($0.027/task vs research/06's estimate of $0.02-$0.10/task). |
| **Correct: ADR-010 OPEN items are honest** | CONFIRM | ADR-010 post-review | The "gate 2 does not prove reachability" open item correctly identifies a gap that exists across ALL surveyed work (SWE-smith, R2E-Gym, SWE-Gym all have this gap). |
---
## 9. Key Numbers for Architecture Reference
| Paper | Tasks | Repos | Env Size | Oracle source | Test-exec? | License |
|---|---|---|---|---|---|---|
| SWE-bench | 2,294 eval / 19,008 train | 12 | ~per-instance large | Human PRs | eval only | CC-BY-4.0 |
| SWE-Gym | 2,438 | 11 | 6 TB | Human PRs | YES | MIT (tooling) |
| R2E-Gym | 8,135 (4,578 subset) | 13 | 4 TB | Commits + LLM tests | YES | Apache-2.0 |
| SWE-smith | 50,137 (59,136 HF) | 128 | 295 GB | LM/AST bug injection | YES | MIT (toolkit) |
| SWE-rebench | ~21k | 3,468 | per-instance | Human PRs (automated) | YES | CC-BY-4.0 |
| SWE-fixer | 115k | 856 | none | Human PRs | NO | - |
SWE-smith's 295 GB for 50k tasks vs 6 TB for 2.4k (SWE-Gym) = the per-repo Docker image architecture is a 500x storage efficiency win. This is the right architecture for any new synthesis beyond existing substrates.
---
## 10. Sources
1. **SWE-smith HTML full text**`research/notes/bugs-scaling-data-for-software-engineering-agents.md` (26,442 words, arxiv.org/html/2504.21798)
2. **SWE-Gym HTML full text**`research/notes/training-software-engineering-agents-and-verifiers-with-swe-gym.md` (10,865 words, arxiv.org/html/2412.21139)
3. **R2E-Gym HTML full text**`research/notes/r2e-gym-scaling-open-weights-software-engineering-agents-with-procedural-synthet.md` (14,810 words, arxiv.org/html/2504.07164)
4. **SWE-bench abstract**`research/notes/231006770-swe-bench-can-language-models-resolve-real-world-github-issues.md`
5. **SWE-smith GitHub README**`research/notes/github-swe-benchswe-smith-neurips-2025-db-spotlight-scaling-data-for-swe-agents.md`
6. **SWE-smith HF dataset page**`research/notes/swe-benchswe-smith-datasets-at-hugging-face.md`
7. **SkyRL GitHub**`research/notes/github-novasky-aiskyrl-skyrl-a-modular-full-stack-rl-library-for-llms-github.md`
8. **SWE-RL abstract**`research/notes/250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft.md`
9. **SWE-MiniSandbox abstract**`research/notes/260211210-swe-minisandbox-container-free-reinforcement-learning-for-building-sof.md`
10. **DeepSWE blog**`research/notes/deepswe-training-a-fully-open-sourced-state-of-the-art-coding-agent-by-scaling-r.md`
11. **SWE-rebench infrastructure blog**`research/notes/behind-swe-rebench-infrastructure-to-collect-massive-datasets-of-swe-tasks-and-e.md`
12. **Repo artifacts:** `composer_replication/datagen/substrates.py`, `research/06-feature-deletion-datagen.md`, `docs/adrs/ADR-010-feature-deletion-datagen.md`