|
|
--- |
|
|
title: README |
|
|
emoji: π» |
|
|
colorFrom: indigo |
|
|
colorTo: purple |
|
|
sdk: static |
|
|
pinned: true |
|
|
thumbnail: >- |
|
|
https://cdn-uploads.huggingface.co/production/uploads/60cc389a0844fb1605fef405/CRHpoi7_GxVx7DhVCVK5e.png |
|
|
--- |
|
|
|
|
|
<h1 align="center"> R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents </h1> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://naman-ntc.github.io/" style="text-decoration: none;">Naman Jain<sup>*,1</sup></a>, |
|
|
<a href="https://1jsingh.github.io/" style="text-decoration: none;">Jaskirat Singh<sup>*,2</sup></a>, |
|
|
<a href="https://manishs.org/" style="text-decoration: none;">Manish Shetty<sup>1</sup></a>, |
|
|
<a href="https://scholar.google.com/citations?user=vNHqr3oAAAAJ&hl=en" style="text-decoration: none;">Liang Zheng<sup>2</sup></a>, |
|
|
<a href="https://scholar.google.com/citations?user=Vn3L_ioAAAAJ&hl=en" style="text-decoration: none;">Koushik Sen<sup>1</sup></a>, |
|
|
<a href="https://scholar.google.com/citations?user=vN-is70AAAAJ&hl=en" style="text-decoration: none;">Ion Stoica<sup>1</sup></a> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<sup>1</sup>UC Berkeley, <sup>2</sup>ANU </br> |
|
|
<sub><sup>*</sup>Equal contribution, <sup>^</sup>Equal supervision</sub> |
|
|
</p> |
|
|
|
|
|
<!-- paper . data and models . project page --> |
|
|
<p align="center"> |
|
|
<a href="https://github.com/R2E-Gym/R2E-Gym">π» Code </a> |
|
|
β’ |
|
|
<a href="./docs/paper.pdf">π Paper</a> |
|
|
β’ |
|
|
<a href="https://huggingface.co/R2E-Gym" >π€ Data & Models</a> |
|
|
β’ |
|
|
<!-- project page --> |
|
|
<a href="https://r2e-gym.github.io/" >π Project Page</a> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
We present **R2E-Gym**, the largest procedurally curated environment for training real-world SWE-Agents. |
|
|
We show that R2E-Gym enables more scalable train and test-time scaling, achieving **51% on the SWE-Bench Verified benchmark**, reflecting a new state-of-the-art for open-weight SWE-Agents and for first time being competitive with proprietary models such as o1 and sonnet-3.5-v2 with tools. |
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/R2E-Gym/R2E-Gym/raw/main/assets/docs-teaser-v1.png" width="100%" alt="teaser"> |
|
|
</p> |
|
|
<p align="left"> |
|
|
<!-- <em> --> |
|
|
<!-- <small> --> |
|
|
<b>R2E-Gym</b> is powered by two main contributions: (a) <b>SWE-GEN: a synthetic data curation recipe</b> for curating executable training environments w/o relying on human tests and issues. (b) <b>Hybrid Inference Time Scaling</b>: showing that while both execution-based and execution-free verifiers elicit inference-time gains; significantly better performance can be achieved by leveraging the strengths of both. (c) Overall, the final approach reflects <b>SOTA performance for open-weight SWE-Agents</b>, while also being competitive with some proprietary model baselines. |
|
|
<!-- </small> --> |
|
|
<!-- </em> --> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
<!-- ## Synthetic Data Enables Scalable Training |
|
|
|
|
|
We propose SWE-GEN β a novel synthetic data curation recipe that enables collection of a large number of executable training environments without reliance on human-written pull requests (PRs) or unit tests. We show that instead of using human-written PRs, good-quality execution environments can directly be curated from *commits*. |
|
|
Compared to PR-based data collection (SWE-Gym), this approach enables more scalable data curation and agent-training, resulting in a SOTA pass@1 performance of 34.4% on the challenging SWE-Bench Verified benchmark. |
|
|
|
|
|
<img src="https://github.com/R2E-Gym/R2E-Gym/raw/main/docs/docs-training-v1.png" alt="Synthetic Data Enables Scalable Training" width="80%"> |
|
|
|
|
|
## Hybrid Test-time Scaling |
|
|
|
|
|
We also propose Hybrid Test-time Scaling, a novel approach for scaling SWE-Agents at test-time. We show that while both execution-based and execution-free verifiers elicit inference-time gains; significantly better performance can be achieved by leveraging the strengths of both. |
|
|
|
|
|
<img src="https://github.com/R2E-Gym/R2E-Gym/raw/main/docs/bestk_plot_agent_nopass.png" alt="Hybrid Test-time Scaling" width="80%"> |
|
|
--> |
|
|
|
|
|
## Usage and Training |
|
|
|
|
|
Please refer our [Github Repo](https://github.com/R2E-Gym/R2E-Gym) for detailed notes on Gym Environment Usage, Training, Inference and Executable SWE Environment Generation. |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{jain2025r2e-gym, |
|
|
title={R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents}, |
|
|
author={Jain Naman and Singh Jaskirat and Shetty Manish and Zheng Liang and Sen Koushik and Stoica Ion}, |
|
|
year={2025}, |
|
|
eprint={xxx.xxxx}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SE}, |
|
|
url={https://arxiv.org/abs/xxx.xxxx}, |
|
|
} |
|
|
``` |