stargazerzj commited on
Commit
6c9857f
·
verified ·
1 Parent(s): 3350b90

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +24 -3
README.md CHANGED
@@ -18,7 +18,7 @@ library_name: transformers
18
 
19
  <div align="center">
20
 
21
- [![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/daVinci-Dev.pdf)
22
  [![arXiv](https://img.shields.io/badge/arXiv-Coming_Soon-b31b1b.svg)](https://arxiv.org/pdf/)
23
  [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
24
  [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
@@ -37,9 +37,11 @@ library_name: transformers
37
  - [Key Results](#key-results)
38
  - [Model Zoo](#model-zoo)
39
  - [Datasets](#datasets)
 
40
  - [Quick Start](#quick-start)
41
  - [Training](#training)
42
  - [Evaluation](#evaluation)
 
43
  - [Citation](#citation)
44
 
45
  ## Overview
@@ -50,8 +52,8 @@ This work presents a systematic study of **agentic mid-training** and introduces
50
 
51
  Our training uses two complementary trajectory types (details in the paper):
52
 
53
- - **Contextually-native trajectories (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
54
- - **Environmentally-native trajectories (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
55
 
56
  Resources (open-source / open-release):
57
 
@@ -90,6 +92,14 @@ We will open-source our datasets through Hugging Face:
90
  |---------|-------------|------|
91
  | `daVinci-Dev` | Agent-native data used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |
92
 
 
 
 
 
 
 
 
 
93
  ## Quick Start
94
 
95
  These checkpoints are intended to be used inside the [SWE-Agent](https://github.com/SWE-agent/SWE-agent) scaffold. They are also compatible with standard inference frameworks.
@@ -180,6 +190,17 @@ This section summarizes the methodology described in the paper.
180
 
181
  We report performance on **SWE-Bench Verified** using **SWE-Agent** with the setup described in the paper (including temperature 0, 128k context, and a 100-step budget). Results are reported as Pass@1 (averaged across 4 runs).
182
 
 
 
 
 
 
 
 
 
 
 
 
183
  ## Citation
184
 
185
  ArXiv link and the official citation block are coming soon (the manuscript is under review at the time of release).
 
18
 
19
  <div align="center">
20
 
21
+ [![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
22
  [![arXiv](https://img.shields.io/badge/arXiv-Coming_Soon-b31b1b.svg)](https://arxiv.org/pdf/)
23
  [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
24
  [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
 
37
  - [Key Results](#key-results)
38
  - [Model Zoo](#model-zoo)
39
  - [Datasets](#datasets)
40
+ - [Pipeline](#pipeline)
41
  - [Quick Start](#quick-start)
42
  - [Training](#training)
43
  - [Evaluation](#evaluation)
44
+ - [License](#license)
45
  - [Citation](#citation)
46
 
47
  ## Overview
 
52
 
53
  Our training uses two complementary trajectory types (details in the paper):
54
 
55
+ - **Contextually-native trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$ (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
56
+ - **Environmentally-native trajectories $\mathcal{D}^{\text{env}}_{\text{pass}}$ (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
57
 
58
  Resources (open-source / open-release):
59
 
 
92
  |---------|-------------|------|
93
  | `daVinci-Dev` | Agent-native data used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |
94
 
95
+ ## Pipeline
96
+
97
+ The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$.
98
+
99
+ | Pipeline | Description | Link |
100
+ |----------|---------|-------------|
101
+ | daVinci-Dev Pipeline | a high-performance pipeline used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$ | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) |
102
+
103
  ## Quick Start
104
 
105
  These checkpoints are intended to be used inside the [SWE-Agent](https://github.com/SWE-agent/SWE-agent) scaffold. They are also compatible with standard inference frameworks.
 
190
 
191
  We report performance on **SWE-Bench Verified** using **SWE-Agent** with the setup described in the paper (including temperature 0, 128k context, and a 100-step budget). Results are reported as Pass@1 (averaged across 4 runs).
192
 
193
+ ## License
194
+
195
+ This project is a **mixed** release:
196
+
197
+ - **Contextually-native PR-derived subset:** only PRs from repositories detected as having a **permissive license** are included. Each repo’s license is provided in `./ctx-native/filtered_repos/part-0000.parquet`.
198
+ - **Environmentally-native subset:** derived from [**SWE-rebench**](https://huggingface.co/datasets/nebius/SWE-rebench), licensed under **CC-BY-4.0**.
199
+ - **daVinci-Dev models:** released under [Qwen](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) license. Users should verify the licensing status of any generated code before using it in production.
200
+ - **daVinci-Dev pipeline:** released under the [Apache-2.0](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/LICENSE) license.
201
+
202
+ Users are responsible for ensuring their downstream usage complies with the licenses of the underlying sources.
203
+
204
  ## Citation
205
 
206
  ArXiv link and the official citation block are coming soon (the manuscript is under review at the time of release).