File size: 8,950 Bytes
c333073
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c9857f
8b8bba7
c333073
 
4060e70
c333073
 
 
3350b90
c333073
 
 
 
 
 
 
 
 
 
 
6c9857f
c333073
 
 
6c9857f
c333073
 
 
 
 
 
 
 
 
 
8b8bba7
 
c333073
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c9857f
 
8b8bba7
6c9857f
 
 
8b8bba7
6c9857f
c333073
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c9857f
 
 
 
 
 
 
 
 
 
 
c333073
 
8b8bba7
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
license: other
license_name: qwen
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
language:
- en
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-72B
tags:
- chat
library_name: transformers
---

<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin-bottom: 10px">
  <img src="assets/sii.png" alt="SII" width="100px">
  <img src="assets/GAIR_Logo2.png" alt="ASI" width="100px">
</div>

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
[![arXiv](https://img.shields.io/badge/arXiv-2601.18418-b31b1b.svg)](https://arxiv.org/pdf/2601.18418)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/GAIR/daVinci-Dev-72B)

</div>

<h1 align="center">daVinci-Dev: Agent-native Mid-training for Software Engineering</h1>

<div align="center">
  <img src="assets/teaser.png" width="100%" />
</div>

## Table of Contents

- [Overview](#overview)
- [Key Results](#key-results)
- [Model Zoo](#model-zoo)
- [Datasets](#datasets)
- [Pipeline](#pipeline)
- [Quick Start](#quick-start)
- [Training](#training)
- [Evaluation](#evaluation)
- [License](#license)
- [Citation](#citation)

## Overview

`daVinci-Dev` is a family of large language models trained for **agentic software engineering**.

This work presents a systematic study of **agentic mid-training** and introduces **agent-native data** to reduce the distribution mismatch between static pretraining corpora and the dynamic, feedback-rich environments faced by real code agents.

Our training uses two complementary trajectory types (details in the paper):

- **Contextually-native trajectories \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\) (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
- **Environmentally-native trajectories \\(\mathcal{D}^{\text{env}}_{\text{pass}}\\) (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.

Resources (open-source / open-release):

- Paper + data processing pipeline: https://github.com/GAIR-NLP/daVinci-Dev
- Dataset: https://huggingface.co/datasets/GAIR/daVinci-Dev

## Key Results

### SWE-Bench Verified

We reach SOTA among open training recipes using agentic scaffolds under their model sizes, despite starting from the **non-coder** `Qwen2.5-Base` family.

| Model | SWE-Bench Verified (Pass@1) | Notes |
|------|------------------------------|------|
| `daVinci-Dev-72B` | **58.5%** | Agent-native MT + SFT |
| `daVinci-Dev-32B` | **56.1%** | Agent-native MT + SFT |

**Generalization gains:** improvements are also observed on standard code benchmarks (e.g., HumanEval/EvalPlus) and scientific reasoning benchmarks (e.g., GPQA/SciBench) as reported in the paper.

## Model Zoo

We will open-source model checkpoints on Hugging Face:

| Model | Description | Link |
|------|-------------|------|
| `daVinci-Dev-72B` | Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-72B |
| `daVinci-Dev-32B` | Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-32B |
| `daVinci-Dev-72B-MT` | **MT checkpoint** (after agent-native mid-training, **before SFT**) | https://huggingface.co/GAIR/daVinci-Dev-72B-MT |
| `daVinci-Dev-32B-MT` | **MT checkpoint** (after agent-native mid-training, **before SFT**) | https://huggingface.co/GAIR/daVinci-Dev-32B-MT |

## Datasets

We will open-source our datasets through Hugging Face:

| Dataset | Description | Link |
|---------|-------------|------|
| `daVinci-Dev` | Agent-native data used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |

## Pipeline

The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\).

| Pipeline | Description | Link |
|----------|---------|-------------|
| daVinci-Dev Pipeline | a high-performance pipeline used to build \\(\mathcal{D}^{\text{ctx}}_{\text{py}}\\) | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) |

## Quick Start

These checkpoints are intended to be used inside the [SWE-Agent](https://github.com/SWE-agent/SWE-agent) scaffold. They are also compatible with standard inference frameworks.

<details>
<summary>Start with HF Transformers</summary>

```bash
pip install transformers torch
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "GAIR/daVinci-Dev-72B"  # or any checkpoint in the model zoo

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "system",
        "content": (
            "You are a software engineering agent. "
            "When solving tasks, reason about the repo structure, propose minimal edits, "
            "and describe how you would validate with tests."
        ),
    },
    {"role": "user", "content": "Bug: tests fail when X. Please fix it."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.2,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

</details>

<details>
<summary>Start with vLLM</summary>

```bash
pip install vllm
```

```bash
python -m vllm.entrypoints.openai.api_server \
  --model GAIR/daVinci-Dev-72B \
  --tensor-parallel-size 8 \
  --max-model-len 131072
```

</details>

## Training

This section summarizes the methodology described in the paper.

### Data

- **Contextually-native PR trajectories:** **68.6B tokens** (constructed from GitHub pull requests, preserving the coupling between context retrieval and edits).
- **Environmentally-native executable trajectories:** **3.1B raw tokens** (**4.5B effective tokens**), collected by running an agent in real executable environments with tool and test feedback. Trajectories include both test-passing and non-passing rollouts.

### Recipe (high level)

- Start from the `Qwen2.5` base model family (32B / 72B).
- Perform **agent-native mid-training** on PR-derived trajectories (and optionally mixed with executable trajectories).
- Perform **SFT** on the **test-passing** subset of environmentally-native trajectories.

`-MT` checkpoints correspond to the state **after mid-training and before SFT**.

## Evaluation

We report performance on **SWE-Bench Verified** using **SWE-Agent** with the setup described in the paper (including temperature 0, 128k context, and a 100-step budget). Results are reported as Pass@1 (averaged across 4 runs).

## License

This project is a **mixed** release:

- **Contextually-native PR-derived subset:** only PRs from repositories detected as having a **permissive license** are included. Each repo’s license is provided in `./ctx-native/filtered_repos/part-0000.parquet`.
- **Environmentally-native subset:** derived from [**SWE-rebench**](https://huggingface.co/datasets/nebius/SWE-rebench), licensed under **CC-BY-4.0**.
- **daVinci-Dev models:** released under [Qwen](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) license. Users should verify the licensing status of any generated code before using it in production.
- **daVinci-Dev pipeline:** released under the [Apache-2.0](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/LICENSE) license.

Users are responsible for ensuring their downstream usage complies with the licenses of the underlying sources.

## Citation

If you use this work, please cite the daVinci-Dev paper.

```
@misc{zeng2026davincidevagentnativemidtrainingsoftware,
      title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
      author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
      year={2026},
      eprint={2601.18418},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.18418},
}
```