GAIR
/

daVinci-Dev-72B

@@ -18,7 +18,7 @@ library_name: transformers
 <div align="center">
-[![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/daVinci-Dev.pdf)
 [![arXiv](https://img.shields.io/badge/arXiv-Coming_Soon-b31b1b.svg)](https://arxiv.org/pdf/)
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
 [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
@@ -37,9 +37,11 @@ library_name: transformers
 - [Key Results](#key-results)
 - [Model Zoo](#model-zoo)
 - [Datasets](#datasets)
 - [Quick Start](#quick-start)
 - [Training](#training)
 - [Evaluation](#evaluation)
 - [Citation](#citation)
 ## Overview
@@ -50,8 +52,8 @@ This work presents a systematic study of **agentic mid-training** and introduces
 Our training uses two complementary trajectory types (details in the paper):
-- **Contextually-native trajectories (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
-- **Environmentally-native trajectories (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
 Resources (open-source / open-release):
@@ -90,6 +92,14 @@ We will open-source our datasets through Hugging Face:
 |---------|-------------|------|
 | `daVinci-Dev` | Agent-native data used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |
 ## Quick Start
 These checkpoints are intended to be used inside the [SWE-Agent](https://github.com/SWE-agent/SWE-agent) scaffold. They are also compatible with standard inference frameworks.
@@ -180,6 +190,17 @@ This section summarizes the methodology described in the paper.
 We report performance on **SWE-Bench Verified** using **SWE-Agent** with the setup described in the paper (including temperature 0, 128k context, and a 100-step budget). Results are reported as Pass@1 (averaged across 4 runs).
 ## Citation
 ArXiv link and the official citation block are coming soon (the manuscript is under review at the time of release).

 <div align="center">
+[![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf)
 [![arXiv](https://img.shields.io/badge/arXiv-Coming_Soon-b31b1b.svg)](https://arxiv.org/pdf/)
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev)
 [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
 - [Key Results](#key-results)
 - [Model Zoo](#model-zoo)
 - [Datasets](#datasets)
+- [Pipeline](#pipeline)
 - [Quick Start](#quick-start)
 - [Training](#training)
 - [Evaluation](#evaluation)
+- [License](#license)
 - [Citation](#citation)
 ## Overview
 Our training uses two complementary trajectory types (details in the paper):
+- **Contextually-native trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$ (PR-derived):** preserve the full information flow by bundling file discovery/context retrieval together with sequential edits. This provides broad coverage and diversity.
+- **Environmentally-native trajectories $\mathcal{D}^{\text{env}}_{\text{pass}}$ (executable rollouts):** collected from real executable repositories with genuine tool/test outputs, capturing authentic feedback loops.
 Resources (open-source / open-release):
 |---------|-------------|------|
 | `daVinci-Dev` | Agent-native data used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |
+## Pipeline
+The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$.
+| Pipeline | Description | Link |
+|----------|---------|-------------|
+| daVinci-Dev Pipeline | a high-performance pipeline used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$ | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) |
 ## Quick Start
 These checkpoints are intended to be used inside the [SWE-Agent](https://github.com/SWE-agent/SWE-agent) scaffold. They are also compatible with standard inference frameworks.
 We report performance on **SWE-Bench Verified** using **SWE-Agent** with the setup described in the paper (including temperature 0, 128k context, and a 100-step budget). Results are reported as Pass@1 (averaged across 4 runs).
+## License
+This project is a **mixed** release:
+- **Contextually-native PR-derived subset:** only PRs from repositories detected as having a **permissive license** are included. Each repo’s license is provided in `./ctx-native/filtered_repos/part-0000.parquet`.
+- **Environmentally-native subset:** derived from [**SWE-rebench**](https://huggingface.co/datasets/nebius/SWE-rebench), licensed under **CC-BY-4.0**.
+- **daVinci-Dev models:** released under [Qwen](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) license. Users should verify the licensing status of any generated code before using it in production.
+- **daVinci-Dev pipeline:** released under the [Apache-2.0](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/LICENSE) license.
+Users are responsible for ensuring their downstream usage complies with the licenses of the underlying sources.
 ## Citation
 ArXiv link and the official citation block are coming soon (the manuscript is under review at the time of release).