Add metadata and link to code
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,9 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
## Overview
|
| 2 |
|
|
|
|
|
|
|
|
|
|
| 3 |
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
daVinci-LLM-3B is a 3B-parameter base language model aimed at making pretraining a transparent and reproducible scientific process. We release not only the final weights but also training trajectories, intermediate checkpoints, data processing decisions, and 200+ ablation studies covering data quality, mixture design, training dynamics, and evaluation validity. The model reaches an overall score of 51.72 across 19 benchmarks, approaching or matching larger 7B-scale models such as OLMo-3 7B.
|
| 6 |
-

|
| 7 |
The model follows a two-stage curriculum over ~8T tokens:
|
| 8 |
- **Stage 1 (6T tokens):** broad pretraining over diverse web-scale corpora.
|
| 9 |
- **Stage 2 (2T tokens):** structured QA and reasoning-heavy data to amplify math and code reasoning.
|
|
@@ -43,24 +56,21 @@ Major categories:
|
|
| 43 |
- **Science/Math:** MegaMath, Nemotron-CC-Math, and Darwin-Science series (L3–L5).
|
| 44 |
- **QA:** multi-source QA data with rejection sampling (L5).
|
| 45 |
|
| 46 |
-
##
|
| 47 |
-
|
| 48 |
-
- **Stage 1:** 6T tokens with progressively adjusted mixtures (shifting weight from web text to code/science).
|
| 49 |
-
- **Stage 2:** 2T tokens with structured QA (30% → 70%) for stronger reasoning and problem-solving.
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
-
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
|
| 61 |
## Citation
|
| 62 |
|
| 63 |
-
```
|
| 64 |
@misc{qin2026davincillmtowardssciencepretraining,
|
| 65 |
title={daVinci-LLM:Towards the Science of Pretraining},
|
| 66 |
author={Yiwei Qin and Yixiu Liu and Tiantian Mi and Muhang Xie and Zhen Huang and Weiye Si and Pengrui Lu and Siyuan Feng and Xia Wu and Liming Liu and Ye Luo and Jinlong Hou and Qipeng Guo and Yu Qiao and Pengfei Liu},
|
|
@@ -70,5 +80,4 @@ Apache-2.0
|
|
| 70 |
primaryClass={cs.AI},
|
| 71 |
url={https://arxiv.org/abs/2603.27164},
|
| 72 |
}
|
| 73 |
-
```
|
| 74 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
tags:
|
| 6 |
+
- qwen2
|
| 7 |
+
- pretraining-science
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
## Overview
|
| 11 |
|
| 12 |
+
**daVinci-LLM-3B** is a 3B-parameter base language model presented in [daVinci-LLM: Towards the Science of Pretraining](https://huggingface.co/papers/2603.27164). This project aims to make the pretraining process a transparent and reproducible scientific endeavor.
|
| 13 |
+
|
| 14 |
+
We release not only the final weights but also training trajectories, intermediate checkpoints, data processing decisions, and 200+ ablation studies covering data quality, mixture design, training dynamics, and evaluation validity.
|
| 15 |
|
| 16 |
+
- **GitHub:** [GAIR-NLP/daVinci-LLM](https://github.com/GAIR-NLP/daVinci-LLM)
|
| 17 |
+
- **Paper:** [arXiv:2603.27164](https://arxiv.org/abs/2603.27164)
|
| 18 |
+
- **Dataset:** [davinci-llm-data](https://huggingface.co/datasets/SII-GAIR-NLP/davinci-llm-data)
|
| 19 |
|
|
|
|
|
|
|
| 20 |
The model follows a two-stage curriculum over ~8T tokens:
|
| 21 |
- **Stage 1 (6T tokens):** broad pretraining over diverse web-scale corpora.
|
| 22 |
- **Stage 2 (2T tokens):** structured QA and reasoning-heavy data to amplify math and code reasoning.
|
|
|
|
| 56 |
- **Science/Math:** MegaMath, Nemotron-CC-Math, and Darwin-Science series (L3–L5).
|
| 57 |
- **QA:** multi-source QA data with rejection sampling (L5).
|
| 58 |
|
| 59 |
+
## Evaluation
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
+
The model reaches an overall average score of **51.72** across 19 benchmarks, matching or exceeding the performance of larger 7B-scale models like OLMo-3 7B.
|
| 62 |
|
| 63 |
+
| Capability Dimension | daVinci-3B | OLMo-3 7B | LLaMA-3.2-3B | Qwen-2.5-3B |
|
| 64 |
+
| -------------------- | ---------: | --------: | -----------: | ----------: |
|
| 65 |
+
| **Overall Performance** | **51.72** | 51.65 | 37.58 | 51.44 |
|
| 66 |
+
| General Knowledge | 52.96 | 55.13 | 51.08 | 55.16 |
|
| 67 |
+
| Code Generation | 55.99 | 54.42 | 32.40 | 56.13 |
|
| 68 |
+
| Scientific Reasoning | 48.30 | 45.98 | 22.45 | 44.65 |
|
| 69 |
+
| MATH | 62.80 | 39.60 | 9.00 | 37.20 |
|
| 70 |
|
| 71 |
## Citation
|
| 72 |
|
| 73 |
+
```bibtex
|
| 74 |
@misc{qin2026davincillmtowardssciencepretraining,
|
| 75 |
title={daVinci-LLM:Towards the Science of Pretraining},
|
| 76 |
author={Yiwei Qin and Yixiu Liu and Tiantian Mi and Muhang Xie and Zhen Huang and Weiye Si and Pengrui Lu and Siyuan Feng and Xia Wu and Liming Liu and Ye Luo and Jinlong Hou and Qipeng Guo and Yu Qiao and Pengfei Liu},
|
|
|
|
| 80 |
primaryClass={cs.AI},
|
| 81 |
url={https://arxiv.org/abs/2603.27164},
|
| 82 |
}
|
| 83 |
+
```
|
|
|