license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- scientific
- pre-training
- foundation-model
- data-darwinism
- qwen2
daVinci-origin-3B
daVinci-origin-3B is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training.
Model Description
Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset strictly excluding scientific content (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
This model is part of the Data Darwinism framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a Learnability Gap in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.
The Data Darwinism Hierarchy
The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.
| Level | Stage | Description | Key Operation |
|---|---|---|---|
| L0–L3 | Selection & Preservation | Filtering raw data. | Heuristic filtering, deduplication. |
| L4 | Generative Refinement | Removing noise and repairing fragmentation. | LLM-based noise removal, formula repair. |
| L5 | Cognitive Completion | Expanding implicit reasoning. | Explicating terminology, bridging logical gaps. |
| L6–L9 | Synthetic Evolution | (Future Work) Model-driven synthesis. | Creating new environments/worlds. |
Project Resources
- Paper: Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
- GitHub Repository: GAIR-NLP/Data-Darwinism
- Related Hugging Face Dataset (Corpus): Darwin-Science
- Related Hugging Face Dataset (Evaluation): Darwin-Science-Eval
- Related Hugging Face Model (7B): daVinci-origin-7B
Citation
If you use Data Darwinism, the dataset, or the baselines in your research, please cite:
@article{qin2026data,
title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
journal={arXiv preprint arXiv:2602.07824},
year={2026}
}