Safetensors
qwen2
daVinci-origin-3B / README.md
nielsr's picture
nielsr HF Staff
Update model card with metadata, links, and detailed description
816efad verified
|
raw
history blame
2.98 kB
metadata
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
  - scientific
  - pre-training
  - foundation-model
  - data-darwinism
  - qwen2

daVinci-origin-3B

daVinci-origin-3B is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training.

Model Description

Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset strictly excluding scientific content (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.

This model is part of the Data Darwinism framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a Learnability Gap in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.

The Data Darwinism Hierarchy

The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.

Level Stage Description Key Operation
L0–L3 Selection & Preservation Filtering raw data. Heuristic filtering, deduplication.
L4 Generative Refinement Removing noise and repairing fragmentation. LLM-based noise removal, formula repair.
L5 Cognitive Completion Expanding implicit reasoning. Explicating terminology, bridging logical gaps.
L6–L9 Synthetic Evolution (Future Work) Model-driven synthesis. Creating new environments/worlds.

Project Resources

Citation

If you use Data Darwinism, the dataset, or the baselines in your research, please cite:

@article{qin2026data,
  title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
  author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
  journal={arXiv preprint arXiv:2602.07824},
  year={2026}
}