nielsr HF Staff

Update model card with metadata, links, and detailed description

816efad verified 2 months ago

2.98 kB

license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
  - scientific
  - pre-training
  - foundation-model
  - data-darwinism
  - qwen2

daVinci-origin-3B

daVinci-origin-3B is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training.

Model Description

Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset strictly excluding scientific content (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.

This model is part of the Data Darwinism framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a Learnability Gap in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.

The Data Darwinism Hierarchy

The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.

Level	Stage	Description	Key Operation
L0–L3	Selection & Preservation	Filtering raw data.	Heuristic filtering, deduplication.
L4	Generative Refinement	Removing noise and repairing fragmentation.	LLM-based noise removal, formula repair.
L5	Cognitive Completion	Expanding implicit reasoning.	Explicating terminology, bridging logical gaps.
L6–L9	Synthetic Evolution	(Future Work) Model-driven synthesis.	Creating new environments/worlds.

Project Resources

Paper: Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
GitHub Repository: GAIR-NLP/Data-Darwinism
Related Hugging Face Dataset (Corpus): Darwin-Science
Related Hugging Face Dataset (Evaluation): Darwin-Science-Eval
Related Hugging Face Model (7B): daVinci-origin-7B

Citation

If you use Data Darwinism, the dataset, or the baselines in your research, please cite:

@article{qin2026data,
  title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
  author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
  journal={arXiv preprint arXiv:2602.07824},
  year={2026}
}