GAIR
/

daVinci-origin-3B

Safetensors

qwen2

Model card Files Files and versions

xet

Community

Update model card with metadata, links, and detailed description

by nielsr HF Staff - opened Feb 18

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+44

-1

Files changed (1) hide show

README.md +44 -1

README.md CHANGED Viewed

@@ -1,10 +1,53 @@
 ---
 license: apache-2.0
 ---
 # daVinci-origin-3B
-**daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://arxiv.org/pdf/2602.07824).
 Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.

 ---
 license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- scientific
+- pre-training
+- foundation-model
+- data-darwinism
+- qwen2
 ---
 # daVinci-origin-3B
+**daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824).
+## Model Description
 Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
+This model is part of the **Data Darwinism** framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a **Learnability Gap** in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.
+### The Data Darwinism Hierarchy
+The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.
+| Level | Stage | Description | Key Operation |
+| :--- | :--- | :--- | :--- |
+| **L0–L3** | **Selection & Preservation** | Filtering raw data. | Heuristic filtering, deduplication. |
+| **L4** | **Generative Refinement** | Removing noise and repairing fragmentation. | LLM-based noise removal, formula repair. |
+| **L5** | **Cognitive Completion** | Expanding implicit reasoning. | Explicating terminology, bridging logical gaps. |
+| **L6–L9** | **Synthetic Evolution** | (Future Work) Model-driven synthesis. | Creating new environments/worlds. |
+## Project Resources
+- **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
+- **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
+- **Related Hugging Face Dataset (Corpus):** [Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
+- **Related Hugging Face Dataset (Evaluation):** [Darwin-Science-Eval](https://huggingface.co/datasets/GAIR/Darwin-Science-Eval)
+- **Related Hugging Face Model (7B):** [daVinci-origin-7B](https://huggingface.co/GAIR/daVinci-origin-7B)
+## Citation
+If you use Data Darwinism, the dataset, or the baselines in your research, please cite:
+```bibtex
+@article{qin2026data,
+  title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
+  author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
+  journal={arXiv preprint arXiv:2602.07824},
+  year={2026}
+}
+```