Safetensors
qwen2

Update model card with metadata, links, and detailed description

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +44 -1
README.md CHANGED
@@ -1,10 +1,53 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # daVinci-origin-3B
6
 
7
- **daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://arxiv.org/pdf/2602.07824).
 
 
8
 
9
  Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - scientific
7
+ - pre-training
8
+ - foundation-model
9
+ - data-darwinism
10
+ - qwen2
11
  ---
12
 
13
  # daVinci-origin-3B
14
 
15
+ **daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824).
16
+
17
+ ## Model Description
18
 
19
  Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
20
 
21
+ This model is part of the **Data Darwinism** framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a **Learnability Gap** in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.
22
+
23
+ ### The Data Darwinism Hierarchy
24
+
25
+ The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.
26
+
27
+ | Level | Stage | Description | Key Operation |
28
+ | :--- | :--- | :--- | :--- |
29
+ | **L0–L3** | **Selection & Preservation** | Filtering raw data. | Heuristic filtering, deduplication. |
30
+ | **L4** | **Generative Refinement** | Removing noise and repairing fragmentation. | LLM-based noise removal, formula repair. |
31
+ | **L5** | **Cognitive Completion** | Expanding implicit reasoning. | Explicating terminology, bridging logical gaps. |
32
+ | **L6–L9** | **Synthetic Evolution** | (Future Work) Model-driven synthesis. | Creating new environments/worlds. |
33
+
34
+ ## Project Resources
35
+
36
+ - **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
37
+ - **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
38
+ - **Related Hugging Face Dataset (Corpus):** [Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
39
+ - **Related Hugging Face Dataset (Evaluation):** [Darwin-Science-Eval](https://huggingface.co/datasets/GAIR/Darwin-Science-Eval)
40
+ - **Related Hugging Face Model (7B):** [daVinci-origin-7B](https://huggingface.co/GAIR/daVinci-origin-7B)
41
+
42
+ ## Citation
43
+
44
+ If you use Data Darwinism, the dataset, or the baselines in your research, please cite:
45
+
46
+ ```bibtex
47
+ @article{qin2026data,
48
+ title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
49
+ author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
50
+ journal={arXiv preprint arXiv:2602.07824},
51
+ year={2026}
52
+ }
53
+ ```