Safetensors
qwen2
nielsr HF Staff commited on
Commit
816efad
·
verified ·
1 Parent(s): 1c477fa

Update model card with metadata, links, and detailed description

Browse files

Hi! I'm Niels from the Hugging Face community team. This PR aims to significantly improve the model card for **daVinci-origin-3B** by adding comprehensive metadata and essential contextual information.

Specifically, this update includes:
- **Metadata**: Adding `pipeline_tag: text-generation` for correct categorization and `library_name: transformers` to enable the automated code snippet widget, based on the model's `Qwen2ForCausalLM` architecture and `transformers_version` in its config files. Relevant `tags` have also been added for improved discoverability.
- **Project Resources**: Providing clear links to the associated research paper ([Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)), the official GitHub repository, and other related Hugging Face artifacts.
- **Model Description**: Expanding the model description with details from the paper abstract and GitHub README, including an overview of the "Data Darwinism Hierarchy" to better explain the model's scientific context and purpose.
- **Citation**: Adding the official BibTeX citation for proper attribution.

These improvements will help users better understand, find, and utilize this valuable model.

Files changed (1) hide show
  1. README.md +44 -1
README.md CHANGED
@@ -1,10 +1,53 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # daVinci-origin-3B
6
 
7
- **daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://arxiv.org/pdf/2602.07824).
 
 
8
 
9
  Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - scientific
7
+ - pre-training
8
+ - foundation-model
9
+ - data-darwinism
10
+ - qwen2
11
  ---
12
 
13
  # daVinci-origin-3B
14
 
15
+ **daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824).
16
+
17
+ ## Model Description
18
 
19
  Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
20
 
21
+ This model is part of the **Data Darwinism** framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a **Learnability Gap** in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.
22
+
23
+ ### The Data Darwinism Hierarchy
24
+
25
+ The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.
26
+
27
+ | Level | Stage | Description | Key Operation |
28
+ | :--- | :--- | :--- | :--- |
29
+ | **L0–L3** | **Selection & Preservation** | Filtering raw data. | Heuristic filtering, deduplication. |
30
+ | **L4** | **Generative Refinement** | Removing noise and repairing fragmentation. | LLM-based noise removal, formula repair. |
31
+ | **L5** | **Cognitive Completion** | Expanding implicit reasoning. | Explicating terminology, bridging logical gaps. |
32
+ | **L6–L9** | **Synthetic Evolution** | (Future Work) Model-driven synthesis. | Creating new environments/worlds. |
33
+
34
+ ## Project Resources
35
+
36
+ - **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
37
+ - **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
38
+ - **Related Hugging Face Dataset (Corpus):** [Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
39
+ - **Related Hugging Face Dataset (Evaluation):** [Darwin-Science-Eval](https://huggingface.co/datasets/GAIR/Darwin-Science-Eval)
40
+ - **Related Hugging Face Model (7B):** [daVinci-origin-7B](https://huggingface.co/GAIR/daVinci-origin-7B)
41
+
42
+ ## Citation
43
+
44
+ If you use Data Darwinism, the dataset, or the baselines in your research, please cite:
45
+
46
+ ```bibtex
47
+ @article{qin2026data,
48
+ title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
49
+ author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
50
+ journal={arXiv preprint arXiv:2602.07824},
51
+ year={2026}
52
+ }
53
+ ```