Update model card with metadata, links, and detailed description
Browse filesHi! I'm Niels from the Hugging Face community team. This PR aims to significantly improve the model card for **daVinci-origin-3B** by adding comprehensive metadata and essential contextual information.
Specifically, this update includes:
- **Metadata**: Adding `pipeline_tag: text-generation` for correct categorization and `library_name: transformers` to enable the automated code snippet widget, based on the model's `Qwen2ForCausalLM` architecture and `transformers_version` in its config files. Relevant `tags` have also been added for improved discoverability.
- **Project Resources**: Providing clear links to the associated research paper ([Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)), the official GitHub repository, and other related Hugging Face artifacts.
- **Model Description**: Expanding the model description with details from the paper abstract and GitHub README, including an overview of the "Data Darwinism Hierarchy" to better explain the model's scientific context and purpose.
- **Citation**: Adding the official BibTeX citation for proper attribution.
These improvements will help users better understand, find, and utilize this valuable model.
|
@@ -1,10 +1,53 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
# daVinci-origin-3B
|
| 6 |
|
| 7 |
-
**daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [
|
|
|
|
|
|
|
| 8 |
|
| 9 |
Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- scientific
|
| 7 |
+
- pre-training
|
| 8 |
+
- foundation-model
|
| 9 |
+
- data-darwinism
|
| 10 |
+
- qwen2
|
| 11 |
---
|
| 12 |
|
| 13 |
# daVinci-origin-3B
|
| 14 |
|
| 15 |
+
**daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824).
|
| 16 |
+
|
| 17 |
+
## Model Description
|
| 18 |
|
| 19 |
Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
|
| 20 |
|
| 21 |
+
This model is part of the **Data Darwinism** framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a **Learnability Gap** in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.
|
| 22 |
+
|
| 23 |
+
### The Data Darwinism Hierarchy
|
| 24 |
+
|
| 25 |
+
The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.
|
| 26 |
+
|
| 27 |
+
| Level | Stage | Description | Key Operation |
|
| 28 |
+
| :--- | :--- | :--- | :--- |
|
| 29 |
+
| **L0–L3** | **Selection & Preservation** | Filtering raw data. | Heuristic filtering, deduplication. |
|
| 30 |
+
| **L4** | **Generative Refinement** | Removing noise and repairing fragmentation. | LLM-based noise removal, formula repair. |
|
| 31 |
+
| **L5** | **Cognitive Completion** | Expanding implicit reasoning. | Explicating terminology, bridging logical gaps. |
|
| 32 |
+
| **L6–L9** | **Synthetic Evolution** | (Future Work) Model-driven synthesis. | Creating new environments/worlds. |
|
| 33 |
+
|
| 34 |
+
## Project Resources
|
| 35 |
+
|
| 36 |
+
- **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
|
| 37 |
+
- **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
|
| 38 |
+
- **Related Hugging Face Dataset (Corpus):** [Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
|
| 39 |
+
- **Related Hugging Face Dataset (Evaluation):** [Darwin-Science-Eval](https://huggingface.co/datasets/GAIR/Darwin-Science-Eval)
|
| 40 |
+
- **Related Hugging Face Model (7B):** [daVinci-origin-7B](https://huggingface.co/GAIR/daVinci-origin-7B)
|
| 41 |
+
|
| 42 |
+
## Citation
|
| 43 |
+
|
| 44 |
+
If you use Data Darwinism, the dataset, or the baselines in your research, please cite:
|
| 45 |
+
|
| 46 |
+
```bibtex
|
| 47 |
+
@article{qin2026data,
|
| 48 |
+
title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
|
| 49 |
+
author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
|
| 50 |
+
journal={arXiv preprint arXiv:2602.07824},
|
| 51 |
+
year={2026}
|
| 52 |
+
}
|
| 53 |
+
```
|