Update model card with metadata, links, and detailed description

Hi! I'm Niels from the Hugging Face community team. This PR aims to significantly improve the model card for **daVinci-origin-3B** by adding comprehensive metadata and essential contextual information.

Specifically, this update includes:
- **Metadata**: Adding `pipeline_tag: text-generation` for correct categorization and `library_name: transformers` to enable the automated code snippet widget, based on the model's `Qwen2ForCausalLM` architecture and `transformers_version` in its config files. Relevant `tags` have also been added for improved discoverability.
- **Project Resources**: Providing clear links to the associated research paper ([Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)), the official GitHub repository, and other related Hugging Face artifacts.
- **Model Description**: Expanding the model description with details from the paper abstract and GitHub README, including an overview of the "Data Darwinism Hierarchy" to better explain the model's scientific context and purpose.
- **Citation**: Adding the official BibTeX citation for proper attribution.

These improvements will help users better understand, find, and utilize this valuable model.

Files changed (1) hide show

README.md +44 -1

README.md CHANGED Viewed

@@ -1,10 +1,53 @@
 ---
 license: apache-2.0
 ---
 # daVinci-origin-3B
-**daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://arxiv.org/pdf/2602.07824).
 Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.

 ---
 license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- scientific
+- pre-training
+- foundation-model
+- data-darwinism
+- qwen2
 ---
 # daVinci-origin-3B
+**daVinci-origin-3B** is a fully transparent, 3-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824).
+## Model Description
 Unlike most open-source models, daVinci-origin-3B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
+This model is part of the **Data Darwinism** framework, a conceptual framework and practical methodology for the co-evolution of data and foundation models. The project identifies a **Learnability Gap** in conceptually dense domains like scientific literature, which raw data alone often fails to bridge. Data Darwinism addresses this through systematic data processing.
+### The Data Darwinism Hierarchy
+The framework introduces a ten-level taxonomy (L0–L9) to organize data transformations, aiming to increase information density and learnability as data ascends the hierarchy.
+| Level | Stage | Description | Key Operation |
+| :--- | :--- | :--- | :--- |
+| **L0–L3** | **Selection & Preservation** | Filtering raw data. | Heuristic filtering, deduplication. |
+| **L4** | **Generative Refinement** | Removing noise and repairing fragmentation. | LLM-based noise removal, formula repair. |
+| **L5** | **Cognitive Completion** | Expanding implicit reasoning. | Explicating terminology, bridging logical gaps. |
+| **L6–L9** | **Synthetic Evolution** | (Future Work) Model-driven synthesis. | Creating new environments/worlds. |
+## Project Resources
+- **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
+- **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
+- **Related Hugging Face Dataset (Corpus):** [Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
+- **Related Hugging Face Dataset (Evaluation):** [Darwin-Science-Eval](https://huggingface.co/datasets/GAIR/Darwin-Science-Eval)
+- **Related Hugging Face Model (7B):** [daVinci-origin-7B](https://huggingface.co/GAIR/daVinci-origin-7B)
+## Citation
+If you use Data Darwinism, the dataset, or the baselines in your research, please cite:
+```bibtex
+@article{qin2026data,
+  title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
+  author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
+  journal={arXiv preprint arXiv:2602.07824},
+  year={2026}
+}
+```