Improve model card: add metadata, paper/code links and citation
Browse filesHi! I'm Niels from the Hugging Face community team.
This PR improves the model card for **daVinci-origin-7B** by adding:
- Metadata for `pipeline_tag: text-generation` and `library_name: transformers`.
- Links to the research paper and the official GitHub repository.
- A citation section using the BibTeX provided in the project's repository.
These changes help users find the model through the Hub's filtering tools and provide better context for the research.
README.md
CHANGED
|
@@ -1,10 +1,33 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
# daVinci-origin-7B
|
| 6 |
|
| 7 |
-
**daVinci-origin-7B** is a fully transparent, 7-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://
|
| 8 |
|
| 9 |
Unlike most open-source models, daVinci-origin-7B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
tags:
|
| 6 |
+
- science
|
| 7 |
+
- data-darwinism
|
| 8 |
---
|
| 9 |
|
| 10 |
# daVinci-origin-7B
|
| 11 |
|
| 12 |
+
**daVinci-origin-7B** is a fully transparent, 7-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://huggingface.co/papers/2602.07824).
|
| 13 |
|
| 14 |
Unlike most open-source models, daVinci-origin-7B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
|
| 15 |
|
| 16 |
+
## Resources
|
| 17 |
+
|
| 18 |
+
- **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
|
| 19 |
+
- **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
|
| 20 |
+
- **Dataset (Darwin-Science):** [GAIR/Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
|
| 21 |
+
|
| 22 |
+
## Citation
|
| 23 |
+
|
| 24 |
+
If you use this model or the associated research in your work, please cite:
|
| 25 |
+
|
| 26 |
+
```bibtex
|
| 27 |
+
@article{qin2026data,
|
| 28 |
+
title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
|
| 29 |
+
author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
|
| 30 |
+
journal={arXiv preprint arXiv:2602.07824},
|
| 31 |
+
year={2026}
|
| 32 |
+
}
|
| 33 |
+
```
|