Safetensors
qwen2
nielsr HF Staff commited on
Commit
22ee8b6
·
verified ·
1 Parent(s): 1c33bd5

Improve model card: add metadata, paper/code links and citation

Browse files

Hi! I'm Niels from the Hugging Face community team.

This PR improves the model card for **daVinci-origin-7B** by adding:
- Metadata for `pipeline_tag: text-generation` and `library_name: transformers`.
- Links to the research paper and the official GitHub repository.
- A citation section using the BibTeX provided in the project's repository.

These changes help users find the model through the Hub's filtering tools and provide better context for the research.

Files changed (1) hide show
  1. README.md +24 -1
README.md CHANGED
@@ -1,10 +1,33 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
4
 
5
  # daVinci-origin-7B
6
 
7
- **daVinci-origin-7B** is a fully transparent, 7-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://arxiv.org/pdf/2602.07824).
8
 
9
  Unlike most open-source models, daVinci-origin-7B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - science
7
+ - data-darwinism
8
  ---
9
 
10
  # daVinci-origin-7B
11
 
12
+ **daVinci-origin-7B** is a fully transparent, 7-billion parameter foundation model trained from scratch. It serves as a "clean-room" baseline for the research paper [*Data Darwinism -- Part1: Unlocking the Value of Scientific Data for Pre-training*](https://huggingface.co/papers/2602.07824).
13
 
14
  Unlike most open-source models, daVinci-origin-7B was explicitly trained on a dataset **strictly excluding scientific content** (books and research papers). This unique design allows researchers to unambiguously attribute performance gains to specific domain data injection strategies during continued pre-training.
15
 
16
+ ## Resources
17
+
18
+ - **Paper:** [Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training](https://huggingface.co/papers/2602.07824)
19
+ - **GitHub Repository:** [GAIR-NLP/Data-Darwinism](https://github.com/GAIR-NLP/Data-Darwinism)
20
+ - **Dataset (Darwin-Science):** [GAIR/Darwin-Science](https://huggingface.co/datasets/GAIR/Darwin-Science)
21
+
22
+ ## Citation
23
+
24
+ If you use this model or the associated research in your work, please cite:
25
+
26
+ ```bibtex
27
+ @article{qin2026data,
28
+ title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training},
29
+ author={Qin, Yiwei and Huang, Zhen and Mi, Tiantian and Si, Weiye and Zhou, Chenyang and Guo, Qipeng and Feng, Siyuan and Liu, Pengfei},
30
+ journal={arXiv preprint arXiv:2602.07824},
31
+ year={2026}
32
+ }
33
+ ```