Papers
arxiv:2602.09003

Data Science and Technology Towards AGI Part I: Tiered Data Management

Published on Feb 9
ยท Submitted by
Yudong Wang
on Feb 10
ยท openbmb OpenBMB
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Large language models are increasingly guiding data management processes through a tiered framework that optimizes data quality, cost, and training efficiency across different stages of model development.

AI-generated summary

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.

Community

AI evolution is shifting from "Data-Driven Learning" to "Data-Model Co-Evolution"โ€”a cycle where models and data enhance each other. ๐Ÿ”„

Today, we launch #UltraData: An all-in-one Data Science platform featuring a systematic L0โ€“L4 Tiered Data Management Framework, 2.4T open tokens, and full-stack processing tools.

Essential for #LLM researchers & engineers seeking to build high-performance models with precision data science. ๐Ÿš€

๐Ÿ“„ Paper: https://ultradata.openbmb.cn/blog/position-paper
๐ŸŒ Site: https://ultradata.openbmb.cn
๐Ÿค— HF: https://huggingface.co/collections/openbmb/ultradata
๐Ÿ’ป GitHub: https://github.com/UltraData-OpenBMB/UltraData-Math

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09003 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09003 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09003 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.