9.73 TB
58,669 files
Updated 11 days ago
Name
Size
assets
data
evaluation
.gitattributes2.55 kB
xet
LICENSE11.3 kB
xet
README.md10.8 kB
xet
README_ZH.md9.1 kB
xet
README.md

Ultra-FineWeb

πŸ“œ Technical Report | πŸ“¦ UltraData Collection | 🌐 UltraData | πŸ€— MiniCPM4 Series | πŸ€— MiniCPM5 Series

English | δΈ­ζ–‡

πŸ“š Introduction

Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText, TeleChat, and CCI3), resulting in the creation of higher-quality Ultra-FineWeb-en with approximately 1T tokens, and Ultra-FineWeb-zh datasets with approximately 120B tokens, collectively referred to as Ultra-FineWeb. Ultra-FineWeb serves as a core pre-training web dataset for the MiniCPM4 Series and MiniCPM5 Series models.

  • Ultra-FineWeb: Ultra-FineWeb, a large-scale, high-quality, and efficiently-filtered dataset, with 1T English tokens and 120B Chinese tokens. (<-- you are here)
  • Ultra-FineWeb-classifier: Ultra-FineWeb classifier, for filtering high-quality data from web corpora.
  • Ultra-FineWeb-L3: the L3 refined data based on Ultra-FineWeb, via Q&A Pair Generation and Multi-style Rewriting, with 400B+ English and 200B+ Chinese tokensβ€”to our best knowledge, the largest open-source Chinese pre-training synthetic corpus to date.

πŸ“’ What's New

  • [2026.05.28] The Ultra-FineWeb-L3 dataset is released! The L3 refined data of Ultra-FineWeb via Q&A Pair Generation and Multi-style Rewriting, with 400B+ English and 200B+ Chinese tokens. To our best knowledge, it is the largest open-source Chinese pre-training synthetic corpus to date. πŸš€πŸš€πŸš€
  • [2026.05.25] MiniCPM5-1B is released!, the first model in the MiniCPM5 series. It is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios, reaching 1B-class open-source SOTA. Ultra-FineWeb serves as the core pre-training web dataset for MiniCPM5-1B.
  • [2026.02.08] The UltraData platform is now live, introducing the L0-L4 tiered data management framework. Ultra-FineWeb serves as the L2 selected layer for general web data in this framework. πŸ”πŸ”πŸ”
  • [2025.06.16] The Ultra-FineWeb-classifier is now available on Hugging Face: openbmb/Ultra-FineWeb-classifier.
  • [2025.06.06] Ultra-FineWeb-en and Ultra-FineWeb-zh datasets are now available on Hugging Face, released alongside the MiniCPM4 Series models.
  • [2025.05.15] Ultra-FineWeb tops the Hugging Face Datasets Trending list, reaching the #1 spot! ⭐️⭐️⭐️
  • [2025.05.09] Ultra-FineWeb technical report is available on arXiv. πŸ”₯πŸ”₯πŸ”₯

πŸ’‘ Highlights

Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion (T) English tokens and 120 billion (B) Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

  • Efficient Verification Strategy: We propose a computationally efficient verification strategy that enables rapid evaluation of the impact of data on LLM training performance with minimal computational cost, significantly improving the efficiency of high-quality data filtering experiments.
  • Large-Scale High-Quality Pre-training Datasets: We design and implement an efficient high-quality data filtering pipeline, applied to the FineWeb and Chinese FineWeb datasets, resulting in the creation of higher-quality datasets, which can facilitate high-quality LLM training.
  • Lightweight Classifier: The Ultra-FineWeb classifier significantly reduces inference costs, achieving superior performance on extracted text from the same data source, thus validating the effectiveness of our proposed data filtering pipeline in enhancing data quality and training efficiency.

πŸ“ˆ Evaluation Results

We utilize the MiniCPM-1.2B model architecture with the MiniCPM3-4B tokenizer. Each experiment involves training on 100B tokens, allowing for comprehensive data performance validation within computationally efficient parameters. We employ Lighteval library for model evaluation, adopt 11 benchmarks to evaluate the performance of trained models, and all evaluation metrics are based on a zero-shot setting. The evaluation metrics include:

  • English benchmarks: MMLU, ARC-C, ARC-E, CommonSenseQA, HellaSwag, OpenbookQA, PIQA, SIQA, and Winogrande.
  • Chinese benchmarks: C-Eval and CMMLU.

Detailed evaluation results are reported below:

  • Individual data experiments. We perform isolated training runs using single datasets, facilitating direct comparisons between differently processed data from identical sources.

    Individual English Table Individual Chinese Table Individual Plot
  • Mixed Data Experiments. We use a mix of 60% English data, 30% Chinese data, and 10% code data (StarCoder-v2).

    Mix Table Mix Plot
  • Loss and Performance Estimation Results. We use the performance estimation methods proposed in Densing Law for further analysis and verification of the effectiveness of Ultra-FineWeb.

Densing Law Table Densing Law Plot

❀️ Acknowledgements

Thanks for their awesome work! Open-source contributions make Ultra-FineWeb possible! πŸ™Œ

🌟 Citation

If you find our work useful, please consider citing:

@misc{wang2025ultrafineweb,
  title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data},
  author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu},
  year={2025},
  eprint={2505.05427},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
}

And the main paper where Ultra-FineWeb is used:

@article{minicpm4,
  title={MiniCPM4: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM Team},
  year={2025}
}

πŸ’³ License

This project is released under the Apache 2.0. Please note that since Ultra-FineWeb is built using multiple datasets, users should check the LICENSE of each dataset individually to ensure proper usage and compliance.

Total size
9.73 TB
Files
58,669
Last updated
Jun 19
Pre-warmed CDN
US EU US EU

Contributors