Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
Abstract
Training high-quality filtered data multiple times yields better performance than single-pass training on larger, less filtered datasets for non-English language models.
Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.
Community
🇩🇪 ❤️
the dense-core + repetition idea is appealing, but i’m curious how sensitive the gains are to the exact definitions of coherence, information value, and educational quality. did you try a softer intersection or a two-tier core, to see if a little diversity in the core itself helps avoid overfitting across epochs? i wonder how well this generalizes to other high-resource nonenglish languages with different scripts or morphological complexity. the arxivlens breakdown helped me parse the method details, a solid walkthrough that complements your writeup (https://arxivlens.com/PaperView/Details/repetition-over-diversity-high-signal-data-filtering-for-sample-efficient-german-language-modeling-9928-f4f19685).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection (2026)
- How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data (2026)
- Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining (2026)
- Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain? (2026)
- MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages (2026)
- FLUX: Data Worth Training On (2026)
- Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.28075 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 4
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper