Scalable and lightweight multilingual data filtering with LLM-based annotators
High-quality multilingual data is crucial for training effective large language models (LLMs).
JQL (Judging Quality across Languages) is a scalable and lightweight data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators. These annotators enable robust filtering of web-scale data.
JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance on Arabic, Thai, and Mandarin. It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.
If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
@article{your2024jql,
title={JQL: Judging Quality across Languages},
author={Your, Name and Collaborators, Here},
journal={Conference or preprint archive},
year={2024}
}