🦊 JQL: Judging Quality across Languages

Scalable and lightweight multilingual data filtering with LLM-based annotators

High-quality multilingual data is crucial for training effective large language models (LLMs).
JQL (Judging Quality across Languages) is a scalable and lightweight data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators. These annotators enable robust filtering of web-scale data.

JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance on Arabic, Thai, and Mandarin. It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.

🧩 Main Pipeline Steps

JQL Pipeline Overview
Figure 1: Overview of the JQL pipeline
  1. 📋 Ground Truth Creation: Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)
  2. 🤖 LLM-as-a-Judge Selection & Data Annotation: Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)
  3. 🪶 Lightweight Annotator Training: Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)
  4. 🚀 Scalable Data Filtering: Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)

📊 Results

📁 Available Artifacts

📜 Citation

If you use JQL, the annotations, or the pretrained annotators, please cite the paper:

@article{your2024jql,
  title={JQL: Judging Quality across Languages},
  author={Your, Name and Collaborators, Here},
  journal={Conference or preprint archive},
  year={2024}
}