| <!DOCTYPE html> |
| <html> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="description" content="JQL: Judging Quality across Languages - A pipeline for multilingual data filtering."> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <title>JQL: Judging Quality across Languages</title> |
| <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css"> |
| <style> |
| body { font-family: 'Noto Sans', sans-serif; } |
| .hero.is-primary { background-color: #f9d5e5; } |
| .subtitle img { max-width: 100%; height: auto; } |
| .section-title { margin-top: 2em; } |
| </style> |
| </head> |
| <body> |
| <section class="hero is-primary"> |
| <div class="hero-body"> |
| <div class="container has-text-centered"> |
| <h1 class="title is-1">π¦ JQL: Judging Quality across Languages</h1> |
| <p class="subtitle is-5">Scalable and lightweight multilingual data filtering with LLM-based annotators</p> |
| </div> |
| </div> |
| </section> |
|
|
| <section class="section"> |
| <div class="container content"> |
| <p> |
| High-quality multilingual data is crucial for training effective large language models (LLMs). |
| <strong>JQL (Judging Quality across Languages)</strong> is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong |
| multilingual LLMs into efficient cross-lingual annotators. |
| </p> |
| <p> |
| Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale. |
| </p> |
| </div> |
| </section> |
| |
| <section class="section"> |
| <div class="container content"> |
| <h2 class="title is-3">π§© Main Pipeline Steps</h2> |
| <figure> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/1zPQcwqt9Li_gCvd04_2_.png" alt="JQL Pipeline Overview"> |
| <figcaption><em>Figure 1: Overview of the JQL pipeline</em></figcaption> |
| </figure> |
|
|
| <ol> |
| <li><strong>π Ground Truth Creation:</strong> Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)</li> |
| <li><strong>π€ LLM-as-a-Judge Selection & Data Annotation:</strong> Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)</li> |
| <li><strong>πͺΆ Lightweight Annotator Training:</strong> Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)</li> |
| <li><strong>π Scalable Data Filtering:</strong> Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)</li> |
| </ol> |
| </div> |
| </section> |
|
|
| <section class="section"> |
| <div class="container content"> |
| <h2 class="title is-3">π Results</h2> |
| <ul> |
| <li><strong>βοΈ Accuracy:</strong> Spearmanβs Ο > 0.87 with human ground truth</li> |
| <li><strong>π Downstream LLM Training:</strong> |
| <ul> |
| <li>+7.2% benchmark performance improvement</li> |
| <li>+4.8% token retention vs. FineWeb2 heuristic filter</li> |
| <li>Effective threshold strategies: 0.6 and 0.7 quantile</li> |
| </ul> |
| </li> |
| <li><strong>β‘ Annotation Speed:</strong> ~11,000 docs/min (A100 GPU, avg. 690 tokens)</li> |
| </ul> |
| </div> |
| </section> |
|
|
| <section class="section"> |
| <div class="container content"> |
| <h2 class="title is-3">π Available Artifacts</h2> |
| <ul> |
| <li>π Ground truth annotations in 35 languages</li> |
| <li>π§ Synthetic LLM-annotated dataset (14M+ documents)</li> |
| <li>πͺΆ Lightweight annotation models: |
| <ul> |
| <li>JQL-Gemma</li> |
| <li>JQL-Mistral</li> |
| <li>JQL-Llama</li> |
| </ul> |
| </li> |
| <li>π οΈ Training & inference scripts (coming soon)</li> |
| </ul> |
| </div> |
| </section> |
|
|
| <section class="section"> |
| <div class="container content"> |
| <h2 class="title is-3">π Citation</h2> |
| <p>If you use JQL, the annotations, or the pretrained annotators, please cite the paper:</p> |
| <pre><code>@article{your2024jql, |
| title={JQL: Judging Quality across Languages}, |
| author={Your, Name and Collaborators, Here}, |
| journal={Conference or preprint archive}, |
| year={2024} |
| }</code></pre> |
| </div> |
| </section> |
|
|
| </body> |
| </html> |