Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,71 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# 🦊 JQL: Judging Quality across Languages
|
| 11 |
+
|
| 12 |
+
High-quality multilingual data is crucial for training effective large language models (LLMs).
|
| 13 |
+
**JQL (Judging Quality across Languages)** is a scalable and lightweight data filtering approach
|
| 14 |
+
that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
|
| 15 |
+
These annotators enable robust filtering of web-scale data.
|
| 16 |
+
|
| 17 |
+
JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance
|
| 18 |
+
on Arabic, Thai, and Mandarin.
|
| 19 |
+
It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 🧩 Main Pipeline Steps
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+

|
| 27 |
+
|
| 28 |
+
1. **📋 Ground Truth Creation**
|
| 29 |
+
Human annotators label monolingual documents based on a structured instruction prompt.
|
| 30 |
+
These documents are then translated into all target languages to form a multilingual gold-standard dataset.
|
| 31 |
+
*(See Figure 1)*
|
| 32 |
+
|
| 33 |
+
3. **🤖 LLM-as-a-Judge Selection & Data Annotation**
|
| 34 |
+
Several strong multilingual LLMs (e.g., Gemma-3-27B-it, Mistral-3.1-24B-it, LLaMA-3.3-70B-it) are evaluated against the GT dataset.
|
| 35 |
+
The top-$n$ models are selected to generate high-quality synthetic annotations on a large multilingual corpus.
|
| 36 |
+
*(See Figure 1)*
|
| 37 |
+
|
| 38 |
+
5. **🪶 Lightweight Annotator Training**
|
| 39 |
+
Regression heads are trained on top of frozen multilingual embeddings (e.g., Snowflake Arctic Embed v2) using the synthetic data.
|
| 40 |
+
This results in lightweight, efficient annotators capable of high-throughput filtering.
|
| 41 |
+
*(See Figure 1)*
|
| 42 |
+
|
| 43 |
+
7. **🚀 Scalable Data Filtering**
|
| 44 |
+
The trained annotators are used to label large-scale pretraining corpora.
|
| 45 |
+
Quality-based thresholds (e.g., 0.6 or 0.7 quantile) are applied to retain high-quality subsets for downstream LLM training.
|
| 46 |
+
*(See Figure 1)*
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## 📊 Results
|
| 51 |
+
|
| 52 |
+
- **Accuracy**: Spearman’s ρ > 0.87 with human ground truth
|
| 53 |
+
- **Downstream LLM Training**:
|
| 54 |
+
- Up to **+7.2% benchmark performance improvement**
|
| 55 |
+
- **+4.8% token retention** over baseline FineWeb2 heuristic filter
|
| 56 |
+
- Reliable trade-offs with 0.6 and 0.7 quantile filtering strategies
|
| 57 |
+
- **Annotation Speed**: ~11,000 docs/min (A100 GPU, 690 tokens avg. length)
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## 📁 Available Artifacts
|
| 63 |
+
|
| 64 |
+
- ✅ Ground truth annotations in 35 languages
|
| 65 |
+
- ✅ Synthetic LLM-annotated dataset (14M+ documents)
|
| 66 |
+
- ✅ Lightweight annotation models:
|
| 67 |
+
- `JQL-Gemma`
|
| 68 |
+
- `JQL-Mistral`
|
| 69 |
+
- `JQL-Llama`
|
| 70 |
+
- ✅ Training & inference scripts *(coming soon)*
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## 📜 Citation
|
| 75 |
+
|
| 76 |
+
If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
|
| 77 |
+
|