Spaces:

JQL-AI
/

README

Running

App Files Files Community

mali90 commited on May 26, 2025

Commit

f6dbd4b

verified ·

1 Parent(s): b5134eb

Update README.md

Browse files

Files changed (1) hide show

README.md +68 -1

README.md CHANGED Viewed

@@ -7,4 +7,71 @@ sdk: static
 pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 pinned: false
 ---
+# 🦊 JQL: Judging Quality across Languages
+High-quality multilingual data is crucial for training effective large language models (LLMs).
+**JQL (Judging Quality across Languages)** is a scalable and lightweight data filtering approach
+that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
+These annotators enable robust filtering of web-scale data.
+JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance
+on Arabic, Thai, and Mandarin.
+It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.
+---
+## 🧩 Main Pipeline Steps
+![JQL Pipeline Overview](assets/jql.pdf)
+1. **📋 Ground Truth Creation**
+   Human annotators label monolingual documents based on a structured instruction prompt.
+   These documents are then translated into all target languages to form a multilingual gold-standard dataset.
+   *(See Figure 1)*
+3. **🤖 LLM-as-a-Judge Selection & Data Annotation**
+   Several strong multilingual LLMs (e.g., Gemma-3-27B-it, Mistral-3.1-24B-it, LLaMA-3.3-70B-it) are evaluated against the GT dataset.
+   The top-$n$ models are selected to generate high-quality synthetic annotations on a large multilingual corpus.
+   *(See Figure 1)*
+5. **🪶 Lightweight Annotator Training**
+   Regression heads are trained on top of frozen multilingual embeddings (e.g., Snowflake Arctic Embed v2) using the synthetic data.
+   This results in lightweight, efficient annotators capable of high-throughput filtering.
+   *(See Figure 1)*
+7. **🚀 Scalable Data Filtering**
+   The trained annotators are used to label large-scale pretraining corpora.
+   Quality-based thresholds (e.g., 0.6 or 0.7 quantile) are applied to retain high-quality subsets for downstream LLM training.
+   *(See Figure 1)*
+---
+## 📊 Results
+- **Accuracy**: Spearman’s ρ > 0.87 with human ground truth
+- **Downstream LLM Training**:
+  - Up to **+7.2% benchmark performance improvement**
+  - **+4.8% token retention** over baseline FineWeb2 heuristic filter
+  - Reliable trade-offs with 0.6 and 0.7 quantile filtering strategies
+- **Annotation Speed**: ~11,000 docs/min (A100 GPU, 690 tokens avg. length)
+---
+## 📁 Available Artifacts
+- ✅ Ground truth annotations in 35 languages
+- ✅ Synthetic LLM-annotated dataset (14M+ documents)
+- ✅ Lightweight annotation models:
+  - `JQL-Gemma`
+  - `JQL-Mistral`
+  - `JQL-Llama`
+- ✅ Training & inference scripts *(coming soon)*
+---
+## 📜 Citation
+If you use JQL, the annotations, or the pretrained annotators, please cite the paper: