Spaces:

JQL-AI
/

JQL

Running

mfromm commited on May 28, 2025

Commit

205b70d

verified ·

1 Parent(s): 7c00f62

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -56,18 +56,18 @@
 <section class="section">
   <div class="container content">
-    <h2 class="title is-3">📊 Results</h2>
-    <ul>
-      <li><strong>✔️ Accuracy:</strong> Spearman’s ρ > 0.87 with human ground truth</li>
-      <li><strong>📈 Downstream LLM Training:</strong>
-        <ul>
-          <li>+7.2% benchmark performance improvement</li>
-          <li>+4.8% token retention vs. FineWeb2 heuristic filter</li>
-          <li>Effective threshold strategies: 0.6 and 0.7 quantile</li>
-        </ul>
-      </li>
-      <li><strong>⚡ Annotation Speed:</strong> ~11,000 docs/min (A100 GPU, avg. 690 tokens)</li>
-    </ul>
   </div>
 </section>

 <section class="section">
   <div class="container content">
+    <h2 class="title is-3">🧩 Main Pipeline Steps</h2>
+    <figure>
+      <img src="https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/1zPQcwqt9Li_gCvd04_2_.png" alt="JQL Pipeline Overview">
+      <figcaption><em>Figure 1: Overview of the JQL pipeline</em></figcaption>
+    </figure>
+    <ol>
+      <li><strong>📋 Ground Truth Creation:</strong> Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)</li>
+      <li><strong>🤖 LLM-as-a-Judge Selection & Data Annotation:</strong> Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)</li>
+      <li><strong>🪶 Lightweight Annotator Training:</strong> Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)</li>
+      <li><strong>🚀 Scalable Data Filtering:</strong> Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)</li>
+    </ol>
   </div>
 </section>