Add org card: platform description, stats, leaderboard, dataset, team
Browse files
README.md
CHANGED
|
@@ -1,10 +1,86 @@
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: static
|
| 7 |
-
pinned:
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
+
emoji: 🏆
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: static
|
| 7 |
+
pinned: true
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# Haakkim — حَكِّم
|
| 11 |
+
|
| 12 |
+
**An open arena-style human preference evaluation platform for Arabic LLMs.**
|
| 13 |
+
|
| 14 |
+
Haakkim collects blind pairwise judgments between Arabic language models and ranks them using a statistically grounded Bradley–Terry model with inverse-probability weighting and bootstrap confidence intervals.
|
| 15 |
+
|
| 16 |
+
🌐 [haakkim.tech](https://haakkim.tech) · 🏆 [Live Leaderboard](https://haakkim.tech/#leaderboard) · 📦 [Dataset](https://huggingface.co/datasets/Haakkim/Haakkim-1.0v)
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## What is Haakkim?
|
| 21 |
+
|
| 22 |
+
Most Arabic LLM benchmarks rely on fixed tasks and MSA-only evaluation. Haakkim takes a different approach: real users compare two models side-by-side on real Arabic prompts, across 11 dialect varieties, and the results are aggregated into a statistically validated leaderboard.
|
| 23 |
+
|
| 24 |
+
**Key features:**
|
| 25 |
+
- **11 Arabic varieties** — MSA, Saudi, Egyptian, Levantine, Tunisian, Iraqi, Moroccan, Algerian, Sudanese, Omani, Libyan
|
| 26 |
+
- **3 evaluation modes** — Ranked Arena (official BT leaderboard), Side-by-Side, 10 Questions
|
| 27 |
+
- **Principled scoring** — Bradley–Terry with IPW sampling corrections and bootstrap CIs
|
| 28 |
+
- **Rankability gate** — BT scores only published when the comparison graph is fully connected and ESS is sufficient
|
| 29 |
+
- **Open & reproducible** — full dataset, audit logs, and scoring pipelines are public
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Current Snapshot (v1.0)
|
| 34 |
+
|
| 35 |
+
| | |
|
| 36 |
+
|---|---|
|
| 37 |
+
| Total battles collected | 1,273 |
|
| 38 |
+
| Ranked-eligible (BT) | 831 |
|
| 39 |
+
| Models on leaderboard | 67 |
|
| 40 |
+
| Dialects covered | 11 |
|
| 41 |
+
| Graph | Fully connected · 774 edges · density 0.35 |
|
| 42 |
+
| ESS (clamped) | 465 |
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## MSA Leaderboard — Top 10
|
| 47 |
+
|
| 48 |
+
| Rank | Model | BT Score |
|
| 49 |
+
|---|---|---|
|
| 50 |
+
| 1 | mistralai/ministral-3b-2512 | 1001.75 |
|
| 51 |
+
| 2 | mistralai/ministral-8b-2512 | 1001.61 |
|
| 52 |
+
| 3 | Qwen/Qwen3-235B-A22B-Thinking-2507 | 1001.21 |
|
| 53 |
+
| 4 | Qwen/Qwen3-30B-A3B-Instruct-2507 | 1001.14 |
|
| 54 |
+
| 5 | deepseek/deepseek-v3.2-exp | 1001.13 |
|
| 55 |
+
| 6 | deepseek/deepseek-v3.1 | 1000.99 |
|
| 56 |
+
| 7 | Qwen/Qwen3-235B-A22B-Instruct-2507 | 1000.98 |
|
| 57 |
+
| 8 | deepseek/deepseek-r1-0528 | 1000.93 |
|
| 58 |
+
| 9 | openai/gpt-oss-120b | 1000.93 |
|
| 59 |
+
| 10 | deepseek/deepseek-v3.2 | 1000.89 |
|
| 60 |
+
|
| 61 |
+
Scores are 1000-centered log-odds units. Full leaderboard → [haakkim.tech/#leaderboard](https://haakkim.tech/#leaderboard)
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Dataset
|
| 66 |
+
|
| 67 |
+
The first public release of Haakkim battle data is available on Hugging Face:
|
| 68 |
+
|
| 69 |
+
📦 **[Haakkim/Haakkim-1.0v](https://huggingface.co/datasets/Haakkim/Haakkim-1.0v)**
|
| 70 |
+
|
| 71 |
+
- 1,273 battle records (JSONL → Parquet, PII-scrubbed)
|
| 72 |
+
- Includes voted comparisons and skipped battles
|
| 73 |
+
- All 11 dialect varieties and all 3 evaluation modes
|
| 74 |
+
- Full conversation transcripts, sampling weights, category annotations
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## Team
|
| 79 |
+
|
| 80 |
+
**College of Computing, Umm Al-Qura University — Mecca, Saudi Arabia**
|
| 81 |
+
|
| 82 |
+
| | |
|
| 83 |
+
|---|---|
|
| 84 |
+
| [Mourad Mars](https://huggingface.co/mouradmars) | Principal Investigator |
|
| 85 |
+
| [Hassan Barmandah](https://huggingface.co/HassanB4) | AI Researcher |
|
| 86 |
+
| Abdulrhman Alassaf | Software Engineer |
|