Spaces:

Haakkim
/

README

Running

App Files Files Community

HassanB4 commited on Apr 9

Commit

b2e646c

verified ·

1 Parent(s): 367609d

Add org card: platform description, stats, leaderboard, dataset, team

Browse files

Files changed (1) hide show

README.md +81 -5

README.md CHANGED Viewed

@@ -1,10 +1,86 @@
 ---
 title: README
-emoji: 🔥
-colorFrom: blue
-colorTo: indigo
 sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
 title: README
+emoji: 🏆
+colorFrom: green
+colorTo: blue
 sdk: static
+pinned: true
 ---
+# Haakkim — حَكِّم
+**An open arena-style human preference evaluation platform for Arabic LLMs.**
+Haakkim collects blind pairwise judgments between Arabic language models and ranks them using a statistically grounded Bradley–Terry model with inverse-probability weighting and bootstrap confidence intervals.
+🌐 [haakkim.tech](https://haakkim.tech) &nbsp;·&nbsp; 🏆 [Live Leaderboard](https://haakkim.tech/#leaderboard) &nbsp;·&nbsp; 📦 [Dataset](https://huggingface.co/datasets/Haakkim/Haakkim-1.0v)
+---
+## What is Haakkim?
+Most Arabic LLM benchmarks rely on fixed tasks and MSA-only evaluation. Haakkim takes a different approach: real users compare two models side-by-side on real Arabic prompts, across 11 dialect varieties, and the results are aggregated into a statistically validated leaderboard.
+**Key features:**
+- **11 Arabic varieties** — MSA, Saudi, Egyptian, Levantine, Tunisian, Iraqi, Moroccan, Algerian, Sudanese, Omani, Libyan
+- **3 evaluation modes** — Ranked Arena (official BT leaderboard), Side-by-Side, 10 Questions
+- **Principled scoring** — Bradley–Terry with IPW sampling corrections and bootstrap CIs
+- **Rankability gate** — BT scores only published when the comparison graph is fully connected and ESS is sufficient
+- **Open & reproducible** — full dataset, audit logs, and scoring pipelines are public
+---
+## Current Snapshot (v1.0)
+| | |
+|---|---|
+| Total battles collected | 1,273 |
+| Ranked-eligible (BT) | 831 |
+| Models on leaderboard | 67 |
+| Dialects covered | 11 |
+| Graph | Fully connected · 774 edges · density 0.35 |
+| ESS (clamped) | 465 |
+---
+## MSA Leaderboard — Top 10
+| Rank | Model | BT Score |
+|---|---|---|
+| 1 | mistralai/ministral-3b-2512 | 1001.75 |
+| 2 | mistralai/ministral-8b-2512 | 1001.61 |
+| 3 | Qwen/Qwen3-235B-A22B-Thinking-2507 | 1001.21 |
+| 4 | Qwen/Qwen3-30B-A3B-Instruct-2507 | 1001.14 |
+| 5 | deepseek/deepseek-v3.2-exp | 1001.13 |
+| 6 | deepseek/deepseek-v3.1 | 1000.99 |
+| 7 | Qwen/Qwen3-235B-A22B-Instruct-2507 | 1000.98 |
+| 8 | deepseek/deepseek-r1-0528 | 1000.93 |
+| 9 | openai/gpt-oss-120b | 1000.93 |
+| 10 | deepseek/deepseek-v3.2 | 1000.89 |
+Scores are 1000-centered log-odds units. Full leaderboard → [haakkim.tech/#leaderboard](https://haakkim.tech/#leaderboard)
+---
+## Dataset
+The first public release of Haakkim battle data is available on Hugging Face:
+📦 **[Haakkim/Haakkim-1.0v](https://huggingface.co/datasets/Haakkim/Haakkim-1.0v)**
+- 1,273 battle records (JSONL → Parquet, PII-scrubbed)
+- Includes voted comparisons and skipped battles
+- All 11 dialect varieties and all 3 evaluation modes
+- Full conversation transcripts, sampling weights, category annotations
+---
+## Team
+**College of Computing, Umm Al-Qura University — Mecca, Saudi Arabia**
+| | |
+|---|---|
+| [Mourad Mars](https://huggingface.co/mouradmars) | Principal Investigator |
+| [Hassan Barmandah](https://huggingface.co/HassanB4) | AI Researcher |
+| Abdulrhman Alassaf | Software Engineer |