HassanB4 commited on
Commit
b2e646c
·
verified ·
1 Parent(s): 367609d

Add org card: platform description, stats, leaderboard, dataset, team

Browse files
Files changed (1) hide show
  1. README.md +81 -5
README.md CHANGED
@@ -1,10 +1,86 @@
1
  ---
2
  title: README
3
- emoji: 🔥
4
- colorFrom: blue
5
- colorTo: indigo
6
  sdk: static
7
- pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: README
3
+ emoji: 🏆
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: static
7
+ pinned: true
8
  ---
9
 
10
+ # Haakkim حَكِّم
11
+
12
+ **An open arena-style human preference evaluation platform for Arabic LLMs.**
13
+
14
+ Haakkim collects blind pairwise judgments between Arabic language models and ranks them using a statistically grounded Bradley–Terry model with inverse-probability weighting and bootstrap confidence intervals.
15
+
16
+ 🌐 [haakkim.tech](https://haakkim.tech)  ·  🏆 [Live Leaderboard](https://haakkim.tech/#leaderboard)  ·  📦 [Dataset](https://huggingface.co/datasets/Haakkim/Haakkim-1.0v)
17
+
18
+ ---
19
+
20
+ ## What is Haakkim?
21
+
22
+ Most Arabic LLM benchmarks rely on fixed tasks and MSA-only evaluation. Haakkim takes a different approach: real users compare two models side-by-side on real Arabic prompts, across 11 dialect varieties, and the results are aggregated into a statistically validated leaderboard.
23
+
24
+ **Key features:**
25
+ - **11 Arabic varieties** — MSA, Saudi, Egyptian, Levantine, Tunisian, Iraqi, Moroccan, Algerian, Sudanese, Omani, Libyan
26
+ - **3 evaluation modes** — Ranked Arena (official BT leaderboard), Side-by-Side, 10 Questions
27
+ - **Principled scoring** — Bradley–Terry with IPW sampling corrections and bootstrap CIs
28
+ - **Rankability gate** — BT scores only published when the comparison graph is fully connected and ESS is sufficient
29
+ - **Open & reproducible** — full dataset, audit logs, and scoring pipelines are public
30
+
31
+ ---
32
+
33
+ ## Current Snapshot (v1.0)
34
+
35
+ | | |
36
+ |---|---|
37
+ | Total battles collected | 1,273 |
38
+ | Ranked-eligible (BT) | 831 |
39
+ | Models on leaderboard | 67 |
40
+ | Dialects covered | 11 |
41
+ | Graph | Fully connected · 774 edges · density 0.35 |
42
+ | ESS (clamped) | 465 |
43
+
44
+ ---
45
+
46
+ ## MSA Leaderboard — Top 10
47
+
48
+ | Rank | Model | BT Score |
49
+ |---|---|---|
50
+ | 1 | mistralai/ministral-3b-2512 | 1001.75 |
51
+ | 2 | mistralai/ministral-8b-2512 | 1001.61 |
52
+ | 3 | Qwen/Qwen3-235B-A22B-Thinking-2507 | 1001.21 |
53
+ | 4 | Qwen/Qwen3-30B-A3B-Instruct-2507 | 1001.14 |
54
+ | 5 | deepseek/deepseek-v3.2-exp | 1001.13 |
55
+ | 6 | deepseek/deepseek-v3.1 | 1000.99 |
56
+ | 7 | Qwen/Qwen3-235B-A22B-Instruct-2507 | 1000.98 |
57
+ | 8 | deepseek/deepseek-r1-0528 | 1000.93 |
58
+ | 9 | openai/gpt-oss-120b | 1000.93 |
59
+ | 10 | deepseek/deepseek-v3.2 | 1000.89 |
60
+
61
+ Scores are 1000-centered log-odds units. Full leaderboard → [haakkim.tech/#leaderboard](https://haakkim.tech/#leaderboard)
62
+
63
+ ---
64
+
65
+ ## Dataset
66
+
67
+ The first public release of Haakkim battle data is available on Hugging Face:
68
+
69
+ 📦 **[Haakkim/Haakkim-1.0v](https://huggingface.co/datasets/Haakkim/Haakkim-1.0v)**
70
+
71
+ - 1,273 battle records (JSONL → Parquet, PII-scrubbed)
72
+ - Includes voted comparisons and skipped battles
73
+ - All 11 dialect varieties and all 3 evaluation modes
74
+ - Full conversation transcripts, sampling weights, category annotations
75
+
76
+ ---
77
+
78
+ ## Team
79
+
80
+ **College of Computing, Umm Al-Qura University — Mecca, Saudi Arabia**
81
+
82
+ | | |
83
+ |---|---|
84
+ | [Mourad Mars](https://huggingface.co/mouradmars) | Principal Investigator |
85
+ | [Hassan Barmandah](https://huggingface.co/HassanB4) | AI Researcher |
86
+ | Abdulrhman Alassaf | Software Engineer |