An open arena-style human preference evaluation platform for Arabic large language models, covering 11 dialect varieties and ranked by a statistically rigorous Bradley–Terry model.
Random model pairing, single-turn MSA, matched system instruction. The only mode that feeds the official Bradley–Terry leaderboard.
✓ BT LeaderboardUser-selected model pair, any dialect. Useful for targeted comparisons — excluded from ranked scoring to prevent selection bias.
Win-rate onlyFixed Arabic prompt pool, any dialect. Provides consistent benchmarking within a curated question set.
Win-rate onlyBattle records covering all 11 dialect varieties and all 3 evaluation modes. Includes full conversation transcripts, sampling weights, and category annotations.