Arabic TTS Arena: Ranking Voice Models the Way Chess Ranks Grandmasters

Community Article Published March 12, 2026

There is a beautiful irony in borrowing a ranking system from a board game to judge the quality of artificial speech. The Elo rating — invented by Arpad Elo in the 1960s to measure the relative strength of chess players — turns out to be one of the most elegant tools for comparing text-to-speech models. We used it to build the Arabic TTS Arena, the first open, community-driven leaderboard dedicated to Arabic speech synthesis.

This post explains how the arena works, what we learned from building it, and why we believe Arabic TTS needs a fundamentally different design philosophy than what exists today.

How the Arena Works

The concept is simple. A user enters Arabic text. Two TTS models — selected at random, identities hidden — synthesize the same sentence. The user listens to both and votes for the one that sounds better (or marks them as tied). That single vote becomes a data point. Thousands of such votes become a leaderboard.

We currently host 13 models spanning both open-source and API-based systems, including Arabic F5-TTS, Arabic Spark TTS, Multilingual Chatterbox, Fish Speech (S1-mini & S2 Pro), Habibi TTS, KaniTTS Arabic, Lahgtna, MOSS-TTS, OuteTTS 1.0, SpeechT5 Arabic, and XTTS v2. Adding a new model requires nothing more than implementing a single Python class — the arena discovers it automatically.

The backend is fully open source. Individual contributors and companies who want to add their models to the arena can find everything they need — architecture, integration guide, and model template — in the GitHub repository. Each model runs in its own containerised environment, while a Gradio frontend on Hugging Face Spaces provides the user interface. Votes are stored persistently, and a daily cron job recomputes ratings and pushes the latest leaderboard.

From Votes to Ratings: The Math

The Elo system answers a precise question: given the outcomes of past matches, how strong is each competitor?

We use the Bradley-Terry model, the same maximum-likelihood framework behind LMArena (formerly Chatbot Arena). The probability that model $i$ beats model $j$ is:

$P(i > j) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}}$

where $\theta_i$ and $\theta_j$ are latent strength parameters. Ties are handled with the standard half-win approach: each tied match awards 0.5 wins to both sides.

Given all pairwise vote counts, we fit the strength parameters $\theta$ using the iterative minorisation-maximisation (MM) algorithm, which is guaranteed to converge. Raw strengths are then mapped to a human-readable scale centred at 1000:

$\text{Rating} = 1000 + 400 \cdot \log_{10}\!\left(e^{\theta}\right)$

A model with no votes starts at 1000. A model that wins twice as often as it loses against equal-strength opponents will sit around 1120. Confidence intervals are computed via 200 rounds of bootstrap resampling.

A quick example. Suppose model $i$ has beaten model $j$ twice, while $j$ beat $i$ once. The empirical win probability for $i$ is $\frac{2}{3}$ . Setting $\theta_j = 0$ as a baseline and solving:

$\frac{2}{3} = \frac{e^{\theta_i}}{e^{\theta_i} + 1} \implies \theta_i = \ln(2) \approx 0.693$

Converting to the rating scale: model $j$ gets a rating of 1000 and model $i$ gets 1120. These simple operations are the same ones that power world chess championships and the most widely-used LLM leaderboard.

Why This Matters for Arabic

The Elo approach has a subtle but important property: it makes no assumptions about what "good" means. Unlike traditional leaderboards that rely on fixed benchmarks and predefined metrics, the arena-style evaluation allows quality to emerge directly from human preferences. It simply asks people — native speakers, dialect experts, everyday users — which output sounds better to them, and lets the ratings emerge. In this sense, it provides a more realistic and flexible signal than static leaderboard scores.

If a model improves over time, its rating rises organically. If a once-praised model stagnates, it naturally falls behind.

For a language as diverse as Arabic, this flexibility is essential.

A Design Thesis for Better Arabic TTS

Building the arena forced us to listen to hundreds of Arabic TTS outputs across every model. That experience shaped a conviction: most Arabic TTS models are solving an incomplete problem.

We propose what we call the TTS Triangle — three dimensions that any complete text-to-speech system must address:

What is being said — the textual content.
Who is saying it — the voice identity.
How it is delivered — the performance style, emotion, and prosody.

Most Arabic TTS models today handle one, perhaps two, of these dimensions. They let you choose what to say and sometimes who says it, but rarely give meaningful control over how it is said. Let us look at the two dimensions where current Arabic models fall short — and what better alternatives look like.

The "Who": Voice Identity over Country-Level Dialect Labels

Arabic is spoken natively by over 500 million people across more than 20 countries. Within a single country — even within a single city — dialects can vary dramatically. Someone from Upper Egypt may not be easily understood by someone from the Nile Delta. A speaker from Casablanca sounds nothing like one from Damascus.

Most Arabic TTS models today reduce this extraordinary diversity to a handful of country-level dialect labels: "Egyptian," "Saudi," "Moroccan." This is a lossy abstraction. It assumes everyone in a country speaks the same way, which is demonstrably false.

The better approach is to shift from dialect labels to voice identities — specific reference speakers whose natural dialect, accent, and cadence are captured by the model. Instead of trying to define and enumerate every sub-dialect, you let the voice itself carry the linguistic identity. This is both more accurate and more scalable: a library of real voices from across the Arab world will always represent how people actually speak better than a flat list of country names ever could.

The "How": Natural Language Instructions over Emotion Tags

The few models that do attempt expressive control typically rely on emotion tags — inline markers like [laugh], [sigh], or [sad] embedded directly in the input text. While this gives users some control, it rarely delivers the human quality people are looking for, for two reasons.

First, emotion in human speech is not a discrete event — a laugh spliced mid-sentence or a sigh tacked onto the end. It is a continuous colouring that permeates the entire utterance, shaping pitch, rhythm, and energy from the very first syllable. Tags cannot capture that. Second, to use them effectively a user needs an upstream language model to decide where to place each tag and which one to use — a task that is poorly defined and for which no good training data exists.

A far more natural interface is a plain-language instruction, the same way a director guides a voice actor: "Read this in a warm, reassuring tone" or "Deliver this as an excited sports commentator." This mirrors how professional voice-over artists actually work — they receive direction on mood and style, then bring it to life through their own voice and instincts.

Putting the Triangle Together

OpenAI's TTS API already separates these three dimensions cleanly: a voice parameter (who), an input field (what), and an instructions field (how). We believe this is the right architecture for Arabic TTS — and perhaps even more important for Arabic than for any other language, given the sheer breadth of its dialectal and expressive landscape. A model that lets you pick a specific voice from Fez or Riyadh, hand it your text, and tell it how to perform that text would be a step change from anything available today.

from pathlib import Path
from openai import OpenAI

client = OpenAI()
speech_file_path = Path(__file__).parent / "speech.mp3"

with client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="coral",  # The who
    input="Today is a wonderful day to build something people love!", # The What
    instructions="Speak in a cheerful and positive tone.", # The How 
) as response:
    response.stream_to_file(speech_file_path)

Summary

The Arabic TTS Arena is an open platform for ranking Arabic text-to-speech models using human preferences and Bradley-Terry ratings. It is live, accepting votes, and open to any model that wants to compete.

Through building it, we arrived at a thesis: Arabic TTS needs models that unify voice identity, textual content, and delivery style into a single, controllable system — moving beyond country-level dialect labels toward individual voice identities, and beyond brittle emotion tags toward natural language instructions.

The leaderboard will keep evolving as new models enter and the community votes. We hope it becomes a useful compass for the Arabic speech community — not by dictating what matters, but by letting speakers decide for themselves.

The Arabic TTS Arena is hosted on Hugging Face Spaces. If you build Arabic TTS models — whether as an open-source contributor or a company — we invite you to add yours via our GitHub repo. For collaborations, partnerships, or any questions, reach out to us at info@navid.sa.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote