Metrics

Below are the key metrics used to evaluate ASR models in our leaderboard.

1. Word Error Rate (WER) WER measures the accuracy of ASR models by calculating the number of substitutions, deletions, and insertions compared to the reference transcript. Where:

S: Number of substitutions
D: Number of deletions
I: Number of insertions
N: Total number of words in the reference transcript

2. Fairness Assessment Model To assess fairness across demographic groups, we apply a mixed-effects Poisson regression model to estimate WER disparities: Where:

WER_i: Observed word error rate for individual i
β₀: Intercept term, representing the baseline (log) WER for a reference group with no demographic or covariate influence
X_i: Demographic attributes (e.g., gender, race, accent) for individual i
β₁: Coefficient that quantifies the impact of demographic factors on WER
Z_i: Covariates (e.g., audio length, noise level) affecting WER but unrelated to fairness
β₂: Coefficient capturing the effect of covariates on WER
u_i: Random effect accounting for individual-level variability (e.g., speaker-specific variation)

3. Group-level Fairness Score Calculation To compute group-level fairness scores, we calculate the predicted WER for each demographic group g: Where:

β_g: Group-specific effect for demographic group g
β_logRef: Coefficient based on log-transformed reference group
X: Predictor variable capturing group identity or covariates

These predicted WER values are then converted to raw fairness scores (on a 0–100 scale): This normalization ensures that groups with the lowest WER receive the highest fairness score. On this scale, 100 represents the most fair (lowest WER), and 0 represents the least fair (highest WER). To compute a category-level score (e.g., for a particular demographic attribute like gender), we take a weighted average across all groups in that category: Where:

p_g: Proportion of data represented by group g

The weights ensure that the fairness score reflects the real-world distribution of groups.

4. Statistical Adjustment and FAAS We assess whether demographic disparities are statistically significant using a Likelihood Ratio Test (LRT) that compares two models: Where:

L_full: Log-likelihood of model including demographic attributes
L_reduced: Log-likelihood of model excluding demographic attributes

If the resulting p-value is below a significance threshold (e.g., p < 0.05), it indicates that demographic disparities are statistically significant. In such cases, a penalty is applied: Lower p-values imply a higher penalty on the fairness score. If p ≥ 0.05, no adjustment is made. Finally, the overall fairness score is computed across all demographic categories c: Where:

w_c: Importance weight for category c

5. Fairness Adjusted ASR Score (FAAS) The Fairness Adjusted ASR Score (FAAS) is a comprehensive metric that balances both accuracy and fairness in evaluating ASR systems. It incorporates the Word Error Rate (WER) and the Overall Fairness Score to ensure that models are not only accurate but also equitable in their performance across demographic groups. A higher FAAS score indicates a model that performs well both in terms of accuracy (lower WER) and fairness (higher Overall Fairness Score). This framework encourages the development of ASR models that are both high-performing and inclusive across all user populations.

6. Real-Time Factor (RTFx) RTFx evaluates the efficiency of an ASR model in processing speech, specifically measuring its latency. It helps assess how fast the model transcribes audio data compared to the actual length of the audio. Where:

T_process: Time taken by the ASR model to process and transcribe the audio
T_audio: Total duration of the audio input in seconds

A lower RTFx value indicates better efficiency, as the model processes the audio quickly in comparison to its duration.