import React from 'react'; import styled from 'styled-components'; import 'katex/dist/katex.min.css'; import { InlineMath, BlockMath } from 'react-katex'; import { media } from '../Utils/helper.js'; function Metrics() { return (

Metrics

Below are the key metrics used to evaluate ASR models in our leaderboard.
1. Word Error Rate (WER) WER measures the accuracy of ASR models by calculating the number of substitutions, deletions, and insertions compared to the reference transcript. Where:
2. Fairness Assessment Model To assess fairness across demographic groups, we apply a mixed-effects Poisson regression model to estimate WER disparities: Where:
3. Group-level Fairness Score Calculation To compute group-level fairness scores, we calculate the predicted WER for each demographic group g: Where: These predicted WER values are then converted to raw fairness scores (on a 0–100 scale): This normalization ensures that groups with the lowest WER receive the highest fairness score. On this scale, 100 represents the most fair (lowest WER), and 0 represents the least fair (highest WER). To compute a category-level score (e.g., for a particular demographic attribute like gender), we take a weighted average across all groups in that category: Where: The weights ensure that the fairness score reflects the real-world distribution of groups.
4. Statistical Adjustment and FAAS We assess whether demographic disparities are statistically significant using a Likelihood Ratio Test (LRT) that compares two models: Where: If the resulting p-value is below a significance threshold (e.g., p < 0.05), it indicates that demographic disparities are statistically significant. In such cases, a penalty is applied: Lower p-values imply a higher penalty on the fairness score. If p ≥ 0.05, no adjustment is made. Finally, the overall fairness score is computed across all demographic categories c: Where:
5. Fairness Adjusted ASR Score (FAAS) The Fairness Adjusted ASR Score (FAAS) is a comprehensive metric that balances both accuracy and fairness in evaluating ASR systems. It incorporates the Word Error Rate (WER) and the Overall Fairness Score to ensure that models are not only accurate but also equitable in their performance across demographic groups. A higher FAAS score indicates a model that performs well both in terms of accuracy (lower WER) and fairness (higher Overall Fairness Score). This framework encourages the development of ASR models that are both high-performing and inclusive across all user populations.
6. Real-Time Factor (RTFx) RTFx evaluates the efficiency of an ASR model in processing speech, specifically measuring its latency. It helps assess how fast the model transcribes audio data compared to the actual length of the audio. Where: A lower RTFx value indicates better efficiency, as the model processes the audio quickly in comparison to its duration.
); } export default Metrics; const Container = styled.div` display: flex; flex-direction: column; align-items: center; padding: 2rem; max-width: 900px; margin: auto; @media ${media.mobile} { padding: 1rem; } @media (max-width: 370px) { padding: 0.2rem; } `; const Section = styled.div` width: 100%; padding: 1rem 0; border-bottom: 1px solid #ddd; `; const Head = styled.h2` font-size: 1.8rem; color: #3b82f6; margin-bottom: 0.5rem; `; const Para = styled.p` color: #4b5563; font-size: 1.2rem; line-height: 1.6; `; const FormulaContainer = styled.div` background: #f3f4f6; padding: 1rem; border-radius: 8px; margin: 1rem 0; display: flex; justify-content: center; overflow-x: auto; `;