File size: 8,739 Bytes
e7185a2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | import React from 'react';
import styled from 'styled-components';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';
import { media } from '../Utils/helper.js';
function Metrics() {
return (
<Container>
<h1>Metrics</h1>
<Para>
Below are the key metrics used to evaluate ASR models in our leaderboard.
</Para>
<Section>
<Head>1. Word Error Rate (WER)</Head>
<Para>
WER measures the accuracy of ASR models by calculating the number of
substitutions, deletions, and insertions compared to the reference
transcript.
</Para>
<FormulaContainer>
<BlockMath math="\text{WER} = \frac{S + D + I}{N}" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>S:</b> Number of substitutions</li>
<li><b>D:</b> Number of deletions</li>
<li><b>I:</b> Number of insertions</li>
<li><b>N:</b> Total number of words in the reference transcript</li>
</ul>
</Section>
<Section>
<Head>2. Fairness Assessment Model</Head>
<Para>
To assess fairness across demographic groups, we apply a mixed-effects Poisson
regression model to estimate WER disparities:
</Para>
<FormulaContainer>
<BlockMath math="\log(\text{WER}_i) = \beta_0 + \beta_1 X_i + \beta_2 Z_i + u_i" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>WER<sub>i</sub>:</b> Observed word error rate for individual i</li>
<li><b>β<sub>0</sub>:</b> Intercept term, representing the baseline (log) WER for a reference group with no demographic or covariate influence</li>
<li><b>X<sub>i</sub>:</b> Demographic attributes (e.g., gender, race, accent) for individual i</li>
<li><b>β<sub>1</sub>:</b> Coefficient that quantifies the impact of demographic factors on WER</li>
<li><b>Z<sub>i</sub>:</b> Covariates (e.g., audio length, noise level) affecting WER but unrelated to fairness</li>
<li><b>β<sub>2</sub>:</b> Coefficient capturing the effect of covariates on WER</li>
<li><b>u<sub>i</sub>:</b> Random effect accounting for individual-level variability (e.g., speaker-specific variation)</li>
</ul>
</Section>
<Section>
<Head>3. Group-level Fairness Score Calculation</Head>
<Para>
To compute group-level fairness scores, we calculate the predicted WER for each demographic group g:
</Para>
<FormulaContainer>
<BlockMath math="\text{WER}_g = \frac{e^{(\beta_0 + \beta_g + \beta_{\log\text{Ref}} \cdot X)}}{e^X - 1}" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>β<sub>g</sub>:</b> Group-specific effect for demographic group g</li>
<li><b>β<sub>logRef</sub>:</b> Coefficient based on log-transformed reference group</li>
<li><b>X:</b> Predictor variable capturing group identity or covariates</li>
</ul>
<Para>
These predicted WER values are then converted to <b>raw fairness scores</b> (on a 0–100 scale):
</Para>
<FormulaContainer>
<BlockMath math="\text{Raw fairness score}_g = 100 \times \left(1 - \frac{\text{WER}_g - \min(\text{WER})}{\max(\text{WER}) - \min(\text{WER})}\right)" />
</FormulaContainer>
<Para>
This normalization ensures that groups with the <b>lowest WER</b> receive the <b>highest fairness score</b>.
On this scale, 100 represents the most fair (lowest WER), and 0 represents the least fair (highest WER).
</Para>
<Para>
To compute a category-level score (e.g., for a particular demographic attribute like gender),
we take a weighted average across all groups in that category:
</Para>
<FormulaContainer>
<BlockMath math="\text{Category score} = \sum_g p_g \times \text{Raw fairness score}_g" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>p<sub>g</sub>:</b> Proportion of data represented by group g</li>
</ul>
<Para>The weights ensure that the fairness score reflects the real-world distribution of groups.</Para>
</Section>
<Section>
<Head>4. Statistical Adjustment and FAAS</Head>
<Para>
We assess whether demographic disparities are <b>statistically significant</b> using a
<b> Likelihood Ratio Test (LRT)</b> that compares two models:
</Para>
<FormulaContainer>
<BlockMath math="\text{LRT} = 2 \times (\log L_{\text{full}} - \log L_{\text{reduced}})" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>L<sub>full</sub>:</b> Log-likelihood of model including demographic attributes</li>
<li><b>L<sub>reduced</sub>:</b> Log-likelihood of model excluding demographic attributes</li>
</ul>
<Para>
If the resulting <b>p-value</b> is below a significance threshold (e.g., p < 0.05),
it indicates that demographic disparities are <b>statistically significant</b>.
In such cases, a penalty is applied:
</Para>
<FormulaContainer>
<BlockMath math="\text{Adjusted score} = \text{Category score} \times \left(\frac{p}{0.05}\right)" />
</FormulaContainer>
<Para>
Lower p-values imply a higher penalty on the fairness score.
If p ≥ 0.05, no adjustment is made.
</Para>
<Para>
Finally, the <b>overall fairness score</b> is computed across all demographic categories c:
</Para>
<FormulaContainer>
<BlockMath math="\text{Overall score} = \frac{\sum_c w_c \times \text{Adjusted score}_c}{\sum_c w_c}" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>w<sub>c</sub>:</b> Importance weight for category c</li>
</ul>
</Section>
<Section>
<Head>5. Fairness Adjusted ASR Score (FAAS)</Head>
<Para>
The Fairness Adjusted ASR Score (FAAS) is a comprehensive metric that balances both accuracy and fairness
in evaluating ASR systems. It incorporates the <b>Word Error Rate (WER)</b> and the <b>Overall Fairness Score</b>
to ensure that models are not only accurate but also equitable in their performance across demographic groups.
</Para>
<FormulaContainer>
<BlockMath math="\text{FAAS} = 10 \times \log_{10}\left(\frac{\text{Overall\_Fairness\_Score}}{\text{WER}}\right)" />
</FormulaContainer>
<Para>
A higher FAAS score indicates a model that performs well both in terms of accuracy (lower WER) and fairness
(higher Overall Fairness Score). This framework encourages the development of ASR models that are both
high-performing and inclusive across all user populations.
</Para>
</Section>
<Section>
<Head>6. Real-Time Factor (RTFx)</Head>
<Para>
RTFx evaluates the efficiency of an ASR model in processing speech, specifically measuring its latency. It helps
assess how fast the model transcribes audio data compared to the actual length of the audio.
</Para>
<FormulaContainer>
<BlockMath math="\text{RTFx} = \frac{T_{\text{process}}}{T_{\text{audio}}}" />
</FormulaContainer>
<Para>Where:</Para>
<ul>
<li><b>T<sub>process</sub>:</b> Time taken by the ASR model to process and transcribe the audio</li>
<li><b>T<sub>audio</sub>:</b> Total duration of the audio input in seconds</li>
</ul>
<Para>
A lower RTFx value indicates better efficiency, as the model processes the audio quickly in comparison to its duration.
</Para>
</Section>
</Container>
);
}
export default Metrics;
const Container = styled.div`
display: flex;
flex-direction: column;
align-items: center;
padding: 2rem;
max-width: 900px;
margin: auto;
@media ${media.mobile} {
padding: 1rem;
}
@media (max-width: 370px) {
padding: 0.2rem;
}
`;
const Section = styled.div`
width: 100%;
padding: 1rem 0;
border-bottom: 1px solid #ddd;
`;
const Head = styled.h2`
font-size: 1.8rem;
color: #3b82f6;
margin-bottom: 0.5rem;
`;
const Para = styled.p`
color: #4b5563;
font-size: 1.2rem;
line-height: 1.6;
`;
const FormulaContainer = styled.div`
background: #f3f4f6;
padding: 1rem;
border-radius: 8px;
margin: 1rem 0;
display: flex;
justify-content: center;
overflow-x: auto;
`;
|