Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
File size: 18,774 Bytes
b064a39 2deac9d b064a39 329b392 b064a39 5a0105c b064a39 8259150 b064a39 1e874c9 b064a39 1e874c9 b064a39 2deac9d b064a39 2deac9d b064a39 2deac9d b064a39 2deac9d b064a39 2deac9d b064a39 2deac9d b064a39 2deac9d b064a39 2deac9d b064a39 bbe522b b064a39 3ea6f44 329b392 3ea6f44 329b392 3ea6f44 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
from pathlib import Path
# Directory where request by models are stored
DIR_OUTPUT_REQUESTS = Path("requested_models")
EVAL_REQUESTS_PATH = Path("eval_requests")
##########################
# Text definitions #
##########################
banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png"
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> ๐ค Open Automatic Speech Recognition Leaderboard </b> </body> </html>"
INTRODUCTION_TEXT = "๐ The ๐ค Open ASR Leaderboard ranks and evaluates speech recognition models \
on the Hugging Face Hub. \
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (โฌ๏ธ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (โฌ๏ธ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the ๐ Metrics tab to understand how the models are evaluated. \
\nIf you want results for a model that is not listed here, you can submit a request for it to be included โ๏ธโจ. \
\nThe leaderboard includes both English ASR evaluation and multilingual benchmarks across the top European languages."
CITATION_TEXT = """@misc{open-asr-leaderboard,
title = {Open Automatic Speech Recognition Leaderboard},
author = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and others},
year = 2023,
publisher = {Hugging Face},
howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}"
}
"""from pathlib import Path
# Directory where request by models are stored
DIR_OUTPUT_REQUESTS = Path("requested_models")
EVAL_REQUESTS_PATH = Path("eval_requests")
##########################
# Text definitions #
##########################
banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png"
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> ๐ค Open Automatic Speech Recognition Leaderboard </b> </body> </html>"
INTRODUCTION_TEXT = "๐ The ๐ค Open ASR Leaderboard ranks and evaluates speech recognition models \
on the Hugging Face Hub. \
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (โฌ๏ธ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (โฌ๏ธ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the ๐ Metrics tab to understand how the models are evaluated. \
\nIf you want results for a model that is not listed here, you can submit a request for it to be included โ๏ธโจ. \
\nThe leaderboard includes both English ASR evaluation and multilingual benchmarks across the top European languages."
CITATION_TEXT = """@misc{open-asr-leaderboard,
title = {Open Automatic Speech Recognition Leaderboard},
author = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and others},
year = 2023,
publisher = {Hugging Face},
howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}"
}
"""
METRICS_TAB_TEXT = """
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.
## Metrics
Models are evaluated jointly using the Word Error Rate (WER) and Inverse Real Time Factor (RTFx) metrics. The WER metric
is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based
on their WER, lowest to highest.
Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold:
1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish.
2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios).
For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model).
### Word Error Rate (WER)
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
Take the following example:
| Reference: | the | cat | sat | on | the | mat |
|-------------|-----|-----|---------|-----|-----|-----|
| Prediction: | the | cat | **sit** | on | the | | |
| Label: | โ
| โ
| S | โ
| โ
| D |
Here, we have:
* 1 substitution ("sit" instead of "sat")
* 0 insertions
* 1 deletion ("mat" is missing)
This gives 2 errors in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
reference (N), which for this example is 6:
```
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
```
Giving a WER of 0.33, or 33%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation).
### Inverse Real Time Factor (RTFx)
Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an
model to process a given amount of speech. It is defined as:
```
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
```
Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
Thus, **a higher RTFx value indicates lower latency**.
## How to reproduce our results
The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible.
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! โฅ๏ธ
## Benchmark datasets
Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains,
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.
The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
are ranked based on their average WER scores, from lowest to highest.
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
"""
# Multilingual benchmark definitions
EU_LANGUAGES = {
"de": {"name": "German", "flag": "๐ฉ๐ช", "datasets": ["mls", "fleurs", "covost"]},
"fr": {"name": "French", "flag": "๐ซ๐ท", "datasets": ["mls", "fleurs", "covost"]},
"it": {"name": "Italian", "flag": "๐ฎ๐น", "datasets": ["mls", "fleurs", "covost"]},
"es": {"name": "Spanish", "flag": "๐ช๐ธ", "datasets": ["mls", "fleurs", "covost"]},
"pt": {"name": "Portuguese", "flag": "๐ต๐น", "datasets": ["mls", "fleurs", "covost"]}
}
MULTILINGUAL_TAB_TEXT = """
## ๐ Multilingual ASR Evaluation
"""
LONGFORM_TAB_TEXT = """
## ๐ Long-form ASR Evaluation
"""
LEADERBOARD_CSS = """
#leaderboard-table th .header-content {
white-space: nowrap;
}
#multilingual-table th .header-content {
white-space: nowrap;
}
#multilingual-table th:hover {
background-color: var(--table-row-focus);
}
#longform-table th .header-content {
white-space: nowrap;
}
#longform-table th:hover {
background-color: var(--table-row-focus);
}
.language-detail-modal {
background: var(--background-fill-primary);
border: 1px solid var(--border-color-primary);
border-radius: 8px;
padding: 1rem;
margin: 1rem 0;
}
"""
METRICS_TAB_TEXT = """
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.
## Metrics
Models are evaluated jointly using the Word Error Rate (WER) and Inverse Real Time Factor (RTFx) metrics. The WER metric
is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based
on their WER, lowest to highest.
Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold:
1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish.
2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios).
For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model).
### Word Error Rate (WER)
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
Take the following example:
| Reference: | the | cat | sat | on | the | mat |
|-------------|-----|-----|---------|-----|-----|-----|
| Prediction: | the | cat | **sit** | on | the | | |
| Label: | โ
| โ
| S | โ
| โ
| D |
Here, we have:
* 1 substitution ("sit" instead of "sat")
* 0 insertions
* 1 deletion ("mat" is missing)
This gives 2 errors in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
reference (N), which for this example is 6:
```
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
```
Giving a WER of 0.33, or 33%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation).
### Inverse Real Time Factor (RTFx)
Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an
model to process a given amount of speech. It is defined as:
```
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
```
Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
Thus, **a higher RTFx value indicates lower latency**.
## How to reproduce our results
The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible.
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! โฅ๏ธ
## Benchmark datasets
Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains,
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.
The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
are ranked based on their average WER scores, from lowest to highest.
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
"""
# Multilingual benchmark definitions
EU_LANGUAGES = {
"de": {"name": "German", "flag": "๐ฉ๐ช", "datasets": ["mls", "fleurs", "covost"]},
"fr": {"name": "French", "flag": "๐ซ๐ท", "datasets": ["mls", "fleurs", "covost"]},
"it": {"name": "Italian", "flag": "๐ฎ๐น", "datasets": ["mls", "fleurs", "covost"]},
"es": {"name": "Spanish", "flag": "๐ช๐ธ", "datasets": ["mls", "fleurs", "covost"]},
"pt": {"name": "Portuguese", "flag": "๐ต๐น", "datasets": ["mls", "fleurs", "covost"]}
}
MULTILINGUAL_TAB_TEXT = """
## ๐ Multilingual ASR Evaluation
"""
LEADERBOARD_CSS = """
#leaderboard-table th .header-content {
white-space: nowrap;
}
#multilingual-table th .header-content {
white-space: nowrap;
}
#multilingual-table th:hover {
background-color: var(--table-row-focus);
}
.language-detail-modal {
background: var(--background-fill-primary);
border: 1px solid var(--border-color-primary);
border-radius: 8px;
padding: 1rem;
margin: 1rem 0;
}
"""
|