Spaces:

hf-audio
/

open_asr_leaderboard

Running on CPU Upgrade

App Files Files Community

Steveeeeeeen HF Staff commited on Oct 6

Commit

1e874c9

verified ·

1 Parent(s): aa72be5

add longform tab

Browse files

Files changed (1) hide show

constants.py +164 -0

constants.py CHANGED Viewed

@@ -13,6 +13,34 @@ BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="
 TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard </b> </body> </html>"
 INTRODUCTION_TEXT = "📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
     on the Hugging Face Hub. \
     \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the 📈 Metrics tab to understand how the models are evaluated. \
@@ -28,6 +56,142 @@ CITATION_TEXT = """@misc{open-asr-leaderboard,
 }
 """
 METRICS_TAB_TEXT = """
 Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.

 TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard </b> </body> </html>"
+INTRODUCTION_TEXT = "📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
+    on the Hugging Face Hub. \
+    \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the 📈 Metrics tab to understand how the models are evaluated. \
+    \nIf you want results for a model that is not listed here, you can submit a request for it to be included ✉️✨. \
+    \nThe leaderboard includes both English ASR evaluation and multilingual benchmarks across the top European languages."
+CITATION_TEXT = """@misc{open-asr-leaderboard,
+	title        = {Open Automatic Speech Recognition Leaderboard},
+	author       = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and others},
+	year         = 2023,
+	publisher    = {Hugging Face},
+	howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}"
+}
+"""from pathlib import Path
+# Directory where request by models are stored
+DIR_OUTPUT_REQUESTS = Path("requested_models")
+EVAL_REQUESTS_PATH = Path("eval_requests")
+##########################
+# Text definitions       #
+##########################
+banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png"
+BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
+TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard </b> </body> </html>"
 INTRODUCTION_TEXT = "📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
     on the Hugging Face Hub. \
     \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher the better). Models are ranked based on their Average WER, from lowest to highest. Check the 📈 Metrics tab to understand how the models are evaluated. \
 }
 """
+METRICS_TAB_TEXT = """
+Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.
+## Metrics
+Models are evaluated jointly using the Word Error Rate (WER) and Inverse Real Time Factor (RTFx) metrics. The WER metric
+is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based
+on their WER, lowest to highest.
+Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold:
+1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish.
+2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios).
+For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model).
+### Word Error Rate (WER)
+Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
+of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
+Take the following example:
+| Reference:  | the | cat | sat     | on  | the | mat |
+|-------------|-----|-----|---------|-----|-----|-----|
+| Prediction: | the | cat | **sit** | on  | the |     |  |
+| Label:      | ✅   | ✅   | S       | ✅   | ✅   | D   |
+Here, we have:
+* 1 substitution ("sit" instead of "sat")
+* 0 insertions
+* 1 deletion ("mat" is missing)
+This gives 2 errors in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
+reference (N), which for this example is 6:
+```
+WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
+```
+Giving a WER of 0.33, or 33%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation).
+### Inverse Real Time Factor (RTFx)
+Inverse Real Time Factor is a measure of  the **latency** of automatic speech recognition systems, i.e. how long it takes an
+model to process a given amount of speech. It is defined as:
+```
+RTFx = (number of seconds of audio inferred) / (compute time in seconds)
+```
+Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
+Thus, **a higher RTFx value indicates lower latency**.
+## How to reproduce our results
+The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible.
+Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
+For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard
+P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! ♥️
+## Benchmark datasets
+Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the
+[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.
+ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad
+set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains,
+acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how
+a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.
+The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
+are ranked based on their average WER scores, from lowest to highest.
+| Dataset                                                                                 | Domain                      | Speaking Style        | Train (h) | Dev (h) | Test (h) | Transcriptions     | License         |
+|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
+| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)                          | Audiobook                   | Narrated              | 960       | 11      | 11       | Normalised         | CC-BY-4.0       |
+| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                         | European Parliament         | Oratory               | 523       | 5       | 5        | Punctuated         | CC0             |
+| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium)                                | TED talks                   | Oratory               | 454       | 2       | 3        | Normalised         | CC-BY-NC-ND 3.0 |
+| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)                    | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500      | 12      | 40       | Punctuated         | apache-2.0      |
+| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech)                         | Financial meetings          | Oratory, spontaneous  | 4900      | 100     | 100      | Punctuated & Cased | User Agreement  |
+| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22)                     | Financial meetings          | Oratory, spontaneous  | 105       | 5       | 5        | Punctuated & Cased | CC-BY-SA-4.0    |
+| [AMI](https://huggingface.co/datasets/edinburghcstr/ami)                                | Meetings                    | Spontaneous           | 78        | 9       | 9        | Punctuated & Cased | CC-BY-4.0       |
+For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
+"""
+# Multilingual benchmark definitions
+EU_LANGUAGES = {
+    "de": {"name": "German", "flag": "🇩🇪", "datasets": ["mls", "fleurs", "covost"]},
+    "fr": {"name": "French", "flag": "🇫🇷", "datasets": ["mls", "fleurs", "covost"]},
+    "it": {"name": "Italian", "flag": "🇮🇹", "datasets": ["mls", "fleurs", "covost"]},
+    "es": {"name": "Spanish", "flag": "🇪🇸", "datasets": ["mls", "fleurs", "covost"]},
+    "pt": {"name": "Portuguese", "flag": "🇵🇹", "datasets": ["mls", "fleurs", "covost"]}
+}
+MULTILINGUAL_TAB_TEXT = """
+## 🌍 Multilingual ASR Evaluation
+"""
+LONGFORM_TAB_TEXT = """
+## 📝 Long-form ASR Evaluation
+"""
+LEADERBOARD_CSS = """
+#leaderboard-table th .header-content {
+    white-space: nowrap;
+}
+#multilingual-table th .header-content {
+    white-space: nowrap;
+}
+#multilingual-table th:hover {
+    background-color: var(--table-row-focus);
+}
+#longform-table th .header-content {
+    white-space: nowrap;
+}
+#longform-table th:hover {
+    background-color: var(--table-row-focus);
+}
+.language-detail-modal {
+    background: var(--background-fill-primary);
+    border: 1px solid var(--border-color-primary);
+    border-radius: 8px;
+    padding: 1rem;
+    margin: 1rem 0;
+}
+"""
 METRICS_TAB_TEXT = """
 Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.