Buckets:
| # Evaluation metrics for ASR | |
| If you're familiar with the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) from NLP, the | |
| metrics for assessing speech recognition systems will be familiar! Don't worry if you're not, we'll go through the | |
| explanations start-to-finish to make sure you know the different metrics and understand what they mean. | |
| When assessing speech recognition systems, we compare the system's predictions to the target text transcriptions, | |
| annotating any errors that are present. We categorise these errors into one of three categories: | |
| 1. Substitutions (S): where we transcribe the **wrong word** in our prediction ("sit" instead of "sat") | |
| 2. Insertions (I): where we add an **extra word** in our prediction | |
| 3. Deletions (D): where we **remove a word** in our prediction | |
| These error categories are the same for all speech recognition metrics. What differs is the level at which we compute | |
| these errors: we can either compute them on the _word level_ or on the _character level_. | |
| We'll use a running example for each of the metric definitions. Here, we have a _ground truth_ or _reference_ text sequence: | |
| ```python | |
| reference = "the cat sat on the mat" | |
| ``` | |
| And a predicted sequence from the speech recognition system that we're trying to assess: | |
| ```python | |
| prediction = "the cat sit on the" | |
| ``` | |
| We can see that the prediction is pretty close, but some words are not quite right. We'll evaluate this prediction | |
| against the reference for the three most popular speech recognition metrics and see what sort of numbers we get for each. | |
| ## Word Error Rate | |
| The *word error rate (WER)* metric is the 'de facto' metric for speech recognition. It calculates substitutions, | |
| insertions and deletions on the *word level*. This means errors are annotated on a word-by-word basis. Take our example: | |
| | Reference: | the | cat | sat | on | the | mat | | |
| |-------------|-----|-----|---------|-----|-----|-----| | |
| | Prediction: | the | cat | **sit** | on | the | | | | |
| | Label: | ✅ | ✅ | S | ✅ | ✅ | D | | |
| Here, we have: | |
| * 1 substitution ("sit" instead of "sat") | |
| * 0 insertions | |
| * 1 deletion ("mat" is missing) | |
| This gives 2 errors in total. To get our error rate, we divide the number of errors by the total number of words in our | |
| reference (N), which for this example is 6: | |
| $$ | |
| \begin{aligned} | |
| WER &= \frac{S + I + D}{N} \\ | |
| &= \frac{1 + 0 + 1}{6} \\ | |
| &= 0.333 | |
| \end{aligned} | |
| $$ | |
| Alright! So we have a WER of 0.333, or 33.3%. Notice how the word "sit" only has one character that is wrong, but the | |
| entire word is marked incorrect. This is a defining feature of the WER: spelling errors are penalised heavily, no matter | |
| how minor they are. | |
| The WER is defined such that *lower is better*: a lower WER means there are fewer errors in our prediction, so a perfect | |
| speech recognition system would have a WER of zero (no errors). | |
| Let's see how we can compute the WER using 🤗 Evaluate. We'll need two packages to compute our WER metric: 🤗 Evaluate | |
| for the API interface, and JIWER to do the heavy lifting of running the calculation: | |
| ``` | |
| pip install --upgrade evaluate jiwer | |
| ``` | |
| Great! We can now load up the WER metric and compute the figure for our example: | |
| ```python | |
| from evaluate import load | |
| wer_metric = load("wer") | |
| wer = wer_metric.compute(references=[reference], predictions=[prediction]) | |
| print(wer) | |
| ``` | |
| **Print Output:** | |
| ``` | |
| 0.3333333333333333 | |
| ``` | |
| 0.33, or 33.3%, as expected! We now know what's going on under-the-hood with this WER calculation. | |
| Now, here's something that's quite confusing... What do you think the upper limit of the WER is? You would expect it to be | |
| 1 or 100% right? Nuh uh! Since the WER is the ratio of errors to number of words (N), there is no upper limit on the WER! | |
| Let's take an example were we predict 10 words and the target only has 2 words. If all of our predictions were wrong (10 errors), | |
| we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over | |
| 100%. Although if you're seeing this, something has likely gone wrong... 😅 | |
| ## Inverse Real-Time Factor (RTFx) | |
| While WER measures the accuracy of transcriptions, the *inverse real-time factor (RTFx)* measures the speed of an ASR system. | |
| RTFx is the inverse ratio of processing time to audio duration: | |
| $$ | |
| \text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}} | |
| $$ | |
| For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0 | |
| means the system can transcribe audio faster than real-time, which is essential for live transcription applications like | |
| video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values | |
| below 1.0 indicate slower-than-real-time processing. | |
| Key points about RTFx: | |
| * **Higher is better**: Higher RTFx means faster processing | |
| * **RTFx > 1.0**: Faster than real-time (good for streaming applications) | |
| * **RTFx = 1.0**: Processes at exactly real-time speed | |
| * **RTFx 0 | |
| ] | |
| all_references_norm = [ | |
| all_references_norm[i] | |
| for i in range(len(all_references_norm)) | |
| if len(all_references_norm[i]) > 0 | |
| ] | |
| wer = 100 * wer_metric.compute( | |
| references=all_references_norm, predictions=all_predictions_norm | |
| ) | |
| wer | |
| ``` | |
| **Output:** | |
| ``` | |
| 125.69809089960707 | |
| ``` | |
| Again we see the drastic reduction in WER we achieve by normalising our references and predictions: the baseline model | |
| achieves an orthographic test WER of 168%, while the normalised WER is 126%. | |
| Right then! These are the numbers that we want to try and beat when we fine-tune the model, in order to improve the Whisper | |
| model for Dhivehi speech recognition. Continue reading to get hands-on with a fine-tuning example 🚀 | |
Xet Storage Details
- Size:
- 5.76 kB
- Xet hash:
- 963482890a40432c2ff971e5e268a98ec1aa856636b3e35277955221d4ed7a1e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.