Buckets:
Evaluation metrics for ASR
If you're familiar with the Levenshtein distance from NLP, the metrics for assessing speech recognition systems will be familiar! Don't worry if you're not, we'll go through the explanations start-to-finish to make sure you know the different metrics and understand what they mean.
When assessing speech recognition systems, we compare the system's predictions to the target text transcriptions, annotating any errors that are present. We categorise these errors into one of three categories:
- Substitutions (S): where we transcribe the wrong word in our prediction ("sit" instead of "sat")
- Insertions (I): where we add an extra word in our prediction
- Deletions (D): where we remove a word in our prediction
These error categories are the same for all speech recognition metrics. What differs is the level at which we compute these errors: we can either compute them on the word level or on the character level.
We'll use a running example for each of the metric definitions. Here, we have a ground truth or reference text sequence:
reference = "the cat sat on the mat"
And a predicted sequence from the speech recognition system that we're trying to assess:
prediction = "the cat sit on the"
We can see that the prediction is pretty close, but some words are not quite right. We'll evaluate this prediction against the reference for the three most popular speech recognition metrics and see what sort of numbers we get for each.
Word Error Rate
The word error rate (WER) metric is the 'de facto' metric for speech recognition. It calculates substitutions, insertions and deletions on the word level. This means errors are annotated on a word-by-word basis. Take our example:
| Reference: | the | cat | sat | on | the | mat |
|---|---|---|---|---|---|---|
| Prediction: | the | cat | sit | on | the | |
| Label: | ✅ | ✅ | S | ✅ | ✅ | D |
Here, we have:
- 1 substitution ("sit" instead of "sat")
- 0 insertions
- 1 deletion ("mat" is missing)
This gives 2 errors in total. To get our error rate, we divide the number of errors by the total number of words in our reference (N), which for this example is 6:
Alright! So we have a WER of 0.333, or 33.3%. Notice how the word "sit" only has one character that is wrong, but the entire word is marked incorrect. This is a defining feature of the WER: spelling errors are penalised heavily, no matter how minor they are.
The WER is defined such that lower is better: a lower WER means there are fewer errors in our prediction, so a perfect speech recognition system would have a WER of zero (no errors).
Let's see how we can compute the WER using 🤗 Evaluate. We'll need two packages to compute our WER metric: 🤗 Evaluate for the API interface, and JIWER to do the heavy lifting of running the calculation:
pip install --upgrade evaluate jiwer
Great! We can now load up the WER metric and compute the figure for our example:
from evaluate import load
wer_metric = load("wer")
wer = wer_metric.compute(references=[reference], predictions=[prediction])
print(wer)
Print Output:
0.3333333333333333
0.33, or 33.3%, as expected! We now know what's going on under-the-hood with this WER calculation.
Now, here's something that's quite confusing... What do you think the upper limit of the WER is? You would expect it to be 1 or 100% right? Nuh uh! Since the WER is the ratio of errors to number of words (N), there is no upper limit on the WER! Let's take an example were we predict 10 words and the target only has 2 words. If all of our predictions were wrong (10 errors), we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over 100%. Although if you're seeing this, something has likely gone wrong... 😅
Inverse Real-Time Factor (RTFx)
While WER measures the accuracy of transcriptions, the inverse real-time factor (RTFx) measures the speed of an ASR system. RTFx is the inverse ratio of processing time to audio duration:
For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0 means the system can transcribe audio faster than real-time, which is essential for live transcription applications like video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values below 1.0 indicate slower-than-real-time processing.
Key points about RTFx:
- Higher is better: Higher RTFx means faster processing
- RTFx > 1.0: Faster than real-time (good for streaming applications)
- RTFx = 1.0: Processes at exactly real-time speed
- **RTFx 0 ] all_references_norm = [ all_references_norm[i] for i in range(len(all_references_norm)) if len(all_references_norm[i]) > 0 ]
wer = 100 * wer_metric.compute( references=all_references_norm, predictions=all_predictions_norm )
wer
**Output:**
125.69809089960707
Again we see the drastic reduction in WER we achieve by normalising our references and predictions: the baseline model
achieves an orthographic test WER of 168%, while the normalised WER is 126%.
Right then! These are the numbers that we want to try and beat when we fine-tune the model, in order to improve the Whisper
model for Dhivehi speech recognition. Continue reading to get hands-on with a fine-tuning example 🚀
Xet Storage Details
- Size:
- 5.76 kB
- Xet hash:
- 963482890a40432c2ff971e5e268a98ec1aa856636b3e35277955221d4ed7a1e
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.