Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / audio-course /pr_239 /en /chapter5 /evaluation.md

rtrm

about 2 months ago

preview code

download

raw

5.76 kB

	# Evaluation metrics for ASR

	If you're familiar with the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) from NLP, the
	metrics for assessing speech recognition systems will be familiar! Don't worry if you're not, we'll go through the
	explanations start-to-finish to make sure you know the different metrics and understand what they mean.

	When assessing speech recognition systems, we compare the system's predictions to the target text transcriptions,
	annotating any errors that are present. We categorise these errors into one of three categories:
	1. Substitutions (S): where we transcribe the wrong word in our prediction ("sit" instead of "sat")
	2. Insertions (I): where we add an extra word in our prediction
	3. Deletions (D): where we remove a word in our prediction

	These error categories are the same for all speech recognition metrics. What differs is the level at which we compute
	these errors: we can either compute them on the _word level_ or on the _character level_.

	We'll use a running example for each of the metric definitions. Here, we have a _ground truth_ or _reference_ text sequence:

	```python
	reference = "the cat sat on the mat"
	```

	And a predicted sequence from the speech recognition system that we're trying to assess:

	```python
	prediction = "the cat sit on the"
	```

	We can see that the prediction is pretty close, but some words are not quite right. We'll evaluate this prediction
	against the reference for the three most popular speech recognition metrics and see what sort of numbers we get for each.

	## Word Error Rate
	The word error rate (WER) metric is the 'de facto' metric for speech recognition. It calculates substitutions,
	insertions and deletions on the word level. This means errors are annotated on a word-by-word basis. Take our example:

	\| Reference: \| the \| cat \| sat \| on \| the \| mat \|
	\|-------------\|-----\|-----\|---------\|-----\|-----\|-----\|
	\| Prediction: \| the \| cat \| sit \| on \| the \| \| \|
	\| Label: \| ✅ \| ✅ \| S \| ✅ \| ✅ \| D \|

	Here, we have:
	* 1 substitution ("sit" instead of "sat")
	* 0 insertions
	* 1 deletion ("mat" is missing)

	This gives 2 errors in total. To get our error rate, we divide the number of errors by the total number of words in our
	reference (N), which for this example is 6:

	$$
	\begin{aligned}
	WER &= \frac{S + I + D}{N} \\
	&= \frac{1 + 0 + 1}{6} \\
	&= 0.333
	\end{aligned}
	$$

	Alright! So we have a WER of 0.333, or 33.3%. Notice how the word "sit" only has one character that is wrong, but the
	entire word is marked incorrect. This is a defining feature of the WER: spelling errors are penalised heavily, no matter
	how minor they are.

	The WER is defined such that lower is better: a lower WER means there are fewer errors in our prediction, so a perfect
	speech recognition system would have a WER of zero (no errors).

	Let's see how we can compute the WER using 🤗 Evaluate. We'll need two packages to compute our WER metric: 🤗 Evaluate
	for the API interface, and JIWER to do the heavy lifting of running the calculation:
	```
	pip install --upgrade evaluate jiwer
	```

	Great! We can now load up the WER metric and compute the figure for our example:

	```python
	from evaluate import load

	wer_metric = load("wer")

	wer = wer_metric.compute(references=[reference], predictions=[prediction])

	print(wer)
	```
	Print Output:
	```
	0.3333333333333333
	```

	0.33, or 33.3%, as expected! We now know what's going on under-the-hood with this WER calculation.

	Now, here's something that's quite confusing... What do you think the upper limit of the WER is? You would expect it to be
	1 or 100% right? Nuh uh! Since the WER is the ratio of errors to number of words (N), there is no upper limit on the WER!
	Let's take an example were we predict 10 words and the target only has 2 words. If all of our predictions were wrong (10 errors),
	we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over
	100%. Although if you're seeing this, something has likely gone wrong... 😅

	## Inverse Real-Time Factor (RTFx)

	While WER measures the accuracy of transcriptions, the inverse real-time factor (RTFx) measures the speed of an ASR system.
	RTFx is the inverse ratio of processing time to audio duration:

	$$
	\text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}}
	$$

	For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0
	means the system can transcribe audio faster than real-time, which is essential for live transcription applications like
	video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values
	below 1.0 indicate slower-than-real-time processing.

	Key points about RTFx:
	* Higher is better: Higher RTFx means faster processing
	* RTFx > 1.0: Faster than real-time (good for streaming applications)
	* RTFx = 1.0: Processes at exactly real-time speed
	* **RTFx 0
	]
	all_references_norm = [
	all_references_norm[i]
	for i in range(len(all_references_norm))
	if len(all_references_norm[i]) > 0
	]

	wer = 100 * wer_metric.compute(
	references=all_references_norm, predictions=all_predictions_norm
	)

	wer
	```
	Output:
	```
	125.69809089960707
	```

	Again we see the drastic reduction in WER we achieve by normalising our references and predictions: the baseline model
	achieves an orthographic test WER of 168%, while the normalised WER is 126%.

	Right then! These are the numbers that we want to try and beat when we fine-tune the model, in order to improve the Whisper
	model for Dhivehi speech recognition. Continue reading to get hands-on with a fine-tuning example 🚀

Xet Storage Details

Size:: 5.76 kB
Xet hash:: 963482890a40432c2ff971e5e268a98ec1aa856636b3e35277955221d4ed7a1e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.