Comparison of all metrics across engine/condition combinations. Green = better, Red = worse. WIP is inverted (higher is better).
Analyze performance distribution and resilience across different groupings.
| Engine | Degradation | Enhancement | Audio Norm | Text Norm | WER | CER | MER | WIL | WIP | Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| parakeet-rnnt-1.1b | None | - | - | Raw | 0.2251 | 0.0589 | 0.2222 | 0.3872 | 0.6128 | 13.21 |
| parakeet-rnnt-1.1b | None | - | - | Normalized | 0.0260 | 0.0024 | 0.0256 | 0.0383 | 0.9617 | 13.21 |
| parakeet-ctc-1.1b | None | - | - | Raw | 0.2251 | 0.0589 | 0.2222 | 0.3872 | 0.6128 | 23.90 |
| parakeet-ctc-1.1b | None | - | - | Normalized | 0.0260 | 0.0024 | 0.0256 | 0.0383 | 0.9617 | 23.90 |
| parakeet-rnnt-0.6b | None | - | - | Raw | 0.2251 | 0.0589 | 0.2222 | 0.3872 | 0.6128 | 17.14 |
| parakeet-rnnt-0.6b | None | - | - | Normalized | 0.0260 | 0.0024 | 0.0256 | 0.0383 | 0.9617 | 17.14 |
| parakeet-ctc-0.6b | None | - | - | Raw | 0.2294 | 0.0628 | 0.2265 | 0.3939 | 0.6061 | 5.58 |
| parakeet-ctc-0.6b | None | - | - | Normalized | 0.0346 | 0.0064 | 0.0342 | 0.0551 | 0.9449 | 5.58 |
| parakeet-tdt-1.1b | None | - | - | Raw | 0.2251 | 0.0589 | 0.2222 | 0.3872 | 0.6128 | 17.50 |
| parakeet-tdt-1.1b | None | - | - | Normalized | 0.0260 | 0.0024 | 0.0256 | 0.0383 | 0.9617 | 17.50 |
| parakeet-tdt_ctc-1.1b | None | - | - | Raw | 0.0693 | 0.0259 | 0.0684 | 0.1208 | 0.8792 | 9.74 |
| parakeet-tdt_ctc-1.1b | None | - | - | Normalized | 0.0390 | 0.0056 | 0.0385 | 0.0634 | 0.9366 | 9.74 |
| parakeet-tdt_ctc-110m | None | - | - | Raw | 0.0736 | 0.0243 | 0.0726 | 0.1289 | 0.8711 | 2.72 |
| parakeet-tdt_ctc-110m | None | - | - | Normalized | 0.0260 | 0.0024 | 0.0256 | 0.0383 | 0.9617 | 2.72 |
| parakeet-tdt-0.6b-v2 | None | - | - | Raw | 0.0563 | 0.0243 | 0.0560 | 0.1051 | 0.8949 | 5.94 |
| parakeet-tdt-0.6b-v2 | None | - | - | Normalized | 0.0130 | 0.0016 | 0.0129 | 0.0215 | 0.9785 | 5.94 |
| parakeet-tdt-0.6b-v3 | None | - | - | Raw | 0.0649 | 0.0228 | 0.0641 | 0.1127 | 0.8873 | 6.55 |
| parakeet-tdt-0.6b-v3 | None | - | - | Normalized | 0.0303 | 0.0040 | 0.0299 | 0.0467 | 0.9533 | 6.55 |
| canary-1b | None | - | - | Raw | 0.5714 | 0.5479 | 0.5690 | 0.6065 | 0.3935 | 221.17 |
| canary-1b | None | - | - | Normalized | 0.5498 | 0.5294 | 0.5474 | 0.5661 | 0.4339 | 221.17 |
| canary-1b-flash | None | - | - | Raw | 0.5325 | 0.5345 | 0.5325 | 0.5532 | 0.4468 | 64.69 |
| canary-1b-flash | None | - | - | Normalized | 0.5108 | 0.5157 | 0.5108 | 0.5108 | 0.4892 | 64.69 |
This section explains the metrics used to evaluate ASR (Automatic Speech Recognition) performance. All metrics are computed by comparing the reference (ground truth) text with the hypothesis (transcription) text.
Range: 0 to ∞ (typically 0 to 1) | Lower is better
The most common ASR metric. Measures the edit distance at word level.
WER = (Substitutions + Deletions + Insertions) / Total Reference Words
Example: If reference is "the cat sat" and hypothesis is "a cat sits", WER = 2/3 = 0.667
Range: 0 to ∞ (typically 0 to 1) | Lower is better
Like WER but computed at character level. More granular, useful for languages without clear word boundaries.
CER = (Char Substitutions + Char Deletions + Char Insertions) / Total Reference Characters
Note: CER is often lower than WER since partial word matches are credited.
Range: 0 to 1 | Lower is better
Proportion of words that were not correctly matched between reference and hypothesis.
MER = (Substitutions + Deletions + Insertions) / (Hits + Substitutions + Deletions + Insertions)
Key difference from WER: MER is bounded at 1.0 and accounts for hypothesis length.
Range: 0 to 1 | Lower is better
Measures the proportion of word information that was lost in transcription.
WIL = 1 - (Hits² / (Reference Length × Hypothesis Length))
Interpretation: Combines precision and recall into a single information-theoretic measure.
Range: 0 to 1 | Higher is better
The complement of WIL - measures how much word information was correctly preserved.
WIP = Hits² / (Reference Length × Hypothesis Length) = 1 - WIL
Note: This is the only metric where higher values indicate better performance.
Interactive scatter plot showing the trade-off between processing time (X-axis) and error rate (Y-axis). Points closer to the bottom-left corner represent faster and more accurate transcriptions.
Color-coded matrix comparing all metrics across configurations. Green indicates good performance, red indicates poor performance. WIP column is inverted since higher values are better.
Shows the distribution of metric values grouped by engine, degradation type, or other factors. Useful for identifying which engines are most resilient to audio degradation.
Side-by-side comparison of reference and hypothesis texts with diff highlighting:
Use the dropdown filters at the top to focus on specific subsets of results. All tabs respect the current filter selection. Click "Reset" to restore all filters to default.
The normalization checkboxes control how metrics (WER, CER, MER, WIL, WIP) are computed:
Note: Text normalization is now a grid search dimension - each combination produces separate results with normalized text shown in the diff view.