| <!doctype html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <title>Single-Shot ASR Model Evaluation</title> |
| <link rel="stylesheet" href="style.css" /> |
| </head> |
| <body> |
|
|
| <header> |
| <h1>Single-Shot ASR Model Evaluation</h1> |
| <p class="subtitle">13 local speech-to-text models tested on AMD GPU with Handy</p> |
| </header> |
|
|
| <main> |
|
|
| <section class="card"> |
| <h2>Experiment</h2> |
| <p>A single-take benchmark of every transcription model available in <a href="https://github.com/pokey/handy" target="_blank">Handy</a> (v0.8.1), a local speech-to-text tool for Linux. Each model transcribed the same test script in one continuous recording, with a deliberate 5-second silence in the middle to test VAD handling and hallucination resistance.</p> |
|
|
| <h3>Test Script</h3> |
| <blockquote> |
| I had scrambled eggs and toast for breakfast this morning. The coffee was a bit too strong but I drank it anyway. <strong>[5 second pause]</strong> The capital of France is Paris. It sits on the River Seine and has a population of about two million people in the city itself. |
| </blockquote> |
| <p>The two halves are deliberately unrelated (personal anecdote vs. factual statement). Any text bridging them during the pause would indicate hallucination.</p> |
| </section> |
|
|
| <section class="card"> |
| <h2>Test Environment</h2> |
| <table> |
| <tr><td><strong>Application</strong></td><td>Handy 0.8.1</td></tr> |
| <tr><td><strong>Inference</strong></td><td>ONNX Runtime (auto) + Whisper.cpp (auto)</td></tr> |
| <tr><td><strong>GPU</strong></td><td>AMD Radeon RX 7800 XT (Navi 32, 12 GB VRAM)</td></tr> |
| <tr><td><strong>CPU</strong></td><td>12th Gen Intel Core i7-12700F</td></tr> |
| <tr><td><strong>OS</strong></td><td>Ubuntu 25.10, kernel 6.17.0-19-generic</td></tr> |
| <tr><td><strong>Date</strong></td><td>2026-03-29</td></tr> |
| </table> |
| </section> |
|
|
| <section class="card"> |
| <h2>Results</h2> |
|
|
| <h3>Rankings</h3> |
| <table class="results"> |
| <thead> |
| <tr> |
| <th>#</th> |
| <th>Model</th> |
| <th>Inference</th> |
| <th>RTF</th> |
| <th>Errors</th> |
| <th>Hallucination</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr class="perfect"><td>1</td><td>Whisper Small</td><td>976 ms</td><td>0.07x</td><td>0</td><td>No</td></tr> |
| <tr class="perfect"><td>2</td><td>Parakeet V2</td><td>1,354 ms</td><td>0.09x</td><td>0</td><td>No</td></tr> |
| <tr class="perfect"><td>3</td><td>Canary 180M Flash</td><td>2,223 ms</td><td>0.17x</td><td>0</td><td>No</td></tr> |
| <tr class="perfect"><td>4</td><td>Moonshine Base</td><td>2,301 ms</td><td>0.15x</td><td>0</td><td>No</td></tr> |
| <tr class="minor"><td>5</td><td>Parakeet V3 (INT8)</td><td>1,378 ms</td><td>0.10x</td><td>1</td><td>No</td></tr> |
| <tr class="minor"><td>6</td><td>Whisper Turbo</td><td>1,112 ms</td><td>0.09x</td><td>2</td><td>No</td></tr> |
| <tr class="minor"><td>7</td><td>Canary 1B v2</td><td>2,473 ms</td><td>0.17x</td><td>1</td><td>No</td></tr> |
| <tr class="minor"><td>8</td><td>Moonshine Small Streaming</td><td>4,140 ms</td><td>0.33x</td><td>1</td><td>No</td></tr> |
| <tr class="minor"><td>9</td><td>Moonshine Tiny Streaming</td><td>3,414 ms</td><td>0.25x</td><td>2</td><td>No</td></tr> |
| <tr class="significant"><td>10</td><td>Whisper Medium</td><td>1,694 ms</td><td>0.13x</td><td>3</td><td>No</td></tr> |
| <tr class="significant"><td>11</td><td>Whisper Large</td><td>2,780 ms</td><td>0.22x</td><td>3</td><td>No</td></tr> |
| <tr class="significant"><td>12</td><td>Breeze ASR</td><td>2,626 ms</td><td>0.20x</td><td>3</td><td>No</td></tr> |
| <tr class="significant"><td>13</td><td>SenseVoice (INT8)</td><td>145 ms</td><td>0.01x</td><td>3</td><td>No</td></tr> |
| </tbody> |
| </table> |
| </section> |
|
|
| <section class="card"> |
| <h2>Charts</h2> |
| <figure> |
| <img src="benchmark-charts.png" alt="Inference speed and accuracy bar charts" /> |
| <figcaption>Inference speed and transcription accuracy across all 13 models</figcaption> |
| </figure> |
| <figure> |
| <img src="speed-vs-accuracy.png" alt="Speed vs accuracy scatter plot" /> |
| <figcaption>Speed vs accuracy tradeoff — ideal models are in the bottom-left</figcaption> |
| </figure> |
| </section> |
|
|
| <section class="card"> |
| <h2>Key Findings</h2> |
|
|
| <h3>VAD & Hallucination</h3> |
| <p>All 13 models handled the 5-second silence cleanly. No model hallucinated words, repeated phrases, or invented bridging text during the pause.</p> |
|
|
| <h3>Bigger is not always better</h3> |
| <p>Whisper Small (976 ms, 0 errors) outperformed both Whisper Medium (1,694 ms, 3 errors) and Whisper Large (2,780 ms, 3 errors) on this GPU. The larger Whisper models were slower and less accurate.</p> |
|
|
| <h3>Common Error Patterns</h3> |
| <ul> |
| <li><strong>Capitalisation:</strong> 7 models wrote "river Seine" instead of "River Seine"</li> |
| <li><strong>Numerals:</strong> 5 models output "2 million" instead of "two million"</li> |
| <li><strong>Punctuation:</strong> SenseVoice and Breeze ASR replaced sentence-ending periods with commas</li> |
| <li><strong>Mishearing:</strong> Moonshine Tiny turned "drank it anyway" into "don't get any way"; SenseVoice heard "Seine" as "sand"</li> |
| <li><strong>Dropped words:</strong> Canary 1B v2 lost "people in" from the final sentence</li> |
| </ul> |
|
|
| <h3>Recommendation</h3> |
| <p><strong>Whisper Small</strong> is the best overall choice for this hardware — fastest perfect transcription at under 1 second. <strong>Parakeet V2</strong> is a strong runner-up. For users who want correct proper noun capitalisation and spelled-out numbers, <strong>Canary 180M Flash</strong> is the most pedantically accurate, though slower.</p> |
| </section> |
|
|
| <section class="card"> |
| <h2>Data</h2> |
| <p>Raw benchmark data is available as <a href="transcription-benchmarks.json">transcription-benchmarks.json</a>.</p> |
| <p>Source repository: <a href="https://github.com/danielrosehill/Handy-Ubuntu-Setup" target="_blank">danielrosehill/Handy-Ubuntu-Setup</a></p> |
| </section> |
|
|
| </main> |
|
|
| <footer> |
| <p>Daniel Rosehill · 2026</p> |
| </footer> |
|
|
| </body> |
| </html> |
|
|