Single-Shot-ASR-Eval / index.html
danielrosehill's picture
Add single-shot ASR benchmark results and static site
7837486
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Single-Shot ASR Model Evaluation</title>
<link rel="stylesheet" href="style.css" />
</head>
<body>
<header>
<h1>Single-Shot ASR Model Evaluation</h1>
<p class="subtitle">13 local speech-to-text models tested on AMD GPU with Handy</p>
</header>
<main>
<section class="card">
<h2>Experiment</h2>
<p>A single-take benchmark of every transcription model available in <a href="https://github.com/pokey/handy" target="_blank">Handy</a> (v0.8.1), a local speech-to-text tool for Linux. Each model transcribed the same test script in one continuous recording, with a deliberate 5-second silence in the middle to test VAD handling and hallucination resistance.</p>
<h3>Test Script</h3>
<blockquote>
I had scrambled eggs and toast for breakfast this morning. The coffee was a bit too strong but I drank it anyway. <strong>[5 second pause]</strong> The capital of France is Paris. It sits on the River Seine and has a population of about two million people in the city itself.
</blockquote>
<p>The two halves are deliberately unrelated (personal anecdote vs. factual statement). Any text bridging them during the pause would indicate hallucination.</p>
</section>
<section class="card">
<h2>Test Environment</h2>
<table>
<tr><td><strong>Application</strong></td><td>Handy 0.8.1</td></tr>
<tr><td><strong>Inference</strong></td><td>ONNX Runtime (auto) + Whisper.cpp (auto)</td></tr>
<tr><td><strong>GPU</strong></td><td>AMD Radeon RX 7800 XT (Navi 32, 12 GB VRAM)</td></tr>
<tr><td><strong>CPU</strong></td><td>12th Gen Intel Core i7-12700F</td></tr>
<tr><td><strong>OS</strong></td><td>Ubuntu 25.10, kernel 6.17.0-19-generic</td></tr>
<tr><td><strong>Date</strong></td><td>2026-03-29</td></tr>
</table>
</section>
<section class="card">
<h2>Results</h2>
<h3>Rankings</h3>
<table class="results">
<thead>
<tr>
<th>#</th>
<th>Model</th>
<th>Inference</th>
<th>RTF</th>
<th>Errors</th>
<th>Hallucination</th>
</tr>
</thead>
<tbody>
<tr class="perfect"><td>1</td><td>Whisper Small</td><td>976 ms</td><td>0.07x</td><td>0</td><td>No</td></tr>
<tr class="perfect"><td>2</td><td>Parakeet V2</td><td>1,354 ms</td><td>0.09x</td><td>0</td><td>No</td></tr>
<tr class="perfect"><td>3</td><td>Canary 180M Flash</td><td>2,223 ms</td><td>0.17x</td><td>0</td><td>No</td></tr>
<tr class="perfect"><td>4</td><td>Moonshine Base</td><td>2,301 ms</td><td>0.15x</td><td>0</td><td>No</td></tr>
<tr class="minor"><td>5</td><td>Parakeet V3 (INT8)</td><td>1,378 ms</td><td>0.10x</td><td>1</td><td>No</td></tr>
<tr class="minor"><td>6</td><td>Whisper Turbo</td><td>1,112 ms</td><td>0.09x</td><td>2</td><td>No</td></tr>
<tr class="minor"><td>7</td><td>Canary 1B v2</td><td>2,473 ms</td><td>0.17x</td><td>1</td><td>No</td></tr>
<tr class="minor"><td>8</td><td>Moonshine Small Streaming</td><td>4,140 ms</td><td>0.33x</td><td>1</td><td>No</td></tr>
<tr class="minor"><td>9</td><td>Moonshine Tiny Streaming</td><td>3,414 ms</td><td>0.25x</td><td>2</td><td>No</td></tr>
<tr class="significant"><td>10</td><td>Whisper Medium</td><td>1,694 ms</td><td>0.13x</td><td>3</td><td>No</td></tr>
<tr class="significant"><td>11</td><td>Whisper Large</td><td>2,780 ms</td><td>0.22x</td><td>3</td><td>No</td></tr>
<tr class="significant"><td>12</td><td>Breeze ASR</td><td>2,626 ms</td><td>0.20x</td><td>3</td><td>No</td></tr>
<tr class="significant"><td>13</td><td>SenseVoice (INT8)</td><td>145 ms</td><td>0.01x</td><td>3</td><td>No</td></tr>
</tbody>
</table>
</section>
<section class="card">
<h2>Charts</h2>
<figure>
<img src="benchmark-charts.png" alt="Inference speed and accuracy bar charts" />
<figcaption>Inference speed and transcription accuracy across all 13 models</figcaption>
</figure>
<figure>
<img src="speed-vs-accuracy.png" alt="Speed vs accuracy scatter plot" />
<figcaption>Speed vs accuracy tradeoff &mdash; ideal models are in the bottom-left</figcaption>
</figure>
</section>
<section class="card">
<h2>Key Findings</h2>
<h3>VAD &amp; Hallucination</h3>
<p>All 13 models handled the 5-second silence cleanly. No model hallucinated words, repeated phrases, or invented bridging text during the pause.</p>
<h3>Bigger is not always better</h3>
<p>Whisper Small (976 ms, 0 errors) outperformed both Whisper Medium (1,694 ms, 3 errors) and Whisper Large (2,780 ms, 3 errors) on this GPU. The larger Whisper models were slower and less accurate.</p>
<h3>Common Error Patterns</h3>
<ul>
<li><strong>Capitalisation:</strong> 7 models wrote "river Seine" instead of "River Seine"</li>
<li><strong>Numerals:</strong> 5 models output "2 million" instead of "two million"</li>
<li><strong>Punctuation:</strong> SenseVoice and Breeze ASR replaced sentence-ending periods with commas</li>
<li><strong>Mishearing:</strong> Moonshine Tiny turned "drank it anyway" into "don't get any way"; SenseVoice heard "Seine" as "sand"</li>
<li><strong>Dropped words:</strong> Canary 1B v2 lost "people in" from the final sentence</li>
</ul>
<h3>Recommendation</h3>
<p><strong>Whisper Small</strong> is the best overall choice for this hardware &mdash; fastest perfect transcription at under 1 second. <strong>Parakeet V2</strong> is a strong runner-up. For users who want correct proper noun capitalisation and spelled-out numbers, <strong>Canary 180M Flash</strong> is the most pedantically accurate, though slower.</p>
</section>
<section class="card">
<h2>Data</h2>
<p>Raw benchmark data is available as <a href="transcription-benchmarks.json">transcription-benchmarks.json</a>.</p>
<p>Source repository: <a href="https://github.com/danielrosehill/Handy-Ubuntu-Setup" target="_blank">danielrosehill/Handy-Ubuntu-Setup</a></p>
</section>
</main>
<footer>
<p>Daniel Rosehill &middot; 2026</p>
</footer>
</body>
</html>