Spaces:

danielrosehill
/

Single-Shot-ASR-Eval

Running

App Files Files Community

Single-Shot-ASR-Eval / index.html

danielrosehill

Add single-shot ASR benchmark results and static site

7837486 2 months ago

raw

history blame contribute delete

6.11 kB

	<!doctype html>
	<html lang="en">
	<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<title>Single-Shot ASR Model Evaluation</title>
	<link rel="stylesheet" href="style.css" />
	</head>
	<body>

	<header>
	<h1>Single-Shot ASR Model Evaluation</h1>
	<p class="subtitle">13 local speech-to-text models tested on AMD GPU with Handy</p>
	</header>

	<main>

	<section class="card">
	<h2>Experiment</h2>
	<p>A single-take benchmark of every transcription model available in <a href="https://github.com/pokey/handy" target="_blank">Handy</a> (v0.8.1), a local speech-to-text tool for Linux. Each model transcribed the same test script in one continuous recording, with a deliberate 5-second silence in the middle to test VAD handling and hallucination resistance.</p>

	<h3>Test Script</h3>
	<blockquote>
	I had scrambled eggs and toast for breakfast this morning. The coffee was a bit too strong but I drank it anyway. <strong>[5 second pause]</strong> The capital of France is Paris. It sits on the River Seine and has a population of about two million people in the city itself.
	</blockquote>
	<p>The two halves are deliberately unrelated (personal anecdote vs. factual statement). Any text bridging them during the pause would indicate hallucination.</p>
	</section>

	<section class="card">
	<h2>Test Environment</h2>
	<table>
	<tr><td><strong>Application</strong></td><td>Handy 0.8.1</td></tr>
	<tr><td><strong>Inference</strong></td><td>ONNX Runtime (auto) + Whisper.cpp (auto)</td></tr>
	<tr><td><strong>GPU</strong></td><td>AMD Radeon RX 7800 XT (Navi 32, 12 GB VRAM)</td></tr>
	<tr><td><strong>CPU</strong></td><td>12th Gen Intel Core i7-12700F</td></tr>
	<tr><td><strong>OS</strong></td><td>Ubuntu 25.10, kernel 6.17.0-19-generic</td></tr>
	<tr><td><strong>Date</strong></td><td>2026-03-29</td></tr>
	</table>
	</section>

	<section class="card">
	<h2>Results</h2>

	<h3>Rankings</h3>
	<table class="results">
	<thead>
	<tr>
	<th>#</th>
	<th>Model</th>
	<th>Inference</th>
	<th>RTF</th>
	<th>Errors</th>
	<th>Hallucination</th>
	</tr>
	</thead>
	<tbody>
	<tr class="perfect"><td>1</td><td>Whisper Small</td><td>976 ms</td><td>0.07x</td><td>0</td><td>No</td></tr>
	<tr class="perfect"><td>2</td><td>Parakeet V2</td><td>1,354 ms</td><td>0.09x</td><td>0</td><td>No</td></tr>
	<tr class="perfect"><td>3</td><td>Canary 180M Flash</td><td>2,223 ms</td><td>0.17x</td><td>0</td><td>No</td></tr>
	<tr class="perfect"><td>4</td><td>Moonshine Base</td><td>2,301 ms</td><td>0.15x</td><td>0</td><td>No</td></tr>
	<tr class="minor"><td>5</td><td>Parakeet V3 (INT8)</td><td>1,378 ms</td><td>0.10x</td><td>1</td><td>No</td></tr>
	<tr class="minor"><td>6</td><td>Whisper Turbo</td><td>1,112 ms</td><td>0.09x</td><td>2</td><td>No</td></tr>
	<tr class="minor"><td>7</td><td>Canary 1B v2</td><td>2,473 ms</td><td>0.17x</td><td>1</td><td>No</td></tr>
	<tr class="minor"><td>8</td><td>Moonshine Small Streaming</td><td>4,140 ms</td><td>0.33x</td><td>1</td><td>No</td></tr>
	<tr class="minor"><td>9</td><td>Moonshine Tiny Streaming</td><td>3,414 ms</td><td>0.25x</td><td>2</td><td>No</td></tr>
	<tr class="significant"><td>10</td><td>Whisper Medium</td><td>1,694 ms</td><td>0.13x</td><td>3</td><td>No</td></tr>
	<tr class="significant"><td>11</td><td>Whisper Large</td><td>2,780 ms</td><td>0.22x</td><td>3</td><td>No</td></tr>
	<tr class="significant"><td>12</td><td>Breeze ASR</td><td>2,626 ms</td><td>0.20x</td><td>3</td><td>No</td></tr>
	<tr class="significant"><td>13</td><td>SenseVoice (INT8)</td><td>145 ms</td><td>0.01x</td><td>3</td><td>No</td></tr>
	</tbody>
	</table>
	</section>

	<section class="card">
	<h2>Charts</h2>
	<figure>
	<img src="benchmark-charts.png" alt="Inference speed and accuracy bar charts" />
	<figcaption>Inference speed and transcription accuracy across all 13 models</figcaption>
	</figure>
	<figure>
	<img src="speed-vs-accuracy.png" alt="Speed vs accuracy scatter plot" />
	<figcaption>Speed vs accuracy tradeoff — ideal models are in the bottom-left</figcaption>
	</figure>
	</section>

	<section class="card">
	<h2>Key Findings</h2>

	<h3>VAD & Hallucination</h3>
	<p>All 13 models handled the 5-second silence cleanly. No model hallucinated words, repeated phrases, or invented bridging text during the pause.</p>

	<h3>Bigger is not always better</h3>
	<p>Whisper Small (976 ms, 0 errors) outperformed both Whisper Medium (1,694 ms, 3 errors) and Whisper Large (2,780 ms, 3 errors) on this GPU. The larger Whisper models were slower and less accurate.</p>

	<h3>Common Error Patterns</h3>
	<ul>
	<li><strong>Capitalisation:</strong> 7 models wrote "river Seine" instead of "River Seine"</li>
	<li><strong>Numerals:</strong> 5 models output "2 million" instead of "two million"</li>
	<li><strong>Punctuation:</strong> SenseVoice and Breeze ASR replaced sentence-ending periods with commas</li>
	<li><strong>Mishearing:</strong> Moonshine Tiny turned "drank it anyway" into "don't get any way"; SenseVoice heard "Seine" as "sand"</li>
	<li><strong>Dropped words:</strong> Canary 1B v2 lost "people in" from the final sentence</li>
	</ul>

	<h3>Recommendation</h3>
	<p><strong>Whisper Small</strong> is the best overall choice for this hardware — fastest perfect transcription at under 1 second. <strong>Parakeet V2</strong> is a strong runner-up. For users who want correct proper noun capitalisation and spelled-out numbers, <strong>Canary 180M Flash</strong> is the most pedantically accurate, though slower.</p>
	</section>

	<section class="card">
	<h2>Data</h2>
	<p>Raw benchmark data is available as <a href="transcription-benchmarks.json">transcription-benchmarks.json</a>.</p>
	<p>Source repository: <a href="https://github.com/danielrosehill/Handy-Ubuntu-Setup" target="_blank">danielrosehill/Handy-Ubuntu-Setup</a></p>
	</section>

	</main>

	<footer>
	<p>Daniel Rosehill · 2026</p>
	</footer>

	</body>
	</html>