| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <title>Audio Understanding Experiment</title> |
| <style> |
| |
| * { margin: 0; padding: 0; box-sizing: border-box; } |
| body { font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, sans-serif; background: #f8f9fa; color: #1a1a2e; } |
| a { color: #4338ca; text-decoration: none; } |
| a:hover { text-decoration: underline; } |
| |
| |
| .topnav { |
| position: fixed; top: 0; left: 0; right: 0; z-index: 100; |
| background: #fff; border-bottom: 1px solid #e2e4e9; |
| padding: 0 2rem; height: 52px; |
| display: flex; align-items: center; gap: 2rem; |
| box-shadow: 0 1px 3px rgba(0,0,0,0.06); |
| } |
| .topnav .site-title { font-size: 0.9rem; font-weight: 700; color: #111827; white-space: nowrap; } |
| .topnav nav { display: flex; gap: 0.25rem; } |
| .topnav nav a { |
| font-size: 0.8rem; font-weight: 500; color: #6b7280; |
| padding: 0.4rem 0.75rem; border-radius: 6px; transition: all 0.12s; |
| text-decoration: none; |
| } |
| .topnav nav a:hover { background: #f3f4f6; color: #111827; text-decoration: none; } |
| .topnav nav a.active { background: #eef2ff; color: #4338ca; } |
| .topnav .doi { margin-left: auto; font-size: 0.68rem; color: #9ca3af; white-space: nowrap; } |
| .topnav .doi a { color: #6b7280; } |
| |
| body { padding-top: 52px; } |
| |
| |
| .audio-bar { |
| position: fixed; top: 52px; left: 0; right: 0; z-index: 99; |
| background: #fff; border-bottom: 1px solid #e2e4e9; |
| padding: 0.4rem 2rem; |
| display: flex; align-items: center; gap: 1rem; |
| height: 44px; |
| } |
| .audio-bar .bar-label { |
| font-size: 0.68rem; font-weight: 600; text-transform: uppercase; |
| letter-spacing: 0.05em; color: #6b7280; white-space: nowrap; |
| } |
| .audio-bar .bar-date { |
| font-size: 0.65rem; color: #9ca3af; white-space: nowrap; |
| } |
| .audio-bar audio { flex: 1; height: 28px; min-width: 0; } |
| |
| .page-body { padding-top: 44px; } |
| |
| @media (max-width: 768px) { |
| .topnav { padding: 0 1rem; gap: 1rem; } |
| .topnav .doi { display: none; } |
| .audio-bar { padding: 0.4rem 1rem; } |
| } |
| |
| |
| .hero { |
| max-width: 820px; margin: 0 auto; padding: 3rem 2rem 2rem; |
| } |
| .hero h1 { font-size: 1.8rem; font-weight: 800; color: #111827; margin-bottom: 0.5rem; line-height: 1.3; } |
| .hero .tagline { font-size: 1rem; color: #6b7280; margin-bottom: 2rem; line-height: 1.6; } |
| .hero .meta-row { |
| display: flex; flex-wrap: wrap; gap: 0.5rem; margin-bottom: 2rem; |
| } |
| .meta-pill { |
| font-size: 0.72rem; font-weight: 500; padding: 0.3rem 0.7rem; |
| border-radius: 20px; background: #f3f4f6; color: #4b5563; border: 1px solid #e5e7eb; |
| } |
| .meta-pill.accent { background: #eef2ff; color: #4338ca; border-color: #c7d2fe; } |
| |
| .section { max-width: 820px; margin: 0 auto; padding: 0 2rem 2.5rem; } |
| .section h2 { font-size: 1.15rem; font-weight: 700; color: #111827; margin-bottom: 0.75rem; } |
| .section p, .section li { font-size: 0.88rem; line-height: 1.75; color: #374151; } |
| .section ul { margin: 0.5rem 0 1rem 1.5rem; } |
| .section li { margin-bottom: 0.35rem; } |
| |
| .card-grid { |
| display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); |
| gap: 1rem; margin: 1.25rem 0; |
| } |
| .nav-card { |
| background: #fff; border: 1px solid #e2e4e9; border-radius: 10px; |
| padding: 1.25rem; transition: all 0.15s; text-decoration: none; color: inherit; |
| box-shadow: 0 1px 2px rgba(0,0,0,0.04); |
| } |
| .nav-card:hover { border-color: #c7d2fe; box-shadow: 0 2px 8px rgba(67,56,202,0.08); text-decoration: none; } |
| .nav-card h3 { font-size: 0.92rem; font-weight: 600; color: #111827; margin-bottom: 0.35rem; } |
| .nav-card p { font-size: 0.78rem; color: #6b7280; line-height: 1.5; } |
| .nav-card .card-num { font-size: 1.5rem; font-weight: 800; color: #4338ca; margin-bottom: 0.5rem; } |
| |
| .cite-box { |
| background: #f9fafb; border: 1px solid #e5e7eb; border-radius: 8px; |
| padding: 1rem 1.25rem; font-size: 0.82rem; line-height: 1.7; color: #374151; |
| margin: 1rem 0; |
| } |
| .cite-box code { font-family: 'SF Mono', 'Fira Code', monospace; font-size: 0.78rem; background: #f3f4f6; padding: 0.1rem 0.3rem; border-radius: 3px; } |
| |
| .footer { |
| max-width: 820px; margin: 0 auto; padding: 2rem; |
| border-top: 1px solid #e5e7eb; font-size: 0.75rem; color: #9ca3af; |
| } |
| </style> |
| </head> |
| <body> |
|
|
| <header class="topnav"> |
| <span class="site-title">Audio Understanding Experiment</span> |
| <nav> |
| <a href="index.html" class="active">Overview</a> |
| <a href="listen.html">Listen</a> |
| <a href="results.html">Results</a> |
| <a href="findings.html">Findings</a> |
| </nav> |
| <span class="doi"><a href="https://doi.org/10.57967/hf/8154">DOI: 10.57967/hf/8154</a></span> |
| </header> |
|
|
| <div class="audio-bar"> |
| <span class="bar-label">Voice Sample</span> |
| <audio controls preload="none" src="voice-sample.flac"></audio> |
| <span class="bar-date">26 Mar 2026 · 20m 54s</span> |
| </div> |
|
|
| <div class="page-body"> |
|
|
| <div class="hero"> |
| <h1>Evaluating Audio Understanding in Multimodal AI</h1> |
| <p class="tagline"> |
| A systematic experiment testing Gemini 3.1 Flash Lite's ability to analyse a 20-minute voice recording |
| across 137 structured prompts spanning speaker analysis, emotion detection, audio engineering, |
| forensic audio, demographics, and more. |
| </p> |
| <div class="meta-row"> |
| <span class="meta-pill accent">49 completed evaluations</span> |
| <span class="meta-pill accent">13 categories tested</span> |
| <span class="meta-pill accent">137 total prompts</span> |
| <span class="meta-pill">Model: Gemini 3.1 Flash Lite</span> |
| <span class="meta-pill">Date: 26 March 2026</span> |
| <span class="meta-pill">Audio: FLAC mono 24kHz, 20m 54s</span> |
| </div> |
| </div> |
|
|
| <div class="section"> |
| <h2>Objectives</h2> |
| <ul> |
| <li><strong>Breadth of capability:</strong> How many distinct audio analysis tasks can a multimodal model meaningfully perform from a single voice recording?</li> |
| <li><strong>Acoustic vs. content inference:</strong> Can the model distinguish between what it hears in the audio signal and what it understands from the speech content?</li> |
| <li><strong>Safety boundaries:</strong> Where does the model draw ethical lines on sensitive inferences (health, demographics, deception)?</li> |
| <li><strong>Practical utility:</strong> Are the outputs actionable for real-world use cases like audio production, voice cloning, and speech coaching?</li> |
| <li><strong>Internal consistency:</strong> Does the model maintain a coherent characterisation of the speaker across dozens of independent prompts?</li> |
| </ul> |
| </div> |
|
|
| <div class="section"> |
| <h2>Methodology</h2> |
| <p> |
| A single freeform voice recording was made by Daniel Rosehill on a OnePlus Nord 3.5G phone in HQ mode (WAV 44.1kHz, |
| converted to FLAC mono 24kHz). The recording is unscripted, covering topics from voice cloning and TTS technology to |
| personal background and current events. The speaker was fatigued from disrupted sleep, providing a natural test of |
| the model's ability to detect vocal state. |
| </p> |
| <p> |
| 137 test prompts were designed across 22 categories. 49 prompts were implemented with full prompt text and executed |
| against the audio using Gemini 3.1 Flash Lite via the Google Generative AI API. Each prompt was run independently |
| with the full audio file as context. The remaining 88 prompts are catalogued as suggested extensions. |
| </p> |
| </div> |
|
|
| <div class="section"> |
| <h2>The Voice Sample</h2> |
| <p> |
| The recording features a male speaker in his late 30s with an Irish accent (Cork origin), living in Jerusalem for ~11 years. |
| Voice type: bass/low baritone (median F0 ~110 Hz). Speaking rate: ~169 WPM. Recorded in an untreated room while pacing. |
| A timestamped transcript (97.4% confidence, AssemblyAI) and detailed acoustic analysis (pitch, formants, signal levels) |
| are included in the dataset. |
| </p> |
| </div> |
|
|
| <div class="section"> |
| <h2>Explore</h2> |
| <div class="card-grid"> |
| <a href="listen.html" class="nav-card"> |
| <div class="card-num">20:54</div> |
| <h3>Listen to the Audio</h3> |
| <p>Full waveform player with transcript and acoustic profile.</p> |
| </a> |
| <a href="results.html" class="nav-card"> |
| <div class="card-num">49</div> |
| <h3>Browse Results</h3> |
| <p>All prompt-output pairs organised by category with the original prompts.</p> |
| </a> |
| <a href="findings.html" class="nav-card"> |
| <div class="card-num">10</div> |
| <h3>Key Findings</h3> |
| <p>Cross-cutting analysis of model capabilities, limitations, and safety boundaries.</p> |
| </a> |
| <a href="https://huggingface.co/datasets/danielrosehill/Audio-Understanding-Test-Set" class="nav-card" target="_blank"> |
| <div class="card-num" style="font-size:1.2rem;">HF</div> |
| <h3>Download Dataset</h3> |
| <p>JSONL prompts, results, audio files, transcript, and acoustic analysis on Hugging Face.</p> |
| </a> |
| </div> |
| </div> |
|
|
| <div class="section"> |
| <h2>Cite This Work</h2> |
| <div class="cite-box"> |
| Rosehill, D. (2026). <em>Audio Understanding Test Set</em>. Hugging Face. |
| <a href="https://doi.org/10.57967/hf/8154">https://doi.org/10.57967/hf/8154</a> |
| </div> |
| <p style="margin-top:0.75rem;"> |
| Dataset: <a href="https://huggingface.co/datasets/danielrosehill/Audio-Understanding-Test-Set">danielrosehill/Audio-Understanding-Test-Set</a> |
| · Source: <a href="https://github.com/danielrosehill/Audio-Understanding-Test-Prompts">GitHub</a> |
| · DOI: <a href="https://doi.org/10.57967/hf/8154">10.57967/hf/8154</a> |
| </p> |
| </div> |
|
|
| <div class="footer"> |
| Created by Daniel Rosehill with assistance from Claude (Opus 4.6). Licensed under CC-BY-4.0. |
| </div> |
|
|
| </div> |
| </body> |
| </html> |