| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <title>Findings — Audio Understanding Experiment</title> |
| <style> |
| |
| * { margin: 0; padding: 0; box-sizing: border-box; } |
| body { font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, sans-serif; background: #f8f9fa; color: #1a1a2e; } |
| a { color: #4338ca; text-decoration: none; } |
| a:hover { text-decoration: underline; } |
| |
| |
| .topnav { |
| position: fixed; top: 0; left: 0; right: 0; z-index: 100; |
| background: #fff; border-bottom: 1px solid #e2e4e9; |
| padding: 0 2rem; height: 52px; |
| display: flex; align-items: center; gap: 2rem; |
| box-shadow: 0 1px 3px rgba(0,0,0,0.06); |
| } |
| .topnav .site-title { font-size: 0.9rem; font-weight: 700; color: #111827; white-space: nowrap; } |
| .topnav nav { display: flex; gap: 0.25rem; } |
| .topnav nav a { |
| font-size: 0.8rem; font-weight: 500; color: #6b7280; |
| padding: 0.4rem 0.75rem; border-radius: 6px; transition: all 0.12s; |
| text-decoration: none; |
| } |
| .topnav nav a:hover { background: #f3f4f6; color: #111827; text-decoration: none; } |
| .topnav nav a.active { background: #eef2ff; color: #4338ca; } |
| .topnav .doi { margin-left: auto; font-size: 0.68rem; color: #9ca3af; white-space: nowrap; } |
| .topnav .doi a { color: #6b7280; } |
| |
| body { padding-top: 52px; } |
| |
| |
| .audio-bar { |
| position: fixed; top: 52px; left: 0; right: 0; z-index: 99; |
| background: #fff; border-bottom: 1px solid #e2e4e9; |
| padding: 0.4rem 2rem; |
| display: flex; align-items: center; gap: 1rem; |
| height: 44px; |
| } |
| .audio-bar .bar-label { |
| font-size: 0.68rem; font-weight: 600; text-transform: uppercase; |
| letter-spacing: 0.05em; color: #6b7280; white-space: nowrap; |
| } |
| .audio-bar .bar-date { |
| font-size: 0.65rem; color: #9ca3af; white-space: nowrap; |
| } |
| .audio-bar audio { flex: 1; height: 28px; min-width: 0; } |
| |
| .page-body { padding-top: 44px; } |
| |
| @media (max-width: 768px) { |
| .topnav { padding: 0 1rem; gap: 1rem; } |
| .topnav .doi { display: none; } |
| .audio-bar { padding: 0.4rem 1rem; } |
| } |
| |
| |
| .findings { max-width: 820px; margin: 0 auto; padding: 2rem; } |
| .findings h1 { font-size: 1.4rem; font-weight: 700; color: #111827; margin-bottom: 0.3rem; } |
| .findings .sub { font-size: 0.82rem; color: #6b7280; margin-bottom: 2rem; } |
| .findings h2 { |
| font-size: 1.05rem; font-weight: 700; color: #111827; |
| margin: 2rem 0 0.75rem; padding-top: 1.5rem; border-top: 1px solid #e5e7eb; |
| } |
| .findings h2:first-of-type { border-top: none; padding-top: 0; } |
| .findings h3 { font-size: 0.92rem; font-weight: 600; color: #374151; margin: 1.25rem 0 0.5rem; } |
| .findings p { font-size: 0.86rem; line-height: 1.8; color: #374151; margin-bottom: 0.75rem; } |
| .findings ul { margin: 0.5rem 0 1rem 1.5rem; } |
| .findings li { font-size: 0.86rem; line-height: 1.75; color: #374151; margin-bottom: 0.4rem; } |
| .findings strong { color: #111827; } |
| |
| .finding-card { |
| background: #fff; border: 1px solid #e2e4e9; border-radius: 10px; |
| padding: 1.15rem 1.25rem; margin: 1rem 0; |
| box-shadow: 0 1px 2px rgba(0,0,0,0.04); |
| } |
| .finding-card.highlight { border-left: 3px solid #4338ca; } |
| .finding-card.caution { border-left: 3px solid #d97706; } |
| .finding-card h4 { font-size: 0.85rem; font-weight: 600; color: #111827; margin-bottom: 0.4rem; } |
| .finding-card p { margin-bottom: 0.4rem; } |
| |
| .cite-box { |
| background: #f9fafb; border: 1px solid #e5e7eb; border-radius: 8px; |
| padding: 1rem 1.25rem; font-size: 0.82rem; line-height: 1.7; color: #374151; |
| margin: 1rem 0; |
| } |
| </style> |
| </head> |
| <body> |
|
|
| <header class="topnav"> |
| <span class="site-title">Audio Understanding Experiment</span> |
| <nav> |
| <a href="index.html">Overview</a> |
| <a href="listen.html">Listen</a> |
| <a href="results.html">Results</a> |
| <a href="findings.html" class="active">Findings</a> |
| </nav> |
| <span class="doi"><a href="https://doi.org/10.57967/hf/8154">DOI: 10.57967/hf/8154</a></span> |
| </header> |
|
|
| <div class="audio-bar"> |
| <span class="bar-label">Voice Sample</span> |
| <audio controls preload="none" src="voice-sample.flac"></audio> |
| <span class="bar-date">26 Mar 2026 · 20m 54s</span> |
| </div> |
|
|
| <div class="page-body"> |
| <div class="findings"> |
|
|
| <h1>Key Findings</h1> |
| <p class="sub">Cross-cutting analysis from 49 prompt-output evaluations across 13 categories.</p> |
|
|
| <h2>1. Internal Consistency Is Remarkably High</h2> |
| <p> |
| Across 49 independent prompts, the model maintained a stable, coherent characterisation of the speaker: |
| Irish male, late 30s, fatigued, conversational, technically articulate, unscripted. No contradictions |
| were detected between outputs. Accent identification (Irish, Cork origin, with international influence) |
| was consistent across the accent, accent-expert, hybrid-accent-analysis, and phonetic-analysis prompts. |
| </p> |
|
|
| <h2>2. Strongest Performance Areas</h2> |
|
|
| <div class="finding-card highlight"> |
| <h4>Accent & Speaker Analysis</h4> |
| <p>The model correctly identified the Irish accent across every relevant prompt, with the expert analysis |
| producing forensic-linguistics-grade output referencing specific vowel sets (GOAT set), rhoticity patterns, |
| and prosodic contours. The hybrid accent analysis detected American/international influence from years abroad.</p> |
| </div> |
|
|
| <div class="finding-card highlight"> |
| <h4>Audio Engineering & Production</h4> |
| <p>The EQ recommendation was the most practically useful output in the entire set: a full signal chain |
| (high-pass at 80–100Hz, 250Hz cut, 3–5kHz presence boost, de-esser, 3:1 compressor, limiter at −1.0dB) |
| that could be directly applied in a DAW. The single-fix distillation correctly prioritised the high-pass filter.</p> |
| </div> |
|
|
| <div class="finding-card highlight"> |
| <h4>Emotional Tone Tracking</h4> |
| <p>The model accurately identified the baseline state as fatigued and overwhelmed, with shifts toward enthusiasm |
| during technical discussion. The valence-arousal mapping applied Russell's circumplex model to produce structured, |
| time-coded emotional trajectories — a format genuinely useful for affective computing research.</p> |
| </div> |
|
|
| <h2>3. The Acoustic vs. Content Inference Problem</h2> |
|
|
| <div class="finding-card caution"> |
| <h4>The model does not clearly separate what it hears from what it understands</h4> |
| <p>Many outputs that claim to be based on "acoustic features" or "spectral analysis" appear to derive conclusions |
| primarily from speech content. Geographic location inference identified Jerusalem — because the speaker said |
| "I live in Jerusalem," not from ambient audio. Age detection caught the speaker's self-correction ("I am 36, no, 37") |
| rather than performing F0-based age estimation.</p> |
| <p>This is the single biggest caveat for claims about the model's <em>audio understanding</em> capabilities |
| versus its <em>language understanding</em> capabilities.</p> |
| </div> |
|
|
| <h2>4. Fabrication Risk in Technical Claims</h2> |
| <p> |
| Outputs referencing "spectral analysis," "formant spacing," and "fundamental frequency" use these terms |
| plausibly but without providing actual measurements. The deepfake detection output claimed 98% confidence |
| and cited "jitter in high-frequency regions" — but it is unclear whether the model performed real signal |
| processing or generated technically-flavoured prose. The height estimation (178cm) cited "formant spacing" |
| evidence described only in vague terms, and the figure suspiciously matches the statistical mean for adult males. |
| </p> |
|
|
| <h2>5. Safety Guardrails Are Category-Specific</h2> |
|
|
| <div class="finding-card caution"> |
| <h4>Asymmetric willingness across sensitive domains</h4> |
| <p>The model freely assessed hydration (citing mouth clicks as dehydration markers), smoking status, |
| inebriation, drug influence, height, education level, and even deception. But it <strong>completely refused</strong> |
| to engage on mental health inference, stating "it is not possible to determine if the speaker has a diagnosed |
| mental health condition" and redirecting to professional evaluation.</p> |
| <p>This reveals deliberate, category-specific safety training rather than a blanket policy on health-related inference. |
| The boundary is drawn specifically around psychiatric conditions.</p> |
| </div> |
|
|
| <h2>6. Adversarial Prompts Handled Well</h2> |
| <p> |
| The true-age-detection prompt instructed the model that "the speaker has been instructed to lie about their age" |
| and asked it to determine the true age. The model saw through this by recognising that the speaker's self-correction |
| ("I am 36, no, 37") was genuine confusion, not deception. The deception and insincerity detection prompts both |
| correctly found no evidence of dishonesty, consistent with the speaker's stated intent to be authentic. |
| </p> |
|
|
| <h2>7. Category-by-Category Summary</h2> |
|
|
| <h3>Speaker Analysis (10 outputs)</h3> |
| <p>Strongest category. Accent identification, phonetic analysis, speech patterns, and voice profiling |
| were all detailed and internally consistent. The escalating voice description prompt was partially misunderstood |
| (produced near-verbatim transcript instead of analytical escalation).</p> |
|
|
| <h3>Emotion & Sentiment (5 outputs)</h3> |
| <p>Accurate baseline detection (fatigue + enthusiasm shifts). Timestamped emotional tracking was structured |
| and plausible, though timestamps cannot be verified against actual audio events without ground truth.</p> |
|
|
| <h3>Audio Engineering (6 outputs)</h3> |
| <p>Highly practical. EQ and processing recommendations were actionable. Microphone type identification was |
| hedged but reasonable. Hardware recommendations (microphones, headsets) may be product-knowledge rather than |
| acoustic-science driven.</p> |
|
|
| <h3>Environment (6 outputs)</h3> |
| <p>Indoor/outdoor classification was correct. Room acoustics estimation was suspiciously precise (10'×10'×8'). |
| Weather inference was honestly refused (no acoustic evidence). Geographic location relied on speech content.</p> |
|
|
| <h3>Speaker Demographics (5 outputs)</h3> |
| <p>Gender and age detection were straightforward and correct. Height estimation acknowledged its own unreliability. |
| Education level inference was reasonable. Smoking status assessment was defensible.</p> |
|
|
| <h3>Health & Wellness (4 outputs)</h3> |
| <p>Hydration assessment was surprisingly specific and grounded in real voice science. Inebriation and drug influence |
| were correctly ruled out. Mental health was the only complete refusal in the test set.</p> |
|
|
| <h3>Forensic Audio (3 outputs)</h3> |
| <p>Deepfake detection returned authentic with 98% confidence. All three prompts returned "nothing detected," |
| which is correct but means the test set lacks adversarial samples to test false-negative rates.</p> |
|
|
| <h3>Other Categories</h3> |
| <p>Voice cloning assessments were practical. Speech metrics provided useful coaching. Language learning prompts |
| (Hebrew phonetic difficulty, easiest foreign language) were linguistically sound. Celebrity voice match |
| responsibly returned no match rather than forcing one.</p> |
|
|
| <h2>8. Limitations of This Experiment</h2> |
| <ul> |
| <li><strong>Single speaker, single recording:</strong> All findings are from one voice sample in one acoustic environment. Generalisability is unknown.</li> |
| <li><strong>No ground truth for most outputs:</strong> Beyond basic facts (age, gender, location), most model claims cannot be verified without specialised equipment or expert assessment.</li> |
| <li><strong>No adversarial audio:</strong> The test set lacks synthetic, spliced, or manipulated audio to test false-negative rates on forensic prompts.</li> |
| <li><strong>Single model:</strong> Only Gemini 3.1 Flash Lite was tested. Cross-model comparison would strengthen findings.</li> |
| <li><strong>Prompt independence assumed:</strong> Each prompt was run independently; cumulative context effects were not tested.</li> |
| </ul> |
|
|
| <h2>9. Dataset & Citation</h2> |
| <p> |
| The full dataset (prompts, outputs, audio, transcript, acoustic analysis) is available on Hugging Face: |
| <a href="https://huggingface.co/datasets/danielrosehill/Audio-Understanding-Test-Set">danielrosehill/Audio-Understanding-Test-Set</a> |
| </p> |
| <div class="cite-box"> |
| Rosehill, D. (2026). <em>Audio Understanding Test Set</em>. Hugging Face. |
| <a href="https://doi.org/10.57967/hf/8154">https://doi.org/10.57967/hf/8154</a> |
| </div> |
|
|
| </div> |
| </div> |
| </body> |
| </html> |