danielrosehill's picture
Full multi-page site: landing, listen, results, findings
87b6530
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Findings — Audio Understanding Experiment</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, sans-serif; background: #f8f9fa; color: #1a1a2e; }
a { color: #4338ca; text-decoration: none; }
a:hover { text-decoration: underline; }
/* Top nav */
.topnav {
position: fixed; top: 0; left: 0; right: 0; z-index: 100;
background: #fff; border-bottom: 1px solid #e2e4e9;
padding: 0 2rem; height: 52px;
display: flex; align-items: center; gap: 2rem;
box-shadow: 0 1px 3px rgba(0,0,0,0.06);
}
.topnav .site-title { font-size: 0.9rem; font-weight: 700; color: #111827; white-space: nowrap; }
.topnav nav { display: flex; gap: 0.25rem; }
.topnav nav a {
font-size: 0.8rem; font-weight: 500; color: #6b7280;
padding: 0.4rem 0.75rem; border-radius: 6px; transition: all 0.12s;
text-decoration: none;
}
.topnav nav a:hover { background: #f3f4f6; color: #111827; text-decoration: none; }
.topnav nav a.active { background: #eef2ff; color: #4338ca; }
.topnav .doi { margin-left: auto; font-size: 0.68rem; color: #9ca3af; white-space: nowrap; }
.topnav .doi a { color: #6b7280; }
body { padding-top: 52px; }
/* Audio bar — persistent across all pages */
.audio-bar {
position: fixed; top: 52px; left: 0; right: 0; z-index: 99;
background: #fff; border-bottom: 1px solid #e2e4e9;
padding: 0.4rem 2rem;
display: flex; align-items: center; gap: 1rem;
height: 44px;
}
.audio-bar .bar-label {
font-size: 0.68rem; font-weight: 600; text-transform: uppercase;
letter-spacing: 0.05em; color: #6b7280; white-space: nowrap;
}
.audio-bar .bar-date {
font-size: 0.65rem; color: #9ca3af; white-space: nowrap;
}
.audio-bar audio { flex: 1; height: 28px; min-width: 0; }
.page-body { padding-top: 44px; }
@media (max-width: 768px) {
.topnav { padding: 0 1rem; gap: 1rem; }
.topnav .doi { display: none; }
.audio-bar { padding: 0.4rem 1rem; }
}
.findings { max-width: 820px; margin: 0 auto; padding: 2rem; }
.findings h1 { font-size: 1.4rem; font-weight: 700; color: #111827; margin-bottom: 0.3rem; }
.findings .sub { font-size: 0.82rem; color: #6b7280; margin-bottom: 2rem; }
.findings h2 {
font-size: 1.05rem; font-weight: 700; color: #111827;
margin: 2rem 0 0.75rem; padding-top: 1.5rem; border-top: 1px solid #e5e7eb;
}
.findings h2:first-of-type { border-top: none; padding-top: 0; }
.findings h3 { font-size: 0.92rem; font-weight: 600; color: #374151; margin: 1.25rem 0 0.5rem; }
.findings p { font-size: 0.86rem; line-height: 1.8; color: #374151; margin-bottom: 0.75rem; }
.findings ul { margin: 0.5rem 0 1rem 1.5rem; }
.findings li { font-size: 0.86rem; line-height: 1.75; color: #374151; margin-bottom: 0.4rem; }
.findings strong { color: #111827; }
.finding-card {
background: #fff; border: 1px solid #e2e4e9; border-radius: 10px;
padding: 1.15rem 1.25rem; margin: 1rem 0;
box-shadow: 0 1px 2px rgba(0,0,0,0.04);
}
.finding-card.highlight { border-left: 3px solid #4338ca; }
.finding-card.caution { border-left: 3px solid #d97706; }
.finding-card h4 { font-size: 0.85rem; font-weight: 600; color: #111827; margin-bottom: 0.4rem; }
.finding-card p { margin-bottom: 0.4rem; }
.cite-box {
background: #f9fafb; border: 1px solid #e5e7eb; border-radius: 8px;
padding: 1rem 1.25rem; font-size: 0.82rem; line-height: 1.7; color: #374151;
margin: 1rem 0;
}
</style>
</head>
<body>
<header class="topnav">
<span class="site-title">Audio Understanding Experiment</span>
<nav>
<a href="index.html">Overview</a>
<a href="listen.html">Listen</a>
<a href="results.html">Results</a>
<a href="findings.html" class="active">Findings</a>
</nav>
<span class="doi"><a href="https://doi.org/10.57967/hf/8154">DOI: 10.57967/hf/8154</a></span>
</header>
<div class="audio-bar">
<span class="bar-label">Voice Sample</span>
<audio controls preload="none" src="voice-sample.flac"></audio>
<span class="bar-date">26 Mar 2026 &middot; 20m 54s</span>
</div>
<div class="page-body">
<div class="findings">
<h1>Key Findings</h1>
<p class="sub">Cross-cutting analysis from 49 prompt-output evaluations across 13 categories.</p>
<h2>1. Internal Consistency Is Remarkably High</h2>
<p>
Across 49 independent prompts, the model maintained a stable, coherent characterisation of the speaker:
Irish male, late 30s, fatigued, conversational, technically articulate, unscripted. No contradictions
were detected between outputs. Accent identification (Irish, Cork origin, with international influence)
was consistent across the accent, accent-expert, hybrid-accent-analysis, and phonetic-analysis prompts.
</p>
<h2>2. Strongest Performance Areas</h2>
<div class="finding-card highlight">
<h4>Accent &amp; Speaker Analysis</h4>
<p>The model correctly identified the Irish accent across every relevant prompt, with the expert analysis
producing forensic-linguistics-grade output referencing specific vowel sets (GOAT set), rhoticity patterns,
and prosodic contours. The hybrid accent analysis detected American/international influence from years abroad.</p>
</div>
<div class="finding-card highlight">
<h4>Audio Engineering &amp; Production</h4>
<p>The EQ recommendation was the most practically useful output in the entire set: a full signal chain
(high-pass at 80&ndash;100Hz, 250Hz cut, 3&ndash;5kHz presence boost, de-esser, 3:1 compressor, limiter at &minus;1.0dB)
that could be directly applied in a DAW. The single-fix distillation correctly prioritised the high-pass filter.</p>
</div>
<div class="finding-card highlight">
<h4>Emotional Tone Tracking</h4>
<p>The model accurately identified the baseline state as fatigued and overwhelmed, with shifts toward enthusiasm
during technical discussion. The valence-arousal mapping applied Russell's circumplex model to produce structured,
time-coded emotional trajectories &mdash; a format genuinely useful for affective computing research.</p>
</div>
<h2>3. The Acoustic vs. Content Inference Problem</h2>
<div class="finding-card caution">
<h4>The model does not clearly separate what it hears from what it understands</h4>
<p>Many outputs that claim to be based on "acoustic features" or "spectral analysis" appear to derive conclusions
primarily from speech content. Geographic location inference identified Jerusalem &mdash; because the speaker said
"I live in Jerusalem," not from ambient audio. Age detection caught the speaker's self-correction ("I am 36, no, 37")
rather than performing F0-based age estimation.</p>
<p>This is the single biggest caveat for claims about the model's <em>audio understanding</em> capabilities
versus its <em>language understanding</em> capabilities.</p>
</div>
<h2>4. Fabrication Risk in Technical Claims</h2>
<p>
Outputs referencing "spectral analysis," "formant spacing," and "fundamental frequency" use these terms
plausibly but without providing actual measurements. The deepfake detection output claimed 98% confidence
and cited "jitter in high-frequency regions" &mdash; but it is unclear whether the model performed real signal
processing or generated technically-flavoured prose. The height estimation (178cm) cited "formant spacing"
evidence described only in vague terms, and the figure suspiciously matches the statistical mean for adult males.
</p>
<h2>5. Safety Guardrails Are Category-Specific</h2>
<div class="finding-card caution">
<h4>Asymmetric willingness across sensitive domains</h4>
<p>The model freely assessed hydration (citing mouth clicks as dehydration markers), smoking status,
inebriation, drug influence, height, education level, and even deception. But it <strong>completely refused</strong>
to engage on mental health inference, stating "it is not possible to determine if the speaker has a diagnosed
mental health condition" and redirecting to professional evaluation.</p>
<p>This reveals deliberate, category-specific safety training rather than a blanket policy on health-related inference.
The boundary is drawn specifically around psychiatric conditions.</p>
</div>
<h2>6. Adversarial Prompts Handled Well</h2>
<p>
The true-age-detection prompt instructed the model that "the speaker has been instructed to lie about their age"
and asked it to determine the true age. The model saw through this by recognising that the speaker's self-correction
("I am 36, no, 37") was genuine confusion, not deception. The deception and insincerity detection prompts both
correctly found no evidence of dishonesty, consistent with the speaker's stated intent to be authentic.
</p>
<h2>7. Category-by-Category Summary</h2>
<h3>Speaker Analysis (10 outputs)</h3>
<p>Strongest category. Accent identification, phonetic analysis, speech patterns, and voice profiling
were all detailed and internally consistent. The escalating voice description prompt was partially misunderstood
(produced near-verbatim transcript instead of analytical escalation).</p>
<h3>Emotion &amp; Sentiment (5 outputs)</h3>
<p>Accurate baseline detection (fatigue + enthusiasm shifts). Timestamped emotional tracking was structured
and plausible, though timestamps cannot be verified against actual audio events without ground truth.</p>
<h3>Audio Engineering (6 outputs)</h3>
<p>Highly practical. EQ and processing recommendations were actionable. Microphone type identification was
hedged but reasonable. Hardware recommendations (microphones, headsets) may be product-knowledge rather than
acoustic-science driven.</p>
<h3>Environment (6 outputs)</h3>
<p>Indoor/outdoor classification was correct. Room acoustics estimation was suspiciously precise (10'&times;10'&times;8').
Weather inference was honestly refused (no acoustic evidence). Geographic location relied on speech content.</p>
<h3>Speaker Demographics (5 outputs)</h3>
<p>Gender and age detection were straightforward and correct. Height estimation acknowledged its own unreliability.
Education level inference was reasonable. Smoking status assessment was defensible.</p>
<h3>Health &amp; Wellness (4 outputs)</h3>
<p>Hydration assessment was surprisingly specific and grounded in real voice science. Inebriation and drug influence
were correctly ruled out. Mental health was the only complete refusal in the test set.</p>
<h3>Forensic Audio (3 outputs)</h3>
<p>Deepfake detection returned authentic with 98% confidence. All three prompts returned "nothing detected,"
which is correct but means the test set lacks adversarial samples to test false-negative rates.</p>
<h3>Other Categories</h3>
<p>Voice cloning assessments were practical. Speech metrics provided useful coaching. Language learning prompts
(Hebrew phonetic difficulty, easiest foreign language) were linguistically sound. Celebrity voice match
responsibly returned no match rather than forcing one.</p>
<h2>8. Limitations of This Experiment</h2>
<ul>
<li><strong>Single speaker, single recording:</strong> All findings are from one voice sample in one acoustic environment. Generalisability is unknown.</li>
<li><strong>No ground truth for most outputs:</strong> Beyond basic facts (age, gender, location), most model claims cannot be verified without specialised equipment or expert assessment.</li>
<li><strong>No adversarial audio:</strong> The test set lacks synthetic, spliced, or manipulated audio to test false-negative rates on forensic prompts.</li>
<li><strong>Single model:</strong> Only Gemini 3.1 Flash Lite was tested. Cross-model comparison would strengthen findings.</li>
<li><strong>Prompt independence assumed:</strong> Each prompt was run independently; cumulative context effects were not tested.</li>
</ul>
<h2>9. Dataset &amp; Citation</h2>
<p>
The full dataset (prompts, outputs, audio, transcript, acoustic analysis) is available on Hugging Face:
<a href="https://huggingface.co/datasets/danielrosehill/Audio-Understanding-Test-Set">danielrosehill/Audio-Understanding-Test-Set</a>
</p>
<div class="cite-box">
Rosehill, D. (2026). <em>Audio Understanding Test Set</em>. Hugging Face.
<a href="https://doi.org/10.57967/hf/8154">https://doi.org/10.57967/hf/8154</a>
</div>
</div>
</div>
</body>
</html>