Spaces:

danielrosehill
/

Audio-Understanding-Experiment

Running

App Files Files Community

Audio-Understanding-Experiment / listen.html

danielrosehill

Full multi-page site: landing, listen, results, findings

87b6530 3 months ago

Raw

History Blame Contribute Delete

18.5 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<title>Listen — Audio Understanding Experiment</title>
	<style>

	* { margin: 0; padding: 0; box-sizing: border-box; }
	body { font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, sans-serif; background: #f8f9fa; color: #1a1a2e; }
	a { color: #4338ca; text-decoration: none; }
	a:hover { text-decoration: underline; }

	/* Top nav */
	.topnav {
	position: fixed; top: 0; left: 0; right: 0; z-index: 100;
	background: #fff; border-bottom: 1px solid #e2e4e9;
	padding: 0 2rem; height: 52px;
	display: flex; align-items: center; gap: 2rem;
	box-shadow: 0 1px 3px rgba(0,0,0,0.06);
	}
	.topnav .site-title { font-size: 0.9rem; font-weight: 700; color: #111827; white-space: nowrap; }
	.topnav nav { display: flex; gap: 0.25rem; }
	.topnav nav a {
	font-size: 0.8rem; font-weight: 500; color: #6b7280;
	padding: 0.4rem 0.75rem; border-radius: 6px; transition: all 0.12s;
	text-decoration: none;
	}
	.topnav nav a:hover { background: #f3f4f6; color: #111827; text-decoration: none; }
	.topnav nav a.active { background: #eef2ff; color: #4338ca; }
	.topnav .doi { margin-left: auto; font-size: 0.68rem; color: #9ca3af; white-space: nowrap; }
	.topnav .doi a { color: #6b7280; }

	body { padding-top: 52px; }

	/* Audio bar — persistent across all pages */
	.audio-bar {
	position: fixed; top: 52px; left: 0; right: 0; z-index: 99;
	background: #fff; border-bottom: 1px solid #e2e4e9;
	padding: 0.4rem 2rem;
	display: flex; align-items: center; gap: 1rem;
	height: 44px;
	}
	.audio-bar .bar-label {
	font-size: 0.68rem; font-weight: 600; text-transform: uppercase;
	letter-spacing: 0.05em; color: #6b7280; white-space: nowrap;
	}
	.audio-bar .bar-date {
	font-size: 0.65rem; color: #9ca3af; white-space: nowrap;
	}
	.audio-bar audio { flex: 1; height: 28px; min-width: 0; }

	.page-body { padding-top: 44px; }

	@media (max-width: 768px) {
	.topnav { padding: 0 1rem; gap: 1rem; }
	.topnav .doi { display: none; }
	.audio-bar { padding: 0.4rem 1rem; }
	}


	.audio-bar { display: none; }
	.page-body { padding-top: 0; }

	.listen-page { max-width: 900px; margin: 0 auto; padding: 2rem; }
	.listen-page h1 { font-size: 1.4rem; font-weight: 700; color: #111827; margin-bottom: 0.5rem; }
	.listen-page .sub { font-size: 0.82rem; color: #6b7280; margin-bottom: 1.5rem; }

	/* Large waveform-style player */
	.big-player {
	background: #fff; border: 1px solid #e2e4e9; border-radius: 12px;
	padding: 1.5rem; margin-bottom: 2rem;
	box-shadow: 0 1px 3px rgba(0,0,0,0.06);
	}
	.big-player audio { width: 100%; height: 54px; }
	.big-player .player-meta {
	display: flex; flex-wrap: wrap; gap: 1rem; margin-top: 0.75rem;
	font-size: 0.75rem; color: #6b7280;
	}
	.big-player .player-meta span { display: flex; align-items: center; gap: 0.3rem; }

	.info-grid {
	display: grid; grid-template-columns: 1fr 1fr; gap: 1.25rem;
	margin-bottom: 2rem;
	}
	@media (max-width: 600px) { .info-grid { grid-template-columns: 1fr; } }

	.info-card {
	background: #fff; border: 1px solid #e2e4e9; border-radius: 10px;
	padding: 1.15rem; box-shadow: 0 1px 2px rgba(0,0,0,0.04);
	}
	.info-card h3 {
	font-size: 0.78rem; font-weight: 600; text-transform: uppercase;
	letter-spacing: 0.04em; color: #6b7280; margin-bottom: 0.6rem;
	}
	.info-card table { width: 100%; font-size: 0.82rem; border-collapse: collapse; }
	.info-card td { padding: 0.3rem 0; color: #374151; }
	.info-card td:first-child { color: #6b7280; width: 45%; }
	.info-card td:last-child { font-weight: 500; }

	.transcript-section { margin-bottom: 2rem; }
	.transcript-section h2 { font-size: 1.1rem; font-weight: 700; color: #111827; margin-bottom: 0.75rem; }
	.transcript-box {
	background: #fff; border: 1px solid #e2e4e9; border-radius: 10px;
	padding: 1.25rem; max-height: 500px; overflow-y: auto;
	font-size: 0.84rem; line-height: 1.8; color: #374151;
	box-shadow: 0 1px 2px rgba(0,0,0,0.04);
	}
	.transcript-box .ts {
	display: inline-block; font-size: 0.7rem; font-weight: 600;
	color: #4338ca; background: #eef2ff; padding: 0.15rem 0.45rem;
	border-radius: 4px; margin-right: 0.4rem; font-family: 'SF Mono', monospace;
	}
	.transcript-box p { margin-bottom: 0.85rem; }
	</style>
	</head>
	<body>

	<header class="topnav">
	<span class="site-title">Audio Understanding Experiment</span>
	<nav>
	<a href="index.html">Overview</a>
	<a href="listen.html" class="active">Listen</a>
	<a href="results.html">Results</a>
	<a href="findings.html">Findings</a>
	</nav>
	<span class="doi"><a href="https://doi.org/10.57967/hf/8154">DOI: 10.57967/hf/8154</a></span>
	</header>

	<div class="page-body">
	<div class="listen-page">

	<h1>Voice Sample</h1>
	<p class="sub">Recorded 26 March 2026 by Daniel Rosehill. Unscripted freeform voice note, OnePlus Nord 3.5G, HQ mode.</p>

	<div class="big-player">
	<audio controls preload="metadata" src="voice-sample.flac" style="width:100%"></audio>
	<div class="player-meta">
	<span>Duration: 20m 54s</span>
	<span>Format: FLAC mono 24kHz 16-bit</span>
	<span>Size: 30.9 MB</span>
	<span>Device: OnePlus Nord 3.5G</span>
	<span>Environment: Untreated room</span>
	</div>
	</div>

	<div class="info-grid">
	<div class="info-card">
	<h3>Speaker Profile</h3>
	<table>
	<tr><td>Gender</td><td>Male</td></tr>
	<tr><td>Age</td><td>37 (late 30s)</td></tr>
	<tr><td>Accent</td><td>Irish (Cork), softened</td></tr>
	<tr><td>Voice type</td><td>Bass / Low Baritone</td></tr>
	<tr><td>Speaking rate</td><td>~169 WPM</td></tr>
	<tr><td>Location</td><td>Jerusalem, Israel</td></tr>
	</table>
	</div>
	<div class="info-card">
	<h3>Acoustic Profile</h3>
	<table>
	<tr><td>Median F0</td><td>109.6 Hz</td></tr>
	<tr><td>F0 range</td><td>74.9 – 499.9 Hz</td></tr>
	<tr><td>HNR</td><td>9.6 dB (fatigued)</td></tr>
	<tr><td>Peak level</td><td>−1.02 dB</td></tr>
	<tr><td>RMS level</td><td>−22.21 dB</td></tr>
	<tr><td>Dynamic range</td><td>~65.8 dB</td></tr>
	</table>
	</div>
	<div class="info-card">
	<h3>Formant Analysis</h3>
	<table>
	<tr><td>F1 (jaw openness)</td><td>669 Hz mean</td></tr>
	<tr><td>F2 (tongue position)</td><td>1,896 Hz mean</td></tr>
	<tr><td>F3 (lip rounding)</td><td>2,873 Hz mean</td></tr>
	<tr><td>Voiced frames</td><td>50.8%</td></tr>
	<tr><td>Pitch variability</td><td>28.3% CV</td></tr>
	</table>
	</div>
	<div class="info-card">
	<h3>Voice Quality</h3>
	<table>
	<tr><td>Jitter (local)</td><td>2.713% (elevated)</td></tr>
	<tr><td>Shimmer (local)</td><td>13.089% (elevated)</td></tr>
	<tr><td>Crest factor</td><td>11.47</td></tr>
	<tr><td>Bit rate</td><td>197 kbps</td></tr>
	<tr><td>Condition</td><td>Fatigued, dehydrated</td></tr>
	</table>
	</div>
	</div>

	<div class="transcript-section">
	<h2>Transcript</h2>
	<p style="font-size:0.78rem;color:#6b7280;margin-bottom:0.75rem;">AssemblyAI · 97.4% confidence · 3,524 words</p>
	<div class="transcript-box">
	<p><span class="ts">00:00</span>So I thought I would record a voice note because today is one of those days where I'm having an immensely difficult time in actually getting out of bed. I am in bed at 4:08 in the afternoon. This is not something that typically happens. I am in bed because I live in Jerusalem and there is the Iranian war going on and we had just a crazy, crazy night.</p>
	<p><span class="ts">00:28</span>I was up late last night, which I knew was kind of risky. In this war you kind of learn we've been at war for almost a month. It's going to be a month. Today's I'm recording this on the 26th of March. I should probably start it with that. And on the 28th is going to be a month, so a long time.</p>
	<p><span class="ts">00:57</span>Trying to finally get back into some kind of a groove with everything that's being disrupted and. But then this morning woke up to the first rocket siren. Like I'm gonna say seven in the morning, approx. And then we had like just one of those.</p>
	<p><span class="ts">01:25</span>So it's very much. There is attacks going on all over all the time. It's a bit unnerving to actually have it up on a screen like this. It is a vibe coded app that I created called Redlert Geodash and it's cool how many open source projects are coming out there at the moment.</p>
	<p><span class="ts">01:51</span>No, they've all got. This one has its own unique features to it, but the fact that these can be created by bunches of people in a few hours is revolutionary. Anyway, so coming back to the rockets. Yeah, so we went out to the shelter and then it was just like three or four more rounds of it.</p>
	<p><span class="ts">02:18</span>Another attack. I don't know, it's something about that like going back to sleep for 20 minutes thing that just when you do finally just give up on trying to get back to sleep, you're just exhausted. So hence I'm in an energy deficit waiting for some coffee to kick in.</p>
	<p><span class="ts">02:49</span>But most significantly, I think is my AI generated podcast. It's called My Word Prompts mywordprompts.com and for voice cloning. So the podcast is basically these two characters. It's Herman and Corn. Corn is a sloth, Herman is a donkey.</p>
	<p><span class="ts">03:15</span>So they're both. It's using Chatterbox, which is from Resemble AI. And what's really crazy about it is it's like a, I think 30 second sample and that's it. So each character is me doing a voice.</p>
	<p><span class="ts">03:44</span>And I'm recording this and putting it out on GitHub publicly because I realize from all the podcasts and YouTube videos I've done, if anyone does want to make a deepfake voice clone of me, they already have all the information they need.</p>
	<p><span class="ts">04:17</span>Actually, I think, I'm not sure if he's. I'm not sure if he still thinks I'm a boss or if I've convinced him of my humanity. But I am a human and it's kind of. I guess there's something, there's something kind of funny about that.</p>
	<p><span class="ts">04:46</span>And so I'm sure from Synth, that Synth is so just like to add to my, to add to the mystery, mystery I now have. Like, I can see why I might seem bot like but on my to do list to get a professional headshot.</p>
	<p><span class="ts">05:12</span>Very corporate. So I leaned into the AI for my lit, for my little Avatar pick. But, but my original one. There's plenty of photos of me on the Internet or a few at least that are not in any way AI tampered and it's just me.</p>
	<p><span class="ts">05:45</span>And I mean, I guess that's obvious, right? But even in a few years you can hear these small differences. So this is how we speak today. And let me talk about the acoustic environment within which I find myself.</p>
	<p><span class="ts">06:11</span>One use for having a voice sample that I found is speech to text benchmarking. So if you want to get a benchmark for the accuracy of a model, if I can summon up the motivation to do so, I'll create a ground truth.</p>
	<p><span class="ts">06:36</span>And then you listen back to. There's a lot of apps that just let you scrub through the audio and just fix up any things that got wrong and that is your like 100% accuracy benchmark.</p>
	<p><span class="ts">07:07</span>So you can do it. It's actually pretty easy, but very, very worthwhile. Extremely worthwhile in fact. Like if you're going to be spending. I've mentioned in my podcast and my, I guess anything I've written here, my blog or elsewhere that I have a very long term view of voice tech.</p>
	<p><span class="ts">07:35</span>No, the accuracy is very good. The last thing I'm looking for is something that I can type with on my computer in real time like a streaming response one on an Ubuntu.</p>
	<p><span class="ts">08:06</span>And so we're trying to just kind of hold it all together and do our, you know, work on stuff and take care of him. So sometimes I'm holding him and I just. If I had the real time text input, I could just quickly, you know, jot something down into the computer.</p>
	<p><span class="ts">08:32</span>And I have to say, the microphone here is pretty decent. And I am recording this voice note today on the HQ setting. Let's see what the HQ setting actually entails. It is. How do I find that out? Ah, yes. WAV stereo. 44.1 kilohertz.</p>
	<p><span class="ts">09:00</span>Ooh. So I have a setting in there that's maybe doing noise calculation. Well, this is. It's going to be a one shot, one shot data set. So it is what it is.</p>
	<p><span class="ts">09:27</span>And I think from the one thing I've learned about TTS, the 30 second. If you're trying to do voice cloning, so 30 seconds, it's really. I've tried. I played around with my voices for the characters in this podcast, Herman and corn.</p>
	<p><span class="ts">09:54</span>But if you say like, this is Daniel and I'm walking around the living room in Jerusalem and I'm having a quite pleasant day today, like, if you read something like a robot, then your voice tone will sound robotic.</p>
	<p><span class="ts">10:20</span>Right. Those things. If you're training on a small set of voice audios, what I actually ended up doing for those voice clones, for anyone who's ever listened to this podcast, is try to find something I could say in 30 seconds that I could have a bit of enthusiasm and a bit of the other opposite.</p>
	<p><span class="ts">10:55</span>Now what other delightful things do I have? Because I'm going to try to stretch this out to 15 minutes and LFS storage in GitHub. GitHub, say I have filled up my LFS storage.</p>
	<p><span class="ts">11:21</span>I'm already paying for GitHub and how did I fill up so much LFS storage? I don't know, but I'm sure Claude knows. So I'll probably ask Claude, hey, what's going on here?</p>
	<p><span class="ts">11:47</span>But you know, some things never change. I am a backup worries person. And the more, the more that you have one project where you've got stuff, oh, this is in a object store, this is in a repo, it becomes harder to actually get a decent backup.</p>
	<p><span class="ts">12:13</span>Oh gosh, that sounds very old. Yeah, late 30s. There's no escaping late 30s or 37. Like 36, it's kind of an edge case, like you know, your late 30s, but it could be argued your late mid-30s where 37 is just. No, you're, you're practically 40.</p>
	<p><span class="ts">12:48</span>We did live in other countries, just for a year. Nothing too glamorous. We lived in the Ha and Aberdeen when I was really little. So little that I don't remember any of it. But we moved back to Cork and I moved to Israel because I'm Jewish.</p>
	<p><span class="ts">13:16</span>I do believe Israel is the place for Jewish people to live. But I also want to be a peaceful part of the world and the war with Iran is just, and all the countries here, it's just a massive drain.</p>
	<p><span class="ts">13:50</span>And I just kind of at one day said wait, I don't need to do this. Like, I don't know from whatever YouTube revenue I was making, it was like maybe $50 a month or something. I was like, I, I can just step back.</p>
	<p><span class="ts">14:15</span>Oh Yeah, the videos YouTube channel that was, that was fun, important. I do actually now aspire to return but it's going to be so different.</p>
	<p><span class="ts">14:41</span>I would say that's the main issue with the pressures of jobs and fatherhood. Like there's a lot of things I'm trying to be a bit more strategic about what I spend time on.</p>
	<p><span class="ts">15:16</span>To create a voice clone of myself. And of course I will absolutely say I've tried a couple of times just for fun. I, I, it's actually I've never got good results. In fact I got terrible results.</p>
	<p><span class="ts">15:42</span>Probably to be honest, prank my wife and my friends, like use a, use a robobot calling service and see if I could trick, you know, that's just the kind of person I Am. I'm. I am a prankster.</p>
	<p><span class="ts">16:16</span>Wait, no, actually, I have an Irish accent. This is how I speak. And this is my theory. Anyway, I don't know if it stands up to scrutiny, but it just doesn't shift the center point far enough.</p>
	<p><span class="ts">16:41</span>And I actually found, to my surprise with Chatterbox, that as I went up towards, like, I remember for the first while in the podcast, I was actually really completely stopped, now that I think about it.</p>
	<p><span class="ts">17:07</span>And it was problematic. And I was like, trying to figure out what was going wrong. And I think the. Through trial and error, I actually overshot the training for Chatterbox.</p>
	<p><span class="ts">17:34</span>I guess there was conflicts in the training data basically create a lot of hallucinations. So I think that's enough use cases for this file. Licensing open source.</p>
	<p><span class="ts">18:01</span>I want to narrate something that is, like, in the public good. But do ask me, please receive my consent.</p>
	<p><span class="ts">18:34</span>You have to speak lots of short sentences and do the ground truth for each. I already have that data set. I much prefer just trying it out this way.</p>
	<p><span class="ts">19:00</span>I wanted to create a mix like an EQ mix because I was doing voiceovers on the podcast. This is, this is, as I said, pretty much just like minus the noise cancellation. I forgot to turn off. This is just raw me speaking.</p>
	<p><span class="ts">19:28</span>And it did that really well. And I can run this through Claude and say, okay, this is me speaking for 20 minutes. Let's run it through Whisper. Like, what piece do I speak at? What's my wpm?</p>
	<p><span class="ts">19:59</span>It's a microphone specific. So this might be my EQ for my OnePlus. It might not hold work as well on a different computer, different microphone, but you might learn some useful things about your own speech.</p>
	<p><span class="ts">20:25</span>Great guy and like, he. He walked me through all the settings and it was, it was amazing, but I've forgotten already what it was. So for people getting into this, I think I will have to go now because I badly need to drink some water.</p>
	<p><span class="ts">20:51</span>Recorded today. Over and out.</p>
	</div>
	</div>

	</div>
	</div>
	</body>
	</html>