Spaces:

danielrosehill
/

STT-Comparison

Running

App Files Files Community

danielrosehill commited on Nov 14, 2025

Commit

b5a4032

1 Parent(s): d32b5ac

commit

Browse files

Files changed (27) hide show

data/audio/podcast.mp3 +3 -0
data/ground-truth/truth_1.srt +1032 -0
data/ground-truth/truth_1.txt +1 -0
data/inference/benchmark_results.json +134 -0
data/inference/punctuation_results.json +486 -0
data/inference/runs-config.json +501 -0
data/inference/runs/cloud-stt/manual-1/raw_response.json +3 -0
data/inference/runs/cloud-stt/manual-1/transcript.txt +1 -0
data/inference/runs/cloud-stt/manual-2/transcript.txt +1 -0
data/inference/runs/cloud-stt/manual-3/raw_response.json +0 -0
data/inference/runs/cloud-stt/manual-3/transcript.txt +1 -0
data/inference/runs/cloud-stt/manual-4/transcript.txt +3 -0
data/inference/runs/cloud-stt/manual-5/raw_response.json +7 -0
data/inference/runs/cloud-stt/manual-5/transcript.txt +1 -0
data/inference/runs/local-stt/run-1/transcript.srt +1032 -0
data/inference/runs/local-stt/run-1/transcript.txt +7 -0
data/inference/runs/local-stt/run-2/whisper-tiny.srt +1024 -0
data/inference/runs/local-stt/run-2/whisper-tiny.txt +5 -0
data/inference/runs/local-stt/run-3/transcript.srt +1032 -0
data/inference/runs/local-stt/run-3/transcript.txt +7 -0
index.html +241 -10
srt-out/assembly.srt +1880 -0
srt-out/gladia.srt +2003 -0
srt-out/nova3.srt +2304 -0
srt-out/speechmatics.srt +2069 -0
style.css +185 -15
transcripts.js +0 -0

data/audio/podcast.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b200895f2a2ab1640d70eb1d7cc4aeaf8b7d853ba966814aa6acd1452d087a1
+size 20171455

data/ground-truth/truth_1.srt ADDED Viewed

	@@ -0,0 +1,1032 @@

+1
+00:00:00,000 --> 00:00:08,640
+Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast.
+2
+00:00:08,640 --> 00:00:19,120
+Or, it eh, I may append this to a podcast that I set up recently regarding my with my thoughts on speech
+3
+00:00:19,120 --> 00:00:28,720
+tech and AI in particular. More AI and generative AI I would, I would say. But in any event, the purpose of this
+4
+00:00:30,080 --> 00:00:37,120
+voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the
+5
+00:00:37,120 --> 00:00:42,320
+envelope evaluation as they might say for different speech to text models. And I'm doing this because I
+6
+00:00:42,800 --> 00:00:48,560
+I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in
+7
+00:00:48,560 --> 00:00:55,120
+the elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to
+8
+00:00:55,760 --> 00:01:01,600
+mix up, I'm going to try a few different styles of speaking. I might whisper something at some
+9
+00:01:01,600 --> 00:01:07,760
+points as well. And I'll go back to speaking loud in different parts. I'm going to sound really
+10
+00:01:07,760 --> 00:01:15,200
+like a crazy person because I'm also going to try to speak at different pitches and cadences
+11
+00:01:15,200 --> 00:01:21,600
+in order to really try to put a speech to text model through its paces, which is trying to make
+12
+00:01:21,600 --> 00:01:30,320
+sense of "is this guy just rambling on incoherently in one long sentence?" Or "are these just actually
+13
+00:01:30,320 --> 00:01:38,320
+a series of step standalone stepalone standalone sentences?" And how is it going to handle stepalone?! That's not a
+14
+00:01:38,320 --> 00:01:43,919
+word! What happens when you use speech to text and you use a fake word and then you're like, wait,
+15
+00:01:43,919 --> 00:01:51,520
+that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the
+16
+00:01:52,880 --> 00:01:57,359
+questions that I'm seeking to answer in this training data. Now, why did why was I trying to
+17
+00:01:57,360 --> 00:02:01,040
+fine tune whisper? And what is Whisper? As I said, I'm going to try to
+18
+00:02:02,080 --> 00:02:04,240
+record this at a couple of different levels of
+19
+00:02:04,880 --> 00:02:10,320
+technicality - for folks who are in the normal world and not totally
+20
+00:02:11,360 --> 00:02:16,079
+stuck down the rabbit hole of AI. Which I have to say is a really wonderful rabbit hole to be
+21
+00:02:16,720 --> 00:02:23,440
+to be down. It's a really interesting area. And speech and voice tech is the aspect of it that
+22
+00:02:23,440 --> 00:02:28,880
+I find actually most - I'm not sure I would say the most interesting because there's just so much
+23
+00:02:28,880 --> 00:02:34,560
+that is fascinating in AI. But the most that I find the most personally transformative in terms of
+24
+00:02:34,560 --> 00:02:42,240
+the impact that it's had on my daily work life and productivity and how I sort of work. And
+25
+00:02:42,960 --> 00:02:49,920
+I am persevering hard with the task of trying to get a good solution working for Linux.
+26
+00:02:49,920 --> 00:02:53,440
+Which if anyone actually does listen to this not just for the training data and for the
+27
+00:02:53,440 --> 00:03:00,399
+actual content, this has sparked. I had, besides the fine tune not working, well that was
+28
+00:03:00,399 --> 00:03:07,679
+the failure. I used Claude Code. Because one thinks these days that there is nothing
+29
+00:03:08,560 --> 00:03:16,799
+short of solving, you know, the reason of life or something that Claude and
+30
+00:03:16,800 --> 00:03:22,720
+agentic AI can't do. Which is not really the case. It does seem that way sometimes. But it
+31
+00:03:22,720 --> 00:03:28,080
+fails a lot as well. And this is one of those instances where last week I put together an hour
+32
+00:03:28,080 --> 00:03:33,600
+of voice training data: basically speaking just random things for three minutes. And
+33
+00:03:35,600 --> 00:03:40,160
+it was actually kind of tedious because the texts were really weird. Some of them were it was like,
+34
+00:03:40,160 --> 00:03:45,440
+it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't,
+35
+00:03:45,440 --> 00:03:51,120
+I was so bored after 10 minutes that I was like, "okay, no, I'm just gonna have to find something
+36
+00:03:51,120 --> 00:03:59,920
+else to read." So I used I created with AI Studio, vibe coded, a synthetic text generator,
+37
+00:04:00,800 --> 00:04:05,680
+which actually I thought was probably a better way of doing it because it would give me more
+38
+00:04:05,680 --> 00:04:12,000
+short samples with more varied content. So I was like, okay, give me a voice note. Like I'm
+39
+00:04:12,000 --> 00:04:18,800
+recording an email. Give me a short story to read. Give me prose. So I came up with all
+40
+00:04:18,800 --> 00:04:24,240
+these different things and I added a little timer to it so I could see how close I was to one
+41
+00:04:24,240 --> 00:04:32,480
+hour. And I spent like an hour one afternoon or probably two hours by the time you do retakes
+42
+00:04:32,480 --> 00:04:39,120
+and whatever because you want to. It gave me a source of truth which I'm not sure if that's the
+43
+00:04:39,120 --> 00:04:45,120
+scientific way to approach this topic of gathering training data but I thought made sense.
+44
+00:04:46,560 --> 00:04:50,880
+I have a lot of audio data from recording voice notes which I've also kind of used
+45
+00:04:52,000 --> 00:04:56,720
+been experimenting with using for a different purpose. It's slightly different - annotating
+46
+00:04:57,840 --> 00:05:03,680
+task types. It's more text classification experiment. Or well it's more than that actually
+47
+00:05:03,680 --> 00:05:08,880
+I'm working on a voice app. So it's a prototype I guess is really more accurate.
+48
+00:05:11,280 --> 00:05:15,920
+But you can do that and you can work backwards. You listen back to a voice note and you
+49
+00:05:17,520 --> 00:05:22,400
+painfully go through one of those - transcribing where you start and stop and scrub around it and
+50
+00:05:22,400 --> 00:05:27,680
+you fix the errors . But it's really really boring to do that. So I thought it would be less tedious
+51
+00:05:27,680 --> 00:05:34,240
+in the long term if I just recorded the source of truth. So it gave me these three minute snippets.
+52
+00:05:34,240 --> 00:05:40,480
+I recorded them and saved an MP3 and a TXT in the same folder and I created an hour of that data.
+53
+00:05:41,840 --> 00:05:47,280
+So I was very hopeful  - quitely, you know, a little bit hopeful - that I would be able that I could actually fine tune
+54
+00:05:47,280 --> 00:05:54,720
+Whisper. I want to fine tune Whisper because when I got into voice tech last November my wife was in
+55
+00:05:54,720 --> 00:06:01,920
+the US and I was alone at home. And when crazy people like me do really wild things like use voice
+56
+00:06:01,920 --> 00:06:08,320
+to tech technology that was basically when I started doing it. I didn't feel like a crazy person
+57
+00:06:08,320 --> 00:06:15,760
+speaking to myself. And my expectations weren't that high. I used speech tech now and again
+58
+00:06:16,960 --> 00:06:21,200
+tried it out. I was like "it'd be really cool if you could just like speak into your computer." And
+59
+00:06:21,280 --> 00:06:28,479
+whatever I tried out that had Linux support was just - it was not good, basically. And this blew me away
+60
+00:06:28,479 --> 00:06:34,400
+from the first go. I mean it wasn't 100% accurate out of the box. And it took work. But it was good
+61
+00:06:34,400 --> 00:06:40,320
+enough that there was a solid foundation. And it kind of passed that pivot point that it's actually
+62
+00:06:40,320 --> 00:06:46,320
+worth doing this. You know, there's a point where it's. So like the transcript is you don't have to get 100%
+63
+00:06:46,400 --> 00:06:51,040
+accuracy for it to be worth your time for speech to text to be a worthwhile addition to your
+64
+00:06:51,040 --> 00:06:58,320
+productivity. But you do need to get above let's say I don't know 85%. If it's 60% or 50% you inevitably
+65
+00:06:58,320 --> 00:07:03,920
+say "screw it I'll just type it."Because you end up missing errors in the transcript and it becomes
+66
+00:07:03,920 --> 00:07:07,840
+actually worse. You end up in a worse position than you started with it. That's been my experience.
+67
+00:07:08,400 --> 00:07:14,400
+So I was like "oh, this is actually really really good. Now how did that happen?" The answer is
+68
+00:07:14,400 --> 00:07:21,599
+ASR, Whisper being open-sourced. and the transformer architecture if you want to go back to the
+69
+00:07:23,200 --> 00:07:29,440
+to the underpinnings. Which really blows my mind. And it's on my list to read through that paper
+70
+00:07:30,239 --> 00:07:38,400
+'All You Need Is Attention' as attentively as can be done with my limited brain. Because it's super
+71
+00:07:38,960 --> 00:07:45,679
+high-level stuff - super advanced stuff I mean. But that I think of all the things that
+72
+00:07:47,280 --> 00:07:54,080
+are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating
+73
+00:07:54,080 --> 00:07:59,599
+that few people are like "hang on, you've got this thing that can speak to you like a chatbot - an LLM.
+74
+00:08:00,640 --> 00:08:06,799
+Then you've got image generation. Okay, so firstly those two things on the surface have nothing
+75
+00:08:06,800 --> 00:08:12,560
+in common. So like, "how are they ... how did THAT just happen all at the same time?" And then when you
+76
+00:08:12,560 --> 00:08:19,920
+extend that further you're like Suno right. You can sing a song and AI will like come up with
+77
+00:08:19,920 --> 00:08:25,200
+an instrumental.  And then you've got Whisper. And then you're like "wait a second how did all this stuff
+78
+00:08:25,200 --> 00:08:30,880
+like if it's all AI what's like, there has to be some commonality. Otherwise these are four these are
+79
+00:08:31,600 --> 00:08:38,640
+totally different technologies on the surface of it and the transformer architecture is as far as
+80
+00:08:38,640 --> 00:08:44,720
+I know the answer. And I can't even say I can't even pretend that I really understand what the
+81
+00:08:44,720 --> 00:08:51,200
+transformer architecture means in depth. But I have scanned it. And as I said I want to print it and
+82
+00:08:51,200 --> 00:08:57,760
+really kind of think over it's at some point. And I'll probably feel bad about myself I think!
+83
+00:08:57,760 --> 00:09:03,280
+Because weren't those guys in their in their 20s like? That's crazy! I think I asked ChatGPT
+84
+00:09:03,280 --> 00:09:09,439
+once "who were the? Who wrote that paper and how old were they when it was published in Arxiv?"
+85
+00:09:09,439 --> 00:09:14,640
+And I was expecting like, I don't know. What do you what do you imagine? I personally imagine kind of
+86
+00:09:14,640 --> 00:09:19,840
+like you know you have these breakthroughs during COVID and things like that where like these kind
+87
+00:09:19,840 --> 00:09:24,480
+of really obscure scientists who are like in their 50s and they've just kind of been laboring in
+88
+00:09:24,640 --> 00:09:31,120
+labs and wearily writing and publishing in kind of obscure academic publications and they
+89
+00:09:31,120 --> 00:09:37,200
+finally like hit a big or win a Nobel Prize. And then they're household household names. So I that
+90
+00:09:37,200 --> 00:09:42,680
+was kind of what I had in mind. That was the mental image I'd formed of the birth of Arxiv.
+91
+00:09:42,680 --> 00:09:47,760
+Like, I wasn't expecting 20-somethings in San Francisco! Though I thought that was both very very
+92
+00:09:47,760 --> 00:09:54,160
+funny, very cool, and actually kind of inspiring. It's nice to think that people who you know just
+93
+00:09:54,160 --> 00:10:01,439
+you might put them in the kind of milieu or bubble or world that you are in or credibly in through
+94
+00:10:01,439 --> 00:10:06,079
+you know the series of connections that are coming up with such literally world changing
+95
+00:10:06,880 --> 00:10:13,439
+innovations. So that was I thought anyway that that was cool. Okay voice training data. How
+96
+00:10:13,439 --> 00:10:19,280
+are we doing? We're about 10 minutes. And I'm still talking about voice technology! So Whisper was
+97
+00:10:19,280 --> 00:10:25,680
+brilliant. And I was so excited that I was my first instinct was to like guess it's like "oh my gosh
+98
+00:10:25,680 --> 00:10:31,040
+I have to get like a really good microphone for this." So I didn't go on a spending spree because
+99
+00:10:31,040 --> 00:10:37,760
+I said I'm gonna have to just wait a month and see if I still use this." And it just kind of became
+100
+00:10:37,760 --> 00:10:44,800
+it's become really part of my daily routine. Like, if I'm writing an email I'll record a voice note
+101
+00:10:44,880 --> 00:10:50,079
+and then I'll develop it and it's nice to see that everyone is like developing the same things in
+102
+00:10:50,079 --> 00:10:56,319
+parallel. Like, that's maybe kind of a weird thing to say. But when I look, I kind of came when I started
+103
+00:10:56,319 --> 00:11:02,640
+working on this these prototypes on GitHub, which is where I just kind of share very freely and loosely
+104
+00:11:03,199 --> 00:11:10,800
+ideas and you know first iterations on concepts. And for want of a better word I called it like
+105
+00:11:11,439 --> 00:11:17,680
+"LLM post processing." Or cleanup. Or basically a system prompt that after you get back the raw text
+106
+00:11:17,680 --> 00:11:25,920
+from Whisper, you run it through model and say "okay this is crappy text like add sentence structure
+107
+00:11:25,920 --> 00:11:33,199
+and you know fix it up. " And now when I'm exploring the different tools that are out there that people
+108
+00:11:33,200 --> 00:11:39,040
+have built, I see quite a number of projects have basically you know done the same thing.
+109
+00:11:40,640 --> 00:11:45,040
+Lest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this
+110
+00:11:45,040 --> 00:11:51,440
+has been a thing that's been integrated into tools for a while. But it's, it's the kind of thing that
+111
+00:11:51,440 --> 00:11:57,520
+when you start using these tools every day the need for it is almost instantly apparent. Because text
+112
+00:11:57,600 --> 00:12:03,520
+that doesn't have any punctuation or paragraph spacing takes a long time to you know, it takes so
+113
+00:12:03,520 --> 00:12:10,079
+long to get it into a presentable email that,again, it moves speech tech into that,
+114
+00:12:11,280 --> 00:12:16,000
+before that inflection point where you're like "nah it's just not worth." It it's like it'll just be
+115
+00:12:16,000 --> 00:12:20,800
+quicker to type this. So it's it's a big - it's a little touch that actually is a big deal
+116
+00:12:21,520 --> 00:12:28,319
+So I was on Whisper and I've been using Whisper and I kind of early on find a couple of tools.
+117
+00:12:28,319 --> 00:12:33,680
+I couldn't find what I was looking for on Linux which is basically just something that'll run
+118
+00:12:34,800 --> 00:12:39,120
+in the background. You'll give it an API key and it'll just like transcribe.
+119
+00:12:41,439 --> 00:12:47,359
+With like a little key to start and stop the dictation. And the issues wer I discovered that
+120
+00:12:47,440 --> 00:12:52,720
+like most people involved in creating these projects were very much focused on local models.
+121
+00:12:52,720 --> 00:12:58,400
+And running Whisper locally because you can. And I tried that a bunch of times and just never
+122
+00:12:58,400 --> 00:13:03,920
+got results that were as good as the cloud. And when I began looking at the cost of the speech to
+123
+00:13:03,920 --> 00:13:10,080
+text APIs and what I was spending  just thought there it's actually in my opinion just one of
+124
+00:13:10,080 --> 00:13:15,600
+the better deals in API spending and in cloud. Like, it's just not that expensive for very, very good
+125
+00:13:15,600 --> 00:13:22,240
+models that are much more. You know, you're going to be able to run the full model, the latest model
+126
+00:13:22,240 --> 00:13:28,960
+versus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't
+127
+00:13:28,960 --> 00:13:34,000
+really make sense to me. Now, I privacy is another concern that I know is kind of like a very much
+128
+00:13:34,000 --> 00:13:38,720
+a separate thing. That people just don't want their voice data and their voice leaving their
+129
+00:13:38,720 --> 00:13:45,360
+local environment. Maybe for regulatory reasons as well. But I'm not in that. I'm don't really really
+130
+00:13:45,360 --> 00:13:51,440
+care about people listening to my grocery list consisting of reminding myself that I need to buy
+131
+00:13:51,440 --> 00:13:58,240
+more beer, Cheetos and hummus. Which is kind of the three three staples of my diet during periods of
+132
+00:13:58,240 --> 00:14:04,560
+poor nutrition. But the kind of stuff that I transcribe most it's just not it's not a it's not a
+133
+00:14:04,560 --> 00:14:12,640
+privacy thing. I'm not that sort of sensitive about. And I don't do anything so you know sensitive
+134
+00:14:12,640 --> 00:14:17,680
+or secure that requires airgapping. So I looked at the pricing and especially the kind of older
+135
+00:14:17,680 --> 00:14:24,400
+models mini. Some of them are very very affordable. And I did a back of the, I did a calculation once
+136
+00:14:24,400 --> 00:14:30,239
+with ChatGPT and I was like "okay, this is the, this is the API price for I can't remember whatever
+137
+00:14:30,320 --> 00:14:37,040
+the model was. Let's say I just go at it like nonstop which rarely happens. Probably I would say an
+138
+00:14:37,040 --> 00:14:45,200
+average I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents,
+139
+00:14:45,200 --> 00:14:51,360
+outlines. Which is a lot. But it's it's still a fairly modest amount. And I was like well some days I
+140
+00:14:51,360 --> 00:14:56,720
+do go on like one or two days where I've been usually when I'm like kind of out of the house and
+141
+00:14:56,720 --> 00:15:02,800
+just have something like I've nothing else to do. Like if I'm at a hospital. We have a newborn.
+142
+00:15:04,000 --> 00:15:09,040
+And you're waiting for like hours and hours for an appointment. And I would probably have
+143
+00:15:09,040 --> 00:15:15,280
+listened to podcasts before becoming a speech fanatic. And I'm like "oh wait let me just get down
+144
+00:15:15,280 --> 00:15:20,880
+let me just get these ideas out of my head." And that's when I'll go on my speech binges. But those
+145
+00:15:20,880 --> 00:15:26,240
+are like once every few months - like not frequently. But I said okay let's just say if I'm gonna price
+146
+00:15:26,240 --> 00:15:35,440
+out cloud STT. If I was like dedicated every second of every waking hour to transcribing for some
+147
+00:15:35,440 --> 00:15:41,600
+odd reason. I mean, I'd have to like eat and use the toilet! Like, you know there's only so many hours
+148
+00:15:41,600 --> 00:15:48,480
+I'm awake for. So like let's just say a maximum of like 40 hour 45 minutes in the hours and I said
+149
+00:15:48,480 --> 00:15:55,360
+all right let's just say 50. Who knows? You're dictating on the toilet! We do it! So you could just do 60.
+150
+00:15:55,440 --> 00:16:02,560
+But whatever I did - and every day. Like you're going flat out, seven days a week dictating nonstop
+151
+00:16:02,560 --> 00:16:08,640
+as like "what's my monthly API bill gonna be at this price?" And it came out to like 70 or
+152
+00:16:08,640 --> 00:16:14,960
+80 bucks. And I was like, well that would be an extraordinary amount of dictation! And I would hope
+153
+00:16:15,600 --> 00:16:21,680
+that there was some compelling reason more worth more than 70 dollars that I embarked upon that.
+154
+00:16:22,640 --> 00:16:26,959
+So given that that's kind of the max point for me I said that's actually very very affordable.
+155
+00:16:27,920 --> 00:16:32,640
+Now you're gonna if you want to spec out the costs and you want to do the post processing
+156
+00:16:33,599 --> 00:16:39,199
+that I really do feel is valuable that's gonna cost more as well. Unless you're using
+157
+00:16:40,160 --> 00:16:47,839
+Gemini which needless to say as a random person sitting in Jerusalem I have no affiliation nor with
+158
+00:16:47,840 --> 00:16:54,800
+Google nor Anthropic nor Gemini nor any major tech vendor for that matter. Um I like Gemini
+159
+00:16:54,800 --> 00:17:00,080
+not so much as a everyday model. Um it's kind of underwhelmed in that respect I would say.
+160
+00:17:00,080 --> 00:17:05,920
+But for multimodal I think it's got a lot to offer. And I think that the transcribing functionality
+161
+00:17:05,920 --> 00:17:13,280
+whereby it can um process audio with the system prompt and both give you a transcription that's
+162
+00:17:13,280 --> 00:17:20,079
+cleaned up - that reduces two steps to one. And that for me is a very very big deal. And uh I feel like
+163
+00:17:20,079 --> 00:17:27,280
+even Google hasn't really sort of thought through how useful the that modality is and what kind of
+164
+00:17:27,280 --> 00:17:33,280
+use cases uh you can achieve with it. Because I found in the course of this year just an endless
+165
+00:17:33,280 --> 00:17:40,399
+list of really kind of system prompt system prompt stuff that I can say "okay I've used it
+166
+00:17:40,560 --> 00:17:45,920
+to capture context data for AI which is literally I might speak for if I wanted to have a good
+167
+00:17:45,920 --> 00:17:52,560
+bank of context data about who knows my childhood uh more realistically maybe my career goals
+168
+00:17:53,520 --> 00:17:59,520
+something that would just be like really boring to type out so I'll just like sit in my car
+169
+00:17:59,520 --> 00:18:06,640
+and record it for 10 minutes. And that 10 minutes you get a lot of information in um emails which is
+170
+00:18:06,640 --> 00:18:13,200
+short text uh just there is a whole bunch. And all these workflows kind of require a little bit
+171
+00:18:13,200 --> 00:18:18,320
+of treatment afterwards and different treatment. My context pipeline is kind of like just extract the
+172
+00:18:18,320 --> 00:18:23,520
+bare essential. So you end up with me talking very loosely about sort of what I've done in my career,
+173
+00:18:23,520 --> 00:18:30,000
+where I've worked, where I might like to work. And it goes - it condenses that down to very robotic language
+174
+00:18:30,000 --> 00:18:36,000
+that is easy to chunk, parse, and maybe put into a vector database. "Daniel has worked in technology!
+175
+00:18:36,080 --> 00:18:42,400
+Daniel is a has been working in marketing." Stuff like that. That's not how you would speak um but I
+176
+00:18:42,400 --> 00:18:48,480
+figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this
+177
+00:18:48,480 --> 00:18:56,880
+is actually a success because I wasted 20 minutes of my uh of the evening speaking into microphone and
+178
+00:18:56,880 --> 00:19:02,720
+the levels were shot and it uh it was clipping. And I said I can't really do an evaluation. I have to
+179
+00:19:02,720 --> 00:19:09,440
+be fair. I have to give the models a chance to do their thing. Uh what am I hoping to achieve in this?
+180
+00:19:09,440 --> 00:19:14,960
+Okay my fine tune was a dud as mentioned. Deepgram STT - I'm really really hopeful that this prototype
+181
+00:19:14,960 --> 00:19:20,560
+will work. And it's a build in public open source. So anyone is welcome to use it if I make anything good
+182
+00:19:21,600 --> 00:19:28,000
+But what was really exciting for me last night when after hours of um trying my own prototypes, seeing
+183
+00:19:28,080 --> 00:19:33,120
+someone just made something that works like that. You know, you're not going to have to build a custom
+184
+00:19:34,240 --> 00:19:40,960
+Conda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it
+185
+00:19:41,840 --> 00:19:46,400
+And I was about to give up and I said "all right. Let me just give Deepgram's Linux thing a shot
+186
+00:19:47,040 --> 00:19:50,960
+and if this doesn't work um I'm just going to go back to trying to vibe code something myself."
+187
+00:19:51,600 --> 00:19:57,360
+And when I ran the script - I was using Claude Code to do the installation process -
+188
+00:19:58,160 --> 00:20:02,800
+it ran the script and "oh my gosh, it works!" Just like that! Uh the tricky thing
+189
+00:20:04,480 --> 00:20:12,480
+for all those ones who want to know all the nitty gritty details um was that I don't think it was actually
+190
+00:20:12,480 --> 00:20:18,160
+struggling with transcription but pasting. Wayland makes life very hard. And I think there was
+191
+00:20:18,160 --> 00:20:22,800
+something not running at the right time. Anyway, Deepgram - I looked at how they actually handled
+192
+00:20:22,960 --> 00:20:28,960
+that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism
+193
+00:20:29,520 --> 00:20:34,560
+and but more so than that the accuracy was brilliant. Now, what am I doing here? This is going to be a 20
+194
+00:20:34,560 --> 00:20:44,399
+minute audio uh sample and I'm I think I've done one or two of these before but I did it with
+195
+00:20:45,360 --> 00:20:51,120
+short, snappy voice notes. This is kind of long form. This actually might be a better approximation
+196
+00:20:51,120 --> 00:20:55,040
+for what's useful to me than voice memos like "I need to buy three
+197
+00:20:55,840 --> 00:20:59,840
+liters of milk tomorrow and pita bread." Which is probably how like half my voice note
+198
+00:20:59,840 --> 00:21:04,399
+voice notes sound. Like if anyone were to I don't know like find my phone they'd be like "this is
+199
+00:21:04,399 --> 00:21:09,280
+the most boring person in the world!" Although actually there are some like kind of uh journaling
+200
+00:21:09,280 --> 00:21:14,080
+thoughts as well. But it's a lot of content like that. And the probably for the evaluation the most
+201
+00:21:14,080 --> 00:21:22,560
+useful thing is slightly obscure tech: Github, Nuclino, Hugging Face. Not so obscure that it's not
+202
+00:21:22,560 --> 00:21:27,360
+going to have a chance of knowing it. But hopefully sufficiently well known that the models should get
+203
+00:21:27,360 --> 00:21:32,800
+it. Uh I tried to do a little bit of speaking really fast and speaking very slowly. I would say in
+204
+00:21:32,800 --> 00:21:38,960
+general I've spoken delivered this at a faster pace than I usually would owing to strong coffee
+205
+00:21:39,120 --> 00:21:44,240
+flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is
+206
+00:21:44,240 --> 00:21:49,920
+background noise. Which in my first take that I had to get rid of my wife came in with my son
+207
+00:21:49,920 --> 00:21:55,680
+for a good night kiss. And that actually would have been super helpful to get in because it was
+208
+00:21:56,400 --> 00:22:01,600
+non-diarised. Or if we had diarisation a female I could say I want the male voice and that
+209
+00:22:01,600 --> 00:22:06,240
+wasn't intended for transcription um. And we're not going to get background noise like people
+210
+00:22:06,240 --> 00:22:11,840
+honking their horns. Which is something I've done in my main dataset where I am trying to go back
+211
+00:22:11,840 --> 00:22:16,880
+to some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure
+212
+00:22:17,680 --> 00:22:24,960
+quick test. And as someone I'm working on a voice note idea that's my sort of end
+213
+00:22:26,560 --> 00:22:30,320
+motivation besides thinking it's an absolute outstanding technology that's coming to
+214
+00:22:30,960 --> 00:22:36,240
+viability and really - I know this sounds cheesy - can actually have a very transformative effect.
+215
+00:22:37,120 --> 00:22:42,720
+It's, you know, voice technology has been life changing for folks living with
+216
+00:22:44,000 --> 00:22:49,760
+disabilities. And I think there's something really nice about the fact that it can also benefit
+217
+00:22:50,480 --> 00:22:54,639
+you know folks who are able-bodied. And like we can all in different ways
+218
+00:22:55,120 --> 00:23:02,560
+um make this tech as useful as possible regardless of the exact way that we're using it um. And I
+219
+00:23:02,560 --> 00:23:07,760
+think there's something very powerful in that. And it can be very cool um I see huge potential. What
+220
+00:23:07,760 --> 00:23:14,480
+excites me about voice tech - a lot of things actually. Firstly the fact that it's cheap and accurate
+221
+00:23:14,480 --> 00:23:19,040
+as I mentioned at the very start of this um. And it's getting better and better with stuff like
+222
+00:23:19,040 --> 00:23:24,160
+accent handling um. I'm not sure my my fine tune will actually ever come to fruition in the
+223
+00:23:24,160 --> 00:23:30,240
+sense that I'll use it day to day as I imagine and get like superb flawless words error rates. Because
+224
+00:23:30,240 --> 00:23:37,680
+I'm just kind of skeptical about local speech to tech as I mentioned. And I think the pace of
+225
+00:23:37,680 --> 00:23:42,720
+innovation and improvement in the models. The main reasons for fine tuning from what I've seen
+226
+00:23:44,320 --> 00:23:50,480
+have been people who are something that really blows blows my mind about ASR is the idea that it's
+227
+00:23:50,480 --> 00:24:00,080
+inherently a-llingual. Or multilingual. Phonetic-based. So as folks who use speak very obscure languages
+228
+00:24:00,080 --> 00:24:04,800
+that there may be there there might be a paucity of training data or almost none at all. And therefore
+229
+00:24:04,800 --> 00:24:11,440
+the accuracy is significantly reduced. Or folks in very critical environments. I know there are
+230
+00:24:11,440 --> 00:24:17,680
+this is used extensively in medical transcription and dispatcher work as um you know the call
+231
+00:24:17,680 --> 00:24:24,000
+centers who send out ambulances etc where accuracy is absolutely paramount. And in the case of doctors,
+232
+00:24:24,560 --> 00:24:29,680
+radiologists they might be using very specialized vocab all the time. So those are kind of the main
+233
+00:24:29,680 --> 00:24:35,680
+two things. And I'm not sure that really just for trying to make it better on a few random tech words
+234
+00:24:35,680 --> 00:24:41,840
+with my slightly. I mean, I have an accent!  But like, not you know an accent that a few other million
+235
+00:24:41,840 --> 00:24:50,720
+people have. Ish. I'm not sure that my little fine tune is going to actually like the bump in
+236
+00:24:50,720 --> 00:24:55,760
+word error reduction if I ever actually figure out how to do it and get it up to the cloud. By the
+237
+00:24:55,760 --> 00:25:00,879
+time we've done that I suspect that the next generation of ASR will just be so good that it will
+238
+00:25:00,879 --> 00:25:07,040
+kind of be "nah, well, that would be cool if it worked out. But I'll just use this instead." So that's going to be
+239
+00:25:07,280 --> 00:25:15,040
+it for today's episodes of voice training data single long shot evaluation. Who am I going to
+240
+00:25:15,040 --> 00:25:21,200
+compare? Whisper is always good as a benchmark. But I'm more interested in seeing Whisper head-to-head
+241
+00:25:21,200 --> 00:25:27,680
+with two things really. One is Whisper variants. So you've got these projects like Faster Whisper,
+242
+00:25:29,120 --> 00:25:34,000
+Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs which
+243
+00:25:34,160 --> 00:25:38,960
+are also a thing. My intention for this is I'm not sure I'm going to have the time in any point
+244
+00:25:38,960 --> 00:25:46,320
+of the foreseeable future to go back through this whole episode and create a proper source truth or I fix
+245
+00:25:47,520 --> 00:25:53,760
+everything. I might do it if I can get one transcription that's sufficiently close to perfection.
+246
+00:25:54,960 --> 00:26:00,560
+But what I would actually love to do on Hugging Face I think would be a great probably how I might
+247
+00:26:00,560 --> 00:26:08,080
+visualize this is having the audio waveform play. And then have the transcript for each model below
+248
+00:26:08,080 --> 00:26:16,320
+it. And maybe even a like you know to scale. And maybe even a local one as well like local Whisper
+249
+00:26:16,320 --> 00:26:23,919
+versus Open AI API etc. And I can then actually listen back to segments. Or anyone who wants to
+250
+00:26:24,000 --> 00:26:30,000
+can listen back to segments of this recording and see where a particular model struggled
+251
+00:26:30,000 --> 00:26:35,600
+while others didn't, as well as the sort of headline finding of which had the best WER. But that would
+252
+00:26:35,600 --> 00:26:41,120
+require the source of truth. Okay, that's it. Hope this was, I don't know, maybe useful for other
+253
+00:26:41,120 --> 00:26:46,480
+folks interested in STT. You want to see - that I always feel think I've just said as something I
+254
+00:26:46,480 --> 00:26:52,800
+didn't intend to. STT I said for those listening carefully! Including hopefully the models themselves!
+255
+00:26:53,280 --> 00:26:58,960
+This has been myself Daniel Rosehill. For more um jumbled repositories about my uh roving interests
+256
+00:26:58,960 --> 00:27:06,639
+in AI. But particularly agentic AI, MCP, and voice tech, you can find me on Github, Hugging Face.
+257
+00:27:08,080 --> 00:27:14,000
+Where else? DanielRosehilll.com which is my personal website. As well as this podcast whose name
+258
+00:27:14,000 --> 00:27:17,280
+I sadly cannot remember! Until next time, thanks for listening!

data/ground-truth/truth_1.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast. Or, it eh, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI and generative AI I would, I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say, for different speech to text models. And I'm doing this because I I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in the elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts. I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech to text model through its paces, which is trying to make sense of "is this guy just rambling on incoherently in one long sentence?" Or "are these just actually a series of step standalone stepalone standalone sentences?" And how is it going to handle stepalone?! That's not a word! What happens when you use speech to text and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was I trying to fine tune whisper? And what is Whisper? As I said, I'm going to try to record this at a couple of different levels of technicality - for folks who are in the normal world and not totally stuck down the rabbit hole of AI. Which I have to say is a really wonderful rabbit hole to be to be down. It's a really interesting area. And speech and voice tech is the aspect of it that I find actually most - I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I am persevering hard with the task of trying to get a good solution working for Linux. Which if anyone actually does listen to this not just for the training data and for the actual content, this has sparked. I had, besides the fine tune not working, well that was the failure. I used Claude Code. Because one thinks these days that there is nothing short of solving, you know, the reason of life or something that Claude and agentic AI can't do. Which is not really the case. It does seem that way sometimes. But it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data: basically speaking just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were it was like, it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, "okay, no, I'm just gonna have to find something else to read." So I used I created with AI Studio, vibe coded, a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note. Like I'm recording an email. Give me a short story to read. Give me prose. So I came up with all these different things and I added a little timer to it so I could see how close I was to one hour. And I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to. It gave me a source of truth which I'm not sure if that's the scientific way to approach this topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used been experimenting with using for a different purpose. It's slightly different - annotating task types. It's more text classification experiment. Or well it's more than that actually I'm working on a voice app. So it's a prototype I guess is really more accurate. But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those - transcribing where you start and stop and scrub around it and you fix the errors . But it's really really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an hour of that data. So I was very hopeful - quitely, you know, a little bit hopeful - that I would be able that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November my wife was in the US and I was alone at home. And when crazy people like me do really wild things like use voice to tech technology that was basically when I started doing it. I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. I used speech tech now and again tried it out. I was like "it'd be really cool if you could just like speak into your computer." And whatever I tried out that had Linux support was just - it was not good, basically. And this blew me away from the first go. I mean it wasn't 100% accurate out of the box. And it took work. But it was good enough that there was a solid foundation. And it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's. So like the transcript is you don't have to get 100% accuracy for it to be worth your time for speech to text to be a worthwhile addition to your productivity. But you do need to get above let's say I don't know 85%. If it's 60% or 50% you inevitably say "screw it I'll just type it."Because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like "oh, this is actually really really good. Now how did that happen?" The answer is ASR, Whisper being open-sourced. and the transformer architecture if you want to go back to the to the underpinnings. Which really blows my mind. And it's on my list to read through that paper 'All You Need Is Attention' as attentively as can be done with my limited brain. Because it's super high-level stuff - super advanced stuff I mean. But that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating that few people are like "hang on, you've got this thing that can speak to you like a chatbot - an LLM. Then you've got image generation. Okay, so firstly those two things on the surface have nothing in common. So like, "how are they ... how did THAT just happen all at the same time?" And then when you extend that further you're like Suno right. You can sing a song and AI will like come up with an instrumental. And then you've got Whisper. And then you're like "wait a second how did all this stuff like if it's all AI what's like, there has to be some commonality. Otherwise these are four these are totally different technologies on the surface of it and the transformer architecture is as far as I know the answer. And I can't even say I can't even pretend that I really understand what the transformer architecture means in depth. But I have scanned it. And as I said I want to print it and really kind of think over it's at some point. And I'll probably feel bad about myself I think! Because weren't those guys in their in their 20s like? That's crazy! I think I asked ChatGPT once "who were the? Who wrote that paper and how old were they when it was published in Arxiv?" And I was expecting like, I don't know. What do you what do you imagine? I personally imagine kind of like you know you have these breakthroughs during COVID and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and wearily writing and publishing in kind of obscure academic publications and they finally like hit a big or win a Nobel Prize. And then they're household household names. So I that was kind of what I had in mind. That was the mental image I'd formed of the birth of Arxiv. Like, I wasn't expecting 20-somethings in San Francisco! Though I thought that was both very very funny, very cool, and actually kind of inspiring. It's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world changing innovations. So that was I thought anyway that that was cool. Okay voice training data. How are we doing? We're about 10 minutes. And I'm still talking about voice technology! So Whisper was brilliant. And I was so excited that I was my first instinct was to like guess it's like "oh my gosh I have to get like a really good microphone for this." So I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this." And it just kind of became it's become really part of my daily routine. Like, if I'm writing an email I'll record a voice note and then I'll develop it and it's nice to see that everyone is like developing the same things in parallel. Like, that's maybe kind of a weird thing to say. But when I look, I kind of came when I started working on this these prototypes on GitHub, which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts. And for want of a better word I called it like "LLM post processing." Or cleanup. Or basically a system prompt that after you get back the raw text from Whisper, you run it through model and say "okay this is crappy text like add sentence structure and you know fix it up. " And now when I'm exploring the different tools that are out there that people have built, I see quite a number of projects have basically you know done the same thing. Lest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while. But it's, it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent. Because text that doesn't have any punctuation or paragraph spacing takes a long time to you know, it takes so long to get it into a presentable email that,again, it moves speech tech into that, before that inflection point where you're like "nah it's just not worth." It it's like it'll just be quicker to type this. So it's it's a big - it's a little touch that actually is a big deal So I was on Whisper and I've been using Whisper and I kind of early on find a couple of tools. I couldn't find what I was looking for on Linux which is basically just something that'll run in the background. You'll give it an API key and it'll just like transcribe. With like a little key to start and stop the dictation. And the issues wer I discovered that like most people involved in creating these projects were very much focused on local models. And running Whisper locally because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending just thought there it's actually in my opinion just one of the better deals in API spending and in cloud. Like, it's just not that expensive for very, very good models that are much more. You know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't really make sense to me. Now, I privacy is another concern that I know is kind of like a very much a separate thing. That people just don't want their voice data and their voice leaving their local environment. Maybe for regulatory reasons as well. But I'm not in that. I'm don't really really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos and hummus. Which is kind of the three three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe most it's just not it's not a it's not a privacy thing. I'm not that sort of sensitive about. And I don't do anything so you know sensitive or secure that requires airgapping. So I looked at the pricing and especially the kind of older models mini. Some of them are very very affordable. And I did a back of the, I did a calculation once with ChatGPT and I was like "okay, this is the, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop which rarely happens. Probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents, outlines. Which is a lot. But it's it's still a fairly modest amount. And I was like well some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do. Like if I'm at a hospital. We have a newborn. And you're waiting for like hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like "oh wait let me just get down let me just get these ideas out of my head." And that's when I'll go on my speech binges. But those are like once every few months - like not frequently. But I said okay let's just say if I'm gonna price out cloud STT. If I was like dedicated every second of every waking hour to transcribing for some odd reason. I mean, I'd have to like eat and use the toilet! Like, you know there's only so many hours I'm awake for. So like let's just say a maximum of like 40 hour 45 minutes in the hours and I said all right let's just say 50. Who knows? You're dictating on the toilet! We do it! So you could just do 60. But whatever I did - and every day. Like you're going flat out, seven days a week dictating nonstop as like "what's my monthly API bill gonna be at this price?" And it came out to like 70 or 80 bucks. And I was like, well that would be an extraordinary amount of dictation! And I would hope that there was some compelling reason more worth more than 70 dollars that I embarked upon that. So given that that's kind of the max point for me I said that's actually very very affordable. Now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel is valuable that's gonna cost more as well. Unless you're using Gemini which needless to say as a random person sitting in Jerusalem I have no affiliation nor with Google nor Anthropic nor Gemini nor any major tech vendor for that matter. Um I like Gemini not so much as a everyday model. Um it's kind of underwhelmed in that respect I would say. But for multimodal I think it's got a lot to offer. And I think that the transcribing functionality whereby it can um process audio with the system prompt and both give you a transcription that's cleaned up - that reduces two steps to one. And that for me is a very very big deal. And uh I feel like even Google hasn't really sort of thought through how useful the that modality is and what kind of use cases uh you can achieve with it. Because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say "okay I've used it to capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood uh more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes. And that 10 minutes you get a lot of information in um emails which is short text uh just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essential. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes - it condenses that down to very robotic language that is easy to chunk, parse, and maybe put into a vector database. "Daniel has worked in technology! Daniel is a has been working in marketing." Stuff like that. That's not how you would speak um but I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this is actually a success because I wasted 20 minutes of my uh of the evening speaking into microphone and the levels were shot and it uh it was clipping. And I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. Uh what am I hoping to achieve in this? Okay my fine tune was a dud as mentioned. Deepgram STT - I'm really really hopeful that this prototype will work. And it's a build in public open source. So anyone is welcome to use it if I make anything good But what was really exciting for me last night when after hours of um trying my own prototypes, seeing someone just made something that works like that. You know, you're not going to have to build a custom Conda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it And I was about to give up and I said "all right. Let me just give Deepgram's Linux thing a shot and if this doesn't work um I'm just going to go back to trying to vibe code something myself." And when I ran the script - I was using Claude Code to do the installation process - it ran the script and "oh my gosh, it works!" Just like that! Uh the tricky thing for all those ones who want to know all the nitty gritty details um was that I don't think it was actually struggling with transcription but pasting. Wayland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram - I looked at how they actually handled that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism and but more so than that the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio uh sample and I'm I think I've done one or two of these before but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos like "I need to buy three liters of milk tomorrow and pita bread." Which is probably how like half my voice note voice notes sound. Like if anyone were to I don't know like find my phone they'd be like "this is the most boring person in the world!" Although actually there are some like kind of uh journaling thoughts as well. But it's a lot of content like that. And the probably for the evaluation the most useful thing is slightly obscure tech: Github, Nuclino, Hugging Face. Not so obscure that it's not going to have a chance of knowing it. But hopefully sufficiently well known that the models should get it. Uh I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general I've spoken delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise. Which in my first take that I had to get rid of my wife came in with my son for a good night kiss. And that actually would have been super helpful to get in because it was non-diarised. Or if we had diarisation a female I could say I want the male voice and that wasn't intended for transcription um. And we're not going to get background noise like people honking their horns. Which is something I've done in my main dataset where I am trying to go back to some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure quick test. And as someone I'm working on a voice note idea that's my sort of end motivation besides thinking it's an absolute outstanding technology that's coming to viability and really - I know this sounds cheesy - can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit you know folks who are able-bodied. And like we can all in different ways um make this tech as useful as possible regardless of the exact way that we're using it um. And I think there's something very powerful in that. And it can be very cool um I see huge potential. What excites me about voice tech - a lot of things actually. Firstly the fact that it's cheap and accurate as I mentioned at the very start of this um. And it's getting better and better with stuff like accent handling um. I'm not sure my my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get like superb flawless words error rates. Because I'm just kind of skeptical about local speech to tech as I mentioned. And I think the pace of innovation and improvement in the models. The main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently a-llingual. Or multilingual. Phonetic-based. So as folks who use speak very obscure languages that there may be there there might be a paucity of training data or almost none at all. And therefore the accuracy is significantly reduced. Or folks in very critical environments. I know there are this is used extensively in medical transcription and dispatcher work as um you know the call centers who send out ambulances etc where accuracy is absolutely paramount. And in the case of doctors, radiologists they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly. I mean, I have an accent! But like, not you know an accent that a few other million people have. Ish. I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud. By the time we've done that I suspect that the next generation of ASR will just be so good that it will kind of be "nah, well, that would be cool if it worked out. But I'll just use this instead." So that's going to be it for today's episodes of voice training data single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark. But I'm more interested in seeing Whisper head-to-head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper, Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back through this whole episode and create a proper source truth or I fix everything. I might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face I think would be a great probably how I might visualize this is having the audio waveform play. And then have the transcript for each model below it. And maybe even a like you know to scale. And maybe even a local one as well like local Whisper versus Open AI API etc. And I can then actually listen back to segments. Or anyone who wants to can listen back to segments of this recording and see where a particular model struggled while others didn't, as well as the sort of headline finding of which had the best WER. But that would require the source of truth. Okay, that's it. Hope this was, I don't know, maybe useful for other folks interested in STT. You want to see - that I always feel think I've just said as something I didn't intend to. STT I said for those listening carefully! Including hopefully the models themselves! This has been myself Daniel Rosehill. For more um jumbled repositories about my uh roving interests in AI. But particularly agentic AI, MCP, and voice tech, you can find me on Github, Hugging Face. Where else? DanielRosehilll.com which is my personal website. As well as this podcast whose name I sadly cannot remember! Until next time, thanks for listening!

data/inference/benchmark_results.json ADDED Viewed

	@@ -0,0 +1,134 @@

+{
+  "ground_truth_file": "/home/daniel/repos/github/Long-Form-Audio-Eval/data/ground-truth/truth_1.txt",
+  "total_runs_evaluated": 8,
+  "results": [
+    {
+      "run_id": "run-1",
+      "run_type": "local-stt",
+      "provider": "local",
+      "model": "whisper-base",
+      "engine": "Buzz",
+      "metrics": {
+        "wer": 17.52,
+        "cer": 5.38,
+        "word_accuracy": 82.48,
+        "insertions": 44,
+        "deletions": 62,
+        "substitutions": 726,
+        "hits": 3960
+      }
+    },
+    {
+      "run_id": "run-2",
+      "run_type": "local-stt",
+      "provider": "local",
+      "model": "whisper-tiny",
+      "engine": "Buzz",
+      "metrics": {
+        "wer": 22.49,
+        "cer": 8.39,
+        "word_accuracy": 77.51,
+        "insertions": 82,
+        "deletions": 155,
+        "substitutions": 831,
+        "hits": 3762
+      }
+    },
+    {
+      "run_id": "run-3",
+      "run_type": "local-stt",
+      "provider": "local",
+      "model": "whisper-base",
+      "engine": "Buzz",
+      "metrics": {
+        "wer": 17.52,
+        "cer": 5.38,
+        "word_accuracy": 82.48,
+        "insertions": 44,
+        "deletions": 62,
+        "substitutions": 726,
+        "hits": 3960
+      }
+    },
+    {
+      "run_id": "manual-1",
+      "run_type": "cloud-stt",
+      "provider": "gladia",
+      "model": "solaria-1",
+      "engine": "api",
+      "metrics": {
+        "wer": 20.83,
+        "cer": 6.3,
+        "word_accuracy": 79.17,
+        "insertions": 100,
+        "deletions": 92,
+        "substitutions": 797,
+        "hits": 3859
+      }
+    },
+    {
+      "run_id": "manual-2",
+      "run_type": "cloud-stt",
+      "provider": "deepgram",
+      "model": "nova-3",
+      "engine": "api",
+      "metrics": {
+        "wer": 18.72,
+        "cer": 7.33,
+        "word_accuracy": 81.28,
+        "insertions": 60,
+        "deletions": 214,
+        "substitutions": 615,
+        "hits": 3919
+      }
+    },
+    {
+      "run_id": "manual-3",
+      "run_type": "cloud-stt",
+      "provider": "assemblyai",
+      "model": "best",
+      "engine": "api",
+      "metrics": {
+        "wer": 18.79,
+        "cer": 6.24,
+        "word_accuracy": 81.21,
+        "insertions": 64,
+        "deletions": 156,
+        "substitutions": 672,
+        "hits": 3920
+      }
+    },
+    {
+      "run_id": "manual-4",
+      "run_type": "cloud-stt",
+      "provider": "speechmatics",
+      "model": "slam-1-global-english",
+      "engine": "api",
+      "metrics": {
+        "wer": 21.65,
+        "cer": 7.15,
+        "word_accuracy": 78.35,
+        "insertions": 158,
+        "deletions": 51,
+        "substitutions": 819,
+        "hits": 3878
+      }
+    },
+    {
+      "run_id": "manual-5",
+      "run_type": "cloud-stt",
+      "provider": "openai",
+      "model": "whisper-1",
+      "engine": "api",
+      "metrics": {
+        "wer": 19.27,
+        "cer": 6.4,
+        "word_accuracy": 80.73,
+        "insertions": 114,
+        "deletions": 106,
+        "substitutions": 695,
+        "hits": 3947
+      }
+    }
+  ]
+}

data/inference/punctuation_results.json ADDED Viewed

	@@ -0,0 +1,486 @@

+{
+  "ground_truth_file": "/home/daniel/repos/github/Long-Form-Audio-Eval/data/ground-truth/truth_1.txt",
+  "total_runs_evaluated": 8,
+  "results": [
+    {
+      "run_id": "run-1",
+      "provider": "local",
+      "model": "whisper-base",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 292,
+          "difference": -396
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 6.17
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 42,
+            "count_accuracy": 15.97
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 10,
+            "count_accuracy": 30.3
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 31,
+            "count_accuracy": 29.81
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 202,
+            "count_accuracy": 99.51
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 7,
+            "count_accuracy": 36.84
+          }
+        },
+        "context_match_accuracy": 13.02,
+        "overall_punctuation_score": 21.9
+      }
+    },
+    {
+      "run_id": "run-2",
+      "provider": "local",
+      "model": "whisper-tiny",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 288,
+          "difference": -400
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 6.16
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 45,
+            "count_accuracy": 17.11
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 5,
+            "count_accuracy": 15.15
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 34,
+            "count_accuracy": 32.69
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 199,
+            "count_accuracy": 98.03
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 5,
+            "count_accuracy": 26.32
+          }
+        },
+        "context_match_accuracy": 8.6,
+        "overall_punctuation_score": 18.78
+      }
+    },
+    {
+      "run_id": "run-3",
+      "provider": "local",
+      "model": "whisper-base",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 292,
+          "difference": -396
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 6.17
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 42,
+            "count_accuracy": 15.97
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 10,
+            "count_accuracy": 30.3
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 31,
+            "count_accuracy": 29.81
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 202,
+            "count_accuracy": 99.51
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 7,
+            "count_accuracy": 36.84
+          }
+        },
+        "context_match_accuracy": 13.02,
+        "overall_punctuation_score": 21.9
+      }
+    },
+    {
+      "run_id": "manual-1",
+      "provider": "gladia",
+      "model": "solaria-1",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 651,
+          "difference": -37
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 13.69
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 180,
+            "count_accuracy": 68.44
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 9,
+            "count_accuracy": 27.27
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 251,
+            "count_accuracy": 0
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 197,
+            "count_accuracy": 97.04
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 14,
+            "count_accuracy": 73.68
+          }
+        },
+        "context_match_accuracy": 22.56,
+        "overall_punctuation_score": 44.13
+      }
+    },
+    {
+      "run_id": "manual-2",
+      "provider": "deepgram",
+      "model": "nova-3",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 698,
+          "difference": 10
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 15.19
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 222,
+            "count_accuracy": 84.41
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 3,
+            "count_accuracy": 9.09
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 265,
+            "count_accuracy": 0
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 189,
+            "count_accuracy": 93.1
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 19,
+            "count_accuracy": 100.0
+          }
+        },
+        "context_match_accuracy": 32.33,
+        "overall_punctuation_score": 51.17
+      }
+    },
+    {
+      "run_id": "manual-3",
+      "provider": "assemblyai",
+      "model": "best",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 791,
+          "difference": 103
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 16.99
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 218,
+            "count_accuracy": 82.89
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 7,
+            "count_accuracy": 21.21
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 356,
+            "count_accuracy": 0
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 191,
+            "count_accuracy": 94.09
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 19,
+            "count_accuracy": 100.0
+          }
+        },
+        "context_match_accuracy": 33.72,
+        "overall_punctuation_score": 48.43
+      }
+    },
+    {
+      "run_id": "manual-4",
+      "provider": "speechmatics",
+      "model": "slam-1-global-english",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 1003,
+          "difference": 315
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 20.66
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 238,
+            "count_accuracy": 90.49
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 4,
+            "count_accuracy": 12.12
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 549,
+            "count_accuracy": 0
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 195,
+            "count_accuracy": 96.06
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 17,
+            "count_accuracy": 89.47
+          }
+        },
+        "context_match_accuracy": 30.0,
+        "overall_punctuation_score": 38.23
+      }
+    },
+    {
+      "run_id": "manual-5",
+      "provider": "openai",
+      "model": "whisper-1",
+      "metrics": {
+        "total_punctuation": {
+          "reference": 688,
+          "hypothesis": 911,
+          "difference": 223
+        },
+        "punctuation_density": {
+          "reference_percent": 14.49,
+          "hypothesis_percent": 19.15
+        },
+        "mark_accuracy": {
+          "!": {
+            "reference_count": 19,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          "\"": {
+            "reference_count": 45,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ".": {
+            "reference_count": 263,
+            "hypothesis_count": 221,
+            "count_accuracy": 84.03
+          },
+          "-": {
+            "reference_count": 33,
+            "hypothesis_count": 6,
+            "count_accuracy": 18.18
+          },
+          ":": {
+            "reference_count": 2,
+            "hypothesis_count": 0,
+            "count_accuracy": 0
+          },
+          ",": {
+            "reference_count": 104,
+            "hypothesis_count": 471,
+            "count_accuracy": 0
+          },
+          "'": {
+            "reference_count": 203,
+            "hypothesis_count": 197,
+            "count_accuracy": 97.04
+          },
+          "?": {
+            "reference_count": 19,
+            "hypothesis_count": 16,
+            "count_accuracy": 84.21
+          }
+        },
+        "context_match_accuracy": 34.42,
+        "overall_punctuation_score": 44.44
+      }
+    }
+  ]
+}

data/inference/runs-config.json ADDED Viewed

	@@ -0,0 +1,501 @@

+{
+  "runs": [
+    {
+      "run_id": "run-1",
+      "run_type": "local-stt",
+      "model": "whisper-base",
+      "provider": "local",
+      "inference_provider": null,
+      "engine": "Buzz",
+      "run_method": {
+        "interface": "gui",
+        "automation": "manual",
+        "description": "Processed using Buzz desktop application"
+      },
+      "settings": {
+        "language": "en",
+        "task": "transcribe"
+      },
+      "output_dir": "runs/local-stt/run-1",
+      "completed": true,
+      "notes": "Whisper Base (local inference) using Buzz",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-2",
+      "run_type": "local-stt",
+      "model": "whisper-tiny",
+      "provider": "local",
+      "inference_provider": null,
+      "engine": "Buzz",
+      "run_method": {
+        "interface": "gui",
+        "automation": "manual",
+        "description": "Processed using Buzz desktop application"
+      },
+      "settings": {
+        "language": "en",
+        "task": "transcribe"
+      },
+      "output_dir": "runs/local-stt/run-2",
+      "completed": true,
+      "notes": "Whisper Tiny (local inference) using Buzz",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-3",
+      "run_type": "local-stt",
+      "model": "whisper-base",
+      "provider": "local",
+      "inference_provider": null,
+      "engine": "Buzz",
+      "run_method": {
+        "interface": "gui",
+        "automation": "manual",
+        "description": "Processed using Buzz desktop application"
+      },
+      "settings": {
+        "language": "auto-detect",
+        "task": "transcribe"
+      },
+      "output_dir": "runs/local-stt/run-3",
+      "completed": true,
+      "notes": "Whisper Base (as run 1) locally but with language set to detect rather than specified as English",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": true,
+          "language_specified": false,
+          "language_code": "auto"
+        }
+      }
+    },
+    {
+      "run_id": "run-4",
+      "run_type": "cloud-stt",
+      "model": "whisper-1",
+      "provider": "openai",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en",
+        "temperature": 0.0
+      },
+      "output_dir": "runs/cloud-stt/run-4",
+      "completed": true,
+      "notes": "OpenAI Whisper API",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-5",
+      "run_type": "cloud-stt",
+      "model": "nova-2",
+      "provider": "deepgram",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en",
+        "smart_format": true,
+        "punctuate": true
+      },
+      "output_dir": "runs/cloud-stt/run-5",
+      "completed": false,
+      "notes": "Deepgram Nova-2 model",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-6",
+      "run_type": "cloud-stt",
+      "model": "chirp",
+      "provider": "assemblyai",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language_code": "en",
+        "punctuate": true,
+        "format_text": true
+      },
+      "output_dir": "runs/cloud-stt/run-6",
+      "completed": false,
+      "notes": "AssemblyAI Universal-1 (Chirp) model",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-7",
+      "run_type": "cloud-stt",
+      "model": "whisper-large-v3",
+      "provider": "groq",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en",
+        "temperature": 0.0
+      },
+      "output_dir": "runs/cloud-stt/run-7",
+      "completed": false,
+      "notes": "Groq Whisper Large V3",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-8",
+      "run_type": "cloud-stt",
+      "model": "enhanced",
+      "provider": "speechmatics",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en",
+        "operating_point": "enhanced",
+        "enable_partials": false
+      },
+      "output_dir": "runs/cloud-stt/run-8",
+      "completed": false,
+      "notes": "Speechmatics Enhanced model via Eden AI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-9",
+      "run_type": "cloud-stt",
+      "model": "whisper",
+      "provider": "google",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en"
+      },
+      "output_dir": "runs/cloud-stt/run-9",
+      "completed": false,
+      "notes": "Google Speech-to-Text via Eden AI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-10",
+      "run_type": "cloud-stt",
+      "model": "transcribe",
+      "provider": "amazon",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en"
+      },
+      "output_dir": "runs/cloud-stt/run-10",
+      "completed": false,
+      "notes": "Amazon Transcribe via Eden AI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-11",
+      "run_type": "cloud-stt",
+      "model": "azure-stt",
+      "provider": "microsoft",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en"
+      },
+      "output_dir": "runs/cloud-stt/run-11",
+      "completed": false,
+      "notes": "Microsoft Azure Speech-to-Text via Eden AI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-12",
+      "run_type": "cloud-stt",
+      "model": "default",
+      "provider": "symbl",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en"
+      },
+      "output_dir": "runs/cloud-stt/run-12",
+      "completed": false,
+      "notes": "Symbl.ai via Eden AI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "run-13",
+      "run_type": "cloud-stt",
+      "model": "fast",
+      "provider": "gladia",
+      "inference_provider": "edenai",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "automated",
+        "description": "Executed via edenai_stt_runner.py script"
+      },
+      "settings": {
+        "language": "en"
+      },
+      "output_dir": "runs/cloud-stt/run-13",
+      "completed": false,
+      "notes": "Gladia fast model via Eden AI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "manual-1",
+      "run_type": "cloud-stt",
+      "model": "solaria-1",
+      "provider": "gladia",
+      "inference_provider": "direct",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "manual",
+        "description": "Manual run - standardized output format"
+      },
+      "settings": {
+        "language": "en"
+      },
+      "output_dir": "runs/cloud-stt/manual-1",
+      "completed": true,
+      "notes": "Gladia Solaria 1 model - manual run",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "manual-2",
+      "run_type": "cloud-stt",
+      "model": "nova-3",
+      "provider": "deepgram",
+      "inference_provider": "direct",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "manual",
+        "description": "Manual run - standardized output format"
+      },
+      "settings": {
+        "language": "en",
+        "smart_format": true,
+        "punctuate": true
+      },
+      "output_dir": "runs/cloud-stt/manual-2",
+      "completed": true,
+      "notes": "Deepgram Nova-3 model - manual run",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        },
+        "processing_time": 3.0,
+        "audio_duration": 1637.9652
+      }
+    },
+    {
+      "run_id": "manual-3",
+      "run_type": "cloud-stt",
+      "model": "best",
+      "provider": "assemblyai",
+      "inference_provider": "direct",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "manual",
+        "description": "Manual run - standardized output format"
+      },
+      "settings": {
+        "language_code": "en",
+        "punctuate": true,
+        "format_text": true
+      },
+      "output_dir": "runs/cloud-stt/manual-3",
+      "completed": true,
+      "notes": "AssemblyAI Best model - manual run",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "manual-4",
+      "run_type": "cloud-stt",
+      "model": "slam-1-global-english",
+      "provider": "speechmatics",
+      "inference_provider": "direct",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "manual",
+        "description": "Manual run - standardized output format"
+      },
+      "settings": {
+        "language": "en",
+        "operating_point": "enhanced",
+        "enable_partials": false
+      },
+      "output_dir": "runs/cloud-stt/manual-4",
+      "completed": true,
+      "notes": "Speechmatics Slam 1 Global English - manual run",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    },
+    {
+      "run_id": "manual-5",
+      "run_type": "cloud-stt",
+      "model": "whisper-1",
+      "provider": "openai",
+      "inference_provider": "direct",
+      "engine": "api",
+      "run_method": {
+        "interface": "api",
+        "automation": "manual",
+        "description": "Manual run - standardized output format"
+      },
+      "settings": {
+        "language": "en",
+        "temperature": 0.0
+      },
+      "output_dir": "runs/cloud-stt/manual-5",
+      "completed": true,
+      "notes": "OpenAI Whisper-1 model - manual run via web UI",
+      "run_notes": {
+        "language_detection": {
+          "auto_detect": false,
+          "language_specified": true,
+          "language_code": "en"
+        }
+      }
+    }
+  ],
+  "source_audio": "data/audio/podcast.mp3",
+  "source_of_truth": "data/ground-truth/truth_1.txt",
+  "evaluation_metrics": [
+    "wer",
+    "cer",
+    "word_accuracy",
+    "processing_time",
+    "cost"
+  ]
+}

data/inference/runs/cloud-stt/manual-1/raw_response.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "full_transcript": "Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um regarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech to text models. And I'm doing this because I thought I'd made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine tuning Whisper. Whisper is, and I'm going to just talk i'm trying to mix up uh i'm going to try a few different styles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step, standalone, step alone, standalone sentences. And how is it going to handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did, why was it trying to fine tune Whisper? what is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to be down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's Just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of training, I guess, a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is this is sparked. I had besides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the reason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like, it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, I know I'm just going to have to find something else to read. So I used... a created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a voice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see. how close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used Bean. experimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or Well, it's more than that, actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors. But it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder. And I created an error of that data. So I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November, my wife was in the US and I was alone at home and, you know, went crazy. people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high I used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically And this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time, for a speech to text to be a worthwhile addition to your productivity. But you do need to get above, let's say, I don't know, 85%. percent. If it's 60% or 50%, you inevitably say, screw it, I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like, oh, this is actually really, really good now. How did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the to the underpinnings, which really blows my mind. And it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff. Super advanced stuff, I mean. But that, I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM. And then you've got image generation. OK, so firstly, those two things on the surface have nothing in common. So like, how are they? How did that just happen all at the same time? And then when you extend that further, you're like Suno, right? You can sing a song and AI will like come up with an instrumental. And then you've got Whisper. And you're like, wait a second. How did all this stuff, like if it's all AI, what's, like there has to be some commonality. Otherwise, these are totally different technologies on the surface of it. And the transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth. But I have scanned this. And as I said, I want to... printed and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the birth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring It's nice to think that people who, you know, just you might put them in the kind of. milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay voice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess It's like, oh my gosh, I have to get like a really good microphone for this. So I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note. And then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say. But when I look, I... kind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh llm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say, okay, this is crappy. text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built I see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's It's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually. is a big deal. So I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run in the background. You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues were I discovered that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i just thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models That are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me. Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. But I'm not in that. I'm neither really care. about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And I don't do anything so, you know, sensitive or secure that requires air gapping. So. I looked at the pricing and especially the kind of older models, mini, some of them are very, very affordable. And I did a calculation once with ChatGPT and I was like, OK, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails. uh, documents, outlines, um, which is a lot, but it's, it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been. Usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloud STT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said that's actually very, very affordable. Now, you're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost some more as well. Unless you're using Gemini, which needless to say, is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. Um, I like Gemini not so much as a everyday model. Um, it's kind of underwhelmed in that respect, I would say, but for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, um, process audio with a system prompt and both give you transcription that's cleaned up, that reduces two steps to one. And that for me is a very, very big deal. And, uh, I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay i've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood. more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has been working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. Deepgram SDT, I'm really, really hopeful that this prototype will work. And it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you know, you're not going to have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up. And I said, all right, let me just give Deepgram's Linux thing. shot and if it doesn't work, I'm just gonna go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process. It ran the script and oh my gosh, it works just like that. The tricky thing for all those who wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting, Wayland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram, I looked at how they actually handled that because it worked out of the... box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i think i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then voice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong coffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure, quick test and As someone working on a voice note idea, that's my sort of end motivation, besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit. you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, and it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as, um, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that. my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that I suspect that the next generation of ASR will just be so good that it will kind of be, no, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single, long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head-to-head with two things, really. One is Whisper variants. So you've got these... projects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but What I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well, like Local Whisper versus OpenAI API, et cetera. And... I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER, but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that I always feel, think I've just said as something I didn't intend to. STT, I said for those. listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face, where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening."
+}

data/inference/runs/cloud-stt/manual-1/transcript.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um regarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech to text models. And I'm doing this because I thought I'd made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine tuning Whisper. Whisper is, and I'm going to just talk i'm trying to mix up uh i'm going to try a few different styles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step, standalone, step alone, standalone sentences. And how is it going to handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did, why was it trying to fine tune Whisper? what is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to be down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's Just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of training, I guess, a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is this is sparked. I had besides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the reason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like, it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, I know I'm just going to have to find something else to read. So I used... a created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a voice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see. how close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used Bean. experimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or Well, it's more than that, actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors. But it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder. And I created an error of that data. So I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November, my wife was in the US and I was alone at home and, you know, went crazy. people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high I used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically And this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time, for a speech to text to be a worthwhile addition to your productivity. But you do need to get above, let's say, I don't know, 85%. percent. If it's 60% or 50%, you inevitably say, screw it, I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like, oh, this is actually really, really good now. How did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the to the underpinnings, which really blows my mind. And it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff. Super advanced stuff, I mean. But that, I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM. And then you've got image generation. OK, so firstly, those two things on the surface have nothing in common. So like, how are they? How did that just happen all at the same time? And then when you extend that further, you're like Suno, right? You can sing a song and AI will like come up with an instrumental. And then you've got Whisper. And you're like, wait a second. How did all this stuff, like if it's all AI, what's, like there has to be some commonality. Otherwise, these are totally different technologies on the surface of it. And the transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth. But I have scanned this. And as I said, I want to... printed and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the birth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring It's nice to think that people who, you know, just you might put them in the kind of. milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay voice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess It's like, oh my gosh, I have to get like a really good microphone for this. So I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note. And then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say. But when I look, I... kind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh llm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say, okay, this is crappy. text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built I see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's It's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually. is a big deal. So I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run in the background. You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues were I discovered that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i just thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models That are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me. Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. But I'm not in that. I'm neither really care. about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And I don't do anything so, you know, sensitive or secure that requires air gapping. So. I looked at the pricing and especially the kind of older models, mini, some of them are very, very affordable. And I did a calculation once with ChatGPT and I was like, OK, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails. uh, documents, outlines, um, which is a lot, but it's, it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been. Usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloud STT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said that's actually very, very affordable. Now, you're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost some more as well. Unless you're using Gemini, which needless to say, is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. Um, I like Gemini not so much as a everyday model. Um, it's kind of underwhelmed in that respect, I would say, but for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, um, process audio with a system prompt and both give you transcription that's cleaned up, that reduces two steps to one. And that for me is a very, very big deal. And, uh, I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay i've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood. more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has been working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. Deepgram SDT, I'm really, really hopeful that this prototype will work. And it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you know, you're not going to have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up. And I said, all right, let me just give Deepgram's Linux thing. shot and if it doesn't work, I'm just gonna go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process. It ran the script and oh my gosh, it works just like that. The tricky thing for all those who wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting, Wayland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram, I looked at how they actually handled that because it worked out of the... box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i think i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then voice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong coffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure, quick test and As someone working on a voice note idea, that's my sort of end motivation, besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit. you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, and it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as, um, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that. my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that I suspect that the next generation of ASR will just be so good that it will kind of be, no, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single, long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head-to-head with two things, really. One is Whisper variants. So you've got these... projects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but What I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well, like Local Whisper versus OpenAI API, et cetera. And... I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER, but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that I always feel, think I've just said as something I didn't intend to. STT, I said for those. listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face, where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.

data/inference/runs/cloud-stt/manual-2/transcript.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Hello and welcome to a audio dataset consisting of one single episode of a nonexistent podcast. Or it I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and A. I. In particular, more A. I. And generative A. I. I would I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, they might say, for different speech attacks models. I'm doing this because I thought I'd made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine tuning whisper. Whisper is, and I'm to just talk, I'm trying to mix up. I'm going to try a few different styles of speaking whisper something at some points as well. And I'll go back to speaking loud in in different parts are going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to push a speech to text model through its paces, which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step standalone, standalone, standalone sentences? And how is it going to handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was I trying to fine tune Whisper? And what is Whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stuck down the rabbit hole of AI, which you have to say is a really wonderful rabbit hole to be done. It's a really interesting area and speech and voice tech is is the aspect of it that I find actually most I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, is sparked. I had, besides the fine tune not working, well that was the failure. I used Claude code because one thinks these days that there is nothing short of solving, you know, the the reason of life or something that clause and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes. It was actually kind of tedious because the texts were really weird. Some of them were, it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after ten minutes that I was like, okay, no, I'm just gonna have to find something else to read. So I used a created with AI Studio, VibeCoded, a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note like I'm recording an email, give me a short story to read, give me prose to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to it gave me a source of truth which I'm not sure if that's the scientific way to approach this topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used, been experimenting with using for a different purpose. Slightly different annotating task types. It's more a text classification experiment or Well, it's more than that actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. Listen back to a voice note and you painfully go through one of those transcribing, where you start and stop and scrub around it and you fix the errors, but it's really, really pouring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minutes snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an error that data. So I was very hopeful, quietly, a little bit hopeful that I would be able, that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November, my wife was in the US and I was alone at home. And when crazy people like me do really wild things like use voice to tech technology. That was basically when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. I'd used speech tech now and again, tried it out. I was like, it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just, it was not good basically. And this blew me away from the first go. I mean, it wasn't one hundred percent accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like, the transcript is you don't have to get one hundred percent accuracy for it to be worth your time for speech to text to be a worthwhile addition to your productivity. But you do need to get above, let's say, I don't know, eighty five percent. If it's sixty percent or fifty percent, you inevitably say, Screw it, I'll just type it. Because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like, Oh, this is actually really, really good now. How did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture, if you want to go back to the underpinnings, which really blows my mind and it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super super high level stuff, super advanced stuff, mean. That I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities, I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM. And then you've got image generation. Okay. So firstly, two things on the surface have nothing in common. So how did that just happen all at the same time? And then when you extend that further, you're like, Suno. You can sing a song and AI will come up with an instrumental. And then you've got Whisper and you're like, Wait a second. How did all this stuff If it's all AI, there has to be some commonality. Otherwise, are totally different technologies on the surface of it. And the transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in-depth. But I have scanned this and as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in twenties? Like, that's crazy. I think I asked ChatGPT once who wrote that paper and how old were they when it was published in ArcSiv? And I was expecting like, I don't know, what do you imagine? I personally imagine kind of like, you you have these breakthroughs during COVID and things like that, where like these kind of really obscure scientists who are in their 50s and they've just kind of been laboring in labs and wearily in writing and publishing in kind of obscure academic publications. And they finally hit a big or win a Nobel Prize and then their household names. So that was kind of what I had in mind. That was the mental image I'd formed of the birth of ArcSim. Like I wasn't expecting twenty somethings in San Francisco. I thought that was both very funny, very cool, and actually kind of inspiring. It's nice to think that people who just you might put them in the kind of milieu or bubble or world that you are in incredibly in through a series of connections that are coming up with such literally world changing innovations. So that was I thought anyway, that's that that was cool. Okay. Voice training data. How are we doing? We're about ten minutes, and I'm still talking about voice technology. So Whisper was brilliant, and I was so excited that my first instinct was to guess, like, Oh my gosh, I have to get a really good microphone for this. So I didn't go on a spending spree because I said, I'm gonna have to just wait a month and see if I still use this. And it just kind of became it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel. That's kind of a weird thing to say, when I started working on these prototypes on GitHub, which is where I just kind of share very freely and loosely ideas and first iterations on concepts. And for want of a better word, I called it like LLM post processing or clean up or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy text like add sentence structure and, you know, fix it up. And now when I'm exploring the different tools that are out there that people have built, I see quite a number of projects have basically done the same thing. Lest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while, but it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. So I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run-in the background. You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues where I discovered that like most people involved in creating these projects were very much focused on local models, running Whisper locally because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there is it's actually, in my opinion, just one of the better deals in API spending in the cloud. Like, it's just not that expensive for very, very good models that are much more, you know, you're gonna be able to run the full model, the latest model versus whatever you can run on your average GPU unless you want to buy a crazy GPU. It doesn't really make sense to me. Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment maybe for regulatory reasons as well. But I'm not in that. I neither really care about people listening to my, grocery list, consisting of, reminding myself that I need to buy more beer, Cheetos, and hummus, which is kind of the three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe, it's just not. It's not a privacy thing I'm that sort of sensitive about and I don't do anything so sensitive or secure that requires air capping. I looked at the pricing and especially the kind of older model mini. Some of them are very, very affordable and I did a calculation once with ChatGPT and I was like, okay, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which rarely happens. Probably, I would say on average I might dictate thirty to sixty minutes per day if I was probably summing up the emails, documents, outlines, which is a lot, but it's it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, Oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloud STT. If I was like dedicated every second of every waking hour to transcribing for some odd reason, I mean I'd have to eat and use the toilet. There's only so many hours I'm awake for. So let's just say a maximum of forty five minutes in the hour, then I said, All right, let's just say fifty. Who knows? You're dictating on the toilet. We do it. So you could just do sixty, but whatever I did and every day, like you're going flat out seven days a week dictating nonstop. I was like, What's my monthly API bill going to be at this price? And it came out to like seventy or eighty bucks. And I was like, Well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than seventy dollars that I embarked upon that project. So given that that's kind of the max point for me I said that's actually very very affordable. Now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel is valuable, that's going to cost some more as well. Unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation nor with Google nor Anthropic nor Gemini nor any major tech vendor for that matter. I like Gemini not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, process audio with a system prompt and both give you transcription that's cleaned up. That reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google hasn't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year just an endless list of really kind of system prompt stuff that I can say, okay, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood. More realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for ten minutes. And that ten minutes you get a lot of information in. Emails, which is short text. Just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. You end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database. Daniel has worked in technology. Daniel has been working in, know, stuff like that. That's not how you would speak, but I figure it's probably easier to parse for, after all, robots. So we've almost got to twenty minutes and this is actually a success because I wasted twenty minutes of my of the evening speaking into you in microphone and the levels were shot and was clipping and I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. Deepgram STT, I'm really, really hopeful that this prototype will work and it's a build in public open source so anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you you're not gonna have to build a custom conda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it and I was about to give up and I said, All right, let me just give Deepgram's Linux thing a shot. And if this doesn't work, I'm just gonna go back to trying to vibe code something myself. And when I ran the script, I was using Cloud Code to do the installation process, it ran the script and, oh my gosh, it works just like that. The tricky thing for all those who wants to know all the nitty, ditty, nitty gritty details was that I don't think it was actually struggling with transcription, but pasting Weyland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram, I looked at how they actually handle that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now what am I what am I doing here? This is gonna be a twenty minute audio sample. And I'm I think I've done one or two of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three liters of milk tomorrow and peter bread, which is probably how half my voice notes sound. Like if anyone were to find my phone they'd be like this is the most boring person in the world. Although actually there are some journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech, GitHub, Nucleano, hugging face, not so obscure that it's not gonna have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. Would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not gonna get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was non diarized or if we had diarization. A female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone I'm working on a voice note idea. That's my sort of end motivation besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. Voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit folks who are able-bodied and we can all in different ways make this tech as useful as possible regardless of the exact way that we're using it. And I think there's something very powerful in that, and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, and it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine. I get like superb, flawless words error rates because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently ailingual or multilingual, phonetic based. So as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments, I know there are this is used extensively in medical transcription and dispatcher work as, you know the call centers who send out ambulances etc. Where accuracy is absolutely paramount and in the case of doctors radiologists they might be using very specialized vocab all the time. So those are kind of the main two things, and I'm not sure that really just for trying to make it better on a few random tech words with my slightly I mean, I have an accent, but, like, not, you know, an accent that a few other million people have ish. I'm not sure that my little fine tune is gonna actually like, the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud, by the time we've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's gonna be it for today's episode of voice training data. Single, long shot evaluation. Who am I gonna compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper. Distill Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm gonna have the time in any point in the foreseeable future to go back to this whole episode and create a proper source truth where I fix everything. Might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a, like, you know, to scale and maybe even a local one as well, like local whisper versus OpenAI API, etcetera. And I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't as well as the sort of headline finding of which had the best W E R but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see I always think I've just said it as something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosol. For more jumbled repositories about my roving interest in AI but particularly AgenTic, MCP and VoiceTech you can find me on GitHub. Hugging Face. Where else? DanielRosel dot com, which is my personal website, as well as this podcast whose name I sadly cannot remember. Until next time. Thanks for listening.

data/inference/runs/cloud-stt/manual-3/raw_response.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/inference/runs/cloud-stt/manual-3/transcript.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular, more AI in generative AI, I would say. But in any event, the purpose of this Voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech attack models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some point. As well. And I'll go back to speaking loud in, in different parts. I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces, which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step, standalone, step alone, standalone sentences? And how is it gonna handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was it trying to fine tune Whisper? And what is Whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are, you know, in the normal world and not totally stuck down the rabbit hole of AI, which I have to say is a really wonderful rabbit hole to be down. It's a really interesting area and speech and voice tech is the aspect of it that I find actually the most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is sparked I had, besides the fine tune not working, well, that was the failure. Um, I used Claude code because one thinks these days that there is nothing short of solving, you know, the, the reason of life or something, that Claude and agentic AI can't do, which is not really the case. Uh, it does seem that way sometimes, but it fails a lot as well. And this is one of those, instances where last week I put together an hour of voice training data, basically speaking, just random things for 3 minutes. And it was actually kind of tedious because the texts were really weird. Some of them were it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored after 10 minutes that I was like, okay, no, I'm just going to have to find something else to read. So I used a created with AI studio vibe coded a synthetic text generator. Which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour one afternoon or probably two hours by the time you you do retakes. And whatever, because you want to, it gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering, training data, but I thought made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, been experimenting with using for a different purpose, slightly different annotating task types. It's more a text classification experiment or, Well, it's more than that actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors, but it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them. It saved an MP3 and a TXT in the same folder, and I created an error with that data. So I was very hopeful, quietly, a little bit hopeful that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into Voicetech last November, my wife was in the US and I was alone at home. And when crazy people like me do really wild things like use voice to tech technology. That was basically when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. I used speech tech now and again, tried it out. It was like, it'd be really cool if you could just, like, speak into your computer. And whatever I tried out that had Linux support was just. It was not good, basically. And this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for speech attacks to be a worthwhile addition to your productivity, but you do need to get above, let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it, I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with. That's been my experience. So I was like, oh, this is actually really, really good now. How did that happen? And the answer is ASR whisper being open source and the transformer architecture. If you want to go back to the to the underpinnings, which really blows my mind and it's on my list. To read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff, super advanced stuff, I mean. But that, I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that a few people are like, hang on, you've got this thing that can speak to you, like a chatbot, an LLM, and then you've got image generation. Okay, so firstly, those two things on the surface have nothing in common. So like, how are they, how did that just happen all at the same time? And then when you extend that further, you're like, Suno, right? You can sing a song and AI will come up with and instrumental. And then you've got Whisper and you're like, wait a second, how did all this stuff, like, if it's all AI, what's like, there has to be some commonality. Otherwise, these are totally different technologies on the surface of it. And the Transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the Transformer architecture means. In depth, but I have scanned it and as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in their 20s? Like, that's crazy. I think I asked ChatGPT once who wrote that paper and how old were they when it was published in Arciv? And I was expecting, like, I don't know, What do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during COVID and things like that where like these kind of really obscure scientists are like in their 50s and they've just kind of been laboring in labs and wearily in writing and publishing in kind of obscure academic publications. And they finally like hit a big or win a Nobel Prize and then their household names. So that was kind of what I had in mind. That was the mental image I'd formed of the birth of Arcsight. Like I wasn't expecting 20-somethings in San Francisco, though. I thought that was both very, very funny, very cool, and actually kind of inspiring. It's nice to think that people who, you know, just you might put them in the kind of milieu or bubble or world that you are in are credibly in through, you know, the series of connections that are coming up with such literally world changing innovations. So that was, I thought, anyway. That's that was cool. Okay, voice training data. How are we doing? We're about 10 minutes and I'm still talking about voice technology. So Whisper was brilliant and I was so excited that I was my first instinct was to like guess like, oh my gosh, I have to get like a really good microphone for this. So I didn't go on a spending spree because I said, I'm gonna have to just wait a month and see if I still use this. And It just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note. And then I've developed and it's nice to see that everyone is like developing the same things in parallel. Like that's my kind of a weird thing to say, but when I look, I kind of came, when I started working on this, these prototypes on GitHub, which is where I just kind of share very freely and loosely, ideas and first iterations on concepts. And for want of a better word, I called it like LLM post-processing or cleanup or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy text, like add sentence structure and fix it up. And now when I'm exploring the different tools that are out there that people have built, I see quite a number of projects have basically done the same thing, lest that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while, but it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or Paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, no, it's just not worth it. It's like, it's, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. Uh, so I was on Whisper and I've been using Whisper and I kind of, early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run in the background. It'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues were I discovered that like most people involved in creating these projects were very much focused on local models, running Whisper locally because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there is, it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're gonna be able to run the full model. The latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me. Now, privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. But I'm not in that. I neither really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos, and hummus, which is kind of the three staples of my diet during periods of poorer nutrition. But the kind of stuff that I transcribe, it's just not, it's not a privacy thing I'm that sort of sensitive about and I don't do anything so sensitive or secure that requires air gapping. So I looked at the pricing and especially the kind of older model mini Some of them are very, very affordable. And I did a back of the, I did a calculation once with ChatGPT and I was like, okay, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which it rarely happens. Probably, I would say on average, I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents, outlines, which is a lot, but it's still a fairly modest amount. And I was like, Some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm gonna price out Cloud SCT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, I mean, I'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hour, 45 minutes. In the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So it could be. You could just do 60. But whatever I did. And every day, like, you're going flat out seven days a week dictating non-stop I was like, what's my monthly API bill gonna be at this price? And it came out to, like, 70 or 80 bucks. And I was like, well, that would be an extraordinary. Amount of dictation. And I would hope that there was some compelling reason more worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said that's actually very, very affordable. Now you're gonna, if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's gonna cost some more as well, unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor anthropic, nor Gemini, nor any major tech vendor for that matter. I like Gemini not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can process audio with a system prompt and both give you transcription that's cleaned up that reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year, just an endless list of really kind of system prompt system prompt stuff that I can say, okay, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood more realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for 10 minutes. And that 10 minutes you get a lot of information in. Um, emails, which is short text, just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database. Daniel has worked in technology. Daniel has been working in, you know, stuff like that. That's not how you would speak, but I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. DeepChrom ST, I'm really, really hopeful that this prototype will work and it's a build in public open source, so anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you know, you're not gonna have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up and I said, all right, let me just give Deep Grams Linux thing a shot. And if this doesn't work, I'm just going to go back to trying to Vibe code something myself. And when I ran the script, I was using Claude code to do the installation process. It ran the script and oh my gosh, it works just like that. The tricky thing For all those who want to know all the nitty gritty details, was that I don't think it was actually struggling with transcription, but pasting Wayland makes life very hard. And I think there was something not running the right time. Anyway, Deepgram, I looked at how they actually handled that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample. And I think I've done one or two of these before, but I did it with short snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three Bread, eaters of milk tomorrow and Peter bread, which is probably how like half my voice notes sound. Like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although actually, there are some like kind of journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech, GitHub, NeocleNo, hugging face, Not so obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, My wife came in with my son and for a goodnight kiss. And that actually would have been super helpful to get in because it was non diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes. Annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone, I'm working on a voice note idea. That's my sort of end motivation. Besides thinking it's an ask to the outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about Voicetech? A lot of things actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this. And it's getting better and better with stuff like accent handling. I'm not sure my fine-tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine. I get like superb flawless words error rates because I'm just kind of skeptical about Local speech to text, as I mentioned, and I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently a lingual or multilingual phonetic based. So as folks who use speak very obscure languages, that there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments, I know this is used extensively in medical transcription and dispatcher work, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologist, they might be using very specialized vocab all the time. So those are kind of the main two things that I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have ish. I'm not sure that my little fine tune is gonna actually like the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud. By the time we've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things, really. One is Whisper variants. So you've got these projects like faster Distill Whisper, it's a bit confusing, there's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth, where I fix everything. Might do it if I can get one transcriptions that sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like, you know, to scale and maybe even a local one as well, like local whisper versus OpenAI API, et cetera. And, I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER, but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that I always feel think I've just said as something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosell. For more jumbled repositories about my roving interests in AI, but particularly agentic, MCP and Voicetech, you can find me on GitHub, huggingface.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.

data/inference/runs/cloud-stt/manual-4/transcript.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or it, uh, I may append this to a podcast that I set up recently. Um, regarding my, uh, with my thoughts on speech, tech and AI in particular, more AI and generative AI, I would, uh, I would say, but in any event, the purpose of this, um, voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech to text models. And I'm doing this because I, uh, I thought I'd made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine tuning. Whisper, whisper is. And I'm going to just talk. I'm trying to mix up, uh, I'm going to try a few different styles of speaking. I might whisper something at some point as well, and I'll go back to speaking loud in, uh, in different parts. I'm going to sound really like a crazy person, because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech to text model through its paces, which is trying to make sense of, is this guy just on incoherently in one long sentence, or are these just actually a series of step standalone, standalone, standalone sentences? And how is it going to handle step alone? That's not a word. Uh, what happens when you use speech to text and you use a fake word and then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And, uh, these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was it trying to fine tune a whisper? And what is whisper? As I said, I'm gonna try to, uh, record this at a couple of different levels of technicality for folks who are, uh, you know, in the normal, uh, world and not totally stuck down the rabbit hole of AI, uh, which I have to say is a really wonderful, uh, rabbit hole to be to be down. Um, it's a really interesting area. And speech and voice tech is is the aspect of it that I find actually most. I'm not sure I would say the most interesting, because there's just so much that is fascinating in AI. Uh, but the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of trying to guess a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, uh, this is this is has sparked I had besides the fine tune not working. Well, that was the failure. Um, I used clod code because one thinks these days that there is nothing short of solving, you know, the, uh, the reason of life or something. Uh, that clod and agentic AI can't do, uh, which is not really the case. Uh, it does seem that way sometimes, but it fails a lot as well. And this is one of those, uh, instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes. And, um, it was actually kind of tedious because the texts were really weird. Some of them were it was like it was AI generated. Um, I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored, uh, after ten minutes that I was like, okay, now I'm just gonna have to find something else to read. So I used a created with AI studio vibe coded. A synthetic text generator. Um, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose, um, to read. So I came up with all these different things, and I added a little timer to it so I could see how close I was to one hour. Um, and, uh, I spent like an hour one afternoon or probably two hours by the time you, um, you do retakes or whatever because you want to. It gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering, uh, training data, but I thought it made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, um, been experimenting with using for a different purpose, slightly different annotating task types. It's more text classification experiment or uh, well, it's more than that, actually. I'm working on a voice app, so it's a prototype I guess is really more accurate. Um, but you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors. But it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded The Source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a txt in the same folder, and I created an hour of that data. Uh, so I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune, whisper. Um, I want to fine tune whisper because when I got into voice tech last November, my wife was in the US and I was alone at home. And you know, when crazy people like me do really wild things like use voice to tech, uh, technology. That was basically, um, when I started doing it, I didn't feel like a crazy person speaking to myself, and my expectations weren't that high. Uh, I used speech tech now and again. Um, tried it out. I was like, it'd be really cool if you could just, like, speak into your computer. And whatever I tried out that had Linux support was just. It was not good, basically. Um, and this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that, uh, pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for speech to text to be a worthwhile addition to your productivity. But you do need to get above. Let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it. I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with. And that's been my experience. So, um, I was like, oh, this is actually really, really good. Now how did that happen? And the answer is ASR whisper being open sourced and the transformer architecture, if you want to go back to the, um, to the underpinnings, which really blows my mind and it's on my list to read through that paper. Um, all you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff. Um, super advanced stuff. I mean, uh, but that I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM, and then you've got image generation. Okay, so firstly, those two things on the surface have nothing in common. Um, so like how are they how did that just happen all at the same time. And then when you extend that further, um, you're like sooner, right? You can sing a song and AI will like, come up with an instrumental and then you've got whisper and you're like, wait a second, how did all this stuff, like, if it's all AI, what's like there has to be some commonality. Otherwise these are four. These are totally different technologies on the surface of it. And, uh, the transformer architecture is, as far as I know, the answer. And I can't even say can't even pretend that I really understand what the transformer architecture means in depth, but I have scanned it and as I said, I want to print it and really kind of think over it at some point, and I'll probably feel bad about myself, I think, because weren't those guys in their in their 20s like, that's crazy. I think I asked ChatGPT once who were the who wrote that paper and how old were they when it was published in arXiv? And I was expecting like, I don't know, what do you what do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during Covid and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and, uh, wearily and writing in publishing in kind of obscure academic publications. And they finally, like, hit a big or win a Nobel Prize and then their household household names. Uh, so that was kind of what I had in mind. That was the mental image I'd formed of the birth of arXiv. Like, I wasn't expecting 20 somethings in San Francisco, though I thought that was both very, very funny, very cool, and actually kind of inspiring. It's nice to think that people who, you know, just you might put them in the kind of milieu or bubble or world that you are in or credibly in, through, you know, a series of connections that are coming up with such literally world changing, um, innovations. Uh, so that was, I thought, anyway, that, that that was cool. Okay. Voice training data. How are we doing? We're about ten minutes, and I'm still talking about voice technology. Um, so whisper was brilliant, and I was so excited that I was. My first instinct was to, like, get like, oh, my gosh, I have to get, like, a really good microphone for this. So, um, I didn't go on a spending spree because I said, I'm gonna have to just wait a month and see if I still use this. And it just kind of became it's become really part of my daily routine. Like, if I'm writing an email, I'll record a voice note. And then I've developed and it's nice to see that everyone is like developing the same
+things in parallel. Like, that's kind of a weird thing to say, but when I look, I kind of came when I started working on this, these prototypes on GitHub, which is where I just kind of share very freely and loosely, uh, ideas and, you know, first iterations on, on concepts, um, and for want of a better word, I called it like, uh, lm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper, you run it through a model and say, okay, this is crappy text, like add sentence structure and, you know, fix it up. And, um, now when I'm exploring the different tools that are out there that people have built, I see, uh, quite a number of projects have basically done the same thing, um, less that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while, but it's it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent, uh, because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's it's it moves speech tech into that before that inflection point where you're like, no, it's just not worth it. It's like it'll just be quicker to type this. So it's a big it's a little touch. That actually is a big deal. Uh, so I was on whisper and I've been using whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is, um, basically just something that'll run in the background. You'll give it an API key and it will just transcribe. Um. with, like, a little key to start and stop the dictation. Uh, and the issues were I discovered that, like most people involved in creating these projects were very much focused on local models running whisper locally, because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there's it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't really make sense to me. Now, privacy is another concern. Um, that I know is kind of like a very much a separate thing that people just don't want their voice, data, and their voice leaving their local environment, maybe for regulatory reasons as well. Um, but I'm not in that. Um, I'm neither really care about people listening to my, uh, grocery list consisting of, uh, reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet. Um, during periods of poor nutrition. Uh, but the kind of stuff that I transcribe, it's just not it's not a, it's not a privacy thing and that sort of sensitive about and, uh, I don't do anything so, you know, sensitive or secure, that requires air gapping. So, um, I looked at the pricing and especially the kind of older models, mini, um, some of them are very, very affordable. And I did a back of the I did a calculation once with ChatGPT and I was like, okay, this is a, this is the API price for I can't remember whatever the model was. Uh, let's say I just go at it like nonstop, which it rarely happens. Probably. I would say on average, I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents, outlines, um, which is a lot, but it's it's still a fairly modest amount. And I was like, well, some days I do go on like 1 or 2 days where I've been. Usually when I'm like kind of out of the house and just have something like, I have nothing else to do. Like if I'm at a hospital with a newborn, uh, and you're waiting for like eight hours and hours for an appointment, and I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm gonna price out. Cloud asked if I was like, dedicated every second of every waking hour to transcribing for some odd reason. Um. I mean, it'd have to, like, eat and use the toilet and, like, you know, there's only so many hours I'm awake for. So, like, let's just say a maximum of, like, 40 hours, 45 minutes in the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. Uh, so it could be you could just do 60. But whatever I did, and every day, like, you're going flat out seven days a week dictating non-stop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason, more worth more than $70, that I embarked upon that project. Uh, so given that that's kind of the max point for me, I said, that's actually very, very affordable. Um, now you're gonna if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable. Um, that's going to cost some more as well, unless you're using Gemini, which, uh, needless to say, is a random person sitting in Jerusalem. Uh, I have no affiliation, nor with Google, nor anthropic, nor Gemini, nor any major tech vendor for that matter. Um, I like Gemini. Not so much as a everyday model. Um, it's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, um, process audio with a system prompt and both give you transcription that's cleaned up, that reduces two steps to one. And that for me is a very, very big deal. And, uh, I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year just an endless list of really kind of system prompt, system prompt stuff that I can say, okay, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about, who knows, my childhood. Uh, more realistically, maybe my career goals, uh, something that would just be, like, really boring to type out. So I'll just, like, sit in my car and record it for ten minutes. And that ten minutes, you get a lot of information in, um, emails, which is short text. Um, just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work, and it goes it condenses that down to very robotic language that is easy to chunk, parse, and maybe put into a vector database. Daniel has worked in technology, Daniel is a has been working in, you know, stuff like that. That's not how you would speak. Um, but I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this is actually a success because I wasted 20 minutes of my, uh, of the evening speaking into a microphone, and, uh, the levels were shot and, uh, it, uh, it was clipping and I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. Uh, what am I hoping to achieve in this? Okay, my fine tune was a dud, as mentioned Deepgram SVT. I'm really, really hopeful that this prototype will work. And it's a built in public open source, so anyone is welcome to use it if I make anything good. Um, but that was really exciting for me last night when after hours of, um, trying my own prototype, seeing someone just made something that works like that. You know, you're not going to have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it and I was about to give up and I said, all right, let me just give deep grams Linux thing a shot. And if this doesn't work, um, I'm just going to go back to trying to code something myself. And when I ran the script, I was using cloud code to do the installation process. It ran the script and oh my gosh, it works just like that. Uh, the tricky thing for all those who wants to know all the nitty gritty, nitty gritty details, um, was that I don't think it was actually struggling with transcription, but pasting Wayland makes life very hard, and I think there was something not running in the right time anyway. Deepgram I looked at how they actually handle that because it worked out of the box when other stuff didn't, and it was quite a clever little mechanism, and but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample, and I'm I think I've done 1 or 2 of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like I need to buy three liters of milk tomorrow, and pita bread, which is probably how like half my voice voice notes sound like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although actually there are some like kind of, uh, journaling thoughts as well. But it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech GitHub uh, hugging face not so
+obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was not diarised. Or if we had diarisation a female, I could say I want the male voice and that wasn't intended for transcription. Um, and we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure quick test. And as someone I'm working on a voice note idea, that's my sort of end motivation. Besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy can actually have a very transformative effect. Um, it's, you know, voice technology has been life changing for, uh, folks living with, um, disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able bodied and like, we can all in different ways, um, make this tech as useful as possible, regardless of the exact way that we're using it. Um, and I think there's something very powerful in that, and it can be very cool. Um, I see use potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, um, and it's getting better and better with stuff like accent handling, um, I'm not sure my, my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine I get like superb, flawless word error rates because I'm just kind of skeptical about local speech to texts, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows, blows my mind about ASR is the idea that it's inherently a lingual or multilingual phonetic based. So as folks who use speak very obscure languages that there may be there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced or folks in very critical environments. I know there are. This is used extensively in medical transcription and dispatcher work as, um, you know, the call centers who send out ambulances, etc., where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly. I mean, I have an accent, but like, not, you know, an accent that a few other million people have. Ish. I'm not sure that my little fine tune is going to actually like the bump in word error rate reduction. If I ever actually figure out how to do it and get it up to the cloud by the time I've done that. I suspect that the next generation of ASR will just be so good that it will kind of be. Ah, well, that would be cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of, uh, voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisperer head to head with two things, really. One is whisper variance. So you've got these projects like faster Whisper, Still whisper. It's a bit confusing. There's a whole bunch of them and the emerging acers, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source, truth or a fix. Everything might do it if I can get one transcription that sufficiently close to perfection. But what I would actually love to do on Hugging Face I think would be a great. Probably how I might visualize this is having the audio waveform play, and then have the transcript for each model below it, and maybe even a, um, like, you know, two scale and maybe even a local one as well, like local whisper versus open AI API, Etc. and, um, I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best, uh, wer. But that would require the source of truth. Okay. That's it. Hope this was, I don't know, maybe useful for other folks interested in stuff you want to see. I always feel think I've just said something I didn't intend to say. I said for those, listen carefully. Including, hopefully, the models themselves. This has been myself, Daniel Rosehill, for more, um, jumbled repositories about my, uh, roving interest in AI, but particularly Agentic, MCP and voice tech. Uh, you can find me on GitHub. Hugging face. Where else? Daniel, which is my personal website, as well as this podcast whose name I sadly cannot remember. Until next time. Thanks for listening.

data/inference/runs/cloud-stt/manual-5/raw_response.json ADDED Viewed

	@@ -0,0 +1,7 @@

+  "run_id": "run-4",
+  "provider": "openai",
+  "model": "whisper-1",
+      "text": "Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast. Or I may append this to a podcast that I set up recently regarding my, with my thoughts on speech tech and AI in particular, more AI and generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine tuning whisper. And I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts, I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces, which is trying to make sense of, is this guy just rambling on incoherently in one long sentence or are these just actually a series of step standalone, step alone, standalone sentences and how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was I trying to fine tune whisper and what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are, you know, in the normal world and not totally stuck down the rabbit hole of AI, which I have to say is a really wonderful rabbit hole to be down. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI, but the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work and I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is sparked. I had, besides the fine tune not working, well, that was the failure. I used flawed code because one thinks these days that there is nothing short of solving, you know, the reason of life or something that flawed and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored after 10 minutes that I was like, okay, no, I'm just going to have to find something else to read. So I used a, created with AI Studio, vibe coded a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose to read. So it came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour, one afternoon or probably two hours by the time you, you do retakes and whatever, because you want to, it gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering training data, but I thought made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, uh, been experimenting with using for a different purpose, slightly different annotating task types. It's more text classification experiments or, uh, well, it's more than that actually working on a voice app. So it's a prototype, I guess, is really more accurate. Um, but you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors, but it's really, really boring to do that. So I thought it would be less tedious in the longterm if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an error that data. Uh, so I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. Um, I want to fine tune Whisper because when I got into voice tech, uh, last November, uh, my wife was in the U S and I was alone at home. And, uh, you know, when crazy people like me do really wild things, like use voice to tech, uh, technology, that was basically, um, when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. Uh, I used speech tech now and again, um, tried it out. I was like, it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just, it was not good basically. Um, and this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that, uh, pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for a speech attacks to be a worthwhile addition to your productivity, but you do need to get above, let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it. I'll just type it because you end up missing, um, errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So, um, I was like, oh, this is actually really, really good. Now, how did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the, uh, to the underpinnings, which really blows my mind and it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain, because it's super, super high level stuff. Um, super advanced stuff, I mean, uh, but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities, I find it fascinating. A few people are like, hang on, you've got this thing that can speak to you like a chat bot, an LLM, and then you've got image generation. Okay. So firstly, those two things on the surface have nothing in common. Um, so like, how are they, how did that just happen all at the same time? And then when you extend that further, um, you're like Suno, right? You can sing a song and AI will like come up with an instrumental and then you've got Whisper and you're like, wait a second, how did all this stuff, like if it's all AI, what's like, there has to be some commonality. Otherwise these are four, these are totally different technologies on the surface of it. And, uh, the transformer architecture is as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth, but I have scanned this. And as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in their twenties? Like, that's crazy. I think I asked Chad TPT once who were the, who wrote that paper and how old were they when it was published in ARCSYV. And I was expecting like, I don't know, what do you, what do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during COVID and things like that, where like these kind of really obscure scientists who are like in their fifties and they've just kind of been laboring in labs and, uh, wearily and writing and publishing and kind of obscure academic publications. And they finally like hit a big or win a Nobel prize. And then their household, household names. Uh, so that was kind of what I had in mind. That was the mental image I'd formed of the birth of ARCSYV. Like I wasn't expecting 20 somethings in San Francisco though. I thought that was both very, very funny, very cool. And actually kind of inspiring. It's nice to think that people who, you know, just, you might put them in the kind of milieu or bubble or world that you are in or credibly in through, you know, the series of connections that are coming up with such literally world-changing, um, innovations. Uh, so that was, I thought, anyway, that's, that, that was cool. Okay. Voice training data. How are we doing? We're about 10 minutes and I'm still talking about voice technology. Um, so Whisper was brilliant and I was so excited that I was, my first instinct was to like guess it's like, Oh my gosh, I have to get like a really good microphone for this. So, um, I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note and then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say, but when I look, I, I kind of came when I started working on this, uh, these prototypes on GitHub, which is where I just kind of share very freely and loosely, uh, ideas and, you know, first iterations on, on concepts. Um, and for want of a better word, I called it like, uh, LLM post-processing or cleanup, or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy, uh, text, like add sentence structure and, you know, fix it up. And, um, now when I'm exploring the different tools that are out there that people have built, I see, uh, quite a number of projects have basically, you know, done the same thing. Um, lest that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a, a thing that's been, uh, integrated into tools for a while, but it's, it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent, uh, because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. Uh, so I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is, um, basically just something that'll run in the background. You'll give it an API key and it will just like transcribe, um, with like a little key to start and stop the dictation. Uh, and the issues were, I discovered that like most people involved in creating these projects were very much focused on local models, uh, running, running Whisper locally because you can, and I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there's, it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me now. Privacy is another concern, um, that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. Um, but I'm not in that, um, I'm neither really care about people listening to my, uh, grocery list, uh, consisting of, uh, reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet, um, during, uh, periods of poor nutrition. Uh, but the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And, uh, I don't do anything so, you know, sensitive or secure that requires air gapping. So, um, I looked at the pricing and especially the kind of older models, mini, um, some of them are very, very affordable. And I did a back of the, I did a calculation once with chat GBT and I was like, okay, this is the, this is the API price for, I can't remember whatever the model was. Uh, let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average, I might dictate 30 to 60 minutes per day. If I was probably summing up the emails, uh, documents, outlines, um, which is a lot, but it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do. Like if I'm at a hospital, we have a newborn, uh, and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloudSTT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hours. And I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week, dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said, that's actually very, very affordable. Now you're going to, if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost them more as well. Unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. I like Gemini, not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can process audio with a system prompt and both give you transcription, that's cleaned up. That reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google hasn't really sort of thought through how useful that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year, just an endless list of really kind of system prompt stuff that I can say, OK, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about, who knows, my childhood. More realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for 10 minutes and that 10 minutes, you get a lot of information in emails, which is short text. Just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk, parse and maybe put into a vector database. Daniel has worked in technology. Daniel is a has been working in, you know, stuff like that. That's not how you would speak. But I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes, and this is actually a success because I wasted 20 minutes of my of the evening speaking into a microphone and the levels were shot and it was clipping. And I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? OK, my fine tune was a dud, as mentioned, DeepGram SDT. I'm really, really hopeful that this prototype will work and it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that. You know, you're not going to have to build a custom Conda environment, an image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up and I said, all right, let me just give DeepGram's Linux thing a shot. And if it doesn't work, I'm just going to go back to trying to code something myself. And when I ran the script, I was using cloud code to do the installation process. It ran the script and oh, my gosh, it works just like that. The tricky thing. For all those wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting. Wayland makes life very hard. And I think there was something not running at the right time. Anyway, DeepGram, I looked at how they actually handle that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample. And I'm I think I've done one or two of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three liters of milk tomorrow and pita bread, which is probably how like half my voice note voice note sound like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although, actually, there are some like kind of journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech. GitHub, Nuclino, Hugging Face, not so obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone working on a voice note idea, that's my sort of end motivation besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this. And it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual, phonetic based. So as folks who speak very obscure languages, that there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments. I know this is used extensively in medical transcription and dispatcher work as, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that my little fine tune is going to actually, like the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud, by the time I've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper, Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where I fix everything. I might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this, is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well. Like local Whisper versus OpenAI API, etc. And I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER. But that would require the source of truth. OK, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that, I always feel, think I've just said something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face. Where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.",

data/inference/runs/cloud-stt/manual-5/transcript.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast. Or I may append this to a podcast that I set up recently regarding my, with my thoughts on speech tech and AI in particular, more AI and generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine tuning whisper. And I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts, I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces, which is trying to make sense of, is this guy just rambling on incoherently in one long sentence or are these just actually a series of step standalone, step alone, standalone sentences and how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was I trying to fine tune whisper and what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are, you know, in the normal world and not totally stuck down the rabbit hole of AI, which I have to say is a really wonderful rabbit hole to be down. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI, but the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work and I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is sparked. I had, besides the fine tune not working, well, that was the failure. I used flawed code because one thinks these days that there is nothing short of solving, you know, the reason of life or something that flawed and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored after 10 minutes that I was like, okay, no, I'm just going to have to find something else to read. So I used a, created with AI Studio, vibe coded a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose to read. So it came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour, one afternoon or probably two hours by the time you, you do retakes and whatever, because you want to, it gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering training data, but I thought made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, uh, been experimenting with using for a different purpose, slightly different annotating task types. It's more text classification experiments or, uh, well, it's more than that actually working on a voice app. So it's a prototype, I guess, is really more accurate. Um, but you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors, but it's really, really boring to do that. So I thought it would be less tedious in the longterm if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an error that data. Uh, so I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. Um, I want to fine tune Whisper because when I got into voice tech, uh, last November, uh, my wife was in the U S and I was alone at home. And, uh, you know, when crazy people like me do really wild things, like use voice to tech, uh, technology, that was basically, um, when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. Uh, I used speech tech now and again, um, tried it out. I was like, it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just, it was not good basically. Um, and this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that, uh, pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for a speech attacks to be a worthwhile addition to your productivity, but you do need to get above, let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it. I'll just type it because you end up missing, um, errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So, um, I was like, oh, this is actually really, really good. Now, how did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the, uh, to the underpinnings, which really blows my mind and it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain, because it's super, super high level stuff. Um, super advanced stuff, I mean, uh, but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities, I find it fascinating. A few people are like, hang on, you've got this thing that can speak to you like a chat bot, an LLM, and then you've got image generation. Okay. So firstly, those two things on the surface have nothing in common. Um, so like, how are they, how did that just happen all at the same time? And then when you extend that further, um, you're like Suno, right? You can sing a song and AI will like come up with an instrumental and then you've got Whisper and you're like, wait a second, how did all this stuff, like if it's all AI, what's like, there has to be some commonality. Otherwise these are four, these are totally different technologies on the surface of it. And, uh, the transformer architecture is as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth, but I have scanned this. And as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in their twenties? Like, that's crazy. I think I asked Chad TPT once who were the, who wrote that paper and how old were they when it was published in ARCSYV. And I was expecting like, I don't know, what do you, what do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during COVID and things like that, where like these kind of really obscure scientists who are like in their fifties and they've just kind of been laboring in labs and, uh, wearily and writing and publishing and kind of obscure academic publications. And they finally like hit a big or win a Nobel prize. And then their household, household names. Uh, so that was kind of what I had in mind. That was the mental image I'd formed of the birth of ARCSYV. Like I wasn't expecting 20 somethings in San Francisco though. I thought that was both very, very funny, very cool. And actually kind of inspiring. It's nice to think that people who, you know, just, you might put them in the kind of milieu or bubble or world that you are in or credibly in through, you know, the series of connections that are coming up with such literally world-changing, um, innovations. Uh, so that was, I thought, anyway, that's, that, that was cool. Okay. Voice training data. How are we doing? We're about 10 minutes and I'm still talking about voice technology. Um, so Whisper was brilliant and I was so excited that I was, my first instinct was to like guess it's like, Oh my gosh, I have to get like a really good microphone for this. So, um, I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note and then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say, but when I look, I, I kind of came when I started working on this, uh, these prototypes on GitHub, which is where I just kind of share very freely and loosely, uh, ideas and, you know, first iterations on, on concepts. Um, and for want of a better word, I called it like, uh, LLM post-processing or cleanup, or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy, uh, text, like add sentence structure and, you know, fix it up. And, um, now when I'm exploring the different tools that are out there that people have built, I see, uh, quite a number of projects have basically, you know, done the same thing. Um, lest that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a, a thing that's been, uh, integrated into tools for a while, but it's, it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent, uh, because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. Uh, so I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is, um, basically just something that'll run in the background. You'll give it an API key and it will just like transcribe, um, with like a little key to start and stop the dictation. Uh, and the issues were, I discovered that like most people involved in creating these projects were very much focused on local models, uh, running, running Whisper locally because you can, and I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there's, it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me now. Privacy is another concern, um, that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. Um, but I'm not in that, um, I'm neither really care about people listening to my, uh, grocery list, uh, consisting of, uh, reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet, um, during, uh, periods of poor nutrition. Uh, but the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And, uh, I don't do anything so, you know, sensitive or secure that requires air gapping. So, um, I looked at the pricing and especially the kind of older models, mini, um, some of them are very, very affordable. And I did a back of the, I did a calculation once with chat GBT and I was like, okay, this is the, this is the API price for, I can't remember whatever the model was. Uh, let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average, I might dictate 30 to 60 minutes per day. If I was probably summing up the emails, uh, documents, outlines, um, which is a lot, but it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do. Like if I'm at a hospital, we have a newborn, uh, and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloudSTT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hours. And I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week, dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said, that's actually very, very affordable. Now you're going to, if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost them more as well. Unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. I like Gemini, not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can process audio with a system prompt and both give you transcription, that's cleaned up. That reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google hasn't really sort of thought through how useful that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year, just an endless list of really kind of system prompt stuff that I can say, OK, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about, who knows, my childhood. More realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for 10 minutes and that 10 minutes, you get a lot of information in emails, which is short text. Just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk, parse and maybe put into a vector database. Daniel has worked in technology. Daniel is a has been working in, you know, stuff like that. That's not how you would speak. But I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes, and this is actually a success because I wasted 20 minutes of my of the evening speaking into a microphone and the levels were shot and it was clipping. And I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? OK, my fine tune was a dud, as mentioned, DeepGram SDT. I'm really, really hopeful that this prototype will work and it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that. You know, you're not going to have to build a custom Conda environment, an image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up and I said, all right, let me just give DeepGram's Linux thing a shot. And if it doesn't work, I'm just going to go back to trying to code something myself. And when I ran the script, I was using cloud code to do the installation process. It ran the script and oh, my gosh, it works just like that. The tricky thing. For all those wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting. Wayland makes life very hard. And I think there was something not running at the right time. Anyway, DeepGram, I looked at how they actually handle that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample. And I'm I think I've done one or two of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three liters of milk tomorrow and pita bread, which is probably how like half my voice note voice note sound like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although, actually, there are some like kind of journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech. GitHub, Nuclino, Hugging Face, not so obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone working on a voice note idea, that's my sort of end motivation besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this. And it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual, phonetic based. So as folks who speak very obscure languages, that there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments. I know this is used extensively in medical transcription and dispatcher work as, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that my little fine tune is going to actually, like the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud, by the time I've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper, Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where I fix everything. I might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this, is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well. Like local Whisper versus OpenAI API, etc. And I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER. But that would require the source of truth. OK, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that, I always feel, think I've just said something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face. Where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.

data/inference/runs/local-stt/run-1/transcript.srt ADDED Viewed

	@@ -0,0 +1,1032 @@

+1
+00:00:00,000 --> 00:00:08,640
+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast.
+2
+00:00:08,640 --> 00:00:19,120
+Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech
+3
+00:00:19,120 --> 00:00:28,720
+tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this
+4
+00:00:30,080 --> 00:00:37,120
+voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the
+5
+00:00:37,120 --> 00:00:42,320
+envelope evaluation as they might say for different speech attacks models. And I'm doing this because I
+6
+00:00:42,800 --> 00:00:48,560
+I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in
+7
+00:00:48,560 --> 00:00:55,120
+the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to
+8
+00:00:55,760 --> 00:01:01,600
+mix up, I'm going to try a few different styles of speaking. I might whisper something at some
+9
+00:01:01,600 --> 00:01:07,760
+points as well. And I'll go back to speaking loud in different parts. I'm going to send really
+10
+00:01:07,760 --> 00:01:15,200
+like a crazy person because I'm also going to try to speak at different pitches and cadences
+11
+00:01:15,200 --> 00:01:21,600
+in order to really try to push a speech attacks model through its paces, which is trying to make
+12
+00:01:21,600 --> 00:01:30,320
+sense of is this guy just rambling on and coherently in one long sentence or are these just actually
+13
+00:01:30,320 --> 00:01:38,320
+series of step standalone standalone sentences? And how is it going to handle step alone? That's not a
+14
+00:01:38,320 --> 00:01:43,919
+word. What happens when you use speech attacks and you use a fake word and then you're like, wait,
+15
+00:01:43,919 --> 00:01:51,520
+that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the
+16
+00:01:52,880 --> 00:01:57,359
+questions that I'm seeking to answer in this training data. Now, why did why was it trying to
+17
+00:01:57,360 --> 00:02:01,040
+find China whisper? And what is whisper? As I said, I'm going to try to
+18
+00:02:02,080 --> 00:02:04,240
+record this at a couple of different levels of
+19
+00:02:04,880 --> 00:02:10,320
+technicality for folks who are in the normal world and not totally
+20
+00:02:11,360 --> 00:02:16,079
+stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be
+21
+00:02:16,720 --> 00:02:23,440
+to be done. It's a really interesting area and speech and voice tech is the aspect of it that
+22
+00:02:23,440 --> 00:02:28,880
+I find actually most I'm not sure I would say the most interesting because there's just so much
+23
+00:02:28,880 --> 00:02:34,560
+that is fascinating in AI. But the most that I find the most personally transformative in terms of
+24
+00:02:34,560 --> 00:02:42,240
+the impact that it's had on my daily work life and productivity and how I sort of work. And
+25
+00:02:42,960 --> 00:02:49,920
+I am persevering hard with the task of training, I guess, a good solution working for Linux.
+26
+00:02:49,920 --> 00:02:53,440
+Would you have anyone actually does listen to this not just for the training data and for the
+27
+00:02:53,440 --> 00:03:00,399
+actual content? This is this is sparked. I had, besides the fine tune not working, well that was
+28
+00:03:00,399 --> 00:03:07,679
+the failure. I used plot code because one thing these days that there is nothing
+29
+00:03:08,560 --> 00:03:16,799
+short of solving, you know, the reason of life or something that's flawed and
+30
+00:03:16,800 --> 00:03:22,720
+agentically I can't do, which is not really the case. It does seem that way sometimes but it
+31
+00:03:22,720 --> 00:03:28,080
+fails a lot as well. And this is one of those instances where last week I put together an hour
+32
+00:03:28,080 --> 00:03:33,600
+of voice training data, basically speaking just random things for three minutes and
+33
+00:03:35,600 --> 00:03:40,160
+it was actually kind of tedious because the text were really weird. Some of them were it was like,
+34
+00:03:40,160 --> 00:03:45,440
+it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't,
+35
+00:03:45,440 --> 00:03:51,120
+I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something
+36
+00:03:51,120 --> 00:03:59,920
+else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator,
+37
+00:04:00,800 --> 00:04:05,680
+which actually I thought was probably a better way of doing it because it would give me more
+38
+00:04:05,680 --> 00:04:12,000
+short samples with more varied content. So I was like, okay, give me a voice note. Like I'm
+39
+00:04:12,000 --> 00:04:18,800
+recording an email, give me a short story to read, give me pros to read. So I came up with all
+40
+00:04:18,800 --> 00:04:24,240
+these different things and they added a little timer to it so I could see how close I was to one
+41
+00:04:24,240 --> 00:04:32,480
+hour and I spent like an hour one afternoon or probably two hours by the time you do retakes
+42
+00:04:32,480 --> 00:04:39,120
+and whatever because you want to, it gave me a source of truth which I'm not sure if that's the
+43
+00:04:39,120 --> 00:04:45,120
+scientific way to approach this. Topic of gathering training data but I thought made sense.
+44
+00:04:46,560 --> 00:04:50,880
+I have a lot of audio data from recording voice notes which I've also kind of used
+45
+00:04:52,000 --> 00:04:56,720
+being experimenting with using for a different purpose. It's slightly different annotating
+46
+00:04:57,840 --> 00:05:03,680
+task types. It's more text classification experiment or well it's more than that actually
+47
+00:05:03,680 --> 00:05:08,880
+working on a voice app so it's a prototype I guess is really more accurate.
+48
+00:05:11,280 --> 00:05:15,920
+But you can do that and you can work backwards. You listen back to a voice note and you
+49
+00:05:17,520 --> 00:05:22,400
+painfully go through one of those transcribing where you start and stop and scrub around it and
+50
+00:05:22,400 --> 00:05:27,680
+you fix the errors but it's really really pouring to do that. So I thought it would be last tedious
+51
+00:05:27,680 --> 00:05:34,240
+in the long term if I just recorded this source of truth so it gave me these three minute snippets.
+52
+00:05:34,240 --> 00:05:40,480
+I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data.
+53
+00:05:41,840 --> 00:05:47,280
+So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune
+54
+00:05:47,280 --> 00:05:54,720
+whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in
+55
+00:05:54,720 --> 00:06:01,920
+the US and I was alone at home and when crazy people like me do really wild things like use voice
+56
+00:06:01,920 --> 00:06:08,320
+to tech technology that was basically when I started doing it I didn't feel like a crazy person
+57
+00:06:08,320 --> 00:06:15,760
+speaking to myself and my expectations weren't that high. I used speech tech now and again
+58
+00:06:16,960 --> 00:06:21,200
+try it out. It's like it'd be really cool if you could just like speak into your computer and
+59
+00:06:21,280 --> 00:06:28,479
+whatever I tried out that had Linux support was just it was not good basically and this blew me away
+60
+00:06:28,479 --> 00:06:34,400
+from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good
+61
+00:06:34,400 --> 00:06:40,320
+enough that there was a solid foundation and it kind of passed that pivot point that it's actually
+62
+00:06:40,320 --> 00:06:46,320
+worth doing this. There's a point where it's so like the transcript is you don't have to get 100%
+63
+00:06:46,400 --> 00:06:51,040
+accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your
+64
+00:06:51,040 --> 00:06:58,320
+productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably
+65
+00:06:58,320 --> 00:07:03,920
+say screw it I'll just type it because you end up missing errors in the transcript and it becomes
+66
+00:07:03,920 --> 00:07:07,840
+actually worse you end up in a worse position than you started with it that's been my experience.
+67
+00:07:08,400 --> 00:07:14,400
+So I was like oh this is actually really really good now how did that happen? The answer is
+68
+00:07:14,400 --> 00:07:21,599
+ASR with per being open-sourced and the transformer architecture if you want to go back to the
+69
+00:07:23,200 --> 00:07:29,440
+to the underpinnings which really blows my mind and it's on my list to read through that paper
+70
+00:07:30,239 --> 00:07:38,400
+all you need is attention as attentively as can be done with my limited brain because it's super
+71
+00:07:38,960 --> 00:07:45,679
+high-level stuff super advanced stuff I mean but that I think of all the things that
+72
+00:07:47,280 --> 00:07:54,080
+are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating
+73
+00:07:54,080 --> 00:07:59,599
+a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM
+74
+00:08:00,640 --> 00:08:06,799
+then you've got image generation okay so firstly those two things on the surface have nothing
+75
+00:08:06,800 --> 00:08:12,560
+in common so like how are they how did that just happen all at the same time and then when you
+76
+00:08:12,560 --> 00:08:19,920
+extend that further you're like sooner right you can sing a song an AI will like come up with
+77
+00:08:19,920 --> 00:08:25,200
+an instrumental and then you've got whisper and you're like wait a second how did all this stuff
+78
+00:08:25,200 --> 00:08:30,880
+like if it's all AI what's like there has to be some commonality otherwise these are four these
+79
+00:08:31,600 --> 00:08:38,640
+totally different technologies on the surface of it and the transformer architecture is as far as
+80
+00:08:38,640 --> 00:08:44,720
+I know the answer and I can't even say I can't even pretend that I really understand what the
+81
+00:08:44,720 --> 00:08:51,200
+transformer architecture means in depth but I have scandis and as I said I want to print it and
+82
+00:08:51,200 --> 00:08:57,760
+really kind of think over it's at some point and I'll probably feel bad about myself I think
+83
+00:08:57,760 --> 00:09:03,280
+because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt
+84
+00:09:03,280 --> 00:09:09,439
+once who were the who wrote that paper and how old were they when it was published in ARC
+85
+00:09:09,439 --> 00:09:14,640
+and I was expecting like I don't know what do you what do you imagine I personally imagine kind of
+86
+00:09:14,640 --> 00:09:19,840
+like you know you have these breakthroughs during COVID and things like that were like these kind
+87
+00:09:19,840 --> 00:09:24,480
+of really obscure scientists who are like in their 50s and they've just kind of been laboring
+88
+00:09:24,640 --> 00:09:31,120
+labs and we're really in writing and publishing and kind of obscure academic publications and they
+89
+00:09:31,120 --> 00:09:37,200
+finally like hit a big or win a Nobel Prize and then their household household names so I that
+90
+00:09:37,200 --> 00:09:42,680
+was kind of what I had in mind that was the mental image I'd formed of the birth of ARC
+91
+00:09:42,680 --> 00:09:47,760
+like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very
+92
+00:09:47,760 --> 00:09:54,160
+funny very cool and actually kind of inspiring it's nice to think that people who you know just
+93
+00:09:54,160 --> 00:10:01,439
+you might put them in the kind of milieu or bubble or world that you are in are credibly in through
+94
+00:10:01,439 --> 00:10:06,079
+you know the series of connections that are coming up with such literally world changing
+95
+00:10:06,880 --> 00:10:13,439
+innovations so that was I thought anyway that that that was cool okay voice training data how
+96
+00:10:13,439 --> 00:10:19,280
+were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was
+97
+00:10:19,280 --> 00:10:25,680
+brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh
+98
+00:10:25,680 --> 00:10:31,040
+I have to get like a really good microphone for this so I didn't go on a spending spree because
+99
+00:10:31,040 --> 00:10:37,760
+I said I'm gonna have to just wait a month and see if I still use this and it just kind of became
+100
+00:10:37,760 --> 00:10:44,800
+it's become really part of my daily routine like if I'm writing an email I'll record a voice note
+101
+00:10:44,880 --> 00:10:50,079
+and then I've developed and it's nice to see that everyone is like developing the same things in
+102
+00:10:50,079 --> 00:10:56,319
+parallel like that's my kind of a weird thing to say but when I look I kind of came when I started
+103
+00:10:56,319 --> 00:11:02,640
+working on this these prototypes on GitHub which is where I just kind of share very freely and loosely
+104
+00:11:03,199 --> 00:11:10,800
+ideas and you know first iterations on concepts and for one of a better word I called it like
+105
+00:11:11,439 --> 00:11:17,680
+LLM post processing or cleanup or basically a system prompt that after you get back the raw text
+106
+00:11:17,680 --> 00:11:25,920
+from whisper you run it through model and say okay this is crappy text like add sentence structure
+107
+00:11:25,920 --> 00:11:33,199
+and you know fix it up and now when I'm exploring the different tools that are out there the people
+108
+00:11:33,200 --> 00:11:39,040
+have built I see quite a number of projects have basically you know done the same thing
+109
+00:11:40,640 --> 00:11:45,040
+less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this
+110
+00:11:45,040 --> 00:11:51,440
+has been a thing that's been integrated into tools for a while but it's it's the kind of thing that
+111
+00:11:51,440 --> 00:11:57,520
+when you start using these tools every day the need for it is almost instantly apparent because text
+112
+00:11:57,600 --> 00:12:03,520
+that doesn't have any punctuation or progress basing takes a long time to you know it takes so
+113
+00:12:03,520 --> 00:12:10,079
+long to get it into a presentable email that again it's it moves speech tech into that
+114
+00:12:11,280 --> 00:12:16,000
+before that inflection point where you're like nah she's not worth it it's like it'll just be
+115
+00:12:16,000 --> 00:12:20,800
+quicker to type this so it's it's a big it's a little touch that actually is a big deal
+116
+00:12:21,520 --> 00:12:28,319
+so I was on whisper and I've been using whisper and I kind of early on find a couple of tools
+117
+00:12:28,319 --> 00:12:33,680
+I couldn't find what I was looking for on Linux which is basically just something that'll run
+118
+00:12:34,800 --> 00:12:39,120
+in the background you'll give it an API key and it'll just like transcribe
+119
+00:12:41,439 --> 00:12:47,359
+with like a little key to start and start the dictation and the issues where I discovered that
+120
+00:12:47,440 --> 00:12:52,720
+like most people involved in creating these projects were very much focused on local models
+121
+00:12:52,720 --> 00:12:58,400
+and running whisper locally because you can and I tried that a bunch of times and just never
+122
+00:12:58,400 --> 00:13:03,920
+got results that were as good as the cloud and when I began looking at the cost of the speech
+123
+00:13:03,920 --> 00:13:10,080
+text API is what I was spending I just thought there is it's actually in my opinion just one of
+124
+00:13:10,080 --> 00:13:15,600
+the better deals in API spending and in cloud like it's just not that expensive for very very good
+125
+00:13:15,600 --> 00:13:22,240
+models that are much more you know you're going to be able to run the full model the latest model
+126
+00:13:22,240 --> 00:13:28,960
+versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't
+127
+00:13:28,960 --> 00:13:34,000
+really make sense to me and I privacy is another concern that I know is kind of like a very much
+128
+00:13:34,000 --> 00:13:38,720
+a separate thing that people just don't want their voice data and their voice leaving their
+129
+00:13:38,720 --> 00:13:45,360
+local environment maybe for regulatory reasons as well but I'm not in that I'm neither really
+130
+00:13:45,360 --> 00:13:51,440
+care about people listening to my grocery list consisting of reminding myself that I need to buy
+131
+00:13:51,440 --> 00:13:58,240
+more beer cheetos and hummus which is kind of the three three staples of my diet during periods of
+132
+00:13:58,240 --> 00:14:04,560
+poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a
+133
+00:14:04,560 --> 00:14:12,640
+privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive
+134
+00:14:12,640 --> 00:14:17,680
+or secure that requires airgapping so I looked at the pricing and especially the kind of older
+135
+00:14:17,680 --> 00:14:24,400
+models mini some of them are very very affordable and I did it back to the I did a calculation once
+136
+00:14:24,400 --> 00:14:30,239
+with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever
+137
+00:14:30,320 --> 00:14:37,040
+the model was let's say I just go at it like nonstop which rarely happens probably I would say an
+138
+00:14:37,040 --> 00:14:45,200
+average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents
+139
+00:14:45,200 --> 00:14:51,360
+outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I
+140
+00:14:51,360 --> 00:14:56,720
+do go on like one or two days where I've been usually when I'm like kind of out of the house and
+141
+00:14:56,720 --> 00:15:02,800
+just have something like I've nothing else to do like if I'm at a hospital we have a newborn
+142
+00:15:04,000 --> 00:15:09,040
+and you're waiting for like eight hours and hours for an appointment and I would probably have
+143
+00:15:09,040 --> 00:15:15,280
+listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down
+144
+00:15:15,280 --> 00:15:20,880
+let me just get these ideas out of my head and that's when I'll go on my speech spinges but those
+145
+00:15:20,880 --> 00:15:26,240
+are like ones every few months like not frequently but I said okay let's just say if I'm gonna price
+146
+00:15:26,240 --> 00:15:35,440
+out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some
+147
+00:15:35,440 --> 00:15:41,600
+odd reason I mean it have to like ease and use the toilet like you know there's only so many hours
+148
+00:15:41,600 --> 00:15:48,480
+I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said
+149
+00:15:48,480 --> 00:15:55,360
+all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60
+150
+00:15:55,440 --> 00:16:02,560
+but whatever I did and every day like you're going flat out seven days a week dictating nonstop
+151
+00:16:02,560 --> 00:16:08,640
+as like what's my monthly API bill gonna be at this price and it came out to like seven to your
+152
+00:16:08,640 --> 00:16:14,960
+80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope
+153
+00:16:15,600 --> 00:16:21,680
+that there was some compelling reason more worth more than 70 dollars that I embarked upon that
+154
+00:16:22,640 --> 00:16:26,959
+so given that that's kind of the max point for me I said that's actually very very affordable
+155
+00:16:27,920 --> 00:16:32,640
+now you're gonna if you want to spec out the costs and you want to do the post processing
+156
+00:16:33,599 --> 00:16:39,199
+that I really do feel as valuable that's gonna cost more as well on a less you're using
+157
+00:16:40,160 --> 00:16:47,839
+Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with
+158
+00:16:47,840 --> 00:16:54,800
+Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini
+159
+00:16:54,800 --> 00:17:00,080
+not so much as a everyday model um it's kind of underwhelmed in that respect I would say
+160
+00:17:00,080 --> 00:17:05,920
+but for multimodal I think it's got a lot to offer and I think that the transcribing functionality
+161
+00:17:05,920 --> 00:17:13,280
+whereby it can um process audio with the system prompt and both give you a transgression that's
+162
+00:17:13,280 --> 00:17:20,079
+cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like
+163
+00:17:20,079 --> 00:17:27,280
+even Google hasn't really sort of thought through how useful the that modality is more kind of
+164
+00:17:27,280 --> 00:17:33,280
+use cases uh you can achieve with it because I found in the course of this year just an endless
+165
+00:17:33,280 --> 00:17:40,399
+list of really kind of system prompt system prompt stuff that I can say okay I've used it
+166
+00:17:40,560 --> 00:17:45,920
+for a capture context data for AI which is literally I might speak for if I wanted to have a good
+167
+00:17:45,920 --> 00:17:52,560
+bank of context data about who knows my childhood uh more realistically maybe my career goals
+168
+00:17:53,520 --> 00:17:59,520
+something that would just be like really boring to type out so I'll just like sit in my car
+169
+00:17:59,520 --> 00:18:06,640
+and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is
+170
+00:18:06,640 --> 00:18:13,200
+short text uh just there is a whole bunch and all these workflows kind of require a little bit
+171
+00:18:13,200 --> 00:18:18,320
+of treatment afterwards and different treatment my context pipeline is kind of like just extract the
+172
+00:18:18,320 --> 00:18:23,520
+bare essential so you end up with me talking very loosely about sort of what I've done in my career
+173
+00:18:23,520 --> 00:18:30,000
+where I've worked where my light to work and it goes it condenses that down to very robotic language
+174
+00:18:30,000 --> 00:18:36,000
+that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology
+175
+00:18:36,080 --> 00:18:42,400
+Daniel is a has been working in martino stuff like that that's not how you would speak um but I
+176
+00:18:42,400 --> 00:18:48,480
+figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this
+177
+00:18:48,480 --> 00:18:56,880
+is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and
+178
+00:18:56,880 --> 00:19:02,720
+the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to
+179
+00:19:02,720 --> 00:19:09,440
+be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this
+180
+00:19:09,440 --> 00:19:14,960
+okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype
+181
+00:19:14,960 --> 00:19:20,560
+will work and it's a built in public open source so anyone is welcome to use it if I make anything good
+182
+00:19:21,600 --> 00:19:28,000
+but that was really exciting for me last night when after hours of um try my own prototype seeing
+183
+00:19:28,080 --> 00:19:33,120
+someone just made something that works like that you know you're not going to have to build a custom
+184
+00:19:34,240 --> 00:19:40,960
+condo environment and image I have AMD GPU which makes things much more complicated I didn't find it
+185
+00:19:41,840 --> 00:19:46,400
+and I was about to give up and I said all right let me just give deep grams Linux thing a shot
+186
+00:19:47,040 --> 00:19:50,960
+and if this doesn't work um I'm just going to go back to trying to vibe code something myself
+187
+00:19:51,600 --> 00:19:57,360
+and when I ran the script I was using cloud code to do the installation process
+188
+00:19:58,160 --> 00:20:02,800
+it ran the script and oh my gosh it works just like that uh the tricky thing
+189
+00:20:04,480 --> 00:20:12,480
+for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually
+190
+00:20:12,480 --> 00:20:18,160
+struggling with transcription but pasting wailant makes life very hard and I think there was
+191
+00:20:18,160 --> 00:20:22,800
+something not running at the right time anyway deep gram I looked at how they actually handled
+192
+00:20:22,960 --> 00:20:28,960
+that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism
+193
+00:20:29,520 --> 00:20:34,560
+and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20
+194
+00:20:34,560 --> 00:20:44,399
+minute audio uh sample and I'm I think I've done one or two of these before but I did it with
+195
+00:20:45,360 --> 00:20:51,120
+sure snappy voice notes this is kind of long form this actually might be a better approximation
+196
+00:20:51,120 --> 00:20:55,040
+for what's useful to me than voice memos like I need to buy three
+197
+00:20:55,840 --> 00:20:59,840
+beaters of moat tomorrow and peter bread which is probably how like half my voice note
+198
+00:20:59,840 --> 00:21:04,399
+voice notes sound like if anyone were to I don't know like find my phone they'd be like this is
+199
+00:21:04,399 --> 00:21:09,280
+the most boring person in the world although actually there are some like kind of uh journaling
+200
+00:21:09,280 --> 00:21:14,080
+thoughts as well but it's a lot of content like that and the probably for the evaluation the most
+201
+00:21:14,080 --> 00:21:22,560
+useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not
+202
+00:21:22,560 --> 00:21:27,360
+going to have a chance of knowing it but hopefully sufficiently well known that the models should get
+203
+00:21:27,360 --> 00:21:32,800
+it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in
+204
+00:21:32,800 --> 00:21:38,960
+general I've spoken deliver this at a faster pace than I usually would go into strong coffee
+205
+00:21:39,120 --> 00:21:44,240
+flowing through my bloodstream and the thing that I'm not going to get into spanish mark is
+206
+00:21:44,240 --> 00:21:49,920
+background noise which is my first take that I had to get rid of my wife come in with my son
+207
+00:21:49,920 --> 00:21:55,680
+and for a good night kiss and that actually would have been super helpful to get in because it was
+208
+00:21:56,400 --> 00:22:01,600
+non-diarray sorry if we had diarrayization a female I could say I want the male voice and that
+209
+00:22:01,600 --> 00:22:06,240
+wasn't intended for transcription um and we're not going to get background noise like people
+210
+00:22:06,240 --> 00:22:11,840
+hunking their horns which is something I've done in my main data set where I am trying to go back
+211
+00:22:11,840 --> 00:22:16,880
+to some of my voice notes annotate them and run a benchmark but this is going to be just a pure
+212
+00:22:17,680 --> 00:22:24,960
+quick test and as someone I'm working on a voice note idea that's my sort of end
+213
+00:22:26,560 --> 00:22:30,320
+motivation besides thinking it's an absolute outstanding technology that's coming to
+214
+00:22:30,960 --> 00:22:36,240
+viability and really I know the same as cheesy can actually have a very transformative effect
+215
+00:22:37,120 --> 00:22:42,720
+it's you know voice technology has been life changing for folks living with
+216
+00:22:44,000 --> 00:22:49,760
+disabilities and I think there's something really nice about the fact that it can also benefit
+217
+00:22:50,480 --> 00:22:54,639
+you know folks who are able bodies and like we can all in different ways
+218
+00:22:55,120 --> 00:23:02,560
+um make this tech as useful as possible regardless of the exact way that we're using it um and I
+219
+00:23:02,560 --> 00:23:07,760
+think there's something very powerful in that and it can be very cool um I see huge potential what
+220
+00:23:07,760 --> 00:23:14,480
+excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate
+221
+00:23:14,480 --> 00:23:19,040
+as I mentioned at the very start of this um and it's getting better and better with stuff like
+222
+00:23:19,040 --> 00:23:24,160
+accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the
+223
+00:23:24,160 --> 00:23:30,240
+sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because
+224
+00:23:30,240 --> 00:23:37,680
+I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of
+225
+00:23:37,680 --> 00:23:42,720
+innovation and improvement in the models the main reasons for fine tuning from what I've seen
+226
+00:23:44,320 --> 00:23:50,480
+have been people who are something that really blows blows my mind about ASR is the idea that it's
+227
+00:23:50,480 --> 00:24:00,080
+inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages
+228
+00:24:00,080 --> 00:24:04,800
+that there may be there there might be a positive training data or almost none at all and therefore
+229
+00:24:04,800 --> 00:24:11,440
+the accuracy is significantly reduced or folks in very critical environments I know there are
+230
+00:24:11,440 --> 00:24:17,680
+you this is used extensively in medical transcription and dispatch your work as um you know the call
+231
+00:24:17,680 --> 00:24:24,000
+sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors
+232
+00:24:24,560 --> 00:24:29,680
+radiologists they might be using very specialized vocab all the time so those are kind of the main
+233
+00:24:29,680 --> 00:24:35,680
+two things and I'm not sure that really just for trying to make it better on a few random tech words
+234
+00:24:35,680 --> 00:24:41,840
+with my slightly I mean I have an accent but like not you know an accent that a few other million
+235
+00:24:41,840 --> 00:24:50,720
+people have ish I'm not sure that my little fine tune is going to actually like the bump in
+236
+00:24:50,720 --> 00:24:55,760
+word error reduction if I ever actually figure out how to do it and get it up to the cloud by the
+237
+00:24:55,760 --> 00:25:00,879
+time we've done that I suspect that the next generation of ASR will just be so good that it will
+238
+00:25:00,879 --> 00:25:07,040
+kind of be well that would be cool for a dive but I'll just use this instead so that's going to be
+239
+00:25:07,280 --> 00:25:15,040
+is for today's episodes of voice training data single long shot evaluation who am I going to
+240
+00:25:15,040 --> 00:25:21,200
+compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head
+241
+00:25:21,200 --> 00:25:27,680
+with two things really one is whisper variance so you've got these projects like faster whisper
+242
+00:25:29,120 --> 00:25:34,000
+distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which
+243
+00:25:34,160 --> 00:25:38,960
+are also a thing my intention for this is I'm not sure I'm going to have the time in any point
+244
+00:25:38,960 --> 00:25:46,320
+of the foreseeable future to go back to this whole episode and create a proper source truth or I fix
+245
+00:25:47,520 --> 00:25:53,760
+everything might do it if I can get one transcriptions that's sufficiently close to perfection
+246
+00:25:54,960 --> 00:26:00,560
+but what I would actually love to do on hogging face I think would be a great probably how I might
+247
+00:26:00,560 --> 00:26:08,080
+visualize this is having the audio waveform play and then have the transcript for each model below
+248
+00:26:08,080 --> 00:26:16,320
+it and maybe even a like you know two scale and maybe even a local one as well like local whisper
+249
+00:26:16,320 --> 00:26:23,919
+versus open AI API etc and I can then actually listen back to segments or anyone who wants to
+250
+00:26:24,000 --> 00:26:30,000
+can listen back to segments of this recording and see where a particular model to struggle
+251
+00:26:30,000 --> 00:26:35,600
+with others didn't as well as the sort of headline finding of which had the best WER but that would
+252
+00:26:35,600 --> 00:26:41,120
+require the source of truth okay that's it hope this was I don't know maybe useful for other
+253
+00:26:41,120 --> 00:26:46,480
+folks interested in STT you want to see that I always feel think I've just said as something I
+254
+00:26:46,480 --> 00:26:52,800
+didn't intend to STT I said for those isn't carefully including hopefully the models themselves
+255
+00:26:53,280 --> 00:26:58,960
+this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests
+256
+00:26:58,960 --> 00:27:06,639
+in AI but particularly agentic mcp and voice tech you can find me on github hogging face
+257
+00:27:08,080 --> 00:27:14,000
+where else daniel rosal dot com which is my personal website as well as this podcast whose name
+258
+00:27:14,000 --> 00:27:17,280
+I sadly cannot remember until next time thanks for listening

data/inference/runs/local-stt/run-1/transcript.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts. I'm going to send really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to push a speech attacks model through its paces, which is trying to make sense of is this guy just rambling on and coherently in one long sentence or are these just actually series of step standalone standalone sentences? And how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was it trying to find China whisper? And what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be to be done. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I am persevering hard with the task of training, I guess, a good solution working for Linux. Would you have anyone actually does listen to this not just for the training data and for the actual content? This is this is sparked. I had, besides the fine tune not working, well that was the failure. I used plot code because one thing these days that there is nothing short of solving, you know, the reason of life or something that's flawed and agentically I can't do, which is not really the case. It does seem that way sometimes but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes and
+it was actually kind of tedious because the text were really weird. Some of them were it was like, it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note. Like I'm recording an email, give me a short story to read, give me pros to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour and I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to, it gave me a source of truth which I'm not sure if that's the scientific way to approach this. Topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used being experimenting with using for a different purpose. It's slightly different annotating task types. It's more text classification experiment or well it's more than that actually working on a voice app so it's a prototype I guess is really more accurate.
+But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those transcribing where you start and stop and scrub around it and you fix the errors but it's really really pouring to do that. So I thought it would be last tedious in the long term if I just recorded this source of truth so it gave me these three minute snippets. I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data. So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in the US and I was alone at home and when crazy people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high. I used speech tech now and again try it out. It's like it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just it was not good basically and this blew me away from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. There's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably say screw it I'll just type it because you end up missing errors in the transcript and it becomes actually worse you end up in a worse position than you started with it that's been my experience. So I was like oh this is actually really really good now how did that happen? The answer is ASR with per being open-sourced and the transformer architecture if you want to go back to the to the underpinnings which really blows my mind and it's on my list to read through that paper all you need is attention as attentively as can be done with my limited brain because it's super high-level stuff super advanced stuff I mean but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM then you've got image generation okay so firstly those two things on the surface have nothing in common so like how are they how did that just happen all at the same time and then when you extend that further you're like sooner right you can sing a song an AI will like come up with an instrumental and then you've got whisper and you're like wait a second how did all this stuff like if it's all AI what's like there has to be some commonality otherwise these are four these totally different technologies on the surface of it and the transformer architecture is as far as I know the answer and I can't even say I can't even pretend that I really understand what the transformer architecture means in depth but I have scandis and as I said I want to print it and really kind of think over it's at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt once who were the who wrote that paper and how old were they when it was published in ARC and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during COVID and things like that were like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring labs and we're really in writing and publishing and kind of obscure academic publications and they finally like hit a big or win a Nobel Prize and then their household household names so I that was kind of what I had in mind that was the mental image I'd formed of the birth of ARC like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very funny very cool and actually kind of inspiring it's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in are credibly in through you know the series of connections that are coming up with such literally world changing innovations so that was I thought anyway that that that was cool okay voice training data how were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh I have to get like a really good microphone for this so I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this and it just kind of became it's become really part of my daily routine like if I'm writing an email I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel like that's my kind of a weird thing to say but when I look I kind of came when I started working on this these prototypes on GitHub which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts and for one of a better word I called it like LLM post processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through model and say okay this is crappy text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there the people have built I see quite a number of projects have basically you know done the same thing less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent because text that doesn't have any punctuation or progress basing takes a long time to you know it takes so long to get it into a presentable email that again it's it moves speech tech into that before that inflection point where you're like nah she's not worth it it's like it'll just be quicker to type this so it's it's a big it's a little touch that actually is a big deal so I was on whisper and I've been using whisper and I kind of early on find a couple of tools I couldn't find what I was looking for on Linux which is basically just something that'll run in the background you'll give it an API key and it'll just like transcribe
+with like a little key to start and start the dictation and the issues where I discovered that like most people involved in creating these projects were very much focused on local models and running whisper locally because you can and I tried that a bunch of times and just never got results that were as good as the cloud and when I began looking at the cost of the speech text API is what I was spending I just thought there is it's actually in my opinion just one of the better deals in API spending and in cloud like it's just not that expensive for very very good models that are much more you know you're going to be able to run the full model the latest model versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't really make sense to me and I privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment maybe for regulatory reasons as well but I'm not in that I'm neither really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer cheetos and hummus which is kind of the three three staples of my diet during periods of poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive or secure that requires airgapping so I looked at the pricing and especially the kind of older models mini some of them are very very affordable and I did it back to the I did a calculation once with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever the model was let's say I just go at it like nonstop which rarely happens probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do like if I'm at a hospital we have a newborn and you're waiting for like eight hours and hours for an appointment and I would probably have listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me just get these ideas out of my head and that's when I'll go on my speech spinges but those are like ones every few months like not frequently but I said okay let's just say if I'm gonna price out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some odd reason I mean it have to like ease and use the toilet like you know there's only so many hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60 but whatever I did and every day like you're going flat out seven days a week dictating nonstop as like what's my monthly API bill gonna be at this price and it came out to like seven to your 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope that there was some compelling reason more worth more than 70 dollars that I embarked upon that so given that that's kind of the max point for me I said that's actually very very affordable now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel as valuable that's gonna cost more as well on a less you're using Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini not so much as a everyday model um it's kind of underwhelmed in that respect I would say but for multimodal I think it's got a lot to offer and I think that the transcribing functionality whereby it can um process audio with the system prompt and both give you a transgression that's cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like even Google hasn't really sort of thought through how useful the that modality is more kind of use cases uh you can achieve with it because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say okay I've used it for a capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood uh more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is short text uh just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essential so you end up with me talking very loosely about sort of what I've done in my career where I've worked where my light to work and it goes it condenses that down to very robotic language that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology Daniel is a has been working in martino stuff like that that's not how you would speak um but I figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype will work and it's a built in public open source so anyone is welcome to use it if I make anything good but that was really exciting for me last night when after hours of um try my own prototype seeing someone just made something that works like that you know you're not going to have to build a custom condo environment and image I have AMD GPU which makes things much more complicated I didn't find it and I was about to give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work um I'm just going to go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that uh the tricky thing for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually struggling with transcription but pasting wailant makes life very hard and I think there was something not running at the right time anyway deep gram I looked at how they actually handled that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20 minute audio uh sample and I'm I think I've done one or two of these before but I did it with sure snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me than voice memos like I need to buy three beaters of moat tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to I don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the models should get it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in general I've spoken deliver this at a faster pace than I usually would go into strong coffee flowing through my bloodstream and the thing that I'm not going to get into spanish mark is background noise which is my first take that I had to get rid of my wife come in with my son and for a good night kiss and that actually would have been super helpful to get in because it was non-diarray sorry if we had diarrayization a female I could say I want the male voice and that wasn't intended for transcription um and we're not going to get background noise like people hunking their horns which is something I've done in my main data set where I am trying to go back to some of my voice notes annotate them and run a benchmark but this is going to be just a pure quick test and as someone I'm working on a voice note idea that's my sort of end motivation besides thinking it's an absolute outstanding technology that's coming to viability and really I know the same as cheesy can actually have a very transformative effect it's you know voice technology has been life changing for folks living with disabilities and I think there's something really nice about the fact that it can also benefit you know folks who are able bodies and like we can all in different ways um make this tech as useful as possible regardless of the exact way that we're using it um and I think there's something very powerful in that and it can be very cool um I see huge potential what excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate as I mentioned at the very start of this um and it's getting better and better with stuff like accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of innovation and improvement in the models the main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages that there may be there there might be a positive training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments I know there are you this is used extensively in medical transcription and dispatch your work as um you know the call sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors radiologists they might be using very specialized vocab all the time so those are kind of the main two things and I'm not sure that really just for trying to make it better on a few random tech words with my slightly I mean I have an accent but like not you know an accent that a few other million people have ish I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time we've done that I suspect that the next generation of ASR will just be so good that it will kind of be well that would be cool for a dive but I'll just use this instead so that's going to be is for today's episodes of voice training data single long shot evaluation who am I going to compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head with two things really one is whisper variance so you've got these projects like faster whisper distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which are also a thing my intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back to this whole episode and create a proper source truth or I fix everything might do it if I can get one transcriptions that's sufficiently close to perfection but what I would actually love to do on hogging face I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like you know two scale and maybe even a local one as well like local whisper versus open AI API etc and I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model to struggle with others didn't as well as the sort of headline finding of which had the best WER but that would require the source of truth okay that's it hope this was I don't know maybe useful for other folks interested in STT you want to see that I always feel think I've just said as something I didn't intend to STT I said for those isn't carefully including hopefully the models themselves this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests in AI but particularly agentic mcp and voice tech you can find me on github hogging face where else daniel rosal dot com which is my personal website as well as this podcast whose name I sadly cannot remember until next time thanks for listening

data/inference/runs/local-stt/run-2/whisper-tiny.srt ADDED Viewed

	@@ -0,0 +1,1024 @@

+1
+00:00:00,000 --> 00:00:08,000
+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast.
+2
+00:00:08,640 --> 00:00:16,000
+Or, I may append this to a podcast that I set up recently regarding my
+3
+00:00:16,640 --> 00:00:26,000
+with my thoughts on speech tech and AI in particular. More AI in generative AI, I would say.
+4
+00:00:26,720 --> 00:00:34,000
+But in any event, the purpose of this voice recording is actually to create a lengthy voice
+5
+00:00:34,000 --> 00:00:39,840
+sample for a quick evaluation of back of the envelope evaluation as they might say for
+6
+00:00:39,840 --> 00:00:44,240
+different speech attacks models. And I'm doing this because I thought I had made a great breakthrough
+7
+00:00:44,240 --> 00:00:50,960
+in my journey with speech tech and that was succeeding in the elusive task of fine-tuning whisper.
+8
+00:00:51,600 --> 00:00:58,800
+Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different
+9
+00:00:59,360 --> 00:01:04,000
+styles of speaking, I might whisper something at some points as well. And I'll go back to
+10
+00:01:04,000 --> 00:01:09,600
+speaking loud in a different part. So I'm going to send really like a crazy person because I'm also
+11
+00:01:09,600 --> 00:01:17,520
+going to try to speak at different pitches and cadences in order to really try to put
+12
+00:01:18,399 --> 00:01:23,280
+a speech attacks model through its pieces, which is trying to make sense of, is this guy just
+13
+00:01:24,000 --> 00:01:30,880
+ramling on and coherently in one long sentence or are these just actually a series of
+14
+00:01:32,960 --> 00:01:37,440
+step of standalone, standalone, standalone sentences. And how is it going to handle
+15
+00:01:37,440 --> 00:01:43,280
+step alone? That's not a word. What happens when you use speech attacks and you use a fake word.
+16
+00:01:43,280 --> 00:01:48,640
+And then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And
+17
+00:01:49,520 --> 00:01:56,160
+these and more are all the questions that I'm seeking to answer in this training data. Now,
+18
+00:01:56,160 --> 00:02:00,960
+why was it trying to find you to whisper? And what is whisper, as I said, I'm going to try to
+19
+00:02:02,160 --> 00:02:08,240
+record this at a couple of different levels of technicality for folks who are in the normal
+20
+00:02:09,120 --> 00:02:14,960
+world and not totally stocked down the rabbit hole of AI. What you have to say is a really wonderful
+21
+00:02:14,960 --> 00:02:22,480
+rabbit hole to be, to be done. It's a really interesting area and speech and voice attack is the
+22
+00:02:22,480 --> 00:02:28,000
+aspect of it that I find actually most, I'm not sure I would say the most interesting because there's
+23
+00:02:28,000 --> 00:02:34,080
+just so much that is fascinating in AI. But the most that I find the most personally transformative
+24
+00:02:34,160 --> 00:02:41,920
+in terms of the impact that it's had on my daily work life and productivity and how I sort of work.
+25
+00:02:41,920 --> 00:02:49,920
+And I'm persevering hard with the task of training, yes, a good solution working for Linux.
+26
+00:02:49,920 --> 00:02:53,760
+Would you have anyone actually, does listen to this not just for the training data and for the actual
+27
+00:02:53,760 --> 00:03:00,960
+content? This is this is sparked. I had, besides the fine tune not working. Well, that was the failure.
+28
+00:03:01,280 --> 00:03:09,920
+I used Claude Codes because one thing's these days that there is nothing sort of solving
+29
+00:03:10,880 --> 00:03:18,960
+you know, the reason of life or something at that's flawed and agentic AI can't do, which is not really
+30
+00:03:18,960 --> 00:03:24,320
+the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those
+31
+00:03:25,200 --> 00:03:29,760
+instances where last week I put together an hour of voice training data.
+32
+00:03:30,399 --> 00:03:37,280
+Basically speaking just random things for three minutes and it was actually kind of tedious because
+33
+00:03:37,280 --> 00:03:43,200
+the texts were really weird. Some of them were it was like AI generated. I tried before
+34
+00:03:43,200 --> 00:03:48,640
+to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was
+35
+00:03:48,640 --> 00:03:56,720
+okay knowing just gonna have to find something out to read. So I used a created with AI Studio
+36
+00:03:56,800 --> 00:04:03,600
+vibe code as a synthetic text generator, which actually I thought was probably a better way of
+37
+00:04:03,600 --> 00:04:09,920
+doing it because it would give me more short samples with more varied content. So I was like okay,
+38
+00:04:09,920 --> 00:04:16,160
+give me a voice note like I'm recording an email, give me a short story to read, give me pros
+39
+00:04:17,440 --> 00:04:22,320
+to read. So I came up with all these different things and they added a little timer to it so I
+40
+00:04:22,320 --> 00:04:29,040
+could see how to let us say well as to one hour. And I spent like an hour, one afternoon or probably
+41
+00:04:29,040 --> 00:04:36,560
+two hours by the time you do retakes on whatever because you want to, it gave me a source of truth
+42
+00:04:37,280 --> 00:04:43,440
+which I'm not sure if that's the scientific way to approach this topic of gathering training data,
+43
+00:04:43,440 --> 00:04:50,159
+but I thought made sense. I have a lot of audio data from recording voice notes which I've also
+44
+00:04:50,160 --> 00:04:56,160
+kind of used being experimenting with using for a different purpose. It's slightly different
+45
+00:04:56,160 --> 00:05:03,680
+annotating task types. It's more text classification experiment or well it's more than that actually
+46
+00:05:03,680 --> 00:05:12,480
+I'm working on a voice app so it's a prototype I guess is really more accurate. But you can do that
+47
+00:05:12,480 --> 00:05:18,960
+and you can work backwards. You listen back to a voice note and you painfully go through one of those
+48
+00:05:19,039 --> 00:05:24,080
+transcribing where you start and stop and scrub around in a new fixie areas but it's really
+49
+00:05:24,080 --> 00:05:29,039
+really pouring to do that. So I thought it would be less tedious in the long term if I just
+50
+00:05:29,599 --> 00:05:35,200
+recorded this source of truth. So it gave me these three minutes snippets. I recorded them
+51
+00:05:35,200 --> 00:05:43,200
+it saved an MP3 and a TXT in the same folder and I created an error that data. So I was very hopeful
+52
+00:05:43,520 --> 00:05:50,960
+that I could actually find you in whisper. I want to find you in whisper because when I got
+53
+00:05:50,960 --> 00:05:59,280
+into voice tech last November my wife was in the US and I was alone at home and when crazy
+54
+00:05:59,280 --> 00:06:05,840
+people like me do really wild things like use voice tech technology that was basically when I
+55
+00:06:05,840 --> 00:06:11,520
+started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't
+56
+00:06:11,599 --> 00:06:18,719
+that high. I used speech tech now and again tried to write as like it would be really cool if you
+57
+00:06:18,719 --> 00:06:24,479
+just like speak into your computer and whatever I tried I used that had Linux support was just
+58
+00:06:25,359 --> 00:06:30,719
+it was not good basically and this blew me away from the first go. I mean it wasn't 100%
+59
+00:06:31,680 --> 00:06:36,240
+accurate either the box and it took work but it was good enough that there was a solid foundation
+60
+00:06:36,240 --> 00:06:42,880
+and it kind of passed that pivot point that it's actually worth doing this. There's a point where
+61
+00:06:42,880 --> 00:06:48,160
+it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time
+62
+00:06:48,880 --> 00:06:52,960
+for it's speech tech to be worth while it isn't your productivity but you do need to get above
+63
+00:06:52,960 --> 00:07:00,480
+let's say 85%. If it's 60% or 50% you inevitably say screw it I'll just type it because you
+64
+00:07:00,480 --> 00:07:06,080
+end up missing errors in the transcript and it becomes actually worse you end up in a worse position
+65
+00:07:06,080 --> 00:07:12,560
+than you started with that's been my experience. So I was like oh this is actually really really good
+66
+00:07:12,560 --> 00:07:19,440
+now how did that happen? The answer is ASR whisper being open sourced and the transformer
+67
+00:07:19,440 --> 00:07:26,160
+architecture if you want to go back to the to the underpinnings which really blows my mind and it's
+68
+00:07:26,240 --> 00:07:37,280
+on my list to reto that paper all you need is attention as attentively as can be done with my limited
+69
+00:07:37,280 --> 00:07:45,040
+brain because it's super super high level stuff super advanced stuff I mean but that I think of all the
+70
+00:07:45,040 --> 00:07:53,680
+things that are fascinating about the sudden rise and AI and the dramatic capabilities I find
+71
+00:07:53,680 --> 00:07:58,880
+fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot
+72
+00:07:58,880 --> 00:08:04,400
+and LLM and then you've got image generation okay so first of all those two things
+73
+00:08:05,360 --> 00:08:12,240
+on the surface have nothing in common so like how did that just happen all at the same time and
+74
+00:08:12,240 --> 00:08:19,200
+then when you extend that further you're like Suno right you can sing a song an AI will like come
+75
+00:08:19,280 --> 00:08:24,880
+up with an instrumental and then you've got whisper and you're like wait a second how did all this
+76
+00:08:24,880 --> 00:08:30,400
+stuff like if it's all AI what's like there has to be some commonality otherwise these are
+77
+00:08:30,400 --> 00:08:37,439
+for these are totally different technologies on the surface of it and the transformer architecture
+78
+00:08:37,439 --> 00:08:43,520
+is as far as I know the answer and I can't even say you can't even pretend that I really understand
+79
+00:08:44,159 --> 00:08:50,160
+what the transformer architecture means in depth but I have scandice and as I said once a
+80
+00:08:50,720 --> 00:08:57,360
+printed I'm really kind of think over it's at some point and I'll probably feel bad about myself
+81
+00:08:57,360 --> 00:09:02,720
+I think because when those guys in the in their 20s like that's crazy I think I asked
+82
+00:09:02,720 --> 00:09:09,199
+you to be one who were the who wrote that paper and how old were they when it was published in
+83
+00:09:09,200 --> 00:09:14,800
+arcs of I was expecting like I don't know what do you imagine I personally imagine kind of like
+84
+00:09:14,800 --> 00:09:20,400
+you know you have these breakthroughs during Covid and things like that were like these kind of
+85
+00:09:20,400 --> 00:09:25,040
+really obscure scientists who are like in their 50s and they've just kind of been laboring labs
+86
+00:09:25,040 --> 00:09:31,120
+and we're really in writing and publishing and kind of obscure academic publications and they
+87
+00:09:31,120 --> 00:09:37,520
+finally like hit a bake or when a noble apprise and then their household names so that was kind
+88
+00:09:37,600 --> 00:09:43,280
+of what I had that was the mental image I'd formed of the birth of arcs of like I wasn't
+89
+00:09:43,280 --> 00:09:48,560
+expecting 20 somethings in San Francisco though I thought that was both very very funny very cool
+90
+00:09:48,560 --> 00:09:55,600
+and actually kind of inspiring it's nice to think that people who you know just you might put them
+91
+00:09:55,600 --> 00:10:02,079
+in the kind of milieu or bubble or world that you are in or incredibly in through you know
+92
+00:10:02,080 --> 00:10:07,680
+the series of connections that are coming up with such literally world changing innovations
+93
+00:10:07,680 --> 00:10:14,000
+so that was I thought anyway that that that was cool okay voice training data how are we doing
+94
+00:10:14,000 --> 00:10:20,080
+we're at by 10 minutes and I'm still talking about voice technology so whisper was brilliant and
+95
+00:10:20,880 --> 00:10:26,000
+I was so excited that I was my first instinct was to like guess like oh my gosh I have to
+96
+00:10:26,000 --> 00:10:31,760
+get like a really good microphone for this so I didn't go on a spending spree because I said
+97
+00:10:31,840 --> 00:10:37,840
+I'm gonna have to just wait a month and see if I still use this and it just kind of became
+98
+00:10:37,840 --> 00:10:44,800
+it's become really part of my daily routine like if I'm writing an email I'll record a voice note
+99
+00:10:44,800 --> 00:10:49,040
+and then I've developed and it's nice to see that everyone is like developing the same
+100
+00:10:49,600 --> 00:10:55,040
+things in parallel like that's my kind of a weird thing to say but when I look I kind of came
+101
+00:10:55,040 --> 00:11:01,200
+when I started working on this these prototypes on GitHub which is where I just kind of share
+102
+00:11:01,200 --> 00:11:09,600
+very freely and loosely ideas and you know first iterations on concepts and for one of
+103
+00:11:09,600 --> 00:11:15,760
+a better word I called it like LLM post processing or cleanup or basically a system prompt that
+104
+00:11:15,760 --> 00:11:22,880
+after you get back the raw text from whisper you run it through model and say okay this is crappy
+105
+00:11:23,600 --> 00:11:31,040
+text like add sentence structure and you know fix it up and now when I'm exploring
+106
+00:11:31,040 --> 00:11:36,480
+the different tools that are out there the people of built I see quite a number of projects have
+107
+00:11:37,280 --> 00:11:42,480
+basically you know done the same thing last that we missed construit I'm not saying for a
+108
+00:11:42,480 --> 00:11:48,480
+millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools
+109
+00:11:48,480 --> 00:11:53,520
+for a while but it's it's the kind of thing that when you start using these tools every day
+110
+00:11:53,520 --> 00:11:59,760
+the need for it is almost instantly apparent because text that doesn't have any punctuation or
+111
+00:11:59,760 --> 00:12:05,360
+paragraph spacing takes a long time to you know it takes so long to get it into a presentable email
+112
+00:12:05,360 --> 00:12:13,280
+that again it's it moves speech tech into that before that inflection point we're like that's just
+113
+00:12:13,280 --> 00:12:19,040
+not worth it it's like it's just be quicker to type this so it's it's a big it's a little touch that
+114
+00:12:19,040 --> 00:12:26,959
+actually is a big deal so I was on whisper and I've been using whisper and I kind of early on
+115
+00:12:26,959 --> 00:12:32,640
+find a couple of tools I couldn't find what I was looking for on Linux which is basically just
+116
+00:12:32,640 --> 00:12:39,120
+something that'll run in the background you'll give it an API key and it'll just like transcribe
+117
+00:12:39,200 --> 00:12:47,120
+with like a little key to start and start the dictation and the issues where I discovered
+118
+00:12:47,120 --> 00:12:52,800
+that like most people involved in creating these projects were very much focused on local models
+119
+00:12:52,800 --> 00:12:58,800
+running whisper locally because you can and I tried that a bunch of times and just never got
+120
+00:12:58,800 --> 00:13:04,160
+results that were as good as the cloud and when I began looking at the cost of the speech tech
+121
+00:13:04,160 --> 00:13:10,160
+to API is what I was spending I just thought there is it's actually in my opinion just one of the
+122
+00:13:10,160 --> 00:13:15,680
+better deals in API spending and in cloud like it's just not that expensive for very very good
+123
+00:13:15,680 --> 00:13:22,240
+models that are much more you know you're going to be able to run the full model the latest model
+124
+00:13:22,240 --> 00:13:29,199
+versus whatever you can run on your average GPU unless you want to buy crazy GPU it doesn't really
+125
+00:13:29,280 --> 00:13:34,000
+make sense to me and I've been a lot of things that I know is kind of like a very much
+126
+00:13:34,000 --> 00:13:39,040
+just everything the people just don't want their voice data and their voice leaving their local
+127
+00:13:39,040 --> 00:13:45,920
+environment maybe for regular few reasons as well but I'm not in that I'm neither really care
+128
+00:13:45,920 --> 00:13:51,680
+about people listening to my gross readest consisting of reminding myself that I need to buy more
+129
+00:13:51,680 --> 00:13:58,320
+beer cheetos and hummus which is kind of the three three staples of my diet during periods of
+130
+00:13:58,320 --> 00:14:04,640
+poor nutrition but the kind of stuff that I transcribe it's just not it's not a it's not a
+131
+00:14:04,640 --> 00:14:12,800
+privacy thing that sort of sensitive about and I don't do anything so you know sensitive or
+132
+00:14:12,800 --> 00:14:17,760
+secure that requires air capping so I looked at the pricing and especially the kind of older
+133
+00:14:17,760 --> 00:14:24,320
+model mini some of them were very very affordable and I did it back the I did a calculation once
+134
+00:14:24,400 --> 00:14:30,800
+with chatchewet and I was like okay this is the API price for I can't remember whatever the model
+135
+00:14:30,800 --> 00:14:37,440
+was let's say I just go out at like nonstop which really happens probably I would say an average
+136
+00:14:37,440 --> 00:14:42,800
+I might dictate 30 to 60 minutes per day if I was probably summing up the emails
+137
+00:14:44,560 --> 00:14:50,800
+documents outlines which is a lot but it's still a fairly modest amount and I was like well
+138
+00:14:50,800 --> 00:14:56,319
+some days I do go on like one or two days right being usually when I'm like kind of I'd do the
+139
+00:14:56,319 --> 00:15:02,800
+house and just have something like I've nothing else to do like if I met a hospital we've a newborn
+140
+00:15:04,079 --> 00:15:09,040
+and you're waiting for like eight hours and hours for an appointment and I would probably have
+141
+00:15:09,040 --> 00:15:15,520
+listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me
+142
+00:15:15,520 --> 00:15:20,800
+just get these ideas out of my head and that's when I'll go on my speech spinches but those
+143
+00:15:20,800 --> 00:15:25,680
+were like once every few months like not frequently but I said okay let's just say if I'm going
+144
+00:15:25,680 --> 00:15:34,960
+to price out cloud STT if I was like dedicated every second of every waking hour to transcribing
+145
+00:15:34,960 --> 00:15:41,280
+for some odd reason I mean it have to like ease and use the toilet like you know there's only so many
+146
+00:15:41,360 --> 00:15:47,920
+hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hour
+147
+00:15:47,920 --> 00:15:53,280
+then I said all right let's just say 50 who knows you're dictating on the toilet we do it
+148
+00:15:53,920 --> 00:16:01,040
+so you could just do 60 but whatever I did and every day like you're going flat out seven days
+149
+00:16:01,040 --> 00:16:07,199
+a week dictating nonstop it's like what's my monthly API bill gonna be at this price and it came
+150
+00:16:07,200 --> 00:16:14,160
+out to like 70 or 80 bucks and I was like well that would be an extraordinary amount of dictation
+151
+00:16:14,160 --> 00:16:20,960
+and I would hope that there were some compelling reason more worth more than 70 dollars that I
+152
+00:16:20,960 --> 00:16:25,760
+embarked upon their project so given the dots kind of the max point for me I said that's actually
+153
+00:16:25,760 --> 00:16:32,080
+very very affordable now you're gonna if you want to specide the costs and you want to do the post
+154
+00:16:32,080 --> 00:16:38,640
+processing that I really do feel as valuable that's gonna cost them more as well on the last
+155
+00:16:38,640 --> 00:16:46,320
+you're using Gemini which needless to say is a random person sitting in Jerusalem I have no
+156
+00:16:46,320 --> 00:16:52,000
+affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter
+157
+00:16:53,760 --> 00:17:00,240
+I like Gemini not so much as a everyday model it's kind of underwhelmed in that respect I would say
+158
+00:17:00,320 --> 00:17:05,839
+but for multi-model I think it's got a lot to offer and I think that the transcribing functionality
+159
+00:17:05,839 --> 00:17:13,520
+whereby it can process audio with the system prompt and both give you transcription that's cleaned
+160
+00:17:13,520 --> 00:17:20,720
+up that reduces two steps to one and that for me is a very very big deal and I feel like even Google
+161
+00:17:20,720 --> 00:17:27,839
+hasn't really sort of thought through how useful the downloadability is more kind of use cases
+162
+00:17:28,319 --> 00:17:34,000
+you can achieve with it because I found in the course of this year just an endless list of
+163
+00:17:34,959 --> 00:17:40,639
+really kind of system prompt system prompt stuff that I can say okay I've used the trick
+164
+00:17:40,639 --> 00:17:46,320
+capture context data for AI which is literally I might speak for if I wanted to have a good bank
+165
+00:17:46,320 --> 00:17:52,560
+of context data about who knows my childhood more realistically maybe my career goals
+166
+00:17:53,520 --> 00:17:59,520
+something that would just be like really boring to type out so I'll just like sit in my car
+167
+00:17:59,520 --> 00:18:06,480
+and record it for 10 minutes and that's 10 minutes you get a lot of information in emails which
+168
+00:18:06,480 --> 00:18:13,200
+is short text just there is a whole bunch and all these workflows kind of require a little bit
+169
+00:18:13,200 --> 00:18:17,919
+of treatment afterwards and different treatment my context pipeline is kind of like just to
+170
+00:18:17,920 --> 00:18:22,320
+extract the bare essential so you end up with me talking very loosely about sort of what
+171
+00:18:22,320 --> 00:18:27,920
+I've done in my career where I've worked where my light work and it goes it condenses that down
+172
+00:18:27,920 --> 00:18:33,920
+to very robotic language that is easy to chunk parse and maybe put into a vector database
+173
+00:18:33,920 --> 00:18:39,760
+Daniel has worked in technology Daniel is a has been working in Martin you know stuff like that
+174
+00:18:39,760 --> 00:18:46,160
+that's not how you would speak but I figure it's probably easier to parse for after all robots
+175
+00:18:46,800 --> 00:18:52,560
+so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my
+176
+00:18:53,600 --> 00:19:00,880
+of the evening speaking into my headphone and the levels were a shot and it was clipping and I said
+177
+00:19:00,880 --> 00:19:06,880
+I can't read each and evaluation I have to be fair I have to give the models a chance to do their thing
+178
+00:19:07,600 --> 00:19:11,520
+at what am I hoping to achieve in this okay my function was a daughter's mentioned
+179
+00:19:11,840 --> 00:19:16,879
+Deep Gram STT I'm really really hopeful that this prototype will work and it's a build and
+180
+00:19:16,879 --> 00:19:22,960
+public open source so anyone is welcome to use it if I make anything good but that was really exciting
+181
+00:19:22,960 --> 00:19:30,160
+for me last night when after hours of trying my own prototype seeing someone just made something that
+182
+00:19:30,160 --> 00:19:36,320
+works like that you know you're not going to have to build a custom condo environment and image
+183
+00:19:36,320 --> 00:19:42,399
+I have AMD GPU which makes things much more complicated I didn't find it and I was about to
+184
+00:19:42,399 --> 00:19:48,240
+give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work
+185
+00:19:48,879 --> 00:19:53,520
+I'm just going to go back to trying to vibe code something myself and when I ran the script
+186
+00:19:54,560 --> 00:20:01,120
+I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that
+187
+00:20:01,760 --> 00:20:08,479
+the tricky thing for all those who wants to know all the nitty ditty degree nitty gritty details
+188
+00:20:09,919 --> 00:20:14,800
+was that I don't think it was actually struggling with transcription but pasting
+189
+00:20:14,800 --> 00:20:20,320
+wailant makes life very hard and I think there was something not running in the right time anyway
+190
+00:20:20,320 --> 00:20:25,120
+Deep Gram I looked at how they actually handled that because it worked out of the box one other stuff
+191
+00:20:25,120 --> 00:20:32,159
+didn't and it was quite a clever little mechanism and but more and so it's not the accuracy was brilliant
+192
+00:20:32,159 --> 00:20:40,639
+now what am I doing here this is going to be a 20 minutes audio sample and I'm I think I've done
+193
+00:20:40,639 --> 00:20:49,120
+one or two of these before but I did this with sure snappy voice notes this is kind of long form
+194
+00:20:49,520 --> 00:20:54,080
+it's actually might be a better approximation for what's useful to me than voice mammals like I
+195
+00:20:54,080 --> 00:20:59,840
+need to buy three beaters of moke tomorrow and Peter bread which is probably how like half my voice
+196
+00:20:59,840 --> 00:21:04,639
+note send like if anyone were to I don't know like find my phone they be like this is the most
+197
+00:21:04,639 --> 00:21:09,760
+boring person in the world although actually there are some like kind of journaling thoughts as well
+198
+00:21:09,760 --> 00:21:14,800
+but it's a lot of content like that and the probably for the evaluation the most useful thing is
+199
+00:21:15,440 --> 00:21:23,280
+slightly obscure tech get hub the clean hooking face not so obscure that is not going to have a chance
+200
+00:21:23,280 --> 00:21:28,879
+of knowing it but hopefully sufficiently well known that the models should get us I tried to do a
+201
+00:21:28,879 --> 00:21:33,919
+little bit of speaking really fast and speaking very slowly I would say in general I've spoken
+202
+00:21:34,560 --> 00:21:40,159
+delivered this at a faster pace than I usually would owe into strong coffee if flowing through my bloodstream
+203
+00:21:41,040 --> 00:21:45,760
+and the thing that I'm not going to get into spent work is background noise which in my first
+204
+00:21:45,760 --> 00:21:52,000
+take that I had to get rid of my wife come in with my son and for a good night kiss and that actually
+205
+00:21:52,000 --> 00:21:58,560
+would have been super helpful to get in because it was non-diorized or if we had diorization
+206
+00:21:59,440 --> 00:22:03,120
+a female I could say I want the male voice and that wasn't intended for transcription
+207
+00:22:04,480 --> 00:22:07,920
+and I'm not going to get background noise like people honking their horns which is something
+208
+00:22:08,080 --> 00:22:13,440
+of done to my main data set where I am trying to go back to some of my voice notes
+209
+00:22:13,440 --> 00:22:19,920
+and I take them and run a benchmark but this is going to be just a pure quick test and
+210
+00:22:21,200 --> 00:22:28,080
+someone working on a voice note idea that's my sort of end motivation besides thinking it's
+211
+00:22:28,080 --> 00:22:33,440
+an astudiate standing technology that's coming to viability and really I know the same it's cheesy
+212
+00:22:33,520 --> 00:22:40,880
+can actually have a very transformative effect it's you know voice technology has been life changing
+213
+00:22:40,880 --> 00:22:48,640
+for folks living with disabilities and I think there's something really nice about the fact that
+214
+00:22:48,640 --> 00:22:54,560
+it can also benefit you know folks who are able bodies and like we can all in different ways
+215
+00:22:57,040 --> 00:23:01,040
+make this tech as useful as possible regardless of the exact way that we're using it
+216
+00:23:02,000 --> 00:23:06,639
+and I think there's something very powerful in that and it can be very cool I see
+217
+00:23:06,639 --> 00:23:10,800
+huge potential what excites me about voice tech a lot of things actually
+218
+00:23:12,080 --> 00:23:16,000
+firstly the fact that it's cheap and accurate as I mentioned at the very start of this
+219
+00:23:17,200 --> 00:23:19,760
+and it's getting better and better with stuff like accent handling
+220
+00:23:20,879 --> 00:23:25,040
+I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it
+221
+00:23:25,040 --> 00:23:30,800
+day to day as I imagine and get likes you per flawless words error rates because I'm just kind of
+222
+00:23:30,800 --> 00:23:38,399
+skeptical about local speech attacks as I mentioned and I think the pace of innovation and
+223
+00:23:38,399 --> 00:23:42,720
+improvement in the models the main reason for fine tuning from what I've seen
+224
+00:23:44,240 --> 00:23:51,120
+have been people who are something that really blows my mind about ASR is the idea that it's inherently
+225
+00:23:52,320 --> 00:24:00,080
+alien fuel or multilingual fanatic based so as folks who use speak very obsturately languages
+226
+00:24:00,159 --> 00:24:05,520
+that there might be a policy of training data or almost none at all and therefore the accuracy
+227
+00:24:05,520 --> 00:24:11,840
+is significantly reduced or folks in very critical environments I know they're you this is
+228
+00:24:11,840 --> 00:24:18,080
+using extensively in medical transcription and dispatcher work as you know the call centers
+229
+00:24:18,080 --> 00:24:24,000
+use send out ambulances etc where accuracy is absolutely paramount and in the case of doctors
+230
+00:24:24,560 --> 00:24:29,679
+radiologists there might be using very specialized vocab all the time so those are kind of the main
+231
+00:24:29,760 --> 00:24:35,040
+two things and I'm not sure that really just for training make it better on a few random
+232
+00:24:35,040 --> 00:24:41,360
+tech words with my slightly I mean I have an accent but like not you know an accent that a few
+233
+00:24:41,360 --> 00:24:48,240
+other million people have ish I'm not sure that my little fine tune is going to actually
+234
+00:24:49,440 --> 00:24:54,560
+like the bump in word error reduction if I ever actually figure out how to do it and get it up to the
+235
+00:24:55,520 --> 00:25:00,560
+by the time I've done that I suspect that the next generation of ASR will just be so good that
+236
+00:25:00,560 --> 00:25:06,399
+it will kind of be no well that would be cool for a doubt but all this uses instead so that's
+237
+00:25:06,399 --> 00:25:14,480
+going to be is for today's episodes of voice training data single long-shaw evaluation
+238
+00:25:14,480 --> 00:25:20,560
+who am I going to compare with supposed to be a benchmark but I'm more interested in seeing whisper
+239
+00:25:20,639 --> 00:25:27,679
+head to head with two things ready one is whisper variants so you've got these projects like faster whisper
+240
+00:25:29,120 --> 00:25:33,840
+distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASR is
+241
+00:25:33,840 --> 00:25:38,879
+which are also a thing my intention for this is I'm not sure I'm going to have the time in any point
+242
+00:25:38,879 --> 00:25:45,840
+of the foreseeable future to go back to this whole episode and create a proper source true through
+243
+00:25:45,840 --> 00:25:53,760
+a fix everything might do it if I can get one transcriptions as efficiently close to perfection
+244
+00:25:54,959 --> 00:25:59,840
+but what I would actually love to do on hugging face I think would be a great
+245
+00:25:59,840 --> 00:26:06,240
+probably how I might visualize this is having the audio waveform play and then have the transcript
+246
+00:26:06,240 --> 00:26:14,720
+for each model below it and maybe even a like you know two scale and maybe even a local one as
+247
+00:26:14,720 --> 00:26:22,960
+well like local whisper versus open AI API etc and I can then actually listen back to segments
+248
+00:26:22,960 --> 00:26:28,960
+or anyone who wants to can listen back to segments of this recording and see where a particular
+249
+00:26:28,960 --> 00:26:34,240
+model struggled and others didn't as well as the sort of headline finding of which had the best
+250
+00:26:34,240 --> 00:26:40,240
+WER but that would require the source of truth okay that's it I hope this was I don't know
+251
+00:26:40,320 --> 00:26:45,760
+maybe useful for other folks interested in STT you want to see that I always feel think I've just
+252
+00:26:45,760 --> 00:26:51,120
+said something I didn't and do STT I said for those it's in carefully including hopefully the
+253
+00:26:51,840 --> 00:26:57,840
+models themselves this has been myself Daniel Rosal for more jumbled repository is about my
+254
+00:26:58,240 --> 00:27:04,640
+roving interest in AI but particularly agentic mcp and voice tech you can find me on
+255
+00:27:04,800 --> 00:27:11,920
+get up hugging face where else Daniel Rosal.com which is my personal website as well as
+256
+00:27:12,560 --> 00:27:17,360
+this podcast whose name I sadly cannot remember but the next time thanks for listening

data/inference/runs/local-stt/run-2/whisper-tiny.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI in generative AI, I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation of back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking, I might whisper something at some points as well. And I'll go back to speaking loud in a different part. So I'm going to send really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its pieces, which is trying to make sense of, is this guy just ramling on and coherently in one long sentence or are these just actually a series of
+step of standalone, standalone, standalone sentences. And how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word. And then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was it trying to find you to whisper? And what is whisper, as I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stocked down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be, to be done. It's a really interesting area and speech and voice attack is the aspect of it that I find actually most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of training, yes, a good solution working for Linux. Would you have anyone actually, does listen to this not just for the training data and for the actual content? This is this is sparked. I had, besides the fine tune not working. Well, that was the failure. I used Claude Codes because one thing's these days that there is nothing sort of solving you know, the reason of life or something at that's flawed and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data. Basically speaking just random things for three minutes and it was actually kind of tedious because the texts were really weird. Some of them were it was like AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was okay knowing just gonna have to find something out to read. So I used a created with AI Studio vibe code as a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like okay, give me a voice note like I'm recording an email, give me a short story to read, give me pros to read. So I came up with all these different things and they added a little timer to it so I could see how to let us say well as to one hour. And I spent like an hour, one afternoon or probably two hours by the time you do retakes on whatever because you want to, it gave me a source of truth which I'm not sure if that's the scientific way to approach this topic of gathering training data, but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used being experimenting with using for a different purpose. It's slightly different annotating task types. It's more text classification experiment or well it's more than that actually I'm working on a voice app so it's a prototype I guess is really more accurate. But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those transcribing where you start and stop and scrub around in a new fixie areas but it's really really pouring to do that. So I thought it would be less tedious in the long term if I just recorded this source of truth. So it gave me these three minutes snippets. I recorded them it saved an MP3 and a TXT in the same folder and I created an error that data. So I was very hopeful that I could actually find you in whisper. I want to find you in whisper because when I got into voice tech last November my wife was in the US and I was alone at home and when crazy people like me do really wild things like use voice tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high. I used speech tech now and again tried to write as like it would be really cool if you just like speak into your computer and whatever I tried I used that had Linux support was just it was not good basically and this blew me away from the first go. I mean it wasn't 100% accurate either the box and it took work but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. There's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for it's speech tech to be worth while it isn't your productivity but you do need to get above let's say 85%. If it's 60% or 50% you inevitably say screw it I'll just type it because you end up missing errors in the transcript and it becomes actually worse you end up in a worse position than you started with that's been my experience. So I was like oh this is actually really really good now how did that happen? The answer is ASR whisper being open sourced and the transformer architecture if you want to go back to the to the underpinnings which really blows my mind and it's on my list to reto that paper all you need is attention as attentively as can be done with my limited brain because it's super super high level stuff super advanced stuff I mean but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM and then you've got image generation okay so first of all those two things on the surface have nothing in common so like how did that just happen all at the same time and then when you extend that further you're like Suno right you can sing a song an AI will like come up with an instrumental and then you've got whisper and you're like wait a second how did all this stuff like if it's all AI what's like there has to be some commonality otherwise these are for these are totally different technologies on the surface of it and the transformer architecture is as far as I know the answer and I can't even say you can't even pretend that I really understand what the transformer architecture means in depth but I have scandice and as I said once a printed I'm really kind of think over it's at some point and I'll probably feel bad about myself I think because when those guys in the in their 20s like that's crazy I think I asked you to be one who were the who wrote that paper and how old were they when it was published in arcs of I was expecting like I don't know what do you imagine I personally imagine kind of like you know you have these breakthroughs during Covid and things like that were like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring labs and we're really in writing and publishing and kind of obscure academic publications and they finally like hit a bake or when a noble apprise and then their household names so that was kind of what I had that was the mental image I'd formed of the birth of arcs of like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very funny very cool and actually kind of inspiring it's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in or incredibly in through you know the series of connections that are coming up with such literally world changing innovations so that was I thought anyway that that that was cool okay voice training data how are we doing we're at by 10 minutes and I'm still talking about voice technology so whisper was brilliant and I was so excited that I was my first instinct was to like guess like oh my gosh I have to get like a really good microphone for this so I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this and it just kind of became it's become really part of my daily routine like if I'm writing an email I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel like that's my kind of a weird thing to say but when I look I kind of came when I started working on this these prototypes on GitHub which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts and for one of a better word I called it like LLM post processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through model and say okay this is crappy text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there the people of built I see quite a number of projects have basically you know done the same thing last that we missed construit I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to you know it takes so long to get it into a presentable email that again it's it moves speech tech into that before that inflection point we're like that's just not worth it it's like it's just be quicker to type this so it's it's a big it's a little touch that actually is a big deal so I was on whisper and I've been using whisper and I kind of early on find a couple of tools I couldn't find what I was looking for on Linux which is basically just something that'll run in the background you'll give it an API key and it'll just like transcribe with like a little key to start and start the dictation and the issues where I discovered that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and I tried that a bunch of times and just never got results that were as good as the cloud and when I began looking at the cost of the speech tech to API is what I was spending I just thought there is it's actually in my opinion just one of the better deals in API spending and in cloud like it's just not that expensive for very very good models that are much more you know you're going to be able to run the full model the latest model versus whatever you can run on your average GPU unless you want to buy crazy GPU it doesn't really make sense to me and I've been a lot of things that I know is kind of like a very much just everything the people just don't want their voice data and their voice leaving their local environment maybe for regular few reasons as well but I'm not in that I'm neither really care about people listening to my gross readest consisting of reminding myself that I need to buy more beer cheetos and hummus which is kind of the three three staples of my diet during periods of poor nutrition but the kind of stuff that I transcribe it's just not it's not a it's not a privacy thing that sort of sensitive about and I don't do anything so you know sensitive or secure that requires air capping so I looked at the pricing and especially the kind of older model mini some of them were very very affordable and I did it back the I did a calculation once with chatchewet and I was like okay this is the API price for I can't remember whatever the model was let's say I just go out at like nonstop which really happens probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents outlines which is a lot but it's still a fairly modest amount and I was like well some days I do go on like one or two days right being usually when I'm like kind of I'd do the house and just have something like I've nothing else to do like if I met a hospital we've a newborn and you're waiting for like eight hours and hours for an appointment and I would probably have listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me just get these ideas out of my head and that's when I'll go on my speech spinches but those were like once every few months like not frequently but I said okay let's just say if I'm going to price out cloud STT if I was like dedicated every second of every waking hour to transcribing for some odd reason I mean it have to like ease and use the toilet like you know there's only so many hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hour then I said all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60 but whatever I did and every day like you're going flat out seven days a week dictating nonstop it's like what's my monthly API bill gonna be at this price and it came out to like 70 or 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope that there were some compelling reason more worth more than 70 dollars that I embarked upon their project so given the dots kind of the max point for me I said that's actually very very affordable now you're gonna if you want to specide the costs and you want to do the post processing that I really do feel as valuable that's gonna cost them more as well on the last you're using Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter I like Gemini not so much as a everyday model it's kind of underwhelmed in that respect I would say but for multi-model I think it's got a lot to offer and I think that the transcribing functionality whereby it can process audio with the system prompt and both give you transcription that's cleaned up that reduces two steps to one and that for me is a very very big deal and I feel like even Google hasn't really sort of thought through how useful the downloadability is more kind of use cases you can achieve with it because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say okay I've used the trick capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that's 10 minutes you get a lot of information in emails which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just to extract the bare essential so you end up with me talking very loosely about sort of what I've done in my career where I've worked where my light work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database Daniel has worked in technology Daniel is a has been working in Martin you know stuff like that that's not how you would speak but I figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my of the evening speaking into my headphone and the levels were a shot and it was clipping and I said I can't read each and evaluation I have to be fair I have to give the models a chance to do their thing at what am I hoping to achieve in this okay my function was a daughter's mentioned Deep Gram STT I'm really really hopeful that this prototype will work and it's a build and public open source so anyone is welcome to use it if I make anything good but that was really exciting for me last night when after hours of trying my own prototype seeing someone just made something that works like that you know you're not going to have to build a custom condo environment and image I have AMD GPU which makes things much more complicated I didn't find it and I was about to give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work I'm just going to go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that the tricky thing for all those who wants to know all the nitty ditty degree nitty gritty details was that I don't think it was actually struggling with transcription but pasting wailant makes life very hard and I think there was something not running in the right time anyway Deep Gram I looked at how they actually handled that because it worked out of the box one other stuff didn't and it was quite a clever little mechanism and but more and so it's not the accuracy was brilliant now what am I doing here this is going to be a 20 minutes audio sample and I'm I think I've done one or two of these before but I did this with sure snappy voice notes this is kind of long form it's actually might be a better approximation for what's useful to me than voice mammals like I need to buy three beaters of moke tomorrow and Peter bread which is probably how like half my voice note send like if anyone were to I don't know like find my phone they be like this is the most boring person in the world although actually there are some like kind of journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech get hub the clean hooking face not so obscure that is not going to have a chance of knowing it but hopefully sufficiently well known that the models should get us I tried to do a little bit of speaking really fast and speaking very slowly I would say in general I've spoken delivered this at a faster pace than I usually would owe into strong coffee if flowing through my bloodstream and the thing that I'm not going to get into spent work is background noise which in my first take that I had to get rid of my wife come in with my son and for a good night kiss and that actually would have been super helpful to get in because it was non-diorized or if we had diorization a female I could say I want the male voice and that wasn't intended for transcription and I'm not going to get background noise like people honking their horns which is something of done to my main data set where I am trying to go back to some of my voice notes and I take them and run a benchmark but this is going to be just a pure quick test and someone working on a voice note idea that's my sort of end motivation besides thinking it's an astudiate standing technology that's coming to viability and really I know the same it's cheesy can actually have a very transformative effect it's you know voice technology has been life changing for folks living with disabilities and I think there's something really nice about the fact that it can also benefit you know folks who are able bodies and like we can all in different ways
+make this tech as useful as possible regardless of the exact way that we're using it and I think there's something very powerful in that and it can be very cool I see huge potential what excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate as I mentioned at the very start of this and it's getting better and better with stuff like accent handling I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of innovation and improvement in the models the main reason for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alien fuel or multilingual fanatic based so as folks who use speak very obsturately languages that there might be a policy of training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments I know they're you this is using extensively in medical transcription and dispatcher work as you know the call centers use send out ambulances etc where accuracy is absolutely paramount and in the case of doctors radiologists there might be using very specialized vocab all the time so those are kind of the main two things and I'm not sure that really just for training make it better on a few random tech words with my slightly I mean I have an accent but like not you know an accent that a few other million people have ish I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the by the time I've done that I suspect that the next generation of ASR will just be so good that it will kind of be no well that would be cool for a doubt but all this uses instead so that's going to be is for today's episodes of voice training data single long-shaw evaluation who am I going to compare with supposed to be a benchmark but I'm more interested in seeing whisper head to head with two things ready one is whisper variants so you've got these projects like faster whisper distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASR is which are also a thing my intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back to this whole episode and create a proper source true through a fix everything might do it if I can get one transcriptions as efficiently close to perfection but what I would actually love to do on hugging face I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like you know two scale and maybe even a local one as well like local whisper versus open AI API etc and I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't as well as the sort of headline finding of which had the best WER but that would require the source of truth okay that's it I hope this was I don't know maybe useful for other folks interested in STT you want to see that I always feel think I've just said something I didn't and do STT I said for those it's in carefully including hopefully the models themselves this has been myself Daniel Rosal for more jumbled repository is about my roving interest in AI but particularly agentic mcp and voice tech you can find me on get up hugging face where else Daniel Rosal.com which is my personal website as well as this podcast whose name I sadly cannot remember but the next time thanks for listening

data/inference/runs/local-stt/run-3/transcript.srt ADDED Viewed

	@@ -0,0 +1,1032 @@

+1
+00:00:00,000 --> 00:00:08,640
+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast.
+2
+00:00:08,640 --> 00:00:19,120
+Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech
+3
+00:00:19,120 --> 00:00:28,720
+tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this
+4
+00:00:30,080 --> 00:00:37,120
+voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the
+5
+00:00:37,120 --> 00:00:42,320
+envelope evaluation as they might say for different speech attacks models. And I'm doing this because I
+6
+00:00:42,800 --> 00:00:48,560
+I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in
+7
+00:00:48,560 --> 00:00:55,120
+the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to
+8
+00:00:55,760 --> 00:01:01,600
+mix up, I'm going to try a few different styles of speaking. I might whisper something at some
+9
+00:01:01,600 --> 00:01:07,760
+points as well. And I'll go back to speaking loud in different parts. I'm going to send really
+10
+00:01:07,760 --> 00:01:15,200
+like a crazy person because I'm also going to try to speak at different pitches and cadences
+11
+00:01:15,200 --> 00:01:21,600
+in order to really try to push a speech attacks model through its paces, which is trying to make
+12
+00:01:21,600 --> 00:01:30,320
+sense of is this guy just rambling on and coherently in one long sentence or are these just actually
+13
+00:01:30,320 --> 00:01:38,320
+series of step standalone standalone sentences? And how is it going to handle step alone? That's not a
+14
+00:01:38,320 --> 00:01:43,919
+word. What happens when you use speech attacks and you use a fake word and then you're like, wait,
+15
+00:01:43,919 --> 00:01:51,520
+that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the
+16
+00:01:52,880 --> 00:01:57,359
+questions that I'm seeking to answer in this training data. Now, why did why was it trying to
+17
+00:01:57,360 --> 00:02:01,040
+find China whisper? And what is whisper? As I said, I'm going to try to
+18
+00:02:02,080 --> 00:02:04,240
+record this at a couple of different levels of
+19
+00:02:04,880 --> 00:02:10,320
+technicality for folks who are in the normal world and not totally
+20
+00:02:11,360 --> 00:02:16,079
+stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be
+21
+00:02:16,720 --> 00:02:23,440
+to be done. It's a really interesting area and speech and voice tech is the aspect of it that
+22
+00:02:23,440 --> 00:02:28,880
+I find actually most I'm not sure I would say the most interesting because there's just so much
+23
+00:02:28,880 --> 00:02:34,560
+that is fascinating in AI. But the most that I find the most personally transformative in terms of
+24
+00:02:34,560 --> 00:02:42,240
+the impact that it's had on my daily work life and productivity and how I sort of work. And
+25
+00:02:42,960 --> 00:02:49,920
+I am persevering hard with the task of training, I guess, a good solution working for Linux.
+26
+00:02:49,920 --> 00:02:53,440
+Would you have anyone actually does listen to this not just for the training data and for the
+27
+00:02:53,440 --> 00:03:00,399
+actual content? This is this is sparked. I had, besides the fine tune not working, well that was
+28
+00:03:00,399 --> 00:03:07,679
+the failure. I used plot code because one thing these days that there is nothing
+29
+00:03:08,560 --> 00:03:16,799
+short of solving, you know, the reason of life or something that's flawed and
+30
+00:03:16,800 --> 00:03:22,720
+agentically I can't do, which is not really the case. It does seem that way sometimes but it
+31
+00:03:22,720 --> 00:03:28,080
+fails a lot as well. And this is one of those instances where last week I put together an hour
+32
+00:03:28,080 --> 00:03:33,600
+of voice training data, basically speaking just random things for three minutes and
+33
+00:03:35,600 --> 00:03:40,160
+it was actually kind of tedious because the text were really weird. Some of them were it was like,
+34
+00:03:40,160 --> 00:03:45,440
+it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't,
+35
+00:03:45,440 --> 00:03:51,120
+I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something
+36
+00:03:51,120 --> 00:03:59,920
+else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator,
+37
+00:04:00,800 --> 00:04:05,680
+which actually I thought was probably a better way of doing it because it would give me more
+38
+00:04:05,680 --> 00:04:12,000
+short samples with more varied content. So I was like, okay, give me a voice note. Like I'm
+39
+00:04:12,000 --> 00:04:18,800
+recording an email, give me a short story to read, give me pros to read. So I came up with all
+40
+00:04:18,800 --> 00:04:24,240
+these different things and they added a little timer to it so I could see how close I was to one
+41
+00:04:24,240 --> 00:04:32,480
+hour and I spent like an hour one afternoon or probably two hours by the time you do retakes
+42
+00:04:32,480 --> 00:04:39,120
+and whatever because you want to, it gave me a source of truth which I'm not sure if that's the
+43
+00:04:39,120 --> 00:04:45,120
+scientific way to approach this. Topic of gathering training data but I thought made sense.
+44
+00:04:46,560 --> 00:04:50,880
+I have a lot of audio data from recording voice notes which I've also kind of used
+45
+00:04:52,000 --> 00:04:56,720
+being experimenting with using for a different purpose. It's slightly different annotating
+46
+00:04:57,840 --> 00:05:03,680
+task types. It's more text classification experiment or well it's more than that actually
+47
+00:05:03,680 --> 00:05:08,880
+working on a voice app so it's a prototype I guess is really more accurate.
+48
+00:05:11,280 --> 00:05:15,920
+But you can do that and you can work backwards. You listen back to a voice note and you
+49
+00:05:17,520 --> 00:05:22,400
+painfully go through one of those transcribing where you start and stop and scrub around it and
+50
+00:05:22,400 --> 00:05:27,680
+you fix the errors but it's really really pouring to do that. So I thought it would be last tedious
+51
+00:05:27,680 --> 00:05:34,240
+in the long term if I just recorded this source of truth so it gave me these three minute snippets.
+52
+00:05:34,240 --> 00:05:40,480
+I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data.
+53
+00:05:41,840 --> 00:05:47,280
+So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune
+54
+00:05:47,280 --> 00:05:54,720
+whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in
+55
+00:05:54,720 --> 00:06:01,920
+the US and I was alone at home and when crazy people like me do really wild things like use voice
+56
+00:06:01,920 --> 00:06:08,320
+to tech technology that was basically when I started doing it I didn't feel like a crazy person
+57
+00:06:08,320 --> 00:06:15,760
+speaking to myself and my expectations weren't that high. I used speech tech now and again
+58
+00:06:16,960 --> 00:06:21,200
+try it out. It's like it'd be really cool if you could just like speak into your computer and
+59
+00:06:21,280 --> 00:06:28,479
+whatever I tried out that had Linux support was just it was not good basically and this blew me away
+60
+00:06:28,479 --> 00:06:34,400
+from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good
+61
+00:06:34,400 --> 00:06:40,320
+enough that there was a solid foundation and it kind of passed that pivot point that it's actually
+62
+00:06:40,320 --> 00:06:46,320
+worth doing this. There's a point where it's so like the transcript is you don't have to get 100%
+63
+00:06:46,400 --> 00:06:51,040
+accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your
+64
+00:06:51,040 --> 00:06:58,320
+productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably
+65
+00:06:58,320 --> 00:07:03,920
+say screw it I'll just type it because you end up missing errors in the transcript and it becomes
+66
+00:07:03,920 --> 00:07:07,840
+actually worse you end up in a worse position than you started with it that's been my experience.
+67
+00:07:08,400 --> 00:07:14,400
+So I was like oh this is actually really really good now how did that happen? The answer is
+68
+00:07:14,400 --> 00:07:21,599
+ASR with per being open-sourced and the transformer architecture if you want to go back to the
+69
+00:07:23,200 --> 00:07:29,440
+to the underpinnings which really blows my mind and it's on my list to read through that paper
+70
+00:07:30,239 --> 00:07:38,400
+all you need is attention as attentively as can be done with my limited brain because it's super
+71
+00:07:38,960 --> 00:07:45,679
+high-level stuff super advanced stuff I mean but that I think of all the things that
+72
+00:07:47,280 --> 00:07:54,080
+are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating
+73
+00:07:54,080 --> 00:07:59,599
+a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM
+74
+00:08:00,640 --> 00:08:06,799
+then you've got image generation okay so firstly those two things on the surface have nothing
+75
+00:08:06,800 --> 00:08:12,560
+in common so like how are they how did that just happen all at the same time and then when you
+76
+00:08:12,560 --> 00:08:19,920
+extend that further you're like sooner right you can sing a song an AI will like come up with
+77
+00:08:19,920 --> 00:08:25,200
+an instrumental and then you've got whisper and you're like wait a second how did all this stuff
+78
+00:08:25,200 --> 00:08:30,880
+like if it's all AI what's like there has to be some commonality otherwise these are four these
+79
+00:08:31,600 --> 00:08:38,640
+totally different technologies on the surface of it and the transformer architecture is as far as
+80
+00:08:38,640 --> 00:08:44,720
+I know the answer and I can't even say I can't even pretend that I really understand what the
+81
+00:08:44,720 --> 00:08:51,200
+transformer architecture means in depth but I have scandis and as I said I want to print it and
+82
+00:08:51,200 --> 00:08:57,760
+really kind of think over it's at some point and I'll probably feel bad about myself I think
+83
+00:08:57,760 --> 00:09:03,280
+because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt
+84
+00:09:03,280 --> 00:09:09,439
+once who were the who wrote that paper and how old were they when it was published in ARC
+85
+00:09:09,439 --> 00:09:14,640
+and I was expecting like I don't know what do you what do you imagine I personally imagine kind of
+86
+00:09:14,640 --> 00:09:19,840
+like you know you have these breakthroughs during COVID and things like that were like these kind
+87
+00:09:19,840 --> 00:09:24,480
+of really obscure scientists who are like in their 50s and they've just kind of been laboring
+88
+00:09:24,640 --> 00:09:31,120
+labs and we're really in writing and publishing and kind of obscure academic publications and they
+89
+00:09:31,120 --> 00:09:37,200
+finally like hit a big or win a Nobel Prize and then their household household names so I that
+90
+00:09:37,200 --> 00:09:42,680
+was kind of what I had in mind that was the mental image I'd formed of the birth of ARC
+91
+00:09:42,680 --> 00:09:47,760
+like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very
+92
+00:09:47,760 --> 00:09:54,160
+funny very cool and actually kind of inspiring it's nice to think that people who you know just
+93
+00:09:54,160 --> 00:10:01,439
+you might put them in the kind of milieu or bubble or world that you are in are credibly in through
+94
+00:10:01,439 --> 00:10:06,079
+you know the series of connections that are coming up with such literally world changing
+95
+00:10:06,880 --> 00:10:13,439
+innovations so that was I thought anyway that that that was cool okay voice training data how
+96
+00:10:13,439 --> 00:10:19,280
+were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was
+97
+00:10:19,280 --> 00:10:25,680
+brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh
+98
+00:10:25,680 --> 00:10:31,040
+I have to get like a really good microphone for this so I didn't go on a spending spree because
+99
+00:10:31,040 --> 00:10:37,760
+I said I'm gonna have to just wait a month and see if I still use this and it just kind of became
+100
+00:10:37,760 --> 00:10:44,800
+it's become really part of my daily routine like if I'm writing an email I'll record a voice note
+101
+00:10:44,880 --> 00:10:50,079
+and then I've developed and it's nice to see that everyone is like developing the same things in
+102
+00:10:50,079 --> 00:10:56,319
+parallel like that's my kind of a weird thing to say but when I look I kind of came when I started
+103
+00:10:56,319 --> 00:11:02,640
+working on this these prototypes on GitHub which is where I just kind of share very freely and loosely
+104
+00:11:03,199 --> 00:11:10,800
+ideas and you know first iterations on concepts and for one of a better word I called it like
+105
+00:11:11,439 --> 00:11:17,680
+LLM post processing or cleanup or basically a system prompt that after you get back the raw text
+106
+00:11:17,680 --> 00:11:25,920
+from whisper you run it through model and say okay this is crappy text like add sentence structure
+107
+00:11:25,920 --> 00:11:33,199
+and you know fix it up and now when I'm exploring the different tools that are out there the people
+108
+00:11:33,200 --> 00:11:39,040
+have built I see quite a number of projects have basically you know done the same thing
+109
+00:11:40,640 --> 00:11:45,040
+less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this
+110
+00:11:45,040 --> 00:11:51,440
+has been a thing that's been integrated into tools for a while but it's it's the kind of thing that
+111
+00:11:51,440 --> 00:11:57,520
+when you start using these tools every day the need for it is almost instantly apparent because text
+112
+00:11:57,600 --> 00:12:03,520
+that doesn't have any punctuation or progress basing takes a long time to you know it takes so
+113
+00:12:03,520 --> 00:12:10,079
+long to get it into a presentable email that again it's it moves speech tech into that
+114
+00:12:11,280 --> 00:12:16,000
+before that inflection point where you're like nah she's not worth it it's like it'll just be
+115
+00:12:16,000 --> 00:12:20,800
+quicker to type this so it's it's a big it's a little touch that actually is a big deal
+116
+00:12:21,520 --> 00:12:28,319
+so I was on whisper and I've been using whisper and I kind of early on find a couple of tools
+117
+00:12:28,319 --> 00:12:33,680
+I couldn't find what I was looking for on Linux which is basically just something that'll run
+118
+00:12:34,800 --> 00:12:39,120
+in the background you'll give it an API key and it'll just like transcribe
+119
+00:12:41,439 --> 00:12:47,359
+with like a little key to start and start the dictation and the issues where I discovered that
+120
+00:12:47,440 --> 00:12:52,720
+like most people involved in creating these projects were very much focused on local models
+121
+00:12:52,720 --> 00:12:58,400
+and running whisper locally because you can and I tried that a bunch of times and just never
+122
+00:12:58,400 --> 00:13:03,920
+got results that were as good as the cloud and when I began looking at the cost of the speech
+123
+00:13:03,920 --> 00:13:10,080
+text API is what I was spending I just thought there is it's actually in my opinion just one of
+124
+00:13:10,080 --> 00:13:15,600
+the better deals in API spending and in cloud like it's just not that expensive for very very good
+125
+00:13:15,600 --> 00:13:22,240
+models that are much more you know you're going to be able to run the full model the latest model
+126
+00:13:22,240 --> 00:13:28,960
+versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't
+127
+00:13:28,960 --> 00:13:34,000
+really make sense to me and I privacy is another concern that I know is kind of like a very much
+128
+00:13:34,000 --> 00:13:38,720
+a separate thing that people just don't want their voice data and their voice leaving their
+129
+00:13:38,720 --> 00:13:45,360
+local environment maybe for regulatory reasons as well but I'm not in that I'm neither really
+130
+00:13:45,360 --> 00:13:51,440
+care about people listening to my grocery list consisting of reminding myself that I need to buy
+131
+00:13:51,440 --> 00:13:58,240
+more beer cheetos and hummus which is kind of the three three staples of my diet during periods of
+132
+00:13:58,240 --> 00:14:04,560
+poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a
+133
+00:14:04,560 --> 00:14:12,640
+privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive
+134
+00:14:12,640 --> 00:14:17,680
+or secure that requires airgapping so I looked at the pricing and especially the kind of older
+135
+00:14:17,680 --> 00:14:24,400
+models mini some of them are very very affordable and I did it back to the I did a calculation once
+136
+00:14:24,400 --> 00:14:30,239
+with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever
+137
+00:14:30,320 --> 00:14:37,040
+the model was let's say I just go at it like nonstop which rarely happens probably I would say an
+138
+00:14:37,040 --> 00:14:45,200
+average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents
+139
+00:14:45,200 --> 00:14:51,360
+outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I
+140
+00:14:51,360 --> 00:14:56,720
+do go on like one or two days where I've been usually when I'm like kind of out of the house and
+141
+00:14:56,720 --> 00:15:02,800
+just have something like I've nothing else to do like if I'm at a hospital we have a newborn
+142
+00:15:04,000 --> 00:15:09,040
+and you're waiting for like eight hours and hours for an appointment and I would probably have
+143
+00:15:09,040 --> 00:15:15,280
+listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down
+144
+00:15:15,280 --> 00:15:20,880
+let me just get these ideas out of my head and that's when I'll go on my speech spinges but those
+145
+00:15:20,880 --> 00:15:26,240
+are like ones every few months like not frequently but I said okay let's just say if I'm gonna price
+146
+00:15:26,240 --> 00:15:35,440
+out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some
+147
+00:15:35,440 --> 00:15:41,600
+odd reason I mean it have to like ease and use the toilet like you know there's only so many hours
+148
+00:15:41,600 --> 00:15:48,480
+I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said
+149
+00:15:48,480 --> 00:15:55,360
+all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60
+150
+00:15:55,440 --> 00:16:02,560
+but whatever I did and every day like you're going flat out seven days a week dictating nonstop
+151
+00:16:02,560 --> 00:16:08,640
+as like what's my monthly API bill gonna be at this price and it came out to like seven to your
+152
+00:16:08,640 --> 00:16:14,960
+80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope
+153
+00:16:15,600 --> 00:16:21,680
+that there was some compelling reason more worth more than 70 dollars that I embarked upon that
+154
+00:16:22,640 --> 00:16:26,959
+so given that that's kind of the max point for me I said that's actually very very affordable
+155
+00:16:27,920 --> 00:16:32,640
+now you're gonna if you want to spec out the costs and you want to do the post processing
+156
+00:16:33,599 --> 00:16:39,199
+that I really do feel as valuable that's gonna cost more as well on a less you're using
+157
+00:16:40,160 --> 00:16:47,839
+Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with
+158
+00:16:47,840 --> 00:16:54,800
+Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini
+159
+00:16:54,800 --> 00:17:00,080
+not so much as a everyday model um it's kind of underwhelmed in that respect I would say
+160
+00:17:00,080 --> 00:17:05,920
+but for multimodal I think it's got a lot to offer and I think that the transcribing functionality
+161
+00:17:05,920 --> 00:17:13,280
+whereby it can um process audio with the system prompt and both give you a transgression that's
+162
+00:17:13,280 --> 00:17:20,079
+cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like
+163
+00:17:20,079 --> 00:17:27,280
+even Google hasn't really sort of thought through how useful the that modality is more kind of
+164
+00:17:27,280 --> 00:17:33,280
+use cases uh you can achieve with it because I found in the course of this year just an endless
+165
+00:17:33,280 --> 00:17:40,399
+list of really kind of system prompt system prompt stuff that I can say okay I've used it
+166
+00:17:40,560 --> 00:17:45,920
+for a capture context data for AI which is literally I might speak for if I wanted to have a good
+167
+00:17:45,920 --> 00:17:52,560
+bank of context data about who knows my childhood uh more realistically maybe my career goals
+168
+00:17:53,520 --> 00:17:59,520
+something that would just be like really boring to type out so I'll just like sit in my car
+169
+00:17:59,520 --> 00:18:06,640
+and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is
+170
+00:18:06,640 --> 00:18:13,200
+short text uh just there is a whole bunch and all these workflows kind of require a little bit
+171
+00:18:13,200 --> 00:18:18,320
+of treatment afterwards and different treatment my context pipeline is kind of like just extract the
+172
+00:18:18,320 --> 00:18:23,520
+bare essential so you end up with me talking very loosely about sort of what I've done in my career
+173
+00:18:23,520 --> 00:18:30,000
+where I've worked where my light to work and it goes it condenses that down to very robotic language
+174
+00:18:30,000 --> 00:18:36,000
+that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology
+175
+00:18:36,080 --> 00:18:42,400
+Daniel is a has been working in martino stuff like that that's not how you would speak um but I
+176
+00:18:42,400 --> 00:18:48,480
+figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this
+177
+00:18:48,480 --> 00:18:56,880
+is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and
+178
+00:18:56,880 --> 00:19:02,720
+the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to
+179
+00:19:02,720 --> 00:19:09,440
+be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this
+180
+00:19:09,440 --> 00:19:14,960
+okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype
+181
+00:19:14,960 --> 00:19:20,560
+will work and it's a built in public open source so anyone is welcome to use it if I make anything good
+182
+00:19:21,600 --> 00:19:28,000
+but that was really exciting for me last night when after hours of um try my own prototype seeing
+183
+00:19:28,080 --> 00:19:33,120
+someone just made something that works like that you know you're not going to have to build a custom
+184
+00:19:34,240 --> 00:19:40,960
+condo environment and image I have AMD GPU which makes things much more complicated I didn't find it
+185
+00:19:41,840 --> 00:19:46,400
+and I was about to give up and I said all right let me just give deep grams Linux thing a shot
+186
+00:19:47,040 --> 00:19:50,960
+and if this doesn't work um I'm just going to go back to trying to vibe code something myself
+187
+00:19:51,600 --> 00:19:57,360
+and when I ran the script I was using cloud code to do the installation process
+188
+00:19:58,160 --> 00:20:02,800
+it ran the script and oh my gosh it works just like that uh the tricky thing
+189
+00:20:04,480 --> 00:20:12,480
+for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually
+190
+00:20:12,480 --> 00:20:18,160
+struggling with transcription but pasting wailant makes life very hard and I think there was
+191
+00:20:18,160 --> 00:20:22,800
+something not running at the right time anyway deep gram I looked at how they actually handled
+192
+00:20:22,960 --> 00:20:28,960
+that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism
+193
+00:20:29,520 --> 00:20:34,560
+and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20
+194
+00:20:34,560 --> 00:20:44,399
+minute audio uh sample and I'm I think I've done one or two of these before but I did it with
+195
+00:20:45,360 --> 00:20:51,120
+sure snappy voice notes this is kind of long form this actually might be a better approximation
+196
+00:20:51,120 --> 00:20:55,040
+for what's useful to me than voice memos like I need to buy three
+197
+00:20:55,840 --> 00:20:59,840
+beaters of moat tomorrow and peter bread which is probably how like half my voice note
+198
+00:20:59,840 --> 00:21:04,399
+voice notes sound like if anyone were to I don't know like find my phone they'd be like this is
+199
+00:21:04,399 --> 00:21:09,280
+the most boring person in the world although actually there are some like kind of uh journaling
+200
+00:21:09,280 --> 00:21:14,080
+thoughts as well but it's a lot of content like that and the probably for the evaluation the most
+201
+00:21:14,080 --> 00:21:22,560
+useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not
+202
+00:21:22,560 --> 00:21:27,360
+going to have a chance of knowing it but hopefully sufficiently well known that the models should get
+203
+00:21:27,360 --> 00:21:32,800
+it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in
+204
+00:21:32,800 --> 00:21:38,960
+general I've spoken deliver this at a faster pace than I usually would go into strong coffee
+205
+00:21:39,120 --> 00:21:44,240
+flowing through my bloodstream and the thing that I'm not going to get into spanish mark is
+206
+00:21:44,240 --> 00:21:49,920
+background noise which is my first take that I had to get rid of my wife come in with my son
+207
+00:21:49,920 --> 00:21:55,680
+and for a good night kiss and that actually would have been super helpful to get in because it was
+208
+00:21:56,400 --> 00:22:01,600
+non-diarray sorry if we had diarrayization a female I could say I want the male voice and that
+209
+00:22:01,600 --> 00:22:06,240
+wasn't intended for transcription um and we're not going to get background noise like people
+210
+00:22:06,240 --> 00:22:11,840
+hunking their horns which is something I've done in my main data set where I am trying to go back
+211
+00:22:11,840 --> 00:22:16,880
+to some of my voice notes annotate them and run a benchmark but this is going to be just a pure
+212
+00:22:17,680 --> 00:22:24,960
+quick test and as someone I'm working on a voice note idea that's my sort of end
+213
+00:22:26,560 --> 00:22:30,320
+motivation besides thinking it's an absolute outstanding technology that's coming to
+214
+00:22:30,960 --> 00:22:36,240
+viability and really I know the same as cheesy can actually have a very transformative effect
+215
+00:22:37,120 --> 00:22:42,720
+it's you know voice technology has been life changing for folks living with
+216
+00:22:44,000 --> 00:22:49,760
+disabilities and I think there's something really nice about the fact that it can also benefit
+217
+00:22:50,480 --> 00:22:54,639
+you know folks who are able bodies and like we can all in different ways
+218
+00:22:55,120 --> 00:23:02,560
+um make this tech as useful as possible regardless of the exact way that we're using it um and I
+219
+00:23:02,560 --> 00:23:07,760
+think there's something very powerful in that and it can be very cool um I see huge potential what
+220
+00:23:07,760 --> 00:23:14,480
+excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate
+221
+00:23:14,480 --> 00:23:19,040
+as I mentioned at the very start of this um and it's getting better and better with stuff like
+222
+00:23:19,040 --> 00:23:24,160
+accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the
+223
+00:23:24,160 --> 00:23:30,240
+sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because
+224
+00:23:30,240 --> 00:23:37,680
+I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of
+225
+00:23:37,680 --> 00:23:42,720
+innovation and improvement in the models the main reasons for fine tuning from what I've seen
+226
+00:23:44,320 --> 00:23:50,480
+have been people who are something that really blows blows my mind about ASR is the idea that it's
+227
+00:23:50,480 --> 00:24:00,080
+inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages
+228
+00:24:00,080 --> 00:24:04,800
+that there may be there there might be a positive training data or almost none at all and therefore
+229
+00:24:04,800 --> 00:24:11,440
+the accuracy is significantly reduced or folks in very critical environments I know there are
+230
+00:24:11,440 --> 00:24:17,680
+you this is used extensively in medical transcription and dispatch your work as um you know the call
+231
+00:24:17,680 --> 00:24:24,000
+sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors
+232
+00:24:24,560 --> 00:24:29,680
+radiologists they might be using very specialized vocab all the time so those are kind of the main
+233
+00:24:29,680 --> 00:24:35,680
+two things and I'm not sure that really just for trying to make it better on a few random tech words
+234
+00:24:35,680 --> 00:24:41,840
+with my slightly I mean I have an accent but like not you know an accent that a few other million
+235
+00:24:41,840 --> 00:24:50,720
+people have ish I'm not sure that my little fine tune is going to actually like the bump in
+236
+00:24:50,720 --> 00:24:55,760
+word error reduction if I ever actually figure out how to do it and get it up to the cloud by the
+237
+00:24:55,760 --> 00:25:00,879
+time we've done that I suspect that the next generation of ASR will just be so good that it will
+238
+00:25:00,879 --> 00:25:07,040
+kind of be well that would be cool for a dive but I'll just use this instead so that's going to be
+239
+00:25:07,280 --> 00:25:15,040
+is for today's episodes of voice training data single long shot evaluation who am I going to
+240
+00:25:15,040 --> 00:25:21,200
+compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head
+241
+00:25:21,200 --> 00:25:27,680
+with two things really one is whisper variance so you've got these projects like faster whisper
+242
+00:25:29,120 --> 00:25:34,000
+distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which
+243
+00:25:34,160 --> 00:25:38,960
+are also a thing my intention for this is I'm not sure I'm going to have the time in any point
+244
+00:25:38,960 --> 00:25:46,320
+of the foreseeable future to go back to this whole episode and create a proper source truth or I fix
+245
+00:25:47,520 --> 00:25:53,760
+everything might do it if I can get one transcriptions that's sufficiently close to perfection
+246
+00:25:54,960 --> 00:26:00,560
+but what I would actually love to do on hogging face I think would be a great probably how I might
+247
+00:26:00,560 --> 00:26:08,080
+visualize this is having the audio waveform play and then have the transcript for each model below
+248
+00:26:08,080 --> 00:26:16,320
+it and maybe even a like you know two scale and maybe even a local one as well like local whisper
+249
+00:26:16,320 --> 00:26:23,919
+versus open AI API etc and I can then actually listen back to segments or anyone who wants to
+250
+00:26:24,000 --> 00:26:30,000
+can listen back to segments of this recording and see where a particular model to struggle
+251
+00:26:30,000 --> 00:26:35,600
+with others didn't as well as the sort of headline finding of which had the best WER but that would
+252
+00:26:35,600 --> 00:26:41,120
+require the source of truth okay that's it hope this was I don't know maybe useful for other
+253
+00:26:41,120 --> 00:26:46,480
+folks interested in STT you want to see that I always feel think I've just said as something I
+254
+00:26:46,480 --> 00:26:52,800
+didn't intend to STT I said for those isn't carefully including hopefully the models themselves
+255
+00:26:53,280 --> 00:26:58,960
+this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests
+256
+00:26:58,960 --> 00:27:06,639
+in AI but particularly agentic mcp and voice tech you can find me on github hogging face
+257
+00:27:08,080 --> 00:27:14,000
+where else daniel rosal dot com which is my personal website as well as this podcast whose name
+258
+00:27:14,000 --> 00:27:17,280
+I sadly cannot remember until next time thanks for listening

data/inference/runs/local-stt/run-3/transcript.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts. I'm going to send really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to push a speech attacks model through its paces, which is trying to make sense of is this guy just rambling on and coherently in one long sentence or are these just actually series of step standalone standalone sentences? And how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was it trying to find China whisper? And what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be to be done. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I am persevering hard with the task of training, I guess, a good solution working for Linux. Would you have anyone actually does listen to this not just for the training data and for the actual content? This is this is sparked. I had, besides the fine tune not working, well that was the failure. I used plot code because one thing these days that there is nothing short of solving, you know, the reason of life or something that's flawed and agentically I can't do, which is not really the case. It does seem that way sometimes but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes and
+it was actually kind of tedious because the text were really weird. Some of them were it was like, it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note. Like I'm recording an email, give me a short story to read, give me pros to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour and I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to, it gave me a source of truth which I'm not sure if that's the scientific way to approach this. Topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used being experimenting with using for a different purpose. It's slightly different annotating task types. It's more text classification experiment or well it's more than that actually working on a voice app so it's a prototype I guess is really more accurate.
+But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those transcribing where you start and stop and scrub around it and you fix the errors but it's really really pouring to do that. So I thought it would be last tedious in the long term if I just recorded this source of truth so it gave me these three minute snippets. I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data. So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in the US and I was alone at home and when crazy people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high. I used speech tech now and again try it out. It's like it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just it was not good basically and this blew me away from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. There's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably say screw it I'll just type it because you end up missing errors in the transcript and it becomes actually worse you end up in a worse position than you started with it that's been my experience. So I was like oh this is actually really really good now how did that happen? The answer is ASR with per being open-sourced and the transformer architecture if you want to go back to the to the underpinnings which really blows my mind and it's on my list to read through that paper all you need is attention as attentively as can be done with my limited brain because it's super high-level stuff super advanced stuff I mean but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM then you've got image generation okay so firstly those two things on the surface have nothing in common so like how are they how did that just happen all at the same time and then when you extend that further you're like sooner right you can sing a song an AI will like come up with an instrumental and then you've got whisper and you're like wait a second how did all this stuff like if it's all AI what's like there has to be some commonality otherwise these are four these totally different technologies on the surface of it and the transformer architecture is as far as I know the answer and I can't even say I can't even pretend that I really understand what the transformer architecture means in depth but I have scandis and as I said I want to print it and really kind of think over it's at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt once who were the who wrote that paper and how old were they when it was published in ARC and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during COVID and things like that were like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring labs and we're really in writing and publishing and kind of obscure academic publications and they finally like hit a big or win a Nobel Prize and then their household household names so I that was kind of what I had in mind that was the mental image I'd formed of the birth of ARC like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very funny very cool and actually kind of inspiring it's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in are credibly in through you know the series of connections that are coming up with such literally world changing innovations so that was I thought anyway that that that was cool okay voice training data how were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh I have to get like a really good microphone for this so I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this and it just kind of became it's become really part of my daily routine like if I'm writing an email I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel like that's my kind of a weird thing to say but when I look I kind of came when I started working on this these prototypes on GitHub which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts and for one of a better word I called it like LLM post processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through model and say okay this is crappy text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there the people have built I see quite a number of projects have basically you know done the same thing less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent because text that doesn't have any punctuation or progress basing takes a long time to you know it takes so long to get it into a presentable email that again it's it moves speech tech into that before that inflection point where you're like nah she's not worth it it's like it'll just be quicker to type this so it's it's a big it's a little touch that actually is a big deal so I was on whisper and I've been using whisper and I kind of early on find a couple of tools I couldn't find what I was looking for on Linux which is basically just something that'll run in the background you'll give it an API key and it'll just like transcribe
+with like a little key to start and start the dictation and the issues where I discovered that like most people involved in creating these projects were very much focused on local models and running whisper locally because you can and I tried that a bunch of times and just never got results that were as good as the cloud and when I began looking at the cost of the speech text API is what I was spending I just thought there is it's actually in my opinion just one of the better deals in API spending and in cloud like it's just not that expensive for very very good models that are much more you know you're going to be able to run the full model the latest model versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't really make sense to me and I privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment maybe for regulatory reasons as well but I'm not in that I'm neither really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer cheetos and hummus which is kind of the three three staples of my diet during periods of poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive or secure that requires airgapping so I looked at the pricing and especially the kind of older models mini some of them are very very affordable and I did it back to the I did a calculation once with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever the model was let's say I just go at it like nonstop which rarely happens probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do like if I'm at a hospital we have a newborn and you're waiting for like eight hours and hours for an appointment and I would probably have listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me just get these ideas out of my head and that's when I'll go on my speech spinges but those are like ones every few months like not frequently but I said okay let's just say if I'm gonna price out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some odd reason I mean it have to like ease and use the toilet like you know there's only so many hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60 but whatever I did and every day like you're going flat out seven days a week dictating nonstop as like what's my monthly API bill gonna be at this price and it came out to like seven to your 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope that there was some compelling reason more worth more than 70 dollars that I embarked upon that so given that that's kind of the max point for me I said that's actually very very affordable now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel as valuable that's gonna cost more as well on a less you're using Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini not so much as a everyday model um it's kind of underwhelmed in that respect I would say but for multimodal I think it's got a lot to offer and I think that the transcribing functionality whereby it can um process audio with the system prompt and both give you a transgression that's cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like even Google hasn't really sort of thought through how useful the that modality is more kind of use cases uh you can achieve with it because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say okay I've used it for a capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood uh more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is short text uh just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essential so you end up with me talking very loosely about sort of what I've done in my career where I've worked where my light to work and it goes it condenses that down to very robotic language that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology Daniel is a has been working in martino stuff like that that's not how you would speak um but I figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype will work and it's a built in public open source so anyone is welcome to use it if I make anything good but that was really exciting for me last night when after hours of um try my own prototype seeing someone just made something that works like that you know you're not going to have to build a custom condo environment and image I have AMD GPU which makes things much more complicated I didn't find it and I was about to give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work um I'm just going to go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that uh the tricky thing for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually struggling with transcription but pasting wailant makes life very hard and I think there was something not running at the right time anyway deep gram I looked at how they actually handled that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20 minute audio uh sample and I'm I think I've done one or two of these before but I did it with sure snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me than voice memos like I need to buy three beaters of moat tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to I don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the models should get it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in general I've spoken deliver this at a faster pace than I usually would go into strong coffee flowing through my bloodstream and the thing that I'm not going to get into spanish mark is background noise which is my first take that I had to get rid of my wife come in with my son and for a good night kiss and that actually would have been super helpful to get in because it was non-diarray sorry if we had diarrayization a female I could say I want the male voice and that wasn't intended for transcription um and we're not going to get background noise like people hunking their horns which is something I've done in my main data set where I am trying to go back to some of my voice notes annotate them and run a benchmark but this is going to be just a pure quick test and as someone I'm working on a voice note idea that's my sort of end motivation besides thinking it's an absolute outstanding technology that's coming to viability and really I know the same as cheesy can actually have a very transformative effect it's you know voice technology has been life changing for folks living with disabilities and I think there's something really nice about the fact that it can also benefit you know folks who are able bodies and like we can all in different ways um make this tech as useful as possible regardless of the exact way that we're using it um and I think there's something very powerful in that and it can be very cool um I see huge potential what excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate as I mentioned at the very start of this um and it's getting better and better with stuff like accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of innovation and improvement in the models the main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages that there may be there there might be a positive training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments I know there are you this is used extensively in medical transcription and dispatch your work as um you know the call sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors radiologists they might be using very specialized vocab all the time so those are kind of the main two things and I'm not sure that really just for trying to make it better on a few random tech words with my slightly I mean I have an accent but like not you know an accent that a few other million people have ish I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time we've done that I suspect that the next generation of ASR will just be so good that it will kind of be well that would be cool for a dive but I'll just use this instead so that's going to be is for today's episodes of voice training data single long shot evaluation who am I going to compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head with two things really one is whisper variance so you've got these projects like faster whisper distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which are also a thing my intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back to this whole episode and create a proper source truth or I fix everything might do it if I can get one transcriptions that's sufficiently close to perfection but what I would actually love to do on hogging face I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like you know two scale and maybe even a local one as well like local whisper versus open AI API etc and I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model to struggle with others didn't as well as the sort of headline finding of which had the best WER but that would require the source of truth okay that's it hope this was I don't know maybe useful for other folks interested in STT you want to see that I always feel think I've just said as something I didn't intend to STT I said for those isn't carefully including hopefully the models themselves this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests in AI but particularly agentic mcp and voice tech you can find me on github hogging face where else daniel rosal dot com which is my personal website as well as this podcast whose name I sadly cannot remember until next time thanks for listening

index.html CHANGED Viewed

@@ -1,19 +1,250 @@
 <!doctype html>
-<html>
 	<head>
 		<meta charset="utf-8" />
 		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
 		<link rel="stylesheet" href="style.css" />
 	</head>
 	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
 	</body>
 </html>

 <!doctype html>
+<html lang="en">
 	<head>
 		<meta charset="utf-8" />
 		<meta name="viewport" content="width=device-width" />
+		<title>STT Comparison Playground</title>
 		<link rel="stylesheet" href="style.css" />
 	</head>
 	<body>
+		<main class="app">
+			<section class="hero">
+				<div>
+					<h1>Speech-to-Text Comparison</h1>
+					<p>
+						Play the sample podcast and compare how each transcription model handled it.
+						The ground-truth reference stays on top in green so you can quickly gauge accuracy.
+					</p>
+				</div>
+				<div class="audio-shell">
+					<audio id="audio" controls preload="auto" src="data/audio/podcast.mp3"></audio>
+					<canvas id="waveform" role="img" aria-label="Audio waveform preview"></canvas>
+				</div>
+			</section>
+			<section class="transcripts">
+				<div id="reference-track" aria-live="polite"></div>
+				<div class="models-grid" id="models-grid" aria-live="polite"></div>
+			</section>
+		</main>
+		<script src="transcripts.js"></script>
+		<script type="module">
+			const referenceTrackEl = document.getElementById("reference-track");
+			const modelsGridEl = document.getElementById("models-grid");
+			const audioElem = document.getElementById("audio");
+			const waveformCanvas = document.getElementById("waveform");
+			const transcriptSources = window.TRANSCRIPTS || {};
+			const tracks = [
+				{
+					id: "truth",
+					label: "Ground Truth",
+					file: "data/ground-truth/truth_1.srt",
+					accent: "#00b894",
+					emphasis: true
+				},
+				{
+					id: "assembly",
+					label: "AssemblyAI",
+					file: "srt-out/assembly.srt",
+					accent: "#4070f4"
+				},
+				{
+					id: "gladia",
+					label: "Gladia",
+					file: "srt-out/gladia.srt",
+					accent: "#9b5de5"
+				},
+				{
+					id: "nova3",
+					label: "Whisper Nova 3",
+					file: "srt-out/nova3.srt",
+					accent: "#ff6b6b"
+				},
+				{
+					id: "speechmatics",
+					label: "Speechmatics",
+					file: "srt-out/speechmatics.srt",
+					accent: "#ffa600"
+				}
+			];
+			const segmentNodes = [];
+			function parseTimestamp(value) {
+				const [time, millisecondPart] = value.split(",");
+				const [hours, minutes, seconds] = time.split(":").map(Number);
+				const milliseconds = Number(millisecondPart);
+				return hours * 3600 + minutes * 60 + seconds + milliseconds / 1000;
+			}
+			function parseSrt(text) {
+				const blocks = text.replace(/\r/g, "").trim().split(/\n{2,}/);
+				return blocks
+					.map((block) => {
+						const lines = block.split("\n");
+						if (lines.length < 3) return null;
+						const timing = lines[1];
+						const [start, end] = timing.split("-->").map((part) => parseTimestamp(part.trim()));
+						const content = lines.slice(2).join(" ").replace(/\s+/g, " ").trim();
+						return { start, end, content };
+					})
+					.filter(Boolean);
+			}
+			function createSegmentElement(segment, accent) {
+				const segmentEl = document.createElement("div");
+				segmentEl.className = "segment";
+				segmentEl.dataset.start = segment.start;
+				segmentEl.dataset.end = segment.end;
+				segmentEl.innerHTML = `<span class="segment-time">${formatTime(segment.start)}</span><p>${segment.content}</p>`;
+				segmentEl.style.setProperty("--accent", accent);
+				return segmentEl;
+			}
+			function formatTime(seconds) {
+				const minutes = Math.floor(seconds / 60)
+					.toString()
+					.padStart(2, "0");
+				const secs = Math.floor(seconds % 60)
+					.toString()
+					.padStart(2, "0");
+				return `${minutes}:${secs}`;
+			}
+			function getTranscriptText(track) {
+				const cached = transcriptSources[track.id];
+				if (cached) {
+					return cached.replace(/^\ufeff/, "");
+				}
+				return null;
+			}
+			async function fetchTranscript(track) {
+				const response = await fetch(track.file);
+				if (!response.ok) {
+					throw new Error(`Unable to load ${track.label}`);
+				}
+				return (await response.text()).replace(/^\ufeff/, "");
+			}
+			async function loadTrack(track) {
+				let transcriptText = getTranscriptText(track);
+				if (!transcriptText) {
+					try {
+						transcriptText = await fetchTranscript(track);
+					} catch (error) {
+						renderFallbackCard(track, error.message);
+						throw error;
+					}
+				}
+				const transcript = parseSrt(transcriptText);
+				const trackEl = document.createElement("article");
+				trackEl.className = "track";
+				if (track.emphasis) {
+					trackEl.classList.add("track--emphasis");
+				}
+				trackEl.style.setProperty("--accent", track.accent);
+				trackEl.innerHTML = `
+					<header>
+						<h2>${track.label}</h2>
+						<span class="badge">Segments: ${transcript.length}</span>
+					</header>
+				`;
+				const contentEl = document.createElement("div");
+				contentEl.className = "track-body";
+				transcript.forEach((segment) => {
+					const segmentEl = createSegmentElement(segment, track.accent);
+					contentEl.appendChild(segmentEl);
+					segmentNodes.push(segmentEl);
+				});
+				trackEl.appendChild(contentEl);
+				if (track.emphasis) {
+					referenceTrackEl.appendChild(trackEl);
+				} else {
+					modelsGridEl.appendChild(trackEl);
+				}
+			}
+			function updateActiveSegments(time) {
+				segmentNodes.forEach((node) => {
+					const start = Number(node.dataset.start);
+					const end = Number(node.dataset.end);
+					const isActive = time >= start && time <= end;
+					node.classList.toggle("is-active", isActive);
+				});
+			}
+			async function drawWaveform() {
+				const response = await fetch(audioElem.currentSrc || audioElem.src);
+				if (!response.ok) return;
+				const arrayBuffer = await response.arrayBuffer();
+				const audioContext = new AudioContext();
+				const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
+				const rawData = audioBuffer.getChannelData(0);
+				const canvas = waveformCanvas;
+				const dpr = window.devicePixelRatio || 1;
+				canvas.width = canvas.clientWidth * dpr;
+				canvas.height = canvas.clientHeight * dpr;
+				const ctx = canvas.getContext("2d");
+				ctx.scale(dpr, dpr);
+				ctx.clearRect(0, 0, canvas.clientWidth, canvas.clientHeight);
+				const sliceWidth = Math.floor(rawData.length / canvas.clientWidth);
+				const halfHeight = canvas.clientHeight / 2;
+				ctx.lineWidth = 1.25;
+				ctx.strokeStyle = "#1f2937";
+				ctx.beginPath();
+				for (let i = 0; i < canvas.clientWidth; i++) {
+					const sliceStart = i * sliceWidth;
+					let sum = 0;
+					for (let j = 0; j < sliceWidth; j++) {
+						sum += Math.abs(rawData[sliceStart + j] || 0);
+					}
+					const amplitude = sum / sliceWidth;
+					const y = halfHeight - amplitude * halfHeight;
+					const yBottom = halfHeight + amplitude * halfHeight;
+					ctx.moveTo(i, y);
+					ctx.lineTo(i, yBottom);
+				}
+				ctx.stroke();
+			}
+			async function bootstrap() {
+				await Promise.all(
+					tracks.map(async (track) => {
+						try {
+							await loadTrack(track);
+						} catch (error) {
+							console.error(error);
+						}
+					})
+				);
+				void drawWaveform();
+			}
+			function renderFallbackCard(track, message) {
+				const card = document.createElement("article");
+				card.className = "track track--error";
+				card.innerHTML = `
+					<header>
+						<h2>${track.label}</h2>
+						<span class="badge badge--error">Unavailable</span>
+					</header>
+					<p class="track-error">${message}</p>
+				`;
+				if (track.emphasis) {
+					referenceTrackEl.appendChild(card);
+				} else {
+					modelsGridEl.appendChild(card);
+				}
+			}
+			audioElem.addEventListener("timeupdate", () => updateActiveSegments(audioElem.currentTime));
+			window.addEventListener("resize", () => drawWaveform());
+			bootstrap();
+		</script>
 	</body>
 </html>

srt-out/assembly.srt ADDED Viewed

	@@ -0,0 +1,1880 @@

+1
+00:00:00,080 --> 00:00:05,680
+Hello and welcome to a audio data set consisting
+2
+00:00:05,680 --> 00:00:10,640
+of one single episode of a non-existent podcast. Or I
+3
+00:00:10,720 --> 00:00:13,360
+may append this to a podcast that I set up
+4
+00:00:13,600 --> 00:00:19,200
+recently regarding my with my thoughts on speech
+5
+00:00:19,280 --> 00:00:24,000
+tech and AI in particular, more AI in generative AI,
+6
+00:00:24,240 --> 00:00:28,640
+I would say. But in any event, the purpose of
+7
+00:00:28,720 --> 00:00:33,850
+this Voice recording is actually to create a lengthy
+8
+00:00:33,930 --> 00:00:37,130
+voice sample for a quick evaluation, a back of the
+9
+00:00:37,130 --> 00:00:40,650
+envelope evaluation, as they might say, for different speech attack
+10
+00:00:40,890 --> 00:00:43,450
+models. And I'm doing this because I thought I had
+11
+00:00:43,450 --> 00:00:46,810
+made a great breakthrough in my journey with speech tech,
+12
+00:00:47,130 --> 00:00:50,730
+and that was succeeding in the elusive task of fine-tuning
+13
+00:00:50,730 --> 00:00:54,810
+Whisper. Whisper is, and I'm going to just talk, I'm
+14
+00:00:54,890 --> 00:00:58,250
+trying to mix up, I'm going to try a few
+15
+00:00:58,410 --> 00:01:01,530
+different styles of speaking. I might whisper something at some
+16
+00:01:01,610 --> 00:01:04,880
+point. As well. And I'll go back to speaking loud
+17
+00:01:04,960 --> 00:01:08,080
+in, in different parts. I'm going to sound really like
+18
+00:01:08,160 --> 00:01:11,120
+a crazy person because I'm also going to try to
+19
+00:01:11,280 --> 00:01:16,240
+speak at different pitches and cadences in order to really
+20
+00:01:16,560 --> 00:01:20,560
+try to put a speech attacks model through its paces,
+21
+00:01:20,720 --> 00:01:23,040
+which is trying to make sense of is this guy
+22
+00:01:23,200 --> 00:01:28,060
+just rambling on incoherently in one long sentence or are
+23
+00:01:28,460 --> 00:01:34,220
+these just actually a series of step, standalone,
+24
+00:01:34,380 --> 00:01:37,420
+step alone, standalone sentences? And how is it gonna handle
+25
+00:01:37,500 --> 00:01:40,460
+step alone? That's not a word. What happens when you
+26
+00:01:40,540 --> 00:01:43,020
+use speech to text and you use a fake word?
+27
+00:01:43,180 --> 00:01:45,580
+And then you're like, wait, that's not actually, that word
+28
+00:01:45,740 --> 00:01:50,220
+doesn't exist. How does AI handle that? And these and
+29
+00:01:50,460 --> 00:01:54,300
+more are all the questions that I'm seeking to answer
+30
+00:01:54,460 --> 00:01:57,500
+in this training data. Now, why was it trying to
+31
+00:01:57,500 --> 00:02:00,290
+fine tune Whisper? And what is Whisper? As I said,
+32
+00:02:00,370 --> 00:02:03,010
+I'm going to try to record this at a couple
+33
+00:02:03,170 --> 00:02:07,490
+of different levels of technicality for folks who are, you
+34
+00:02:07,490 --> 00:02:11,730
+know, in the normal world and not totally stuck down
+35
+00:02:11,810 --> 00:02:13,810
+the rabbit hole of AI, which I have to say
+36
+00:02:13,970 --> 00:02:18,130
+is a really wonderful rabbit hole to be down. It's
+37
+00:02:18,210 --> 00:02:21,570
+a really interesting area and speech and voice tech is
+38
+00:02:21,970 --> 00:02:24,610
+the aspect of it that I find actually the most,
+39
+00:02:25,010 --> 00:02:27,410
+I'm not sure I would say the most interesting because
+40
+00:02:27,650 --> 00:02:31,370
+there's just so much that is fascinating in AI. But
+41
+00:02:31,530 --> 00:02:34,330
+the most that I find the most personally transformative in
+42
+00:02:34,410 --> 00:02:38,970
+terms of the impact that it's had on my daily
+43
+00:02:39,050 --> 00:02:41,530
+work life and productivity and how I sort of work.
+44
+00:02:42,170 --> 00:02:47,290
+And I'm persevering hard with the task of trying
+45
+00:02:47,290 --> 00:02:50,330
+to get a good solution working for Linux, which if
+46
+00:02:50,330 --> 00:02:52,330
+anyone actually does listen to this, not just for the
+47
+00:02:52,330 --> 00:02:56,490
+training data and for the actual content, this is sparked
+48
+00:02:56,830 --> 00:03:00,030
+I had, besides the fine tune not working, well, that
+49
+00:03:00,110 --> 00:03:05,310
+was the failure. Um, I used Claude code because one
+50
+00:03:05,550 --> 00:03:10,030
+thinks these days that there is nothing short of solving,
+51
+00:03:11,070 --> 00:03:15,470
+you know, the, the reason of life or something, that
+52
+00:03:15,870 --> 00:03:19,070
+Claude and agentic AI can't do, which is not really
+53
+00:03:19,150 --> 00:03:22,270
+the case. Uh, it does seem that way sometimes, but
+54
+00:03:22,430 --> 00:03:24,270
+it fails a lot as well. And this is one
+55
+00:03:24,270 --> 00:03:27,710
+of those, instances where last week I put together an
+56
+00:03:27,790 --> 00:03:32,090
+hour of voice training data, basically speaking, just random things
+57
+00:03:32,330 --> 00:03:37,130
+for 3 minutes. And it was actually kind of tedious
+58
+00:03:37,210 --> 00:03:39,290
+because the texts were really weird. Some of them were
+59
+00:03:39,530 --> 00:03:43,130
+it was like it was AI generated. I tried before
+60
+00:03:43,290 --> 00:03:45,210
+to read Sherlock Holmes for an hour and I just
+61
+00:03:45,210 --> 00:03:48,410
+couldn't. I was so bored after 10 minutes that I
+62
+00:03:48,410 --> 00:03:50,810
+was like, okay, no, I'm just going to have to
+63
+00:03:50,810 --> 00:03:55,370
+find something else to read. So I used a created
+64
+00:03:55,770 --> 00:04:01,360
+with AI studio vibe coded a synthetic text generator. Which
+65
+00:04:01,680 --> 00:04:03,920
+actually I thought was probably a better way of doing
+66
+00:04:04,000 --> 00:04:07,520
+it because it would give me more short samples with
+67
+00:04:07,760 --> 00:04:10,560
+more varied content. So I was like, okay, give me
+68
+00:04:10,960 --> 00:04:13,840
+a voice note, like I'm recording an email, give me
+69
+00:04:14,080 --> 00:04:17,760
+a short story to read, give me prose to read.
+70
+00:04:18,080 --> 00:04:20,480
+So I came up with all these different things and
+71
+00:04:20,640 --> 00:04:22,640
+they added a little timer to it so I could
+72
+00:04:22,800 --> 00:04:26,480
+see how close I was to one hour. And I
+73
+00:04:26,640 --> 00:04:29,680
+spent like an hour one afternoon or probably two hours
+74
+00:04:29,840 --> 00:04:33,410
+by the time you you do retakes. And whatever, because
+75
+00:04:33,490 --> 00:04:36,690
+you want to, it gave me a source of truth,
+76
+00:04:37,410 --> 00:04:40,130
+which I'm not sure if that's the scientific way to
+77
+00:04:40,290 --> 00:04:44,290
+approach this topic of gathering, training data, but I thought
+78
+00:04:44,530 --> 00:04:48,210
+made sense. Um, I have a lot of audio data
+79
+00:04:48,290 --> 00:04:50,850
+from recording voice notes, which I've also kind of used,
+80
+00:04:52,130 --> 00:04:55,890
+been experimenting with using for a different purpose, slightly different
+81
+00:04:56,290 --> 00:05:01,490
+annotating task types. It's more a text classification experiment
+82
+00:05:01,810 --> 00:05:04,240
+or, Well, it's more than that actually. I'm working on
+83
+00:05:04,240 --> 00:05:08,160
+a voice app. So it's a prototype, I guess, is
+84
+00:05:08,320 --> 00:05:12,800
+really more accurate. But you can do that and you
+85
+00:05:12,800 --> 00:05:15,280
+can work backwards. You're like, you listen back to a
+86
+00:05:15,280 --> 00:05:18,800
+voice note and you painfully go through one of those
+87
+00:05:19,120 --> 00:05:21,920
+transcribing, you know, where you start and stop and scrub
+88
+00:05:22,080 --> 00:05:24,000
+around it and you fix the errors, but it's really,
+89
+00:05:24,160 --> 00:05:26,800
+really boring to do that. So I thought it would
+90
+00:05:26,880 --> 00:05:29,120
+be less tedious in the long term if I just
+91
+00:05:30,139 --> 00:05:33,020
+recorded the source of truth. So it gave me these
+92
+00:05:33,100 --> 00:05:36,220
+three minute snippets. I recorded them. It saved an MP3
+93
+00:05:36,460 --> 00:05:39,580
+and a TXT in the same folder, and I created
+94
+00:05:39,660 --> 00:05:42,940
+an error with that data. So I was very hopeful,
+95
+00:05:43,340 --> 00:05:46,940
+quietly, a little bit hopeful that I could actually fine
+96
+00:05:47,020 --> 00:05:50,540
+tune Whisper. I want to fine tune Whisper because when
+97
+00:05:50,620 --> 00:05:54,860
+I got into Voicetech last November, my wife was in
+98
+00:05:54,860 --> 00:05:58,220
+the US and I was alone at home. And when
+99
+00:05:58,680 --> 00:06:01,480
+crazy people like me do really wild things like use
+100
+00:06:01,720 --> 00:06:06,200
+voice to tech technology. That was basically when I started
+101
+00:06:06,280 --> 00:06:08,840
+doing it, I didn't feel like a crazy person speaking
+102
+00:06:08,920 --> 00:06:13,800
+to myself. And my expectations weren't that high. I used
+103
+00:06:14,360 --> 00:06:17,720
+speech tech now and again, tried it out. It was
+104
+00:06:17,720 --> 00:06:19,240
+like, it'd be really cool if you could just, like,
+105
+00:06:19,400 --> 00:06:22,840
+speak into your computer. And whatever I tried out that
+106
+00:06:23,080 --> 00:06:26,670
+had Linux support was just. It was not good, basically.
+107
+00:06:27,310 --> 00:06:29,550
+And this blew me away from the first go. I
+108
+00:06:29,550 --> 00:06:32,830
+mean, it wasn't 100% accurate out of the box and
+109
+00:06:32,910 --> 00:06:34,990
+it took work, but it was good enough that there
+110
+00:06:35,070 --> 00:06:37,550
+was a solid foundation and it kind of passed that
+111
+00:06:38,750 --> 00:06:41,950
+pivot point that it's actually worth doing this. You know,
+112
+00:06:42,110 --> 00:06:44,750
+there's a point where it's so like the transcript is
+113
+00:06:44,990 --> 00:06:47,390
+you don't have to get 100% accuracy for it to
+114
+00:06:47,390 --> 00:06:50,110
+be worth your time for speech attacks to be a
+115
+00:06:50,110 --> 00:06:52,510
+worthwhile addition to your productivity, but you do need to
+116
+00:06:52,510 --> 00:06:56,050
+get above, let's say, I don't know, 85%. If it's
+117
+00:06:56,210 --> 00:06:59,890
+60% or 50%, you inevitably say, screw it, I'll just
+118
+00:06:59,890 --> 00:07:02,850
+type it because you end up missing errors in the
+119
+00:07:02,850 --> 00:07:05,570
+transcript and it becomes actually worse. You end up in
+120
+00:07:05,570 --> 00:07:07,650
+a worse position than you started with. That's been my
+121
+00:07:07,730 --> 00:07:12,050
+experience. So I was like, oh, this is actually really,
+122
+00:07:12,210 --> 00:07:14,050
+really good now. How did that happen? And the answer
+123
+00:07:14,210 --> 00:07:19,490
+is ASR whisper being open source and the transformer
+124
+00:07:19,490 --> 00:07:23,250
+architecture. If you want to go back to the to
+125
+00:07:23,330 --> 00:07:26,450
+the underpinnings, which really blows my mind and it's on
+126
+00:07:26,530 --> 00:07:30,760
+my list. To read through that paper. All you need
+127
+00:07:30,840 --> 00:07:36,040
+is attention as attentively as can be done
+128
+00:07:36,280 --> 00:07:39,400
+with my limited brain because it's super, super high level
+129
+00:07:39,720 --> 00:07:44,600
+stuff, super advanced stuff, I mean. But that, I think
+130
+00:07:44,760 --> 00:07:49,400
+of all the things that are fascinating about the sudden
+131
+00:07:49,720 --> 00:07:53,780
+rise in AI and the dramatic capabilities. I find it
+132
+00:07:53,780 --> 00:07:56,180
+fascinating that a few people are like, hang on, you've
+133
+00:07:56,180 --> 00:07:58,500
+got this thing that can speak to you, like a
+134
+00:07:58,500 --> 00:08:03,060
+chatbot, an LLM, and then you've got image generation. Okay,
+135
+00:08:03,140 --> 00:08:06,660
+so firstly, those two things on the surface have nothing
+136
+00:08:06,980 --> 00:08:10,820
+in common. So like, how are they, how did that
+137
+00:08:10,980 --> 00:08:12,580
+just happen all at the same time? And then when
+138
+00:08:12,580 --> 00:08:16,660
+you extend that further, you're like, Suno, right? You can
+139
+00:08:17,140 --> 00:08:20,110
+sing a song and AI will come up with and
+140
+00:08:20,270 --> 00:08:23,470
+instrumental. And then you've got Whisper and you're like, wait
+141
+00:08:23,470 --> 00:08:25,950
+a second, how did all this stuff, like, if it's
+142
+00:08:25,950 --> 00:08:29,310
+all AI, what's like, there has to be some commonality.
+143
+00:08:29,550 --> 00:08:34,670
+Otherwise, these are totally different technologies on the surface of
+144
+00:08:34,670 --> 00:08:38,910
+it. And the Transformer architecture is, as far as I
+145
+00:08:38,990 --> 00:08:41,630
+know, the answer. And I can't even say, can't even
+146
+00:08:41,710 --> 00:08:46,350
+pretend that I really understand what the Transformer architecture means.
+147
+00:08:46,850 --> 00:08:49,330
+In depth, but I have scanned it and as I
+148
+00:08:49,490 --> 00:08:51,890
+said, I want to print it and really kind of
+149
+00:08:52,290 --> 00:08:56,130
+think over it at some point. And I'll probably feel
+150
+00:08:56,370 --> 00:08:59,330
+bad about myself, I think, because weren't those guys in
+151
+00:08:59,410 --> 00:09:03,490
+their 20s? Like, that's crazy. I think I asked ChatGPT
+152
+00:09:03,570 --> 00:09:07,970
+once who wrote that paper and how old were they
+153
+00:09:08,130 --> 00:09:10,850
+when it was published in Arciv? And I was expecting,
+154
+00:09:11,090 --> 00:09:13,970
+like, I don't know, What do you imagine? I personally
+155
+00:09:14,050 --> 00:09:16,290
+imagine kind of like, you know, you have these breakthroughs
+156
+00:09:16,450 --> 00:09:19,890
+during COVID and things like that where like these kind
+157
+00:09:19,970 --> 00:09:22,850
+of really obscure scientists are like in their 50s and
+158
+00:09:22,850 --> 00:09:27,250
+they've just kind of been laboring in labs and wearily
+159
+00:09:27,250 --> 00:09:30,530
+in writing and publishing in kind of obscure academic publications.
+160
+00:09:30,850 --> 00:09:33,250
+And they finally like hit a big or win a
+161
+00:09:33,250 --> 00:09:37,330
+Nobel Prize and then their household names. So that was
+162
+00:09:37,410 --> 00:09:39,070
+kind of what I had in mind. That was the
+163
+00:09:39,070 --> 00:09:43,070
+mental image I'd formed of the birth of Arcsight. Like
+164
+00:09:43,070 --> 00:09:46,350
+I wasn't expecting 20-somethings in San Francisco, though. I thought
+165
+00:09:46,430 --> 00:09:48,910
+that was both very, very funny, very cool, and actually
+166
+00:09:49,070 --> 00:09:52,590
+kind of inspiring. It's nice to think that people who,
+167
+00:09:53,390 --> 00:09:56,190
+you know, just you might put them in the kind
+168
+00:09:56,270 --> 00:09:59,630
+of milieu or bubble or world that you are in
+169
+00:09:59,710 --> 00:10:03,310
+are credibly in through, you know, the series of connections
+170
+00:10:03,390 --> 00:10:07,470
+that are coming up with such literally world changing innovations.
+171
+00:10:07,950 --> 00:10:11,540
+So that was, I thought, anyway. That's that was cool.
+172
+00:10:11,940 --> 00:10:14,580
+Okay, voice training data. How are we doing? We're about
+173
+00:10:14,580 --> 00:10:18,660
+10 minutes and I'm still talking about voice technology. So
+174
+00:10:18,740 --> 00:10:22,180
+Whisper was brilliant and I was so excited that I
+175
+00:10:22,260 --> 00:10:25,460
+was my first instinct was to like guess like, oh
+176
+00:10:25,460 --> 00:10:26,900
+my gosh, I have to get like a really good
+177
+00:10:26,900 --> 00:10:30,660
+microphone for this. So I didn't go on a spending
+178
+00:10:30,660 --> 00:10:32,820
+spree because I said, I'm gonna have to just wait
+179
+00:10:32,820 --> 00:10:35,220
+a month and see if I still use this. And
+180
+00:10:36,510 --> 00:10:38,990
+It just kind of became, it's become really part of
+181
+00:10:39,150 --> 00:10:43,470
+my daily routine. Like if I'm writing an email, I'll
+182
+00:10:43,550 --> 00:10:47,070
+record a voice note. And then I've developed and it's
+183
+00:10:47,070 --> 00:10:49,150
+nice to see that everyone is like developing the same
+184
+00:10:49,630 --> 00:10:52,030
+things in parallel. Like that's my kind of a weird
+185
+00:10:52,030 --> 00:10:54,590
+thing to say, but when I look, I kind of
+186
+00:10:54,750 --> 00:10:59,070
+came, when I started working on this, these prototypes on
+187
+00:10:59,150 --> 00:11:01,550
+GitHub, which is where I just kind of share very
+188
+00:11:01,790 --> 00:11:06,810
+freely and loosely, ideas and first iterations on concepts.
+189
+00:11:08,570 --> 00:11:10,730
+And for want of a better word, I called it
+190
+00:11:10,810 --> 00:11:15,530
+like LLM post-processing or cleanup or basically a system prompt
+191
+00:11:15,610 --> 00:11:18,970
+that after you get back the raw text from Whisper,
+192
+00:11:19,130 --> 00:11:22,090
+you run it through a model and say, okay, this
+193
+00:11:22,170 --> 00:11:27,050
+is crappy text, like add sentence structure and fix it
+194
+00:11:27,130 --> 00:11:32,330
+up. And now when I'm exploring the different tools that
+195
+00:11:32,410 --> 00:11:35,260
+are out there that people have built, I see quite
+196
+00:11:35,500 --> 00:11:39,180
+a number of projects have basically done the same thing,
+197
+00:11:40,540 --> 00:11:43,260
+lest that be misconstrued. I'm not saying for a millisecond
+198
+00:11:43,340 --> 00:11:46,300
+that I inspired them. I'm sure this has been a
+199
+00:11:46,380 --> 00:11:49,580
+thing that's been integrated into tools for a while, but
+200
+00:11:50,460 --> 00:11:52,380
+it's the kind of thing that when you start using
+201
+00:11:52,380 --> 00:11:54,860
+these tools every day, the need for it is almost
+202
+00:11:55,020 --> 00:11:59,500
+instantly apparent because text that doesn't have any punctuation or
+203
+00:11:59,880 --> 00:12:03,080
+Paragraph spacing takes a long time to, you know, it
+204
+00:12:03,240 --> 00:12:05,480
+takes so long to get it into a presentable email
+205
+00:12:05,640 --> 00:12:09,800
+that again, it's, it's, it, it moves speech tech into
+206
+00:12:10,040 --> 00:12:13,560
+that before that inflection point where you're like, no, it's
+207
+00:12:13,560 --> 00:12:16,040
+just not worth it. It's like, it's, it'll just be
+208
+00:12:16,120 --> 00:12:18,600
+quicker to type this. So it's a big, it's a
+209
+00:12:18,600 --> 00:12:21,640
+little touch that actually is a big deal. Uh, so
+210
+00:12:21,800 --> 00:12:25,720
+I was on Whisper and I've been using Whisper and
+211
+00:12:25,720 --> 00:12:28,190
+I kind of, early on found a couple of tools.
+212
+00:12:28,350 --> 00:12:30,590
+I couldn't find what I was looking for on Linux,
+213
+00:12:30,750 --> 00:12:35,550
+which is basically just something that'll run in the background.
+214
+00:12:35,790 --> 00:12:38,110
+It'll give it an API key and it will just
+215
+00:12:38,270 --> 00:12:42,990
+like transcribe with like a little key to start and
+216
+00:12:43,070 --> 00:12:47,390
+stop the dictation. And the issues were I discovered that
+217
+00:12:47,550 --> 00:12:51,150
+like most people involved in creating these projects were very
+218
+00:12:51,310 --> 00:12:55,150
+much focused on local models, running Whisper locally because you
+219
+00:12:55,230 --> 00:12:58,020
+can. And I tried that a bunch of times and
+220
+00:12:58,100 --> 00:13:00,420
+just never got results that were as good as the
+221
+00:13:00,420 --> 00:13:03,220
+cloud. And when I began looking at the cost of
+222
+00:13:03,300 --> 00:13:05,780
+the speech to text APIs and what I was spending,
+223
+00:13:06,340 --> 00:13:09,540
+I just thought there is, it's actually, in my opinion,
+224
+00:13:09,700 --> 00:13:12,900
+just one of the better deals in API spending and
+225
+00:13:12,900 --> 00:13:15,220
+in cloud. Like it's just not that expensive for very,
+226
+00:13:15,380 --> 00:13:19,380
+very good models that are much more, you know, you're
+227
+00:13:19,380 --> 00:13:21,960
+gonna be able to run the full model. The latest
+228
+00:13:21,960 --> 00:13:25,960
+model versus whatever you can run on your average GPU,
+229
+00:13:26,200 --> 00:13:29,240
+unless you want to buy a crazy GPU. It doesn't
+230
+00:13:29,240 --> 00:13:31,160
+really make sense to me. Now, privacy is another concern
+231
+00:13:32,200 --> 00:13:33,960
+that I know is kind of like a very much
+232
+00:13:34,040 --> 00:13:36,840
+a separate thing that people just don't want their voice
+233
+00:13:37,080 --> 00:13:40,760
+data and their voice leaving their local environment, maybe for
+234
+00:13:40,760 --> 00:13:44,280
+regulatory reasons as well. But I'm not in that. I
+235
+00:13:44,680 --> 00:13:48,920
+neither really care about people listening to my grocery list
+236
+00:13:49,160 --> 00:13:51,800
+consisting of reminding myself that I need to buy more
+237
+00:13:51,880 --> 00:13:55,230
+beer, Cheetos, and hummus, which is kind of the three
+238
+00:13:55,390 --> 00:13:59,950
+staples of my diet during periods of poorer nutrition. But
+239
+00:14:00,030 --> 00:14:02,510
+the kind of stuff that I transcribe, it's just not,
+240
+00:14:04,030 --> 00:14:07,790
+it's not a privacy thing I'm that sort of sensitive
+241
+00:14:07,870 --> 00:14:13,230
+about and I don't do anything so sensitive or secure
+242
+00:14:13,310 --> 00:14:16,510
+that requires air gapping. So I looked at the pricing
+243
+00:14:16,590 --> 00:14:19,870
+and especially the kind of older model mini Some of
+244
+00:14:19,950 --> 00:14:22,030
+them are very, very affordable. And I did a back
+245
+00:14:22,270 --> 00:14:25,950
+of the, I did a calculation once with ChatGPT and
+246
+00:14:25,950 --> 00:14:29,310
+I was like, okay, this is the API price for
+247
+00:14:29,470 --> 00:14:32,350
+I can't remember whatever the model was. Let's say I
+248
+00:14:32,430 --> 00:14:35,310
+just go at it like nonstop, which it rarely happens.
+249
+00:14:35,550 --> 00:14:38,910
+Probably, I would say on average, I might dictate 30
+250
+00:14:38,990 --> 00:14:41,870
+to 60 minutes per day if I was probably summing
+251
+00:14:41,870 --> 00:14:47,070
+up the emails, documents, outlines, which
+252
+00:14:47,310 --> 00:14:49,950
+is a lot, but it's still a fairly modest amount.
+253
+00:14:50,110 --> 00:14:52,020
+And I was like, Some days I do go on
+254
+00:14:52,180 --> 00:14:54,980
+like one or two days where I've been usually when
+255
+00:14:54,980 --> 00:14:57,060
+I'm like kind of out of the house and just
+256
+00:14:57,300 --> 00:15:00,580
+have something like I have nothing else to do. Like
+257
+00:15:00,740 --> 00:15:04,100
+if I'm at a hospital, we have a newborn and
+258
+00:15:04,260 --> 00:15:07,380
+you're waiting for like eight hours and hours for an
+259
+00:15:07,460 --> 00:15:10,900
+appointment. And I would probably have listened to podcasts before
+260
+00:15:11,460 --> 00:15:14,260
+becoming a speech fanatic. And I'm like, oh, wait, let
+261
+00:15:14,420 --> 00:15:16,339
+me just get down. Let me just get these ideas
+262
+00:15:16,500 --> 00:15:18,620
+out of my head. And that's when I'll go on
+263
+00:15:19,340 --> 00:15:21,900
+my speech binges. But those are like once every few
+264
+00:15:21,900 --> 00:15:25,020
+months, like not frequently. But I said, okay, let's just
+265
+00:15:25,100 --> 00:15:29,180
+say if I'm gonna price out Cloud SCT, if I
+266
+00:15:29,260 --> 00:15:33,980
+was like dedicated every second of every waking hour to
+267
+00:15:34,140 --> 00:15:37,980
+transcribing for some odd reason, I mean, I'd have to
+268
+00:15:38,060 --> 00:15:40,860
+like eat and use the toilet. Like, you know, there's
+269
+00:15:40,940 --> 00:15:43,500
+only so many hours I'm awake for. So like, let's
+270
+00:15:43,500 --> 00:15:46,700
+just say a maximum of like 40 hour, 45 minutes.
+271
+00:15:47,290 --> 00:15:49,370
+In the hour. Then I said, all right, let's just
+272
+00:15:49,370 --> 00:15:52,970
+say 50. Who knows? You're dictating on the toilet. We
+273
+00:15:53,130 --> 00:15:55,130
+do it. So it could be. You could just do
+274
+00:15:55,210 --> 00:15:59,370
+60. But whatever I did. And every day, like, you're
+275
+00:15:59,450 --> 00:16:02,810
+going flat out seven days a week dictating non-stop I
+276
+00:16:02,810 --> 00:16:05,930
+was like, what's my monthly API bill gonna be at
+277
+00:16:06,010 --> 00:16:08,650
+this price? And it came out to, like, 70 or
+278
+00:16:08,650 --> 00:16:10,810
+80 bucks. And I was like, well, that would be
+279
+00:16:11,210 --> 00:16:15,780
+an extraordinary. Amount of dictation. And I would hope that
+280
+00:16:16,260 --> 00:16:20,020
+there was some compelling reason more worth more than $70
+281
+00:16:20,340 --> 00:16:23,540
+that I embarked upon that project. So given that that's
+282
+00:16:23,540 --> 00:16:25,540
+kind of the max point for me, I said that's
+283
+00:16:25,620 --> 00:16:29,220
+actually very, very affordable. Now you're gonna, if you want
+284
+00:16:29,300 --> 00:16:31,780
+to spec out the costs and you want to do
+285
+00:16:31,780 --> 00:16:36,340
+the post-processing that I really do feel is valuable, that's
+286
+00:16:36,420 --> 00:16:40,900
+gonna cost some more as well, unless you're using Gemini,
+287
+00:16:41,380 --> 00:16:44,500
+which needless to say is a random person sitting in
+288
+00:16:44,580 --> 00:16:49,140
+Jerusalem. I have no affiliation, nor with Google, nor anthropic,
+289
+00:16:49,220 --> 00:16:52,100
+nor Gemini, nor any major tech vendor for that matter.
+290
+00:16:53,700 --> 00:16:56,900
+I like Gemini not so much as a everyday model.
+291
+00:16:57,380 --> 00:16:59,940
+It's kind of underwhelmed in that respect, I would say.
+292
+00:17:00,340 --> 00:17:02,820
+But for multimodal, I think it's got a lot to
+293
+00:17:02,820 --> 00:17:06,580
+offer. And I think that the transcribing functionality whereby it
+294
+00:17:06,660 --> 00:17:11,980
+can process audio with a system prompt and both give
+295
+00:17:12,140 --> 00:17:15,180
+you transcription that's cleaned up that reduces two steps to
+296
+00:17:15,340 --> 00:17:18,300
+one. And that for me is a very, very big
+297
+00:17:18,460 --> 00:17:21,660
+deal. And I feel like even Google has haven't really
+298
+00:17:21,900 --> 00:17:26,780
+sort of thought through how useful the that modality is
+299
+00:17:26,860 --> 00:17:29,340
+and what kind of use cases you can achieve with
+300
+00:17:29,420 --> 00:17:31,340
+it. Because I found in the course of this year,
+301
+00:17:31,980 --> 00:17:36,620
+just an endless list of really kind of system prompt
+302
+00:17:36,940 --> 00:17:40,300
+system prompt stuff that I can say, okay, I've used
+303
+00:17:40,300 --> 00:17:43,500
+it to capture context data for AI, which is literally
+304
+00:17:43,580 --> 00:17:45,740
+I might speak for if I wanted to have a
+305
+00:17:45,740 --> 00:17:49,820
+good bank of context data about who knows my childhood
+306
+00:17:50,380 --> 00:17:54,300
+more realistically, maybe my career goals, something that would just
+307
+00:17:54,380 --> 00:17:56,780
+be like really boring to type out. So I'll just
+308
+00:17:56,860 --> 00:18:00,860
+like sit in my car and record it for 10
+309
+00:18:00,940 --> 00:18:03,180
+minutes. And that 10 minutes you get a lot of
+310
+00:18:03,340 --> 00:18:08,730
+information in. Um, emails, which is short text, just
+311
+00:18:09,130 --> 00:18:12,330
+there is a whole bunch and all these workflows kind
+312
+00:18:12,490 --> 00:18:14,490
+of require a little bit of treatment afterwards and different
+313
+00:18:14,730 --> 00:18:18,170
+treatment. My context pipeline is kind of like just extract
+314
+00:18:18,250 --> 00:18:21,050
+the bare essentials. So you end up with me talking
+315
+00:18:21,130 --> 00:18:23,050
+very loosely about sort of what I've done in my
+316
+00:18:23,130 --> 00:18:25,450
+career, where I've worked, where I might like to work.
+317
+00:18:25,930 --> 00:18:29,050
+And it goes, it condenses that down to very robotic
+318
+00:18:29,290 --> 00:18:32,570
+language that is easy to chunk parse and maybe put
+319
+00:18:32,650 --> 00:18:36,630
+into a vector database. Daniel has worked in technology. Daniel
+320
+00:18:37,510 --> 00:18:40,230
+has been working in, you know, stuff like that. That's
+321
+00:18:40,230 --> 00:18:43,190
+not how you would speak, but I figure it's probably
+322
+00:18:43,430 --> 00:18:47,430
+easier to parse for, after all, robots. So we've almost
+323
+00:18:47,510 --> 00:18:49,350
+got to 20 minutes and this is actually a success
+324
+00:18:49,830 --> 00:18:55,190
+because I wasted 20 minutes of the evening speaking
+325
+00:18:55,270 --> 00:18:59,990
+into a microphone and the levels were shot and it
+326
+00:18:59,990 --> 00:19:01,670
+was clipping and I said, I can't really do an
+327
+00:19:01,750 --> 00:19:04,070
+evaluation. I have to be fair. I have to give
+328
+00:19:04,640 --> 00:19:08,000
+the models a chance to do their thing. What am
+329
+00:19:08,000 --> 00:19:10,400
+I hoping to achieve in this? Okay, my fine tune
+330
+00:19:10,400 --> 00:19:13,440
+was a dud as mentioned. DeepChrom ST, I'm really, really
+331
+00:19:13,520 --> 00:19:16,560
+hopeful that this prototype will work and it's a build
+332
+00:19:16,800 --> 00:19:19,360
+in public open source, so anyone is welcome to use
+333
+00:19:19,440 --> 00:19:22,400
+it if I make anything good. But that was really
+334
+00:19:22,560 --> 00:19:26,560
+exciting for me last night when after hours of trying
+335
+00:19:26,640 --> 00:19:30,560
+my own prototype, seeing someone just made something that works
+336
+00:19:30,720 --> 00:19:32,480
+like that, you know, you're not gonna have to build
+337
+00:19:32,720 --> 00:19:37,540
+a custom conda environment and image. I have AMD GPU,
+338
+00:19:37,700 --> 00:19:41,060
+which makes things much more complicated. I didn't find it.
+339
+00:19:41,620 --> 00:19:43,060
+And I was about to give up and I said,
+340
+00:19:43,140 --> 00:19:45,540
+all right, let me just give Deep Grams Linux thing
+341
+00:19:46,020 --> 00:19:49,300
+a shot. And if this doesn't work, I'm just going
+342
+00:19:49,300 --> 00:19:51,060
+to go back to trying to Vibe code something myself.
+343
+00:19:51,700 --> 00:19:55,540
+And when I ran the script, I was using Claude
+344
+00:19:55,620 --> 00:19:59,140
+code to do the installation process. It ran the script
+345
+00:19:59,220 --> 00:20:02,100
+and oh my gosh, it works just like that. The
+346
+00:20:02,180 --> 00:20:06,060
+tricky thing For all those who want to know all
+347
+00:20:06,060 --> 00:20:11,340
+the nitty gritty details, was that I
+348
+00:20:11,340 --> 00:20:14,460
+don't think it was actually struggling with transcription, but pasting
+349
+00:20:14,780 --> 00:20:18,220
+Wayland makes life very hard. And I think there was
+350
+00:20:18,300 --> 00:20:21,580
+something not running the right time. Anyway, Deepgram, I looked
+351
+00:20:21,580 --> 00:20:23,900
+at how they actually handled that because it worked out
+352
+00:20:23,980 --> 00:20:26,620
+of the box when other stuff didn't. And it was
+353
+00:20:27,180 --> 00:20:30,650
+quite a clever little mechanism. And but more so than
+354
+00:20:30,730 --> 00:20:33,370
+that, the accuracy was brilliant. Now, what am I doing
+355
+00:20:33,370 --> 00:20:36,010
+here? This is going to be a 20 minute audio
+356
+00:20:36,570 --> 00:20:42,090
+sample. And I think I've done one or two
+357
+00:20:42,250 --> 00:20:46,650
+of these before, but I did it with short snappy
+358
+00:20:46,810 --> 00:20:49,850
+voice notes. This is kind of long form. This actually
+359
+00:20:50,090 --> 00:20:52,250
+might be a better approximation for what's useful to me
+360
+00:20:52,410 --> 00:20:55,970
+than voice memos. Like, I need to buy three Bread,
+361
+00:20:56,050 --> 00:20:58,690
+eaters of milk tomorrow and Peter bread, which is probably
+362
+00:20:58,850 --> 00:21:01,410
+how like half my voice notes sound. Like if anyone
+363
+00:21:01,890 --> 00:21:04,130
+were to, I don't know, like find my phone, they'd
+364
+00:21:04,130 --> 00:21:05,650
+be like, this is the most boring person in the
+365
+00:21:05,650 --> 00:21:09,410
+world. Although actually, there are some like kind of journaling
+366
+00:21:09,410 --> 00:21:11,570
+thoughts as well, but it's a lot of content like
+367
+00:21:11,570 --> 00:21:14,530
+that. And the probably for the evaluation, the most useful
+368
+00:21:14,610 --> 00:21:20,290
+thing is slightly obscure tech, GitHub, NeocleNo, hugging
+369
+00:21:20,370 --> 00:21:23,020
+face, Not so obscure that it's not going to have
+370
+00:21:23,100 --> 00:21:26,540
+a chance of knowing it, but hopefully sufficiently well known
+371
+00:21:26,540 --> 00:21:28,780
+that the model should get it. I tried to do
+372
+00:21:28,860 --> 00:21:31,660
+a little bit of speaking really fast and speaking very
+373
+00:21:31,820 --> 00:21:35,100
+slowly. I would say in general, I've spoken, delivered this
+374
+00:21:35,260 --> 00:21:37,580
+at a faster pace than I usually would owing to
+375
+00:21:38,060 --> 00:21:42,540
+strong coffee flowing through my bloodstream. And the thing that
+376
+00:21:42,540 --> 00:21:44,780
+I'm not going to get in this benchmark is background
+377
+00:21:44,860 --> 00:21:46,540
+noise, which in my first take that I had to
+378
+00:21:46,540 --> 00:21:49,790
+get rid of, My wife came in with my son
+379
+00:21:50,110 --> 00:21:52,430
+and for a goodnight kiss. And that actually would have
+380
+00:21:52,430 --> 00:21:56,590
+been super helpful to get in because it was non
+381
+00:21:56,670 --> 00:22:00,270
+diarized or if we had diarization, a female, I could
+382
+00:22:00,270 --> 00:22:02,510
+say, I want the male voice and that wasn't intended
+383
+00:22:02,510 --> 00:22:05,950
+for transcription. And we're not going to get background noise
+384
+00:22:06,030 --> 00:22:08,350
+like people honking their horns, which is something I've done
+385
+00:22:08,510 --> 00:22:11,230
+in my main data set where I am trying to
+386
+00:22:11,470 --> 00:22:14,420
+go back to some of my voice notes. Annotate them
+387
+00:22:14,660 --> 00:22:16,500
+and run a benchmark. But this is going to be
+388
+00:22:16,500 --> 00:22:21,780
+just a pure quick test. And as someone,
+389
+00:22:22,340 --> 00:22:24,740
+I'm working on a voice note idea. That's my sort
+390
+00:22:24,740 --> 00:22:28,740
+of end motivation. Besides thinking it's an ask to the
+391
+00:22:28,740 --> 00:22:32,420
+outstanding technology that's coming to viability. And really, I know
+392
+00:22:32,500 --> 00:22:36,020
+this sounds cheesy, can actually have a very transformative effect.
+393
+00:22:37,060 --> 00:22:41,210
+It's, you know, voice technology has been life changing for
+394
+00:22:42,010 --> 00:22:47,050
+folks living with disabilities. And I think
+395
+00:22:47,210 --> 00:22:49,050
+there's something really nice about the fact that it can
+396
+00:22:49,210 --> 00:22:52,570
+also benefit, you know, folks who are able bodied and
+397
+00:22:52,730 --> 00:22:57,770
+like we can all in different ways make this tech
+398
+00:22:57,850 --> 00:23:00,490
+as useful as possible, regardless of the exact way that
+399
+00:23:00,490 --> 00:23:03,850
+we're using it. And I think there's something very powerful
+400
+00:23:03,930 --> 00:23:06,520
+in that and it can be very cool. I see
+401
+00:23:06,680 --> 00:23:10,280
+huge potential. What excites me about Voicetech? A lot of
+402
+00:23:10,360 --> 00:23:14,440
+things actually. Firstly, the fact that it's cheap and accurate,
+403
+00:23:14,520 --> 00:23:17,160
+as I mentioned at the very start of this. And
+404
+00:23:17,320 --> 00:23:19,960
+it's getting better and better with stuff like accent handling.
+405
+00:23:20,760 --> 00:23:23,480
+I'm not sure my fine-tune will actually ever come to
+406
+00:23:23,560 --> 00:23:25,400
+fruition in the sense that I'll use it day to
+407
+00:23:25,480 --> 00:23:28,920
+day as I imagine. I get like superb flawless words
+408
+00:23:29,000 --> 00:23:33,420
+error rates because I'm just kind of skeptical about Local
+409
+00:23:33,580 --> 00:23:37,180
+speech to text, as I mentioned, and I think the
+410
+00:23:37,260 --> 00:23:40,780
+pace of innovation and improvement in the models, the main
+411
+00:23:40,940 --> 00:23:44,700
+reasons for fine tuning from what I've seen have been
+412
+00:23:44,860 --> 00:23:47,500
+people who are something that really blows my mind about
+413
+00:23:48,060 --> 00:23:53,180
+ASR is the idea that it's inherently a lingual or
+414
+00:23:53,340 --> 00:23:58,650
+multilingual phonetic based. So as folks who use speak
+415
+00:23:58,970 --> 00:24:02,330
+very obscure languages, that there might be a paucity of
+416
+00:24:02,330 --> 00:24:04,970
+training data or almost none at all, and therefore the
+417
+00:24:04,970 --> 00:24:10,170
+accuracy is significantly reduced. Or folks in very critical
+418
+00:24:10,410 --> 00:24:14,330
+environments, I know this is used extensively in medical transcription
+419
+00:24:14,410 --> 00:24:19,210
+and dispatcher work, the call centers who send out ambulances,
+420
+00:24:19,290 --> 00:24:23,210
+et cetera, where accuracy is absolutely paramount. And in the
+421
+00:24:23,210 --> 00:24:26,940
+case of doctors, radiologist, they might be using very specialized
+422
+00:24:26,940 --> 00:24:29,500
+vocab all the time. So those are kind of the
+423
+00:24:29,580 --> 00:24:31,500
+main two things that I'm not sure that really just
+424
+00:24:31,580 --> 00:24:35,020
+for trying to make it better on a few random
+425
+00:24:35,020 --> 00:24:37,980
+tech words with my slightly, I mean, I have an
+426
+00:24:38,060 --> 00:24:41,100
+accent, but like not, you know, an accent that a
+427
+00:24:41,180 --> 00:24:45,980
+few other million people have ish. I'm not sure that
+428
+00:24:46,460 --> 00:24:50,380
+my little fine tune is gonna actually like the bump
+429
+00:24:50,540 --> 00:24:53,580
+in word error reduction, if I ever actually figure out
+430
+00:24:53,580 --> 00:24:54,700
+how to do it and get it up to the
+431
+00:24:54,780 --> 00:24:57,950
+cloud. By the time we've done that, I suspect that
+432
+00:24:58,270 --> 00:25:00,510
+the next generation of ASR will just be so good
+433
+00:25:00,590 --> 00:25:03,070
+that it will kind of be, well, that would have
+434
+00:25:03,070 --> 00:25:04,750
+been cool if it worked out, but I'll just use
+435
+00:25:04,830 --> 00:25:08,590
+this instead. So that's going to be it for today's
+436
+00:25:08,910 --> 00:25:14,110
+episode of voice training data. Single long shot evaluation.
+437
+00:25:14,430 --> 00:25:17,230
+Who am I going to compare? Whisper is always good
+438
+00:25:17,230 --> 00:25:20,590
+as a benchmark, but I'm more interested in seeing Whisper
+439
+00:25:20,670 --> 00:25:24,590
+head to head with two things, really. One is Whisper
+440
+00:25:24,670 --> 00:25:29,780
+variants. So you've got these projects like faster Distill Whisper,
+441
+00:25:29,860 --> 00:25:31,780
+it's a bit confusing, there's a whole bunch of them.
+442
+00:25:32,100 --> 00:25:35,380
+And the emerging ASRs, which are also a thing. My
+443
+00:25:35,460 --> 00:25:37,300
+intention for this is I'm not sure I'm going to
+444
+00:25:37,300 --> 00:25:39,940
+have the time in any point in the foreseeable future
+445
+00:25:40,260 --> 00:25:44,660
+to go back through this whole episode and create a
+446
+00:25:44,740 --> 00:25:49,780
+proper source truth, where I fix everything. Might do
+447
+00:25:49,860 --> 00:25:52,820
+it if I can get one transcriptions that sufficiently close
+448
+00:25:53,060 --> 00:25:57,120
+to perfection. But what I would actually love to do
+449
+00:25:57,280 --> 00:26:00,000
+on Hugging Face, I think would be a great probably
+450
+00:26:00,320 --> 00:26:02,960
+how I might visualize this is having the audio waveform
+451
+00:26:03,280 --> 00:26:08,240
+play and then have the transcript for each model below
+452
+00:26:08,240 --> 00:26:12,640
+it and maybe even a like, you know, to scale
+453
+00:26:13,200 --> 00:26:15,680
+and maybe even a local one as well, like local
+454
+00:26:15,840 --> 00:26:21,180
+whisper versus OpenAI API, et cetera. And, I
+455
+00:26:21,260 --> 00:26:23,580
+can then actually listen back to segments or anyone who
+456
+00:26:23,580 --> 00:26:25,900
+wants to can listen back to segments of this recording
+457
+00:26:26,220 --> 00:26:31,020
+and see where a particular model struggled and others didn't,
+458
+00:26:31,500 --> 00:26:33,420
+as well as the sort of headline finding of which
+459
+00:26:33,580 --> 00:26:36,940
+had the best WER, but that would require the source
+460
+00:26:36,940 --> 00:26:39,660
+of truth. Okay, that's it. I hope this was, I
+461
+00:26:39,660 --> 00:26:42,620
+don't know, maybe useful for other folks interested in STT.
+462
+00:26:42,940 --> 00:26:45,740
+You want to see that I always feel think I've
+463
+00:26:45,740 --> 00:26:48,950
+just said as something I didn't intend to. STT, I
+464
+00:26:48,950 --> 00:26:52,550
+said for those. Listen carefully, including hopefully the models themselves.
+465
+00:26:53,270 --> 00:26:57,350
+This has been myself, Daniel Rosell. For more jumbled repositories
+466
+00:26:57,430 --> 00:27:01,830
+about my roving interests in AI, but particularly agentic, MCP
+467
+00:27:02,070 --> 00:27:07,109
+and Voicetech, you can find me on GitHub, huggingface.com,
+468
+00:27:10,310 --> 00:27:13,350
+which is my personal website, as well as this podcast,
+469
+00:27:13,590 --> 00:27:17,030
+whose name I sadly cannot remember. Until next time, thanks
+470
+00:27:17,030 --> 00:27:17,590
+for listening.

srt-out/gladia.srt ADDED Viewed

	@@ -0,0 +1,2003 @@

+1
+00:00:00.172 --> 00:00:15.108
+Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um
+2
+00:00:15.467 --> 00:00:29.435
+regarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um
+3
+00:00:30.219 --> 00:00:36.545
+voice recording is actually to create a lengthy voice sample for a quick evaluation,
+4
+00:00:36.546 --> 00:00:38.088
+a back of the envelope evaluation,
+5
+00:00:38.390 --> 00:00:39.148
+as they might say,
+6
+00:00:39.749 --> 00:00:41.273
+for different speech to text models.
+7
+00:00:41.274 --> 00:00:42.195
+And I'm doing this because
+8
+00:00:42.975 --> 00:00:46.655
+I thought I'd made a great breakthrough in my journey with speech tech,
+9
+00:00:47.234 --> 00:00:50.999
+and that was succeeding in the elusive task of fine tuning Whisper.
+10
+00:00:51.780 --> 00:00:52.655
+Whisper is,
+11
+00:00:52.920 --> 00:00:58.890
+and I'm going to just talk i'm trying to mix up uh i'm going to try a few different
+12
+00:00:59.524 --> 00:01:18.581
+styles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a
+13
+00:01:18.706 --> 00:01:28.831
+speech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these
+14
+00:01:29.652 --> 00:01:33.436
+just actually a series of step,
+15
+00:01:33.734 --> 00:01:34.355
+standalone,
+16
+00:01:34.415 --> 00:01:34.918
+step alone,
+17
+00:01:35.016 --> 00:01:36.200
+standalone sentences.
+18
+00:01:36.519 --> 00:01:38.040
+And how is it going to handle step alone?
+19
+00:01:38.078 --> 00:01:38.680
+That's not a word.
+20
+00:01:39.859 --> 00:01:43.343
+What happens when you use speech to text and you use a fake word?
+21
+00:01:43.367 --> 00:01:43.884
+And then you're like,
+22
+00:01:43.923 --> 00:01:44.063
+wait,
+23
+00:01:44.087 --> 00:01:44.703
+that's not actually,
+24
+00:01:45.468 --> 00:01:46.328
+that word doesn't exist.
+25
+00:01:47.048 --> 00:01:48.266
+How does AI handle that?
+26
+00:01:48.484 --> 00:01:55.359
+And these and more are all the questions that I'm seeking to answer in this training data.
+27
+00:01:56.001 --> 00:01:56.141
+Now,
+28
+00:01:56.359 --> 00:01:56.718
+why did,
+29
+00:01:56.843 --> 00:01:58.266
+why was it trying to fine tune Whisper?
+30
+00:01:58.787 --> 00:02:16.968
+what is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to
+31
+00:02:16.969 --> 00:02:27.735
+be down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's
+32
+00:02:28.147 --> 00:02:30.349
+Just so much that is fascinating in AI.
+33
+00:02:31.372 --> 00:02:41.520
+But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work.
+34
+00:02:42.082 --> 00:02:42.379
+And
+35
+00:02:43.183 --> 00:02:47.230
+I'm persevering hard with the task of training,
+36
+00:02:47.231 --> 00:02:47.527
+I guess,
+37
+00:02:47.730 --> 00:02:49.762
+a good solution working for Linux,
+38
+00:02:50.122 --> 00:02:51.683
+which if anyone actually does listen to this,
+39
+00:02:51.777 --> 00:02:54.355
+not just for the training data and for the actual content,
+40
+00:02:55.247 --> 00:02:56.497
+this is this is sparked.
+41
+00:02:56.762 --> 00:02:57.044
+I had
+42
+00:02:58.056 --> 00:03:13.229
+besides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the
+43
+00:03:13.368 --> 00:03:24.518
+reason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those
+44
+00:03:25.304 --> 00:03:29.768
+instances where last week I put together an hour of voice training data,
+45
+00:03:30.528 --> 00:03:31.229
+basically speaking,
+46
+00:03:31.271 --> 00:03:33.174
+just random things for three minutes.
+47
+00:03:33.407 --> 00:03:38.618
+And it was actually kind of tedious because the texts were really weird.
+48
+00:03:38.674 --> 00:03:39.174
+Some of them were,
+49
+00:03:39.556 --> 00:03:40.080
+it was like,
+50
+00:03:40.361 --> 00:03:40.596
+it was
+51
+00:03:41.127 --> 00:03:41.939
+AI generated.
+52
+00:03:42.721 --> 00:03:45.518
+I tried before to read Sherlock Holmes for an hour and I just couldn't,
+53
+00:03:45.564 --> 00:03:48.893
+I was so bored after 10 minutes that I was like,
+54
+00:03:48.894 --> 00:03:49.064
+okay,
+55
+00:03:49.066 --> 00:03:51.705
+I know I'm just going to have to find something else to read.
+56
+00:03:51.752 --> 00:03:51.877
+So
+57
+00:03:52.907 --> 00:03:53.705
+I used...
+58
+00:03:54.207 --> 00:04:11.201
+a created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a
+59
+00:04:11.248 --> 00:04:22.858
+voice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see.
+60
+00:04:23.295 --> 00:04:50.961
+how close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used
+61
+00:04:52.117 --> 00:04:52.384
+Bean.
+62
+00:04:52.755 --> 00:05:02.007
+experimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or
+63
+00:05:02.836 --> 00:05:02.956
+Well,
+64
+00:05:02.956 --> 00:05:03.497
+it's more than that,
+65
+00:05:03.536 --> 00:05:03.776
+actually.
+66
+00:05:03.778 --> 00:05:04.857
+I'm working on a voice app.
+67
+00:05:04.937 --> 00:05:07.660
+So it's a prototype,
+68
+00:05:07.680 --> 00:05:07.980
+I guess,
+69
+00:05:08.019 --> 00:05:09.000
+is really more accurate.
+70
+00:05:11.382 --> 00:05:13.805
+But you can do that and you can work backwards.
+71
+00:05:13.843 --> 00:05:14.187
+You're like,
+72
+00:05:14.343 --> 00:05:19.757
+you listen back to a voice note and you painfully go through one of those transcribing,
+73
+00:05:19.992 --> 00:05:20.226
+you know,
+74
+00:05:20.274 --> 00:05:23.413
+where you start and stop and scrub around it and you fix the errors.
+75
+00:05:23.415 --> 00:05:24.117
+But it's really,
+76
+00:05:24.180 --> 00:05:25.538
+really boring to do that.
+77
+00:05:26.163 --> 00:05:31.680
+So I thought it would be less tedious in the long term if I just recorded the source of truth.
+78
+00:05:32.247 --> 00:05:34.190
+So it gave me these three minute snippets.
+79
+00:05:34.428 --> 00:05:38.593
+I recorded them and saved an MP3 and a TXT in the same folder.
+80
+00:05:38.855 --> 00:05:40.500
+And I created an error of that data.
+81
+00:05:41.975 --> 00:05:43.038
+So I was very hopeful,
+82
+00:05:43.398 --> 00:05:43.781
+quietly,
+83
+00:05:43.898 --> 00:05:44.117
+you know,
+84
+00:05:44.117 --> 00:05:47.725
+a little bit hopeful that I would be able that I could actually fine tune Whisper.
+85
+00:05:48.586 --> 00:05:53.100
+I want to fine tune Whisper because when I got into voice tech last November,
+86
+00:05:54.242 --> 00:05:57.538
+my wife was in the US and I was alone at home and,
+87
+00:05:57.819 --> 00:05:58.053
+you know,
+88
+00:05:58.069 --> 00:05:59.117
+went crazy.
+89
+00:05:59.444 --> 00:06:12.454
+people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high
+90
+00:06:13.336 --> 00:06:26.509
+I used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically
+91
+00:06:27.500 --> 00:06:29.440
+And this blew me away from the first go.
+92
+00:06:29.480 --> 00:06:29.701
+I mean,
+93
+00:06:29.701 --> 00:06:30.860
+it wasn't 100%
+94
+00:06:31.841 --> 00:06:33.360
+accurate out of the box and it took work,
+95
+00:06:33.942 --> 00:06:41.302
+but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this.
+96
+00:06:41.942 --> 00:06:42.185
+You know,
+97
+00:06:42.185 --> 00:06:46.418
+there's a point where it's so like the transcript is you don't have to get 100%
+98
+00:06:46.482 --> 00:06:48.262
+accuracy for it to be worth your time,
+99
+00:06:49.091 --> 00:06:51.668
+for a speech to text to be a worthwhile addition to your productivity.
+100
+00:06:51.778 --> 00:06:53.043
+But you do need to get above,
+101
+00:06:53.091 --> 00:06:53.418
+let's say,
+102
+00:06:53.528 --> 00:06:53.887
+I don't know,
+103
+00:06:53.966 --> 00:06:54.451
+85%.
+104
+00:06:54.466 --> 00:06:54.887
+percent.
+105
+00:06:55.711 --> 00:06:56.651
+If it's 60%
+106
+00:06:57.031 --> 00:06:57.413
+or 50%,
+107
+00:06:57.793 --> 00:06:58.692
+you inevitably say,
+108
+00:06:59.173 --> 00:06:59.512
+screw it,
+109
+00:06:59.514 --> 00:07:05.033
+I'll just type it because you end up missing errors in the transcript and it becomes actually worse.
+110
+00:07:05.110 --> 00:07:06.978
+You end up in a worse position than you started with it.
+111
+00:07:06.978 --> 00:07:07.915
+That's been my experience.
+112
+00:07:08.555 --> 00:07:08.673
+So
+113
+00:07:10.572 --> 00:07:10.915
+I was like,
+114
+00:07:10.994 --> 00:07:11.134
+oh,
+115
+00:07:11.158 --> 00:07:12.228
+this is actually really,
+116
+00:07:12.274 --> 00:07:12.838
+really good now.
+117
+00:07:12.930 --> 00:07:13.555
+How did that happen?
+118
+00:07:13.603 --> 00:07:15.040
+And the answer is ASR,
+119
+00:07:15.680 --> 00:07:20.072
+Whisper being open sourced and the transformer architecture.
+120
+00:07:20.072 --> 00:07:21.619
+If you want to go back to the
+121
+00:07:23.319 --> 00:07:24.120
+to the underpinnings,
+122
+00:07:24.139 --> 00:07:25.660
+which really blows my mind.
+123
+00:07:25.920 --> 00:07:29.480
+And it's on my list to read through that paper.
+124
+00:07:30.422 --> 00:07:38.444
+All you need is attention as attentively as can be done with my limited brain because it's super,
+125
+00:07:38.500 --> 00:07:39.819
+super high level stuff.
+126
+00:07:41.461 --> 00:07:42.350
+Super advanced stuff,
+127
+00:07:42.367 --> 00:07:42.678
+I mean.
+128
+00:07:43.100 --> 00:07:44.100
+But that,
+129
+00:07:44.507 --> 00:07:52.600
+I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities.
+130
+00:07:53.507 --> 00:07:55.048
+I find it fascinating that few people are like,
+131
+00:07:55.189 --> 00:07:55.490
+hang on,
+132
+00:07:56.009 --> 00:07:58.994
+you've got this thing that can speak to you like a chatbot,
+133
+00:07:58.995 --> 00:07:59.634
+an LLM.
+134
+00:08:00.576 --> 00:08:02.600
+And then you've got image generation.
+135
+00:08:02.959 --> 00:08:03.076
+OK,
+136
+00:08:03.139 --> 00:08:03.521
+so firstly,
+137
+00:08:03.639 --> 00:08:07.341
+those two things on the surface have nothing in common.
+138
+00:08:08.545 --> 00:08:08.826
+So like,
+139
+00:08:08.904 --> 00:08:09.505
+how are they?
+140
+00:08:10.427 --> 00:08:12.286
+How did that just happen all at the same time?
+141
+00:08:12.302 --> 00:08:13.411
+And then when you extend that further,
+142
+00:08:14.944 --> 00:08:15.630
+you're like Suno,
+143
+00:08:16.036 --> 00:08:16.225
+right?
+144
+00:08:16.271 --> 00:08:20.896
+You can sing a song and AI will like come up with an instrumental.
+145
+00:08:21.516 --> 00:08:22.637
+And then you've got Whisper.
+146
+00:08:22.757 --> 00:08:23.077
+And you're like,
+147
+00:08:23.079 --> 00:08:23.699
+wait a second.
+148
+00:08:24.158 --> 00:08:25.201
+How did all this stuff,
+149
+00:08:25.319 --> 00:08:26.598
+like if it's all AI,
+150
+00:08:27.262 --> 00:08:27.603
+what's,
+151
+00:08:27.942 --> 00:08:29.384
+like there has to be some commonality.
+152
+00:08:29.543 --> 00:08:30.161
+Otherwise,
+153
+00:08:30.865 --> 00:08:34.707
+these are totally different technologies on the surface of it.
+154
+00:08:34.888 --> 00:08:37.990
+And the transformer architecture is,
+155
+00:08:38.349 --> 00:08:39.067
+as far as I know,
+156
+00:08:39.240 --> 00:08:40.162
+the answer.
+157
+00:08:40.332 --> 00:08:41.192
+And I can't even say,
+158
+00:08:41.302 --> 00:08:47.287
+can't even pretend that I really understand what the transformer architecture means in depth.
+159
+00:08:47.317 --> 00:08:48.457
+But I have scanned this.
+160
+00:08:48.707 --> 00:08:49.629
+And as I said,
+161
+00:08:49.707 --> 00:08:50.599
+I want to...
+162
+00:08:50.840 --> 00:09:01.552
+printed and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy
+163
+00:09:02.208 --> 00:09:11.177
+I think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like
+164
+00:09:11.662 --> 00:09:20.067
+I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of
+165
+00:09:20.543 --> 00:09:22.184
+really obscure scientists who are like in their
+166
+00:09:22.524 --> 00:09:41.356
+50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the
+167
+00:09:41.919 --> 00:09:49.809
+birth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring
+168
+00:09:50.580 --> 00:09:52.484
+It's nice to think that people who,
+169
+00:09:53.488 --> 00:09:53.729
+you know,
+170
+00:09:53.927 --> 00:09:56.294
+just you might put them in the kind of.
+171
+00:09:56.966 --> 00:10:12.508
+milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay
+172
+00:10:12.570 --> 00:10:24.687
+voice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess
+173
+00:10:25.066 --> 00:10:25.326
+It's like,
+174
+00:10:25.326 --> 00:10:25.807
+oh my gosh,
+175
+00:10:25.826 --> 00:10:27.609
+I have to get like a really good microphone for this.
+176
+00:10:28.169 --> 00:10:28.288
+So
+177
+00:10:29.370 --> 00:10:31.471
+I didn't go on a spending spree because I said,
+178
+00:10:31.592 --> 00:10:34.432
+I'm going to have to just wait a month and see if I still use this.
+179
+00:10:35.198 --> 00:10:37.596
+And it just kind of became,
+180
+00:10:38.019 --> 00:10:40.823
+it's become really part of my daily routine.
+181
+00:10:41.863 --> 00:10:43.003
+Like if I'm writing an email,
+182
+00:10:43.269 --> 00:10:44.503
+I'll record a voice note.
+183
+00:10:45.049 --> 00:10:46.284
+And then I've developed.
+184
+00:10:46.784 --> 00:10:50.534
+And it's nice to see that everyone is like developing the same things in parallel.
+185
+00:10:50.566 --> 00:10:52.409
+Like that's kind of a weird thing to say.
+186
+00:10:52.488 --> 00:10:53.549
+But when I look,
+187
+00:10:53.659 --> 00:10:53.769
+I...
+188
+00:10:54.298 --> 00:11:11.754
+kind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh
+189
+00:11:11.754 --> 00:11:21.441
+llm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say,
+190
+00:11:21.566 --> 00:11:21.738
+okay,
+191
+00:11:21.784 --> 00:11:22.909
+this is crappy.
+192
+00:11:23.785 --> 00:11:33.653
+text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built
+193
+00:11:34.216 --> 00:11:49.996
+I see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's
+194
+00:11:50.710 --> 00:11:53.312
+It's the kind of thing that when you start using these tools every day,
+195
+00:11:53.613 --> 00:12:02.100
+the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to,
+196
+00:12:02.842 --> 00:12:03.086
+you know,
+197
+00:12:03.163 --> 00:12:06.023
+it takes so long to get it into a presentable email that again,
+198
+00:12:06.086 --> 00:12:06.241
+it's,
+199
+00:12:06.428 --> 00:12:06.600
+it's,
+200
+00:12:06.788 --> 00:12:06.928
+it,
+201
+00:12:07.086 --> 00:12:13.006
+it moves speech tech into that before that inflection point where you're like,
+202
+00:12:13.008 --> 00:12:13.131
+nah,
+203
+00:12:13.133 --> 00:12:13.836
+it's just not worth it.
+204
+00:12:13.850 --> 00:12:14.491
+It's like,
+205
+00:12:15.178 --> 00:12:16.898
+it'll just be quicker to type this.
+206
+00:12:17.428 --> 00:12:18.336
+So it's a big,
+207
+00:12:18.350 --> 00:12:19.461
+it's a little touch that actually.
+208
+00:12:20.289 --> 00:12:20.791
+is a big deal.
+209
+00:12:21.672 --> 00:12:22.373
+So I was on
+210
+00:12:22.712 --> 00:12:28.100
+Whisper and I've been using Whisper and I kind of early on found a couple of tools.
+211
+00:12:28.458 --> 00:12:30.419
+I couldn't find what I was looking for on Linux,
+212
+00:12:30.498 --> 00:12:35.725
+which is basically just something that'll run in the background.
+213
+00:12:36.044 --> 00:12:43.873
+You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation.
+214
+00:12:45.248 --> 00:12:47.061
+And the issues were I discovered
+215
+00:12:47.241 --> 00:13:06.619
+that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i
+216
+00:13:06.682 --> 00:13:16.104
+just thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models
+217
+00:13:16.730 --> 00:13:18.470
+That are much more,
+218
+00:13:19.070 --> 00:13:19.291
+you know,
+219
+00:13:19.292 --> 00:13:20.688
+you're going to be able to run the full model,
+220
+00:13:21.572 --> 00:13:24.916
+the latest model versus whatever you can run on your average
+221
+00:13:25.533 --> 00:13:28.711
+GPU, unless you want to buy a crazy GPU.
+222
+00:13:28.751 --> 00:13:29.892
+It doesn't really make sense to me.
+223
+00:13:30.033 --> 00:13:39.619
+Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment,
+224
+00:13:40.352 --> 00:13:42.197
+maybe for regulatory reasons as well.
+225
+00:13:42.916 --> 00:13:43.727
+But I'm not in that.
+226
+00:13:44.291 --> 00:13:45.744
+I'm neither really care.
+227
+00:13:46.118 --> 00:13:52.018
+about people listening to my grocery list consisting of reminding myself that I need to buy more beer,
+228
+00:13:52.619 --> 00:13:53.721
+Cheetos and hummus,
+229
+00:13:53.759 --> 00:13:54.716
+which is kind of the three,
+230
+00:13:55.264 --> 00:13:59.240
+three staples of my diet during periods of poor nutrition.
+231
+00:14:00.020 --> 00:14:01.458
+But the kind of stuff that I transcribe,
+232
+00:14:01.498 --> 00:14:02.153
+it's just not,
+233
+00:14:02.154 --> 00:14:03.106
+it's not a,
+234
+00:14:04.248 --> 00:14:08.059
+it's not a privacy thing I'm that sort of sensitive about.
+235
+00:14:08.356 --> 00:14:08.748
+And
+236
+00:14:09.606 --> 00:14:10.544
+I don't do anything so,
+237
+00:14:11.559 --> 00:14:11.826
+you know,
+238
+00:14:12.356 --> 00:14:14.356
+sensitive or secure that requires air gapping.
+239
+00:14:14.403 --> 00:14:14.528
+So.
+240
+00:14:15.770 --> 00:14:18.131
+I looked at the pricing and especially the kind of older models,
+241
+00:14:18.273 --> 00:14:18.493
+mini,
+242
+00:14:19.714 --> 00:14:20.417
+some of them are very,
+243
+00:14:20.495 --> 00:14:21.174
+very affordable.
+244
+00:14:21.256 --> 00:14:21.475
+And
+245
+00:14:22.937 --> 00:14:24.721
+I did a calculation once with
+246
+00:14:25.361 --> 00:14:26.339
+ChatGPT and I was like,
+247
+00:14:26.424 --> 00:14:26.542
+OK,
+248
+00:14:27.322 --> 00:14:27.783
+this is the
+249
+00:14:28.464 --> 00:14:31.027
+API price for I can't remember whatever the model was.
+250
+00:14:31.971 --> 00:14:33.861
+Let's say I just go at it like nonstop,
+251
+00:14:34.269 --> 00:14:35.408
+which it rarely happens.
+252
+00:14:35.549 --> 00:14:36.033
+Probably
+253
+00:14:36.691 --> 00:14:42.956
+I would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails.
+254
+00:14:44.114 --> 00:14:44.234
+uh,
+255
+00:14:44.635 --> 00:14:45.236
+documents,
+256
+00:14:45.356 --> 00:14:46.080
+outlines,
+257
+00:14:46.760 --> 00:14:47.100
+um,
+258
+00:14:47.201 --> 00:14:47.763
+which is a lot,
+259
+00:14:47.802 --> 00:14:48.182
+but it's,
+260
+00:14:48.484 --> 00:14:49.889
+it's still a fairly modest amount.
+261
+00:14:50.327 --> 00:14:50.730
+And I was like,
+262
+00:14:50.750 --> 00:14:50.870
+well,
+263
+00:14:50.952 --> 00:14:53.840
+some days I do go on like one or two days where I've been.
+264
+00:14:54.749 --> 00:15:00.255
+Usually when I'm like kind of out of the house and just have something like I have nothing else to do.
+265
+00:15:00.354 --> 00:15:01.813
+Like if I'm at a hospital,
+266
+00:15:01.856 --> 00:15:07.841
+we have a newborn and you're waiting for like eight hours and hours for an appointment.
+267
+00:15:08.380 --> 00:15:12.865
+And I would probably have listened to podcasts before becoming a speech fanatic.
+268
+00:15:12.942 --> 00:15:13.475
+And I'm like,
+269
+00:15:13.520 --> 00:15:13.645
+oh,
+270
+00:15:13.662 --> 00:15:13.865
+wait,
+271
+00:15:14.302 --> 00:15:15.255
+let me just get down.
+272
+00:15:15.427 --> 00:15:16.975
+Let me just get these ideas out of my head.
+273
+00:15:17.567 --> 00:15:20.645
+And that's when I'll go on my speech binges.
+274
+00:15:20.692 --> 00:15:22.067
+But those are like once every few months,
+275
+00:15:22.130 --> 00:15:23.270
+like not frequently.
+276
+00:15:23.832 --> 00:15:24.192
+But I said,
+277
+00:15:24.232 --> 00:15:24.413
+okay,
+278
+00:15:24.494 --> 00:15:27.597
+let's just say if I'm going to price out cloud STT,
+279
+00:15:29.038 --> 00:15:36.043
+if I was like dedicated every second of every waking hour to transcribing for some odd reason,
+280
+00:15:36.823 --> 00:15:37.129
+um,
+281
+00:15:37.323 --> 00:15:37.590
+I mean,
+282
+00:15:37.591 --> 00:15:39.465
+it'd have to like eat and use the toilet.
+283
+00:15:39.823 --> 00:15:40.090
+Like,
+284
+00:15:40.527 --> 00:15:40.730
+you know,
+285
+00:15:40.730 --> 00:15:42.527
+there's only so many hours I'm awake for.
+286
+00:15:42.652 --> 00:15:43.090
+So like,
+287
+00:15:43.198 --> 00:15:45.495
+let's just say a maximum of like 40 hours,
+288
+00:15:45.620 --> 00:15:48.058
+45 minutes in the hour.
+289
+00:15:48.120 --> 00:15:48.573
+Then I said,
+290
+00:15:48.590 --> 00:15:48.840
+all right,
+291
+00:15:48.855 --> 00:15:49.823
+let's just say 50.
+292
+00:15:50.715 --> 00:15:51.277
+Who knows?
+293
+00:15:51.495 --> 00:15:52.573
+You're dictating on the toilet.
+294
+00:15:52.855 --> 00:15:53.323
+We do it.
+295
+00:15:54.144 --> 00:15:55.385
+So you could just do 60,
+296
+00:15:55.524 --> 00:15:58.764
+but whatever I did and every day,
+297
+00:15:58.986 --> 00:16:02.525
+like you're going flat out seven days a week dictating nonstop.
+298
+00:16:02.565 --> 00:16:02.964
+I was like,
+299
+00:16:03.104 --> 00:16:06.424
+what's my monthly API bill going to be at this price?
+300
+00:16:06.947 --> 00:16:09.307
+And it came out to like 70 or 80 bucks.
+301
+00:16:09.307 --> 00:16:09.745
+And I was like,
+302
+00:16:09.854 --> 00:16:10.042
+well,
+303
+00:16:10.135 --> 00:16:14.167
+that would be an extraordinary amount of dictation.
+304
+00:16:14.322 --> 00:16:22.104
+And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project.
+305
+00:16:22.832 --> 00:16:24.716
+So given that that's kind of the max point for me,
+306
+00:16:24.895 --> 00:16:26.116
+I said that's actually very,
+307
+00:16:26.296 --> 00:16:26.996
+very affordable.
+308
+00:16:28.099 --> 00:16:28.220
+Now,
+309
+00:16:28.278 --> 00:16:35.504
+you're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable,
+310
+00:16:36.207 --> 00:16:37.365
+that's going to cost some more as well.
+311
+00:16:38.091 --> 00:16:39.309
+Unless you're using
+312
+00:16:40.309 --> 00:16:42.996
+Gemini, which needless to say,
+313
+00:16:43.013 --> 00:16:45.091
+is a random person sitting in Jerusalem.
+314
+00:16:46.013 --> 00:16:46.934
+I have no affiliation,
+315
+00:16:47.216 --> 00:16:48.341
+nor with Google,
+316
+00:16:48.403 --> 00:16:49.184
+nor Anthropic,
+317
+00:16:49.231 --> 00:16:49.903
+nor Gemini,
+318
+00:16:49.966 --> 00:16:52.028
+nor any major tech vendor for that matter.
+319
+00:16:52.688 --> 00:16:52.908
+Um,
+320
+00:16:53.951 --> 00:16:56.770
+I like Gemini not so much as a everyday model.
+321
+00:16:57.072 --> 00:16:57.412
+Um,
+322
+00:16:57.513 --> 00:16:59.416
+it's kind of underwhelmed in that respect,
+323
+00:16:59.434 --> 00:16:59.837
+I would say,
+324
+00:17:00.477 --> 00:17:01.653
+but for multimodal,
+325
+00:17:01.716 --> 00:17:02.934
+I think it's got a lot to offer.
+326
+00:17:03.576 --> 00:17:06.762
+And I think that the transcribing functionality whereby it can,
+327
+00:17:07.584 --> 00:17:07.840
+um,
+328
+00:17:08.059 --> 00:17:13.809
+process audio with a system prompt and both give you transcription that's cleaned up,
+329
+00:17:13.873 --> 00:17:15.373
+that reduces two steps to one.
+330
+00:17:15.965 --> 00:17:18.012
+And that for me is a very,
+331
+00:17:18.076 --> 00:17:18.653
+very big deal.
+332
+00:17:18.873 --> 00:17:19.090
+And,
+333
+00:17:19.840 --> 00:17:19.951
+uh,
+334
+00:17:19.951 --> 00:17:22.045
+I feel like even Google has haven't really sort of
+335
+00:17:22.669 --> 00:17:39.968
+thought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay
+336
+00:17:40.125 --> 00:17:49.733
+i've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood.
+337
+00:17:50.480 --> 00:18:06.348
+more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails
+338
+00:18:06.458 --> 00:18:15.864
+which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context
+339
+00:18:16.441 --> 00:18:37.698
+pipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has
+340
+00:18:37.979 --> 00:18:44.526
+been working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for,
+341
+00:18:44.962 --> 00:18:45.432
+after all,
+342
+00:18:45.759 --> 00:18:46.104
+robots.
+343
+00:18:46.930 --> 00:19:02.180
+So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation.
+344
+00:19:02.539 --> 00:19:03.320
+I have to be fair.
+345
+00:19:03.398 --> 00:19:06.961
+I have to give the models a chance to do their thing.
+346
+00:19:07.852 --> 00:19:09.430
+What am I hoping to achieve in this?
+347
+00:19:09.586 --> 00:19:09.789
+Okay,
+348
+00:19:09.852 --> 00:19:11.352
+my fine tune was a dud as mentioned.
+349
+00:19:11.977 --> 00:19:12.648
+Deepgram SDT,
+350
+00:19:12.789 --> 00:19:13.180
+I'm really,
+351
+00:19:13.211 --> 00:19:15.477
+really hopeful that this prototype will work.
+352
+00:19:16.060 --> 00:19:17.843
+And it's a built in public open source.
+353
+00:19:17.844 --> 00:19:20.624
+So anyone is welcome to use it if I make anything good.
+354
+00:19:21.788 --> 00:19:27.515
+But that was really exciting for me last night when after hours of trying my own prototype,
+355
+00:19:27.593 --> 00:19:31.054
+seeing someone just made something that works like that,
+356
+00:19:31.451 --> 00:19:31.654
+you know,
+357
+00:19:31.655 --> 00:19:36.279
+you're not going to have to build a custom conda environment and image.
+358
+00:19:36.468 --> 00:19:37.482
+I have AMD GPU,
+359
+00:19:37.546 --> 00:19:39.811
+which makes things much more complicated.
+360
+00:19:40.311 --> 00:19:41.029
+I didn't find it.
+361
+00:19:42.093 --> 00:19:42.843
+And I was about to give up.
+362
+00:19:42.844 --> 00:19:43.140
+And I said,
+363
+00:19:43.171 --> 00:19:43.421
+all right,
+364
+00:19:43.422 --> 00:19:45.468
+let me just give Deepgram's Linux thing.
+365
+00:19:46.178 --> 00:19:48.265
+shot and if it doesn't work,
+366
+00:19:49.027 --> 00:19:53.621
+I'm just gonna go back to trying to vibe code something myself and when I ran the script
+367
+00:19:54.367 --> 00:19:57.450
+I was using cloud code to do the installation process.
+368
+00:19:58.271 --> 00:20:00.114
+It ran the script and oh my gosh,
+369
+00:20:00.192 --> 00:20:01.195
+it works just like that.
+370
+00:20:01.977 --> 00:20:10.789
+The tricky thing for all those who wants to know all the nitty gritty details was that
+371
+00:20:11.398 --> 00:20:13.648
+I don't think it was actually struggling with transcription,
+372
+00:20:13.680 --> 00:20:14.352
+but pasting,
+373
+00:20:14.884 --> 00:20:17.509
+Wayland makes life very hard.
+374
+00:20:17.617 --> 00:20:19.634
+And I think there was something not running at the right time.
+375
+00:20:19.695 --> 00:20:19.977
+Anyway,
+376
+00:20:20.617 --> 00:20:21.117
+Deepgram,
+377
+00:20:21.273 --> 00:20:24.134
+I looked at how they actually handled that because it worked out of the...
+378
+00:20:24.203 --> 00:20:40.180
+box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i
+379
+00:20:40.181 --> 00:20:52.413
+think i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then
+380
+00:20:53.144 --> 00:21:09.383
+voice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling
+381
+00:21:09.398 --> 00:21:21.586
+thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not
+382
+00:21:21.743 --> 00:21:38.417
+so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong
+383
+00:21:38.542 --> 00:21:51.214
+coffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss
+384
+00:21:51.675 --> 00:21:58.541
+And that actually would have been super helpful to get in because it was non-diarized or if we had diarization,
+385
+00:21:59.502 --> 00:21:59.968
+a female,
+386
+00:22:00.007 --> 00:22:00.443
+I could say,
+387
+00:22:00.607 --> 00:22:03.171
+I want the male voice and that wasn't intended for transcription.
+388
+00:22:04.724 --> 00:22:07.029
+And we're not going to get background noise like people honking their horns,
+389
+00:22:07.146 --> 00:22:13.099
+which is something I've done in my main data set where I am trying to go back to some of my voice notes,
+390
+00:22:13.818 --> 00:22:15.740
+annotate them and run a benchmark.
+391
+00:22:15.741 --> 00:22:17.007
+But this is going to be just a pure,
+392
+00:22:17.788 --> 00:22:20.007
+quick test and
+393
+00:22:21.152 --> 00:22:24.012
+As someone working on a voice note idea,
+394
+00:22:24.071 --> 00:22:27.272
+that's my sort of end motivation,
+395
+00:22:27.332 --> 00:22:31.694
+besides thinking it's an absolutely outstanding technology that's coming to viability.
+396
+00:22:31.772 --> 00:22:32.172
+And really,
+397
+00:22:32.211 --> 00:22:33.094
+I know this sounds cheesy,
+398
+00:22:33.633 --> 00:22:36.336
+can actually have a very transformative effect.
+399
+00:22:37.272 --> 00:22:37.429
+It's,
+400
+00:22:37.836 --> 00:22:38.069
+you know,
+401
+00:22:38.101 --> 00:22:44.897
+voice technology has been life changing for folks living with disabilities.
+402
+00:22:45.851 --> 00:22:46.258
+And
+403
+00:22:47.054 --> 00:22:49.851
+I think there's something really nice about the fact that it can also benefit.
+404
+00:22:50.619 --> 00:22:50.859
+you know,
+405
+00:22:51.019 --> 00:22:58.787
+folks who are able-bodied and like we can all in different ways make this tech as useful as possible,
+406
+00:22:59.231 --> 00:23:01.051
+regardless of the exact way that we're using it.
+407
+00:23:02.490 --> 00:23:05.294
+And I think there's something very powerful in that and it can be very cool.
+408
+00:23:06.395 --> 00:23:07.451
+I see huge potential.
+409
+00:23:07.715 --> 00:23:08.934
+What excites me about voice tech?
+410
+00:23:09.903 --> 00:23:10.512
+A lot of things,
+411
+00:23:10.576 --> 00:23:10.872
+actually.
+412
+00:23:12.294 --> 00:23:12.622
+Firstly,
+413
+00:23:13.028 --> 00:23:14.278
+the fact that it's cheap and accurate,
+414
+00:23:14.715 --> 00:23:16.122
+as I mentioned at the very start of this,
+415
+00:23:17.372 --> 00:23:19.809
+and it's getting better and better with stuff like accent handling.
+416
+00:23:21.053 --> 00:23:25.577
+I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day,
+417
+00:23:25.675 --> 00:23:26.878
+as I imagine.
+418
+00:23:26.880 --> 00:23:27.878
+I get like superb,
+419
+00:23:28.000 --> 00:23:28.942
+flawless words,
+420
+00:23:29.058 --> 00:23:29.582
+error rates,
+421
+00:23:29.597 --> 00:23:34.489
+because I'm just kind of skeptical about local speech to text,
+422
+00:23:34.847 --> 00:23:35.503
+as I mentioned.
+423
+00:23:36.105 --> 00:23:36.371
+And
+424
+00:23:36.792 --> 00:23:40.386
+I think the pace of innovation and improvement in the models,
+425
+00:23:40.574 --> 00:23:47.511
+the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about
+426
+00:23:48.199 --> 00:23:49.278
+ASR is
+427
+00:23:49.531 --> 00:24:04.644
+the idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and
+428
+00:24:04.644 --> 00:24:15.738
+therefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as,
+429
+00:24:15.955 --> 00:24:16.894
+um,
+430
+00:24:17.195 --> 00:24:17.435
+you know,
+431
+00:24:17.455 --> 00:24:19.137
+the call centers who send out ambulances,
+432
+00:24:19.199 --> 00:24:19.618
+et cetera,
+433
+00:24:20.397 --> 00:24:22.441
+where accuracy is absolutely paramount.
+434
+00:24:22.660 --> 00:24:24.125
+And in the case of doctors,
+435
+00:24:24.721 --> 00:24:25.461
+radiologists,
+436
+00:24:25.461 --> 00:24:28.008
+they might be using very specialized vocab all the time.
+437
+00:24:28.827 --> 00:24:30.147
+So those are kind of the main two things.
+438
+00:24:30.148 --> 00:24:37.093
+And I'm not sure that really just for trying to make it better on a few random tech words with my slightly,
+439
+00:24:37.530 --> 00:24:37.750
+I mean,
+440
+00:24:37.750 --> 00:24:38.358
+I have an accent,
+441
+00:24:38.436 --> 00:24:39.218
+but like not,
+442
+00:24:39.530 --> 00:24:39.797
+you know,
+443
+00:24:40.233 --> 00:24:43.936
+an accent that a few other million people have it.
+444
+00:24:44.922 --> 00:24:46.172
+I'm not sure that.
+445
+00:24:46.579 --> 00:24:56.540
+my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that
+446
+00:24:57.029 --> 00:25:01.308
+I suspect that the next generation of ASR will just be so good that it will kind of be,
+447
+00:25:02.051 --> 00:25:02.173
+no,
+448
+00:25:02.430 --> 00:25:02.630
+well,
+449
+00:25:02.808 --> 00:25:03.833
+that would have been cool if it worked out,
+450
+00:25:03.872 --> 00:25:05.192
+but I'll just use this instead.
+451
+00:25:05.972 --> 00:25:11.294
+So that's going to be it for today's episode of voice training data.
+452
+00:25:12.011 --> 00:25:12.333
+Single,
+453
+00:25:12.933 --> 00:25:14.028
+long shot evaluation.
+454
+00:25:14.636 --> 00:25:15.450
+Who am I going to compare?
+455
+00:25:16.622 --> 00:25:17.855
+Whisper is always good as a benchmark,
+456
+00:25:17.886 --> 00:25:22.278
+but I'm more interested in seeing Whisper head-to-head with two things,
+457
+00:25:22.308 --> 00:25:22.511
+really.
+458
+00:25:23.450 --> 00:25:25.169
+One is Whisper variants.
+459
+00:25:25.200 --> 00:25:25.950
+So you've got these...
+460
+00:25:26.178 --> 00:25:44.617
+projects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create
+461
+00:25:44.618 --> 00:25:55.430
+a proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but
+462
+00:25:55.942 --> 00:25:57.241
+What I would actually love to do on
+463
+00:25:58.102 --> 00:25:58.903
+Hugging Face,
+464
+00:25:59.021 --> 00:25:59.800
+I think would be a great,
+465
+00:25:59.984 --> 00:26:08.324
+probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it.
+466
+00:26:08.824 --> 00:26:09.722
+And maybe even a,
+467
+00:26:11.144 --> 00:26:11.364
+like,
+468
+00:26:11.489 --> 00:26:11.722
+you know,
+469
+00:26:11.871 --> 00:26:15.105
+two scale and maybe even a local one as well,
+470
+00:26:15.371 --> 00:26:17.903
+like Local Whisper versus OpenAI API,
+471
+00:26:18.903 --> 00:26:19.449
+et cetera.
+472
+00:26:19.746 --> 00:26:20.105
+And...
+473
+00:26:21.238 --> 00:26:30.903
+I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't,
+474
+00:26:31.606 --> 00:26:34.090
+as well as the sort of headline finding of which had the best
+475
+00:26:34.731 --> 00:26:37.372
+WER, but that would require the source of truth.
+476
+00:26:37.919 --> 00:26:38.090
+Okay,
+477
+00:26:38.137 --> 00:26:38.434
+that's it.
+478
+00:26:38.637 --> 00:26:39.372
+I hope this was,
+479
+00:26:39.622 --> 00:26:39.997
+I don't know,
+480
+00:26:40.419 --> 00:26:42.403
+maybe useful for other folks interested in STT.
+481
+00:26:43.106 --> 00:26:43.762
+You want to see that
+482
+00:26:44.137 --> 00:26:44.919
+I always feel,
+483
+00:26:45.434 --> 00:26:47.247
+think I've just said as something I didn't intend to.
+484
+00:26:48.044 --> 00:26:48.481
+STT,
+485
+00:26:48.872 --> 00:26:49.528
+I said for those.
+486
+00:26:49.817 --> 00:26:50.378
+listen carefully,
+487
+00:26:50.419 --> 00:26:52.902
+including hopefully the models themselves.
+488
+00:26:53.441 --> 00:26:54.163
+This has been myself,
+489
+00:26:54.304 --> 00:26:54.902
+Daniel Rosehill.
+490
+00:26:55.022 --> 00:26:59.404
+For more jumbled repositories about my roving interest in AI,
+491
+00:26:59.507 --> 00:27:00.765
+but particularly agentic,
+492
+00:27:01.451 --> 00:27:03.015
+MCP and voice tech,
+493
+00:27:03.413 --> 00:27:04.335
+you can find me on
+494
+00:27:04.990 --> 00:27:06.749
+GitHub, Hugging Face,
+495
+00:27:08.279 --> 00:27:08.811
+where else?
+496
+00:27:09.140 --> 00:27:10.154
+Danielrosehill.com,
+497
+00:27:10.171 --> 00:27:11.296
+which is my personal website,
+498
+00:27:11.374 --> 00:27:13.483
+as well as this podcast,
+499
+00:27:13.624 --> 00:27:15.186
+whose name I sadly cannot remember.
+500
+00:27:15.936 --> 00:27:16.499
+Until next time,
+501
+00:27:16.826 --> 00:27:17.343
+thanks for listening.

srt-out/nova3.srt ADDED Viewed

	@@ -0,0 +1,2304 @@

+1
+00:00:00,080 --> 00:00:06,240
+Hello and welcome to a audio dataset consisting of one
+2
+00:00:06,240 --> 00:00:08,400
+single episode of a nonexistent podcast.
+3
+00:00:08,800 --> 00:00:12,880
+Or it I may append this to a podcast that
+4
+00:00:12,880 --> 00:00:18,814
+I set up recently regarding my with my thoughts on
+5
+00:00:18,815 --> 00:00:20,815
+speech tech and A.
+6
+00:00:20,815 --> 00:00:21,214
+I.
+7
+00:00:21,214 --> 00:00:22,814
+In particular, more A.
+8
+00:00:22,814 --> 00:00:23,054
+I.
+9
+00:00:23,054 --> 00:00:23,935
+And generative A.
+10
+00:00:23,935 --> 00:00:24,095
+I.
+11
+00:00:24,095 --> 00:00:26,494
+I would I would say.
+12
+00:00:26,814 --> 00:00:30,869
+But in any event, the purpose of this voice recording
+13
+00:00:30,869 --> 00:00:35,590
+is actually to create a lengthy voice sample for a
+14
+00:00:35,590 --> 00:00:38,950
+quick evaluation, a back of the envelope evaluation, they might
+15
+00:00:38,950 --> 00:00:41,429
+say, for different speech attacks models.
+16
+00:00:41,429 --> 00:00:43,945
+I'm doing this because I thought I'd made a great
+17
+00:00:43,945 --> 00:00:47,784
+breakthrough in my journey with speech tech and that was
+18
+00:00:47,784 --> 00:00:51,385
+succeeding in the elusive task of fine tuning whisper.
+19
+00:00:51,704 --> 00:00:56,424
+Whisper is, and I'm to just talk, I'm trying to
+20
+00:00:55,829 --> 00:00:56,789
+mix up.
+21
+00:00:56,869 --> 00:01:00,390
+I'm going to try a few different styles of speaking
+22
+00:01:00,390 --> 00:01:02,869
+whisper something at some points as well.
+23
+00:01:03,350 --> 00:01:06,790
+And I'll go back to speaking loud in in different
+24
+00:01:06,790 --> 00:01:09,030
+parts are going to sound really like a crazy person
+25
+00:01:09,030 --> 00:01:12,424
+because I'm also going to try to speak at different
+26
+00:01:12,984 --> 00:01:18,025
+pitches and cadences in order to really try to push
+27
+00:01:18,344 --> 00:01:21,145
+a speech to text model through its paces, which is
+28
+00:01:21,145 --> 00:01:24,609
+trying to make sense of is this guy just rambling
+29
+00:01:24,609 --> 00:01:30,049
+on incoherently in one long sentence or are these just
+30
+00:01:30,049 --> 00:01:36,450
+actually a series of step standalone, standalone, standalone sentences?
+31
+00:01:36,450 --> 00:01:38,130
+And how is it going to handle step alone?
+32
+00:01:38,130 --> 00:01:38,770
+That's not a word.
+33
+00:01:39,704 --> 00:01:42,025
+What happens when you use speech to text and you
+34
+00:01:42,025 --> 00:01:43,384
+use a fake word?
+35
+00:01:43,384 --> 00:01:45,784
+And then you're like, wait, that's not actually that word
+36
+00:01:45,784 --> 00:01:46,665
+doesn't exist.
+37
+00:01:46,984 --> 00:01:48,584
+How does AI handle that?
+38
+00:01:48,584 --> 00:01:53,750
+And these and more are all the questions that I'm
+39
+00:01:53,750 --> 00:01:55,750
+seeking to answer in this training data.
+40
+00:01:55,829 --> 00:01:58,549
+Now, why was I trying to fine tune Whisper?
+41
+00:01:58,549 --> 00:01:59,750
+And what is Whisper?
+42
+00:01:59,750 --> 00:02:02,710
+As I said, I'm going to try to record this
+43
+00:02:02,710 --> 00:02:06,644
+at a couple of different levels of technicality for folks
+44
+00:02:06,644 --> 00:02:11,764
+who are in the normal world and not totally stuck
+45
+00:02:11,764 --> 00:02:13,764
+down the rabbit hole of AI, which you have to
+46
+00:02:13,764 --> 00:02:17,685
+say is a really wonderful rabbit hole to be done.
+47
+00:02:17,844 --> 00:02:20,919
+It's a really interesting area and speech and voice tech
+48
+00:02:20,919 --> 00:02:24,359
+is is the aspect of it that I find actually
+49
+00:02:24,359 --> 00:02:27,239
+most I'm not sure I would say the most interesting
+50
+00:02:27,239 --> 00:02:30,759
+because there's just so much that is fascinating in AI.
+51
+00:02:31,400 --> 00:02:34,134
+But the most that I find the most personally transformative
+52
+00:02:34,134 --> 00:02:38,534
+in terms of the impact that it's had on my
+53
+00:02:38,534 --> 00:02:41,254
+daily work life and productivity and how I sort of
+54
+00:02:41,254 --> 00:02:41,895
+work.
+55
+00:02:42,935 --> 00:02:47,500
+I'm persevering hard with the task of trying to get
+56
+00:02:47,500 --> 00:02:50,939
+a good solution working for Linux, which if anyone actually
+57
+00:02:50,939 --> 00:02:52,939
+does listen to this, not just for the training data
+58
+00:02:52,939 --> 00:02:56,700
+and for the actual content, is sparked.
+59
+00:02:56,700 --> 00:02:59,980
+I had, besides the fine tune not working, well that
+60
+00:02:59,980 --> 00:03:01,385
+was the failure.
+61
+00:03:02,504 --> 00:03:06,745
+I used Claude code because one thinks these days that
+62
+00:03:06,745 --> 00:03:13,280
+there is nothing short of solving, you know, the the
+63
+00:03:13,280 --> 00:03:17,599
+reason of life or something that clause and agentic AI
+64
+00:03:17,599 --> 00:03:19,680
+can't do, which is not really the case.
+65
+00:03:19,680 --> 00:03:23,199
+It does seem that way sometimes, but it fails a
+66
+00:03:23,199 --> 00:03:23,759
+lot as well.
+67
+00:03:23,759 --> 00:03:26,639
+And this is one of those instances where last week
+68
+00:03:26,639 --> 00:03:30,824
+I put together an hour of voice training data, basically
+69
+00:03:30,824 --> 00:03:33,465
+speaking just random things for three minutes.
+70
+00:03:35,465 --> 00:03:38,104
+It was actually kind of tedious because the texts were
+71
+00:03:38,104 --> 00:03:38,664
+really weird.
+72
+00:03:38,664 --> 00:03:41,370
+Some of them were, it was like it was AI
+73
+00:03:41,370 --> 00:03:42,250
+generated.
+74
+00:03:42,569 --> 00:03:44,889
+I tried before to read Sherlock Holmes for an hour
+75
+00:03:44,889 --> 00:03:47,689
+and I just couldn't, I was so bored after ten
+76
+00:03:47,689 --> 00:03:50,569
+minutes that I was like, okay, no, I'm just gonna
+77
+00:03:50,569 --> 00:03:51,930
+have to find something else to read.
+78
+00:03:51,930 --> 00:03:58,284
+So I used a created with AI Studio, VibeCoded, a
+79
+00:03:58,284 --> 00:04:03,164
+synthetic text generator which actually I thought was probably a
+80
+00:04:03,164 --> 00:04:05,245
+better way of doing it because it would give me
+81
+00:04:05,245 --> 00:04:09,069
+more short samples with more varied content.
+82
+00:04:09,069 --> 00:04:11,710
+So I was like, okay, give me a voice note
+83
+00:04:11,710 --> 00:04:14,909
+like I'm recording an email, give me a short story
+84
+00:04:14,909 --> 00:04:18,189
+to read, give me prose to read.
+85
+00:04:18,189 --> 00:04:20,634
+So I came up with all these different things and
+86
+00:04:20,634 --> 00:04:22,714
+they added a little timer to it so I could
+87
+00:04:22,714 --> 00:04:24,955
+see how close I was to one hour.
+88
+00:04:25,915 --> 00:04:29,115
+And I spent like an hour one afternoon or probably
+89
+00:04:29,115 --> 00:04:33,115
+two hours by the time you do retakes and whatever
+90
+00:04:33,115 --> 00:04:36,169
+because you want to it gave me a source of
+91
+00:04:36,169 --> 00:04:40,009
+truth which I'm not sure if that's the scientific way
+92
+00:04:40,009 --> 00:04:44,169
+to approach this topic of gathering training data but I
+93
+00:04:44,169 --> 00:04:45,449
+thought made sense.
+94
+00:04:46,490 --> 00:04:49,464
+I have a lot of audio data from recording voice
+95
+00:04:49,464 --> 00:04:53,544
+notes which I've also kind of used, been experimenting with
+96
+00:04:53,544 --> 00:04:55,064
+using for a different purpose.
+97
+00:04:55,384 --> 00:04:58,745
+Slightly different annotating task types.
+98
+00:04:58,745 --> 00:05:03,250
+It's more a text classification experiment or Well, it's more
+99
+00:05:03,250 --> 00:05:03,810
+than that actually.
+100
+00:05:03,810 --> 00:05:05,009
+I'm working on a voice app.
+101
+00:05:05,009 --> 00:05:09,329
+So it's a prototype, I guess, is really more accurate.
+102
+00:05:11,409 --> 00:05:13,969
+But you can do that and you can work backwards.
+103
+00:05:13,969 --> 00:05:18,354
+Listen back to a voice note and you painfully go
+104
+00:05:18,354 --> 00:05:21,474
+through one of those transcribing, where you start and stop
+105
+00:05:21,474 --> 00:05:23,634
+and scrub around it and you fix the errors, but
+106
+00:05:23,634 --> 00:05:25,875
+it's really, really pouring to do that.
+107
+00:05:26,115 --> 00:05:28,034
+So I thought it would be less tedious in the
+108
+00:05:28,034 --> 00:05:31,714
+long term if I just recorded the source of truth.
+109
+00:05:32,069 --> 00:05:34,389
+So it gave me these three minutes snippets.
+110
+00:05:34,389 --> 00:05:37,509
+I recorded them and saved an MP3 and a TXT
+111
+00:05:37,750 --> 00:05:40,310
+in the same folder and I created an error that
+112
+00:05:40,310 --> 00:05:40,949
+data.
+113
+00:05:41,990 --> 00:05:44,870
+So I was very hopeful, quietly, a little bit hopeful
+114
+00:05:44,870 --> 00:05:47,029
+that I would be able, that I could actually fine
+115
+00:05:47,029 --> 00:05:47,750
+tune Whisper.
+116
+00:05:48,365 --> 00:05:51,085
+I want to fine tune Whisper because when I got
+117
+00:05:51,085 --> 00:05:55,004
+into voice tech last November, my wife was in the
+118
+00:05:55,004 --> 00:05:57,245
+US and I was alone at home.
+119
+00:05:57,324 --> 00:06:01,004
+And when crazy people like me do really wild things
+120
+00:06:01,004 --> 00:06:03,980
+like use voice to tech technology.
+121
+00:06:03,980 --> 00:06:06,939
+That was basically when I started doing it, I didn't
+122
+00:06:06,939 --> 00:06:09,580
+feel like a crazy person speaking to myself.
+123
+00:06:09,980 --> 00:06:12,780
+And my expectations weren't that high.
+124
+00:06:13,180 --> 00:06:17,685
+I'd used speech tech now and again, tried it out.
+125
+00:06:17,685 --> 00:06:18,884
+I was like, it'd be really cool if you could
+126
+00:06:18,884 --> 00:06:22,404
+just like speak into your computer and whatever I tried
+127
+00:06:22,404 --> 00:06:25,925
+out that had Linux support was just, it was not
+128
+00:06:25,925 --> 00:06:26,805
+good basically.
+129
+00:06:27,365 --> 00:06:29,524
+And this blew me away from the first go.
+130
+00:06:29,524 --> 00:06:32,339
+I mean, it wasn't one hundred percent accurate out of
+131
+00:06:32,339 --> 00:06:34,500
+the box and it took work, but it was good
+132
+00:06:34,500 --> 00:06:36,819
+enough that there was a solid foundation and it kind
+133
+00:06:36,819 --> 00:06:41,139
+of passed that pivot point that it's actually worth doing
+134
+00:06:41,139 --> 00:06:41,620
+this.
+135
+00:06:41,939 --> 00:06:43,939
+You know, there's a point where it's so like, the
+136
+00:06:43,939 --> 00:06:46,485
+transcript is you don't have to get one hundred percent
+137
+00:06:46,485 --> 00:06:49,525
+accuracy for it to be worth your time for speech
+138
+00:06:49,525 --> 00:06:51,925
+to text to be a worthwhile addition to your productivity.
+139
+00:06:51,925 --> 00:06:53,685
+But you do need to get above, let's say, I
+140
+00:06:53,685 --> 00:06:55,125
+don't know, eighty five percent.
+141
+00:06:55,605 --> 00:06:58,805
+If it's sixty percent or fifty percent, you inevitably say,
+142
+00:06:59,040 --> 00:07:00,319
+Screw it, I'll just type it.
+143
+00:07:00,319 --> 00:07:03,680
+Because you end up missing errors in the transcript and
+144
+00:07:03,680 --> 00:07:05,040
+it becomes actually worse.
+145
+00:07:05,040 --> 00:07:06,720
+You end up in a worse position than you started
+146
+00:07:06,720 --> 00:07:07,040
+with it.
+147
+00:07:07,040 --> 00:07:08,240
+That's been my experience.
+148
+00:07:08,560 --> 00:07:12,480
+So I was like, Oh, this is actually really, really
+149
+00:07:12,480 --> 00:07:12,960
+good now.
+150
+00:07:12,960 --> 00:07:13,680
+How did that happen?
+151
+00:07:13,680 --> 00:07:17,995
+And the answer is ASR, Whisper being open sourced and
+152
+00:07:18,714 --> 00:07:21,594
+the transformer architecture, if you want to go back to
+153
+00:07:21,594 --> 00:07:26,394
+the underpinnings, which really blows my mind and it's on
+154
+00:07:26,394 --> 00:07:29,830
+my list to read through that paper.
+155
+00:07:30,389 --> 00:07:35,990
+All you need is attention as attentively as can be
+156
+00:07:35,990 --> 00:07:39,350
+done with my limited brain because it's super super high
+157
+00:07:39,350 --> 00:07:43,045
+level stuff, super advanced stuff, mean.
+158
+00:07:43,285 --> 00:07:48,084
+That I think of all the things that are fascinating
+159
+00:07:48,084 --> 00:07:52,564
+about the sudden rise in AI and the dramatic capabilities,
+160
+00:07:53,339 --> 00:07:55,419
+I find it fascinating that few people are like, hang
+161
+00:07:55,419 --> 00:07:58,300
+on, you've got this thing that can speak to you
+162
+00:07:58,300 --> 00:08:00,060
+like a chatbot, an LLM.
+163
+00:08:00,620 --> 00:08:02,860
+And then you've got image generation.
+164
+00:08:02,860 --> 00:08:03,180
+Okay.
+165
+00:08:03,180 --> 00:08:07,100
+So firstly, two things on the surface have nothing in
+166
+00:08:07,100 --> 00:08:07,419
+common.
+167
+00:08:08,365 --> 00:08:12,044
+So how did that just happen all at the same
+168
+00:08:12,044 --> 00:08:12,285
+time?
+169
+00:08:12,285 --> 00:08:15,964
+And then when you extend that further, you're like, Suno.
+170
+00:08:15,964 --> 00:08:19,485
+You can sing a song and AI will come up
+171
+00:08:19,485 --> 00:08:21,165
+with an instrumental.
+172
+00:08:21,485 --> 00:08:23,485
+And then you've got Whisper and you're like, Wait a
+173
+00:08:23,485 --> 00:08:23,725
+second.
+174
+00:08:24,100 --> 00:08:28,180
+How did all this stuff If it's all AI, there
+175
+00:08:28,180 --> 00:08:29,540
+has to be some commonality.
+176
+00:08:29,540 --> 00:08:35,139
+Otherwise, are totally different technologies on the surface of it.
+177
+00:08:35,220 --> 00:08:39,384
+And the transformer architecture is, as far as I know,
+178
+00:08:39,384 --> 00:08:40,264
+the answer.
+179
+00:08:40,264 --> 00:08:42,985
+And I can't even say, can't even pretend that I
+180
+00:08:42,985 --> 00:08:47,384
+really understand what the transformer architecture means in-depth.
+181
+00:08:47,384 --> 00:08:49,865
+But I have scanned this and as I said, I
+182
+00:08:49,865 --> 00:08:52,879
+want to print it and really kind of think over
+183
+00:08:52,879 --> 00:08:54,160
+it at some point.
+184
+00:08:54,879 --> 00:08:58,080
+And I'll probably feel bad about myself, I think, because
+185
+00:08:58,080 --> 00:08:59,679
+weren't those guys in twenties?
+186
+00:09:00,320 --> 00:09:01,840
+Like, that's crazy.
+187
+00:09:02,160 --> 00:09:06,160
+I think I asked ChatGPT once who wrote that paper
+188
+00:09:06,545 --> 00:09:09,264
+and how old were they when it was published in
+189
+00:09:09,264 --> 00:09:09,825
+ArcSiv?
+190
+00:09:09,825 --> 00:09:13,105
+And I was expecting like, I don't know, what do
+191
+00:09:13,105 --> 00:09:13,585
+you imagine?
+192
+00:09:13,585 --> 00:09:15,665
+I personally imagine kind of like, you you have these
+193
+00:09:15,665 --> 00:09:19,745
+breakthroughs during COVID and things like that, where like these
+194
+00:09:19,745 --> 00:09:22,629
+kind of really obscure scientists who are in their 50s
+195
+00:09:22,629 --> 00:09:26,870
+and they've just kind of been laboring in labs and
+196
+00:09:26,870 --> 00:09:29,830
+wearily in writing and publishing in kind of obscure academic
+197
+00:09:29,830 --> 00:09:30,710
+publications.
+198
+00:09:30,870 --> 00:09:33,669
+And they finally hit a big or win a Nobel
+199
+00:09:33,669 --> 00:09:36,235
+Prize and then their household names.
+200
+00:09:36,634 --> 00:09:38,634
+So that was kind of what I had in mind.
+201
+00:09:38,634 --> 00:09:42,154
+That was the mental image I'd formed of the birth
+202
+00:09:42,154 --> 00:09:42,955
+of ArcSim.
+203
+00:09:42,955 --> 00:09:45,595
+Like I wasn't expecting twenty somethings in San Francisco.
+204
+00:09:45,595 --> 00:09:48,794
+I thought that was both very funny, very cool, and
+205
+00:09:48,794 --> 00:09:50,075
+actually kind of inspiring.
+206
+00:09:50,554 --> 00:09:55,230
+It's nice to think that people who just you might
+207
+00:09:55,230 --> 00:09:58,509
+put them in the kind of milieu or bubble or
+208
+00:09:58,509 --> 00:10:02,669
+world that you are in incredibly in through a series
+209
+00:10:02,669 --> 00:10:05,835
+of connections that are coming up with such literally world
+210
+00:10:05,835 --> 00:10:07,835
+changing innovations.
+211
+00:10:07,914 --> 00:10:11,274
+So that was I thought anyway, that's that that was
+212
+00:10:11,274 --> 00:10:11,835
+cool.
+213
+00:10:12,235 --> 00:10:12,554
+Okay.
+214
+00:10:12,554 --> 00:10:13,434
+Voice training data.
+215
+00:10:13,434 --> 00:10:14,154
+How are we doing?
+216
+00:10:14,154 --> 00:10:17,355
+We're about ten minutes, and I'm still talking about voice
+217
+00:10:17,355 --> 00:10:18,235
+technology.
+218
+00:10:18,634 --> 00:10:22,179
+So Whisper was brilliant, and I was so excited that
+219
+00:10:22,179 --> 00:10:25,860
+my first instinct was to guess, like, Oh my gosh,
+220
+00:10:25,860 --> 00:10:28,019
+I have to get a really good microphone for this.
+221
+00:10:28,179 --> 00:10:31,379
+So I didn't go on a spending spree because I
+222
+00:10:31,379 --> 00:10:33,299
+said, I'm gonna have to just wait a month and
+223
+00:10:33,299 --> 00:10:34,740
+see if I still use this.
+224
+00:10:35,220 --> 00:10:38,875
+And it just kind of became it's become really part
+225
+00:10:38,875 --> 00:10:40,955
+of my daily routine.
+226
+00:10:41,754 --> 00:10:44,315
+Like if I'm writing an email, I'll record a voice
+227
+00:10:44,315 --> 00:10:47,595
+note and then I've developed and it's nice to see
+228
+00:10:47,595 --> 00:10:50,759
+that everyone is like developing the same things in parallel.
+229
+00:10:50,759 --> 00:10:53,399
+That's kind of a weird thing to say, when I
+230
+00:10:53,399 --> 00:11:00,279
+started working on these prototypes on GitHub, which is where
+231
+00:11:00,279 --> 00:11:04,039
+I just kind of share very freely and loosely ideas
+232
+00:11:04,039 --> 00:11:06,945
+and first iterations on concepts.
+233
+00:11:09,024 --> 00:11:10,704
+And for want of a better word, I called it
+234
+00:11:10,704 --> 00:11:14,945
+like LLM post processing or clean up or basically a
+235
+00:11:14,945 --> 00:11:17,745
+system prompt that after you get back the raw text
+236
+00:11:17,745 --> 00:11:21,620
+from Whisper, you run it through a model and say,
+237
+00:11:21,620 --> 00:11:26,339
+okay, this is crappy text like add sentence structure and,
+238
+00:11:26,339 --> 00:11:27,459
+you know, fix it up.
+239
+00:11:27,860 --> 00:11:32,579
+And now when I'm exploring the different tools that are
+240
+00:11:32,579 --> 00:11:35,634
+out there that people have built, I see quite a
+241
+00:11:35,634 --> 00:11:39,475
+number of projects have basically done the same thing.
+242
+00:11:40,754 --> 00:11:43,235
+Lest that be misconstrued, I'm not saying for a millisecond
+243
+00:11:43,235 --> 00:11:44,595
+that I inspired them.
+244
+00:11:44,595 --> 00:11:48,034
+I'm sure this has been a thing that's been integrated
+245
+00:11:48,034 --> 00:11:51,290
+into tools for a while, but it's the kind of
+246
+00:11:51,290 --> 00:11:53,690
+thing that when you start using these tools every day,
+247
+00:11:53,690 --> 00:11:57,610
+the need for it is almost instantly apparent because text
+248
+00:11:57,610 --> 00:12:01,529
+that doesn't have any punctuation or paragraph spacing takes a
+249
+00:12:01,529 --> 00:12:03,965
+long time to, you know, it takes so long to
+250
+00:12:03,965 --> 00:12:09,004
+get it into a presentable email that again, moves speech
+251
+00:12:09,004 --> 00:12:13,085
+tech into that before that inflection point where you're like,
+252
+00:12:13,085 --> 00:12:13,965
+nah, it's just not worth it.
+253
+00:12:13,965 --> 00:12:16,924
+It's like, it'll just be quicker to type this.
+254
+00:12:17,279 --> 00:12:19,840
+So it's a big, it's a little touch that actually
+255
+00:12:20,080 --> 00:12:21,200
+is a big deal.
+256
+00:12:21,519 --> 00:12:25,440
+So I was on Whisper and I've been using Whisper
+257
+00:12:25,440 --> 00:12:27,759
+and I kind of early on found a couple of
+258
+00:12:27,759 --> 00:12:28,399
+tools.
+259
+00:12:28,399 --> 00:12:30,639
+I couldn't find what I was looking for on Linux,
+260
+00:12:30,639 --> 00:12:35,924
+which is basically just something that'll run-in the background.
+261
+00:12:35,924 --> 00:12:38,245
+You'll give it an API key and it will just
+262
+00:12:38,245 --> 00:12:43,044
+like transcribe with like a little key to start and
+263
+00:12:43,044 --> 00:12:43,845
+stop the dictation.
+264
+00:12:45,080 --> 00:12:48,440
+And the issues where I discovered that like most people
+265
+00:12:48,440 --> 00:12:52,040
+involved in creating these projects were very much focused on
+266
+00:12:52,040 --> 00:12:55,800
+local models, running Whisper locally because you can.
+267
+00:12:56,279 --> 00:12:58,200
+And I tried that a bunch of times and just
+268
+00:12:58,200 --> 00:13:01,054
+never got results that were as good as the cloud.
+269
+00:13:01,455 --> 00:13:03,615
+And when I began looking at the cost of the
+270
+00:13:03,615 --> 00:13:06,654
+speech to text APIs and what I was spending, I
+271
+00:13:06,654 --> 00:13:09,855
+just thought there is it's actually, in my opinion, just
+272
+00:13:09,855 --> 00:13:13,160
+one of the better deals in API spending in the
+273
+00:13:13,160 --> 00:13:13,480
+cloud.
+274
+00:13:13,480 --> 00:13:15,720
+Like, it's just not that expensive for very, very good
+275
+00:13:15,720 --> 00:13:19,639
+models that are much more, you know, you're gonna be
+276
+00:13:19,639 --> 00:13:22,759
+able to run the full model, the latest model versus
+277
+00:13:22,759 --> 00:13:26,605
+whatever you can run on your average GPU unless you
+278
+00:13:26,605 --> 00:13:28,845
+want to buy a crazy GPU.
+279
+00:13:28,845 --> 00:13:30,044
+It doesn't really make sense to me.
+280
+00:13:30,044 --> 00:13:33,164
+Privacy is another concern that I know is kind of
+281
+00:13:33,164 --> 00:13:35,325
+like a very much a separate thing that people just
+282
+00:13:35,325 --> 00:13:38,845
+don't want their voice data and their voice leaving their
+283
+00:13:38,845 --> 00:13:42,460
+local environment maybe for regulatory reasons as well.
+284
+00:13:42,700 --> 00:13:43,980
+But I'm not in that.
+285
+00:13:44,220 --> 00:13:48,540
+I neither really care about people listening to my, grocery
+286
+00:13:48,540 --> 00:13:51,580
+list, consisting of, reminding myself that I need to buy
+287
+00:13:51,580 --> 00:13:54,779
+more beer, Cheetos, and hummus, which is kind of the
+288
+00:13:55,334 --> 00:13:59,574
+three staples of my diet during periods of poor nutrition.
+289
+00:13:59,894 --> 00:14:02,375
+But the kind of stuff that I transcribe, it's just
+290
+00:14:02,375 --> 00:14:02,694
+not.
+291
+00:14:02,694 --> 00:14:07,814
+It's not a privacy thing I'm that sort of sensitive
+292
+00:14:07,814 --> 00:14:13,269
+about and I don't do anything so sensitive or secure
+293
+00:14:13,269 --> 00:14:14,790
+that requires air capping.
+294
+00:14:15,670 --> 00:14:17,590
+I looked at the pricing and especially the kind of
+295
+00:14:17,590 --> 00:14:18,950
+older model mini.
+296
+00:14:19,590 --> 00:14:21,910
+Some of them are very, very affordable and I did
+297
+00:14:21,910 --> 00:14:26,764
+a calculation once with ChatGPT and I was like, okay,
+298
+00:14:26,764 --> 00:14:30,365
+this is the API price for I can't remember whatever
+299
+00:14:30,365 --> 00:14:31,404
+the model was.
+300
+00:14:31,804 --> 00:14:34,445
+Let's say I just go at it like nonstop, which
+301
+00:14:34,445 --> 00:14:35,565
+rarely happens.
+302
+00:14:35,644 --> 00:14:38,959
+Probably, I would say on average I might dictate thirty
+303
+00:14:38,959 --> 00:14:41,759
+to sixty minutes per day if I was probably summing
+304
+00:14:41,759 --> 00:14:48,000
+up the emails, documents, outlines, which is a lot, but
+305
+00:14:48,000 --> 00:14:50,159
+it's it's still a fairly modest amount.
+306
+00:14:50,159 --> 00:14:51,839
+And I was like, well, some days I do go
+307
+00:14:51,839 --> 00:14:54,934
+on like one or two days where I've been usually
+308
+00:14:54,934 --> 00:14:56,855
+when I'm like kind of out of the house and
+309
+00:14:56,855 --> 00:15:00,535
+just have something like I have nothing else to do.
+310
+00:15:00,535 --> 00:15:03,175
+Like if I'm at a hospital, we have a newborn
+311
+00:15:03,575 --> 00:15:07,299
+and you're waiting for like eight hours and hours for
+312
+00:15:07,299 --> 00:15:08,100
+an appointment.
+313
+00:15:08,179 --> 00:15:12,019
+And I would probably have listened to podcasts before becoming
+314
+00:15:12,019 --> 00:15:12,980
+a speech fanatic.
+315
+00:15:12,980 --> 00:15:15,379
+And I'm like, Oh, wait, let me just get down.
+316
+00:15:15,379 --> 00:15:17,379
+Let me just get these ideas out of my head.
+317
+00:15:17,540 --> 00:15:20,745
+And that's when I'll go on my speech binges.
+318
+00:15:20,745 --> 00:15:22,664
+But those are like once every few months, like not
+319
+00:15:22,664 --> 00:15:23,544
+frequently.
+320
+00:15:23,784 --> 00:15:25,784
+But I said, okay, let's just say if I'm going
+321
+00:15:25,784 --> 00:15:28,184
+to price out cloud STT.
+322
+00:15:28,985 --> 00:15:33,500
+If I was like dedicated every second of every waking
+323
+00:15:33,500 --> 00:15:37,820
+hour to transcribing for some odd reason, I mean I'd
+324
+00:15:37,820 --> 00:15:39,820
+have to eat and use the toilet.
+325
+00:15:40,540 --> 00:15:42,700
+There's only so many hours I'm awake for.
+326
+00:15:42,700 --> 00:15:47,019
+So let's just say a maximum of forty five minutes
+327
+00:15:47,205 --> 00:15:49,205
+in the hour, then I said, All right, let's just
+328
+00:15:49,205 --> 00:15:50,165
+say fifty.
+329
+00:15:50,644 --> 00:15:51,365
+Who knows?
+330
+00:15:51,365 --> 00:15:52,804
+You're dictating on the toilet.
+331
+00:15:52,804 --> 00:15:53,605
+We do it.
+332
+00:15:53,924 --> 00:15:56,884
+So you could just do sixty, but whatever I did
+333
+00:15:57,125 --> 00:16:01,179
+and every day, like you're going flat out seven days
+334
+00:16:01,179 --> 00:16:02,620
+a week dictating nonstop.
+335
+00:16:02,620 --> 00:16:05,579
+I was like, What's my monthly API bill going to
+336
+00:16:05,579 --> 00:16:06,700
+be at this price?
+337
+00:16:06,779 --> 00:16:09,339
+And it came out to like seventy or eighty bucks.
+338
+00:16:09,339 --> 00:16:12,620
+And I was like, Well, that would be an extraordinary
+339
+00:16:12,940 --> 00:16:14,379
+amount of dictation.
+340
+00:16:14,379 --> 00:16:18,105
+And I would hope that there was some compelling reason
+341
+00:16:18,745 --> 00:16:21,784
+worth more than seventy dollars that I embarked upon that
+342
+00:16:21,784 --> 00:16:22,424
+project.
+343
+00:16:22,664 --> 00:16:24,585
+So given that that's kind of the max point for
+344
+00:16:24,585 --> 00:16:27,304
+me I said that's actually very very affordable.
+345
+00:16:28,024 --> 00:16:30,504
+Now you're gonna if you want to spec out the
+346
+00:16:30,504 --> 00:16:33,909
+costs and you want to do the post processing that
+347
+00:16:33,909 --> 00:16:36,789
+I really do feel is valuable, that's going to cost
+348
+00:16:36,789 --> 00:16:37,750
+some more as well.
+349
+00:16:38,070 --> 00:16:43,269
+Unless you're using Gemini, which needless to say is a
+350
+00:16:43,269 --> 00:16:45,190
+random person sitting in Jerusalem.
+351
+00:16:45,855 --> 00:16:49,455
+I have no affiliation nor with Google nor Anthropic nor
+352
+00:16:49,455 --> 00:16:52,414
+Gemini nor any major tech vendor for that matter.
+353
+00:16:53,855 --> 00:16:57,215
+I like Gemini not so much as a everyday model.
+354
+00:16:57,455 --> 00:16:59,934
+It's kind of underwhelmed in that respect, I would say.
+355
+00:17:00,379 --> 00:17:02,779
+But for multimodal, I think it's got a lot to
+356
+00:17:02,779 --> 00:17:03,339
+offer.
+357
+00:17:03,659 --> 00:17:07,179
+And I think that the transcribing functionality whereby it can,
+358
+00:17:08,059 --> 00:17:12,380
+process audio with a system prompt and both give you
+359
+00:17:12,380 --> 00:17:13,900
+transcription that's cleaned up.
+360
+00:17:13,900 --> 00:17:15,339
+That reduces two steps to one.
+361
+00:17:15,835 --> 00:17:18,954
+And that for me is a very, very big deal.
+362
+00:17:18,955 --> 00:17:22,474
+And I feel like even Google hasn't really sort of
+363
+00:17:22,555 --> 00:17:27,195
+thought through how useful the that modality is and what
+364
+00:17:27,195 --> 00:17:29,700
+kind of use cases you can achieve with it.
+365
+00:17:29,700 --> 00:17:32,339
+Because I found in the course of this year just
+366
+00:17:32,339 --> 00:17:38,019
+an endless list of really kind of system prompt stuff
+367
+00:17:38,019 --> 00:17:40,900
+that I can say, okay, I've used it to capture
+368
+00:17:40,900 --> 00:17:44,115
+context data for AI, which is literally I might speak
+369
+00:17:44,115 --> 00:17:46,755
+for if I wanted to have a good bank of
+370
+00:17:46,755 --> 00:17:50,035
+context data about who knows my childhood.
+371
+00:17:50,434 --> 00:17:54,355
+More realistically, maybe my career goals, something that would just
+372
+00:17:54,355 --> 00:17:56,195
+be like really boring to type out.
+373
+00:17:56,195 --> 00:18:00,500
+So I'll just like sit in my car and record
+374
+00:18:00,500 --> 00:18:01,460
+it for ten minutes.
+375
+00:18:01,460 --> 00:18:03,779
+And that ten minutes you get a lot of information
+376
+00:18:03,779 --> 00:18:04,419
+in.
+377
+00:18:05,619 --> 00:18:07,700
+Emails, which is short text.
+378
+00:18:08,660 --> 00:18:10,419
+Just there is a whole bunch.
+379
+00:18:10,420 --> 00:18:13,375
+And all these workflows kind of require a little bit
+380
+00:18:13,375 --> 00:18:15,134
+of treatment afterwards and different treatment.
+381
+00:18:15,134 --> 00:18:18,414
+My context pipeline is kind of like just extract the
+382
+00:18:18,414 --> 00:18:19,295
+bare essentials.
+383
+00:18:19,295 --> 00:18:22,174
+You end up with me talking very loosely about sort
+384
+00:18:22,174 --> 00:18:24,494
+of what I've done in my career, where I've worked,
+385
+00:18:24,494 --> 00:18:25,454
+where I might like to work.
+386
+00:18:26,000 --> 00:18:29,119
+And it goes, it condenses that down to very robotic
+387
+00:18:29,119 --> 00:18:32,720
+language that is easy to chunk parse and maybe put
+388
+00:18:32,720 --> 00:18:34,000
+into a vector database.
+389
+00:18:34,000 --> 00:18:36,240
+Daniel has worked in technology.
+390
+00:18:36,240 --> 00:18:39,840
+Daniel has been working in, know, stuff like that.
+391
+00:18:39,840 --> 00:18:43,055
+That's not how you would speak, but I figure it's
+392
+00:18:43,055 --> 00:18:46,494
+probably easier to parse for, after all, robots.
+393
+00:18:46,815 --> 00:18:48,734
+So we've almost got to twenty minutes and this is
+394
+00:18:48,734 --> 00:18:53,134
+actually a success because I wasted twenty minutes of my
+395
+00:18:53,535 --> 00:18:57,200
+of the evening speaking into you in microphone and the
+396
+00:18:57,200 --> 00:19:01,119
+levels were shot and was clipping and I said I
+397
+00:19:01,119 --> 00:19:02,400
+can't really do an evaluation.
+398
+00:19:02,400 --> 00:19:03,440
+I have to be fair.
+399
+00:19:03,440 --> 00:19:06,400
+I have to give the models a chance to do
+400
+00:19:06,400 --> 00:19:06,960
+their thing.
+401
+00:19:07,505 --> 00:19:09,585
+What am I hoping to achieve in this?
+402
+00:19:09,585 --> 00:19:11,664
+Okay, my fine tune was a dud as mentioned.
+403
+00:19:11,745 --> 00:19:15,265
+Deepgram STT, I'm really, really hopeful that this prototype will
+404
+00:19:15,265 --> 00:19:18,065
+work and it's a build in public open source so
+405
+00:19:18,065 --> 00:19:20,384
+anyone is welcome to use it if I make anything
+406
+00:19:20,384 --> 00:19:20,705
+good.
+407
+00:19:21,640 --> 00:19:23,880
+But that was really exciting for me last night when
+408
+00:19:23,880 --> 00:19:28,920
+after hours of trying my own prototype, seeing someone just
+409
+00:19:28,920 --> 00:19:32,119
+made something that works like that, you you're not gonna
+410
+00:19:32,119 --> 00:19:36,454
+have to build a custom conda environment and image.
+411
+00:19:36,454 --> 00:19:40,054
+I have AMD GPU which makes things much more complicated.
+412
+00:19:40,294 --> 00:19:42,694
+I didn't find it and I was about to give
+413
+00:19:42,694 --> 00:19:43,974
+up and I said, All right, let me just give
+414
+00:19:43,974 --> 00:19:46,535
+Deepgram's Linux thing a shot.
+415
+00:19:47,109 --> 00:19:49,669
+And if this doesn't work, I'm just gonna go back
+416
+00:19:49,669 --> 00:19:51,429
+to trying to vibe code something myself.
+417
+00:19:51,750 --> 00:19:55,589
+And when I ran the script, I was using Cloud
+418
+00:19:55,589 --> 00:19:59,109
+Code to do the installation process, it ran the script
+419
+00:19:59,109 --> 00:20:01,269
+and, oh my gosh, it works just like that.
+420
+00:20:01,904 --> 00:20:06,065
+The tricky thing for all those who wants to know
+421
+00:20:06,065 --> 00:20:11,505
+all the nitty, ditty, nitty gritty details was that I
+422
+00:20:11,505 --> 00:20:14,704
+don't think it was actually struggling with transcription, but pasting
+423
+00:20:14,785 --> 00:20:17,619
+Weyland makes life very hard.
+424
+00:20:17,619 --> 00:20:19,220
+And I think there was something not running at the
+425
+00:20:19,220 --> 00:20:19,779
+right time.
+426
+00:20:19,779 --> 00:20:23,059
+Anyway, Deepgram, I looked at how they actually handle that
+427
+00:20:23,059 --> 00:20:25,220
+because it worked out of the box when other stuff
+428
+00:20:25,220 --> 00:20:25,859
+didn't.
+429
+00:20:26,180 --> 00:20:28,980
+And it was quite a clever little mechanism.
+430
+00:20:29,575 --> 00:20:32,215
+And but more so than that, the accuracy was brilliant.
+431
+00:20:32,215 --> 00:20:33,654
+Now what am I what am I doing here?
+432
+00:20:33,654 --> 00:20:37,255
+This is gonna be a twenty minute audio sample.
+433
+00:20:38,455 --> 00:20:42,490
+And I'm I think I've done one or two of
+434
+00:20:42,490 --> 00:20:47,210
+these before, but I did it with short, snappy voice
+435
+00:20:47,210 --> 00:20:47,690
+notes.
+436
+00:20:47,690 --> 00:20:49,450
+This is kind of long form.
+437
+00:20:49,529 --> 00:20:52,009
+This actually might be a better approximation for what's useful
+438
+00:20:52,009 --> 00:20:53,929
+to me than voice memos.
+439
+00:20:53,929 --> 00:20:56,974
+Like, I need to buy three liters of milk tomorrow
+440
+00:20:56,974 --> 00:21:00,255
+and peter bread, which is probably how half my voice
+441
+00:21:00,255 --> 00:21:00,815
+notes sound.
+442
+00:21:00,815 --> 00:21:04,174
+Like if anyone were to find my phone they'd be
+443
+00:21:04,174 --> 00:21:06,014
+like this is the most boring person in the world.
+444
+00:21:06,095 --> 00:21:10,130
+Although actually there are some journaling thoughts as well, but
+445
+00:21:10,130 --> 00:21:11,890
+it's a lot of content like that.
+446
+00:21:11,890 --> 00:21:14,690
+And the probably for the evaluation, the most useful thing
+447
+00:21:14,690 --> 00:21:21,914
+is slightly obscure tech, GitHub, Nucleano, hugging face, not so
+448
+00:21:21,914 --> 00:21:24,554
+obscure that it's not gonna have a chance of knowing
+449
+00:21:24,554 --> 00:21:27,274
+it, but hopefully sufficiently well known that the model should
+450
+00:21:27,274 --> 00:21:27,914
+get it.
+451
+00:21:27,994 --> 00:21:30,075
+I tried to do a little bit of speaking really
+452
+00:21:30,075 --> 00:21:32,474
+fast and speaking very slowly.
+453
+00:21:32,474 --> 00:21:35,609
+Would say in general, I've spoken, delivered this at a
+454
+00:21:35,609 --> 00:21:39,210
+faster pace than I usually would owing to strong coffee
+455
+00:21:39,210 --> 00:21:40,650
+flowing through my bloodstream.
+456
+00:21:41,210 --> 00:21:43,609
+And the thing that I'm not gonna get in this
+457
+00:21:43,609 --> 00:21:46,170
+benchmark is background noise, which in my first take that
+458
+00:21:46,170 --> 00:21:48,535
+I had to get rid of, my wife came in
+459
+00:21:48,535 --> 00:21:51,575
+with my son and for a good night kiss.
+460
+00:21:51,654 --> 00:21:55,174
+And that actually would have been super helpful to get
+461
+00:21:55,174 --> 00:21:57,894
+in because it was non diarized or if we had
+462
+00:21:57,894 --> 00:21:58,775
+diarization.
+463
+00:21:59,414 --> 00:22:01,494
+A female, I could say, I want the male voice
+464
+00:22:01,494 --> 00:22:03,174
+and that wasn't intended for transcription.
+465
+00:22:04,589 --> 00:22:06,349
+And we're not going to get background noise like people
+466
+00:22:06,349 --> 00:22:09,069
+honking their horns, which is something I've done in my
+467
+00:22:09,230 --> 00:22:11,950
+main data set where I am trying to go back
+468
+00:22:11,950 --> 00:22:15,150
+to some of my voice notes, annotate them and run
+469
+00:22:15,150 --> 00:22:15,789
+a benchmark.
+470
+00:22:15,789 --> 00:22:18,345
+But this is going to be just a pure quick
+471
+00:22:18,345 --> 00:22:19,144
+test.
+472
+00:22:19,865 --> 00:22:24,105
+And as someone I'm working on a voice note idea.
+473
+00:22:24,105 --> 00:22:28,265
+That's my sort of end motivation besides thinking it's an
+474
+00:22:28,265 --> 00:22:31,865
+absolutely outstanding technology that's coming to viability.
+475
+00:22:31,865 --> 00:22:34,480
+And really, I know this sounds cheesy, can actually have
+476
+00:22:34,480 --> 00:22:36,559
+a very transformative effect.
+477
+00:22:38,000 --> 00:22:43,200
+Voice technology has been life changing for folks living with
+478
+00:22:44,079 --> 00:22:45,119
+disabilities.
+479
+00:22:46,000 --> 00:22:48,625
+And I think there's something really nice about the fact
+480
+00:22:48,625 --> 00:22:52,625
+that it can also benefit folks who are able-bodied and
+481
+00:22:52,625 --> 00:22:57,984
+we can all in different ways make this tech as
+482
+00:22:57,984 --> 00:23:00,785
+useful as possible regardless of the exact way that we're
+483
+00:23:00,785 --> 00:23:01,105
+using it.
+484
+00:23:02,279 --> 00:23:04,519
+And I think there's something very powerful in that, and
+485
+00:23:04,519 --> 00:23:05,639
+it can be very cool.
+486
+00:23:06,200 --> 00:23:07,639
+I see huge potential.
+487
+00:23:07,639 --> 00:23:09,399
+What excites me about voice tech?
+488
+00:23:09,799 --> 00:23:11,239
+A lot of things actually.
+489
+00:23:12,200 --> 00:23:14,919
+Firstly, the fact that it's cheap and accurate, as I
+490
+00:23:14,919 --> 00:23:17,865
+mentioned at the very start of this, and it's getting
+491
+00:23:17,865 --> 00:23:20,184
+better and better with stuff like accent handling.
+492
+00:23:20,825 --> 00:23:23,384
+I'm not sure my fine tune will actually ever come
+493
+00:23:23,384 --> 00:23:25,305
+to fruition in the sense that I'll use it day
+494
+00:23:25,305 --> 00:23:26,664
+to day as I imagine.
+495
+00:23:26,744 --> 00:23:30,585
+I get like superb, flawless words error rates because I'm
+496
+00:23:30,585 --> 00:23:35,029
+just kind of skeptical about local speech to text, as
+497
+00:23:35,029 --> 00:23:35,750
+I mentioned.
+498
+00:23:36,150 --> 00:23:39,910
+And I think the pace of innovation and improvement in
+499
+00:23:39,910 --> 00:23:42,390
+the models, the main reasons for fine tuning from what
+500
+00:23:42,390 --> 00:23:46,230
+I've seen have been people who are something that really
+501
+00:23:46,230 --> 00:23:50,455
+blows blows my mind about ASR is the idea that
+502
+00:23:50,455 --> 00:23:55,654
+it's inherently ailingual or multilingual, phonetic based.
+503
+00:23:56,375 --> 00:24:00,455
+So as folks who use speak very obscure languages that
+504
+00:24:00,455 --> 00:24:03,174
+there may be very there might be a paucity of
+505
+00:24:02,309 --> 00:24:05,110
+training data or almost none at all, and therefore the
+506
+00:24:05,110 --> 00:24:06,870
+accuracy is significantly reduced.
+507
+00:24:06,870 --> 00:24:11,430
+Or folks in very critical environments, I know there are
+508
+00:24:11,590 --> 00:24:15,430
+this is used extensively in medical transcription and dispatcher work
+509
+00:24:15,430 --> 00:24:19,144
+as, you know the call centers who send out ambulances
+510
+00:24:19,144 --> 00:24:19,944
+etc.
+511
+00:24:20,345 --> 00:24:23,625
+Where accuracy is absolutely paramount and in the case of
+512
+00:24:23,625 --> 00:24:27,625
+doctors radiologists they might be using very specialized vocab all
+513
+00:24:27,625 --> 00:24:27,945
+the time.
+514
+00:24:28,710 --> 00:24:30,309
+So those are kind of the main two things, and
+515
+00:24:30,309 --> 00:24:32,230
+I'm not sure that really just for trying to make
+516
+00:24:32,230 --> 00:24:36,470
+it better on a few random tech words with my
+517
+00:24:36,470 --> 00:24:39,509
+slightly I mean, I have an accent, but, like, not,
+518
+00:24:39,509 --> 00:24:42,549
+you know, an accent that a few other million people
+519
+00:24:42,950 --> 00:24:43,990
+have ish.
+520
+00:24:44,765 --> 00:24:48,045
+I'm not sure that my little fine tune is gonna
+521
+00:24:48,045 --> 00:24:52,684
+actually like, the bump in word error reduction, if I
+522
+00:24:52,684 --> 00:24:54,285
+ever actually figure out how to do it and get
+523
+00:24:54,285 --> 00:24:56,445
+it up to the cloud, by the time we've done
+524
+00:24:56,445 --> 00:25:00,039
+that, I suspect that the next generation of ASR will
+525
+00:25:00,039 --> 00:25:01,799
+just be so good that it will kind of be,
+526
+00:25:02,039 --> 00:25:04,039
+well, that would have been cool if it worked out,
+527
+00:25:04,039 --> 00:25:05,559
+but I'll just use this instead.
+528
+00:25:05,799 --> 00:25:10,759
+So that's gonna be it for today's episode of voice
+529
+00:25:10,759 --> 00:25:11,720
+training data.
+530
+00:25:11,960 --> 00:25:14,335
+Single, long shot evaluation.
+531
+00:25:14,575 --> 00:25:15,774
+Who am I gonna compare?
+532
+00:25:16,494 --> 00:25:18,654
+Whisper is always good as a benchmark, but I'm more
+533
+00:25:18,654 --> 00:25:22,255
+interested in seeing Whisper head to head with two things
+534
+00:25:22,255 --> 00:25:22,974
+really.
+535
+00:25:23,375 --> 00:25:25,214
+One is Whisper variants.
+536
+00:25:25,214 --> 00:25:27,775
+So you've got these projects like Faster Whisper.
+537
+00:25:29,190 --> 00:25:30,069
+Distill Whisper.
+538
+00:25:30,069 --> 00:25:30,789
+It's a bit confusing.
+539
+00:25:30,789 --> 00:25:31,989
+There's a whole bunch of them.
+540
+00:25:32,230 --> 00:25:35,190
+And the emerging ASRs, which are also a thing.
+541
+00:25:35,349 --> 00:25:37,190
+My intention for this is I'm not sure I'm gonna
+542
+00:25:37,190 --> 00:25:39,990
+have the time in any point in the foreseeable future
+543
+00:25:39,990 --> 00:25:44,855
+to go back to this whole episode and create a
+544
+00:25:44,855 --> 00:25:48,374
+proper source truth where I fix everything.
+545
+00:25:49,335 --> 00:25:51,974
+Might do it if I can get one transcription that's
+546
+00:25:51,974 --> 00:25:54,214
+sufficiently close to perfection.
+547
+00:25:55,014 --> 00:25:58,480
+But what I would actually love to do on Hugging
+548
+00:25:58,480 --> 00:26:00,559
+Face, I think would be a great probably how I
+549
+00:26:00,559 --> 00:26:04,480
+might visualize this is having the audio waveform play and
+550
+00:26:04,480 --> 00:26:08,960
+then have the transcript for each model below it and
+551
+00:26:08,960 --> 00:26:13,845
+maybe even a, like, you know, to scale and maybe
+552
+00:26:13,845 --> 00:26:16,724
+even a local one as well, like local whisper versus
+553
+00:26:16,724 --> 00:26:19,764
+OpenAI API, etcetera.
+554
+00:26:19,845 --> 00:26:23,204
+And I can then actually listen back to segments or
+555
+00:26:23,204 --> 00:26:25,365
+anyone who wants to can listen back to segments of
+556
+00:26:25,365 --> 00:26:30,299
+this recording and see where a particular model struggled and
+557
+00:26:30,299 --> 00:26:33,179
+others didn't as well as the sort of headline finding
+558
+00:26:33,179 --> 00:26:35,659
+of which had the best W E R but that
+559
+00:26:35,659 --> 00:26:37,739
+would require the source of truth.
+560
+00:26:37,740 --> 00:26:38,539
+Okay, that's it.
+561
+00:26:38,505 --> 00:26:41,065
+I hope this was, I don't know, maybe useful for
+562
+00:26:41,065 --> 00:26:42,984
+other folks interested in STT.
+563
+00:26:43,065 --> 00:26:46,025
+You want to see I always think I've just said
+564
+00:26:46,025 --> 00:26:47,704
+it as something I didn't intend to.
+565
+00:26:47,944 --> 00:26:49,704
+STT, I said for those.
+566
+00:26:49,704 --> 00:26:53,129
+Listen carefully, including hopefully the models themselves.
+567
+00:26:53,369 --> 00:26:55,129
+This has been myself, Daniel Rosol.
+568
+00:26:55,129 --> 00:26:59,450
+For more jumbled repositories about my roving interest in AI
+569
+00:26:59,450 --> 00:27:04,089
+but particularly AgenTic, MCP and VoiceTech you can find me
+570
+00:27:04,089 --> 00:27:05,769
+on GitHub.
+571
+00:27:06,009 --> 00:27:06,730
+Hugging Face.
+572
+00:27:08,125 --> 00:27:09,004
+Where else?
+573
+00:27:09,005 --> 00:27:11,805
+DanielRosel dot com, which is my personal website, as well
+574
+00:27:11,805 --> 00:27:15,565
+as this podcast whose name I sadly cannot remember.
+575
+00:27:15,724 --> 00:27:16,765
+Until next time.
+576
+00:27:16,765 --> 00:27:17,404
+Thanks for listening.

srt-out/speechmatics.srt ADDED Viewed

	@@ -0,0 +1,2069 @@

+1
+00:00:00,120 --> 00:00:06,520
+Hello and welcome to a audio data
+set consisting of one single
+2
+00:00:06,520 --> 00:00:12,120
+episode of a non-existent podcast.
+Or it, uh, I may append this to a
+3
+00:00:12,120 --> 00:00:16,640
+podcast that I set up recently.
+Um, regarding my, uh,
+4
+00:00:16,680 --> 00:00:21,960
+with my thoughts on speech,
+tech and AI in particular,
+5
+00:00:22,240 --> 00:00:27,960
+more AI and generative AI, I would,
+uh, I would say, but in any event,
+6
+00:00:27,960 --> 00:00:32,480
+the purpose of this, um,
+voice recording is actually to create
+7
+00:00:32,680 --> 00:00:37,560
+a lengthy voice sample for a quick
+evaluation, a back of the envelope
+8
+00:00:37,560 --> 00:00:41,160
+evaluation, as they might say,
+for different speech to text models.
+9
+00:00:41,160 --> 00:00:43,800
+And I'm doing this because I,
+uh, I thought I'd made a great
+10
+00:00:43,800 --> 00:00:48,320
+breakthrough in my journey with
+speech tech, and that was succeeding
+11
+00:00:48,320 --> 00:00:52,720
+in the elusive task of fine tuning.
+Whisper, whisper is.
+12
+00:00:52,840 --> 00:00:56,960
+And I'm going to just talk.
+I'm trying to mix up, uh,
+13
+00:00:56,960 --> 00:01:00,470
+I'm going to try a few different
+styles of speaking.
+14
+00:01:00,470 --> 00:01:02,630
+I might whisper something at
+some point as well,
+15
+00:01:03,190 --> 00:01:07,150
+and I'll go back to speaking loud in,
+uh, in different parts.
+16
+00:01:07,150 --> 00:01:09,710
+I'm going to sound really like a
+crazy person, because I'm also
+17
+00:01:09,710 --> 00:01:15,870
+going to try to speak at different
+pitches and cadences in order to
+18
+00:01:15,910 --> 00:01:20,630
+really try to put a speech to
+text model through its paces,
+19
+00:01:20,630 --> 00:01:25,870
+which is trying to make sense of,
+is this guy just on incoherently in
+20
+00:01:25,870 --> 00:01:34,350
+one long sentence, or are these just
+actually a series of step standalone,
+21
+00:01:34,350 --> 00:01:37,510
+standalone, standalone sentences?
+And how is it going to handle
+22
+00:01:37,510 --> 00:01:40,750
+step alone? That's not a word.
+Uh, what happens when you use
+23
+00:01:40,750 --> 00:01:44,030
+speech to text and you use a fake
+word and then you're like, wait,
+24
+00:01:44,030 --> 00:01:48,350
+that's not actually that word doesn't
+exist. How does AI handle that?
+25
+00:01:48,390 --> 00:01:53,910
+And, uh, these and more are all
+the questions that I'm seeking
+26
+00:01:53,910 --> 00:01:57,350
+to answer in this training data.
+Now, why did why was it trying
+27
+00:01:57,350 --> 00:01:59,740
+to fine tune a whisper?
+And what is whisper?
+28
+00:01:59,780 --> 00:02:03,540
+As I said, I'm gonna try to, uh,
+record this at a couple of different
+29
+00:02:03,540 --> 00:02:09,060
+levels of technicality for folks who
+are, uh, you know, in the normal, uh,
+30
+00:02:09,060 --> 00:02:13,460
+world and not totally stuck down
+the rabbit hole of AI, uh, which I
+31
+00:02:13,460 --> 00:02:17,460
+have to say is a really wonderful,
+uh, rabbit hole to be to be down.
+32
+00:02:17,580 --> 00:02:21,700
+Um, it's a really interesting area.
+And speech and voice tech is is
+33
+00:02:21,940 --> 00:02:24,980
+the aspect of it that I find
+actually most.
+34
+00:02:25,180 --> 00:02:28,340
+I'm not sure I would say the most
+interesting, because there's just
+35
+00:02:28,340 --> 00:02:32,700
+so much that is fascinating in AI.
+Uh, but the most that I find the
+36
+00:02:32,700 --> 00:02:36,220
+most personally transformative
+in terms of the impact that it's
+37
+00:02:36,220 --> 00:02:41,660
+had on my daily work life and
+productivity and how I sort of work.
+38
+00:02:41,940 --> 00:02:48,020
+And I'm persevering hard with the
+task of trying to guess a good
+39
+00:02:48,020 --> 00:02:51,700
+solution working for Linux, which if
+anyone actually does listen to this,
+40
+00:02:51,700 --> 00:02:55,100
+not just for the training data
+and for the actual content, uh,
+41
+00:02:55,140 --> 00:02:59,600
+this is this is has sparked I had
+besides the fine tune not working.
+42
+00:02:59,600 --> 00:03:05,560
+Well, that was the failure.
+Um, I used clod code because one
+43
+00:03:05,560 --> 00:03:10,160
+thinks these days that there is
+nothing short of solving,
+44
+00:03:11,040 --> 00:03:14,680
+you know, the, uh,
+the reason of life or something.
+45
+00:03:15,080 --> 00:03:19,560
+Uh, that clod and agentic AI can't
+do, uh, which is not really the case.
+46
+00:03:19,600 --> 00:03:23,600
+Uh, it does seem that way sometimes,
+but it fails a lot as well.
+47
+00:03:23,600 --> 00:03:26,960
+And this is one of those, uh,
+instances where last week I put
+48
+00:03:26,960 --> 00:03:31,400
+together an hour of voice training
+data, basically speaking just
+49
+00:03:31,400 --> 00:03:35,040
+random things for three minutes.
+And, um,
+50
+00:03:35,720 --> 00:03:38,520
+it was actually kind of tedious
+because the texts were really weird.
+51
+00:03:38,520 --> 00:03:42,120
+Some of them were it was like it
+was AI generated.
+52
+00:03:42,320 --> 00:03:44,920
+Um, I tried before to read
+Sherlock Holmes for an hour and
+53
+00:03:44,920 --> 00:03:47,000
+I just couldn't.
+I was so bored, uh,
+54
+00:03:47,040 --> 00:03:50,800
+after ten minutes that I was like,
+okay, now I'm just gonna have to
+55
+00:03:50,800 --> 00:03:56,470
+find something else to read.
+So I used a created with AI
+56
+00:03:56,510 --> 00:04:00,150
+studio vibe coded.
+A synthetic text generator.
+57
+00:04:00,390 --> 00:04:03,990
+Um, which actually I thought was
+probably a better way of doing it
+58
+00:04:03,990 --> 00:04:08,870
+because it would give me more short
+samples with more varied content.
+59
+00:04:08,870 --> 00:04:13,310
+So I was like, okay, give me a voice
+note, like I'm recording an email,
+60
+00:04:13,310 --> 00:04:18,110
+give me a short story to read,
+give me prose, um, to read.
+61
+00:04:18,110 --> 00:04:21,310
+So I came up with all these
+different things, and I added a
+62
+00:04:21,310 --> 00:04:24,750
+little timer to it so I could
+see how close I was to one hour.
+63
+00:04:24,990 --> 00:04:29,830
+Um, and, uh, I spent like an hour one
+afternoon or probably two hours by
+64
+00:04:29,830 --> 00:04:34,190
+the time you, um, you do retakes
+or whatever because you want to.
+65
+00:04:34,990 --> 00:04:39,190
+It gave me a source of truth,
+which I'm not sure if that's the
+66
+00:04:39,190 --> 00:04:43,550
+scientific way to approach this topic
+of gathering, uh, training data,
+67
+00:04:43,550 --> 00:04:48,070
+but I thought it made sense.
+Um, I have a lot of audio data
+68
+00:04:48,070 --> 00:04:52,070
+from recording voice notes,
+which I've also kind of used, um,
+69
+00:04:52,070 --> 00:04:55,780
+been experimenting with using for
+a different purpose, slightly
+70
+00:04:55,780 --> 00:05:00,820
+different annotating task types.
+It's more text classification
+71
+00:05:00,820 --> 00:05:03,740
+experiment or uh, well,
+it's more than that, actually.
+72
+00:05:03,740 --> 00:05:08,100
+I'm working on a voice app,
+so it's a prototype I guess is
+73
+00:05:08,100 --> 00:05:12,780
+really more accurate.
+Um, but you can do that and you
+74
+00:05:12,780 --> 00:05:14,220
+can work backwards.
+You're like,
+75
+00:05:14,260 --> 00:05:18,620
+you listen back to a voice note
+and you painfully go through one
+76
+00:05:18,620 --> 00:05:21,980
+of those transcribing, you know,
+where you start and stop and scrub
+77
+00:05:21,980 --> 00:05:24,100
+around it and you fix the errors.
+But it's really,
+78
+00:05:24,100 --> 00:05:27,220
+really boring to do that.
+So I thought it would be less
+79
+00:05:27,220 --> 00:05:31,860
+tedious in the long term if I just
+recorded The Source of truth.
+80
+00:05:32,180 --> 00:05:34,300
+So it gave me these three minute
+snippets.
+81
+00:05:34,300 --> 00:05:38,780
+I recorded them and saved an MP3
+and a txt in the same folder,
+82
+00:05:38,780 --> 00:05:43,820
+and I created an hour of that data.
+Uh, so I was very hopeful, quietly,
+83
+00:05:43,860 --> 00:05:46,380
+you know, a little bit hopeful
+that I would be able that I could
+84
+00:05:46,380 --> 00:05:49,700
+actually fine tune, whisper.
+Um, I want to fine tune whisper
+85
+00:05:49,700 --> 00:05:54,840
+because when I got into voice tech
+last November, my wife was in
+86
+00:05:54,840 --> 00:05:59,600
+the US and I was alone at home.
+And you know, when crazy people
+87
+00:05:59,600 --> 00:06:03,760
+like me do really wild things like
+use voice to tech, uh, technology.
+88
+00:06:03,760 --> 00:06:06,520
+That was basically, um,
+when I started doing it,
+89
+00:06:06,520 --> 00:06:10,280
+I didn't feel like a crazy person
+speaking to myself, and my
+90
+00:06:10,280 --> 00:06:16,120
+expectations weren't that high.
+Uh, I used speech tech now and again.
+91
+00:06:16,200 --> 00:06:18,480
+Um, tried it out.
+I was like, it'd be really cool
+92
+00:06:18,480 --> 00:06:20,520
+if you could just, like,
+speak into your computer.
+93
+00:06:20,880 --> 00:06:24,720
+And whatever I tried out that
+had Linux support was just.
+94
+00:06:25,440 --> 00:06:28,640
+It was not good, basically.
+Um, and this blew me away from
+95
+00:06:28,640 --> 00:06:32,040
+the first go.
+I mean, it wasn't 100% accurate
+96
+00:06:32,080 --> 00:06:35,160
+out of the box and it took work,
+but it was good enough that there was
+97
+00:06:35,160 --> 00:06:39,720
+a solid foundation and it kind of
+passed that, uh, pivot point that
+98
+00:06:39,720 --> 00:06:42,880
+it's actually worth doing this.
+You know, there's a point where
+99
+00:06:42,880 --> 00:06:46,920
+it's so like the transcript is you
+don't have to get 100% accuracy
+100
+00:06:46,920 --> 00:06:50,630
+for it to be worth your time for
+speech to text to be a worthwhile
+101
+00:06:50,630 --> 00:06:53,070
+addition to your productivity.
+But you do need to get above.
+102
+00:06:53,110 --> 00:06:57,750
+Let's say, I don't know, 85%.
+If it's 60% or 50%,
+103
+00:06:57,750 --> 00:07:00,790
+you inevitably say, screw it.
+I'll just type it because you end up
+104
+00:07:00,790 --> 00:07:05,070
+missing errors in the transcript
+and it becomes actually worse.
+105
+00:07:05,070 --> 00:07:06,830
+You end up in a worse position
+than you started with.
+106
+00:07:06,830 --> 00:07:11,030
+And that's been my experience.
+So, um, I was like, oh,
+107
+00:07:11,070 --> 00:07:13,550
+this is actually really, really good.
+Now how did that happen?
+108
+00:07:13,550 --> 00:07:18,910
+And the answer is ASR whisper
+being open sourced and the
+109
+00:07:18,910 --> 00:07:21,910
+transformer architecture,
+if you want to go back to the,
+110
+00:07:22,510 --> 00:07:26,750
+um, to the underpinnings, which
+really blows my mind and it's on my
+111
+00:07:26,750 --> 00:07:32,430
+list to read through that paper.
+Um, all you need is attention as
+112
+00:07:33,470 --> 00:07:38,470
+attentively as can be done with my
+limited brain because it's super,
+113
+00:07:38,470 --> 00:07:42,310
+super high level stuff.
+Um, super advanced stuff.
+114
+00:07:42,350 --> 00:07:48,070
+I mean, uh, but that I think of all
+the things that are fascinating
+115
+00:07:48,180 --> 00:07:52,820
+about the sudden rise in AI and
+the dramatic capabilities.
+116
+00:07:53,420 --> 00:07:55,700
+I find it fascinating that few
+people are like, hang on,
+117
+00:07:55,860 --> 00:07:59,740
+you've got this thing that can speak
+to you like a chatbot, an LLM,
+118
+00:08:00,420 --> 00:08:05,580
+and then you've got image generation.
+Okay, so firstly, those two things on
+119
+00:08:05,580 --> 00:08:10,860
+the surface have nothing in common.
+Um, so like how are they how did that
+120
+00:08:10,860 --> 00:08:13,100
+just happen all at the same time.
+And then when you extend that
+121
+00:08:13,100 --> 00:08:16,180
+further, um, you're like sooner,
+right?
+122
+00:08:16,180 --> 00:08:21,700
+You can sing a song and AI will like,
+come up with an instrumental and then
+123
+00:08:21,700 --> 00:08:23,860
+you've got whisper and you're like,
+wait a second,
+124
+00:08:24,060 --> 00:08:28,100
+how did all this stuff, like,
+if it's all AI, what's like there
+125
+00:08:28,100 --> 00:08:30,700
+has to be some commonality.
+Otherwise these are four.
+126
+00:08:30,780 --> 00:08:34,780
+These are totally different
+technologies on the surface of it.
+127
+00:08:34,780 --> 00:08:40,220
+And, uh, the transformer architecture
+is, as far as I know, the answer.
+128
+00:08:40,220 --> 00:08:43,860
+And I can't even say can't even
+pretend that I really understand
+129
+00:08:44,140 --> 00:08:47,290
+what the transformer
+architecture means in depth,
+130
+00:08:47,290 --> 00:08:51,810
+but I have scanned it and as I said,
+I want to print it and really kind
+131
+00:08:51,810 --> 00:08:56,770
+of think over it at some point,
+and I'll probably feel bad about
+132
+00:08:56,770 --> 00:08:59,090
+myself, I think,
+because weren't those guys in their
+133
+00:08:59,130 --> 00:09:04,010
+in their 20s like, that's crazy.
+I think I asked ChatGPT once who
+134
+00:09:04,050 --> 00:09:08,370
+were the who wrote that paper
+and how old were they when it
+135
+00:09:08,370 --> 00:09:11,290
+was published in arXiv?
+And I was expecting like,
+136
+00:09:11,530 --> 00:09:13,450
+I don't know,
+what do you what do you imagine?
+137
+00:09:13,450 --> 00:09:15,050
+I personally imagine kind of like,
+you know,
+138
+00:09:15,090 --> 00:09:19,210
+you have these breakthroughs during
+Covid and things like that where
+139
+00:09:19,250 --> 00:09:22,210
+like these kind of really obscure
+scientists who are like in their
+140
+00:09:22,210 --> 00:09:27,250
+50s and they've just kind of been
+laboring in labs and, uh, wearily
+141
+00:09:27,250 --> 00:09:30,650
+and writing in publishing in kind
+of obscure academic publications.
+142
+00:09:30,850 --> 00:09:34,050
+And they finally, like,
+hit a big or win a Nobel Prize and
+143
+00:09:34,050 --> 00:09:37,930
+then their household household names.
+Uh, so that was kind of what I
+144
+00:09:37,930 --> 00:09:39,770
+had in mind.
+That was the mental image I'd
+145
+00:09:39,770 --> 00:09:44,010
+formed of the birth of arXiv.
+Like, I wasn't expecting 20
+146
+00:09:44,050 --> 00:09:47,430
+somethings in San Francisco,
+though I thought that was both very,
+147
+00:09:47,430 --> 00:09:49,990
+very funny, very cool,
+and actually kind of inspiring.
+148
+00:09:50,510 --> 00:09:55,630
+It's nice to think that people who,
+you know, just you might put them
+149
+00:09:55,630 --> 00:10:01,030
+in the kind of milieu or bubble or
+world that you are in or credibly in,
+150
+00:10:01,070 --> 00:10:03,710
+through, you know,
+a series of connections that are
+151
+00:10:03,710 --> 00:10:07,750
+coming up with such literally
+world changing, um, innovations.
+152
+00:10:07,790 --> 00:10:11,550
+Uh, so that was, I thought,
+anyway, that, that that was cool.
+153
+00:10:12,190 --> 00:10:14,070
+Okay. Voice training data.
+How are we doing?
+154
+00:10:14,070 --> 00:10:18,110
+We're about ten minutes, and I'm
+still talking about voice technology.
+155
+00:10:18,310 --> 00:10:22,470
+Um, so whisper was brilliant,
+and I was so excited that I was.
+156
+00:10:22,470 --> 00:10:25,750
+My first instinct was to, like,
+get like, oh, my gosh,
+157
+00:10:25,750 --> 00:10:27,830
+I have to get, like,
+a really good microphone for this.
+158
+00:10:28,070 --> 00:10:31,750
+So, um, I didn't go on a
+spending spree because I said,
+159
+00:10:31,790 --> 00:10:34,590
+I'm gonna have to just wait a
+month and see if I still use this.
+160
+00:10:35,030 --> 00:10:40,110
+And it just kind of became it's
+become really part of my daily
+161
+00:10:40,110 --> 00:10:43,110
+routine.
+Like, if I'm writing an email,
+162
+00:10:43,110 --> 00:10:47,140
+I'll record a voice note.
+And then I've developed and it's
+163
+00:10:47,140 --> 00:10:50,020
+nice to see that everyone is
+like developing the same things
+164
+00:10:50,020 --> 00:10:52,020
+in parallel.
+Like, that's kind of a weird thing
+165
+00:10:52,060 --> 00:10:57,460
+to say, but when I look, I kind of
+came when I started working on this,
+166
+00:10:57,500 --> 00:11:00,820
+these prototypes on GitHub,
+which is where I just kind of
+167
+00:11:00,860 --> 00:11:04,860
+share very freely and loosely,
+uh, ideas and, you know,
+168
+00:11:04,900 --> 00:11:10,140
+first iterations on, on concepts,
+um, and for want of a better word,
+169
+00:11:10,140 --> 00:11:14,020
+I called it like, uh,
+lm post-processing or cleanup or
+170
+00:11:14,260 --> 00:11:18,220
+basically a system prompt that after
+you get back the raw text from
+171
+00:11:18,540 --> 00:11:24,220
+whisper, you run it through a model
+and say, okay, this is crappy text,
+172
+00:11:24,260 --> 00:11:27,260
+like add sentence structure and,
+you know, fix it up.
+173
+00:11:27,700 --> 00:11:32,780
+And, um, now when I'm exploring the
+different tools that are out there
+174
+00:11:32,820 --> 00:11:36,700
+that people have built, I see, uh,
+quite a number of projects have
+175
+00:11:37,300 --> 00:11:41,820
+basically done the same thing,
+um, less that be misconstrued.
+176
+00:11:41,820 --> 00:11:44,490
+I'm not saying for a millisecond
+that I inspired them.
+177
+00:11:44,490 --> 00:11:49,010
+I'm sure this has been a thing that's
+been integrated into tools for a
+178
+00:11:49,050 --> 00:11:52,410
+while, but it's it's the kind of
+thing that when you start using these
+179
+00:11:52,410 --> 00:11:56,850
+tools every day, the need for it
+is almost instantly apparent, uh,
+180
+00:11:56,850 --> 00:12:00,890
+because text that doesn't have any
+punctuation or paragraph spacing
+181
+00:12:00,930 --> 00:12:04,370
+takes a long time to, you know,
+it takes so long to get it into
+182
+00:12:04,370 --> 00:12:09,490
+a presentable email that again,
+it's it's it moves speech tech
+183
+00:12:09,530 --> 00:12:13,050
+into that before that inflection
+point where you're like, no,
+184
+00:12:13,050 --> 00:12:16,370
+it's just not worth it.
+It's like it'll just be quicker
+185
+00:12:16,370 --> 00:12:18,970
+to type this.
+So it's a big it's a little touch.
+186
+00:12:18,970 --> 00:12:24,210
+That actually is a big deal.
+Uh, so I was on whisper and I've
+187
+00:12:24,210 --> 00:12:28,290
+been using whisper and I kind of
+early on found a couple of tools.
+188
+00:12:28,330 --> 00:12:31,050
+I couldn't find what I was
+looking for on Linux, which is,
+189
+00:12:31,490 --> 00:12:35,890
+um, basically just something
+that'll run in the background.
+190
+00:12:35,930 --> 00:12:40,250
+You'll give it an API key and it
+will just transcribe. Um.
+191
+00:12:41,400 --> 00:12:44,120
+with, like, a little key to
+start and stop the dictation.
+192
+00:12:44,720 --> 00:12:49,160
+Uh, and the issues were I discovered
+that, like most people involved in
+193
+00:12:49,160 --> 00:12:54,040
+creating these projects were very
+much focused on local models running
+194
+00:12:54,040 --> 00:12:57,520
+whisper locally, because you can.
+And I tried that a bunch of
+195
+00:12:57,520 --> 00:13:00,960
+times and just never got results
+that were as good as the cloud.
+196
+00:13:01,280 --> 00:13:04,760
+And when I began looking at the
+cost of the speech to text APIs
+197
+00:13:04,760 --> 00:13:08,640
+and what I was spending,
+I just thought there's it's actually,
+198
+00:13:08,840 --> 00:13:13,320
+in my opinion, just one of the better
+deals in API spending and in cloud.
+199
+00:13:13,360 --> 00:13:17,400
+Like it's just not that expensive
+for very, very good models that are
+200
+00:13:17,520 --> 00:13:20,960
+much more, you know, you're going
+to be able to run the full model,
+201
+00:13:21,480 --> 00:13:26,080
+the latest model versus whatever
+you can run on your average GPU.
+202
+00:13:26,120 --> 00:13:29,880
+Unless you want to buy a crazy GPU.
+It doesn't really make sense to me.
+203
+00:13:29,880 --> 00:13:33,600
+Now, privacy is another concern.
+Um, that I know is kind of like a
+204
+00:13:33,640 --> 00:13:37,040
+very much a separate thing that
+people just don't want their voice,
+205
+00:13:37,040 --> 00:13:39,910
+data, and their voice leaving
+their local environment,
+206
+00:13:40,230 --> 00:13:43,950
+maybe for regulatory reasons as well.
+Um, but I'm not in that.
+207
+00:13:44,030 --> 00:13:48,030
+Um, I'm neither really care about
+people listening to my, uh,
+208
+00:13:48,070 --> 00:13:51,310
+grocery list consisting of, uh,
+reminding myself that I need to
+209
+00:13:51,350 --> 00:13:54,910
+buy more beer, Cheetos and hummus,
+which is kind of the three,
+210
+00:13:55,110 --> 00:13:59,430
+three staples of my diet.
+Um, during periods of poor nutrition.
+211
+00:13:59,710 --> 00:14:03,430
+Uh, but the kind of stuff that I
+transcribe, it's just not it's not a,
+212
+00:14:04,110 --> 00:14:09,470
+it's not a privacy thing and that
+sort of sensitive about and, uh,
+213
+00:14:09,470 --> 00:14:13,190
+I don't do anything so,
+you know, sensitive or secure,
+214
+00:14:13,190 --> 00:14:16,710
+that requires air gapping.
+So, um, I looked at the pricing and
+215
+00:14:16,710 --> 00:14:20,390
+especially the kind of older models,
+mini, um, some of them are very,
+216
+00:14:20,390 --> 00:14:23,230
+very affordable.
+And I did a back of the I did a
+217
+00:14:23,230 --> 00:14:27,270
+calculation once with ChatGPT
+and I was like, okay, this is a,
+218
+00:14:27,270 --> 00:14:31,190
+this is the API price for I can't
+remember whatever the model was.
+219
+00:14:31,670 --> 00:14:34,030
+Uh, let's say I just go at it
+like nonstop,
+220
+00:14:34,150 --> 00:14:37,530
+which it rarely happens. Probably.
+I would say on average,
+221
+00:14:37,530 --> 00:14:42,010
+I might dictate 30 to 60 minutes per
+day if I was probably summing up
+222
+00:14:42,010 --> 00:14:48,610
+the emails, documents, outlines,
+um, which is a lot, but it's it's
+223
+00:14:48,610 --> 00:14:50,850
+still a fairly modest amount.
+And I was like, well,
+224
+00:14:50,890 --> 00:14:54,050
+some days I do go on like 1 or 2
+days where I've been.
+225
+00:14:54,570 --> 00:14:58,570
+Usually when I'm like kind of out of
+the house and just have something
+226
+00:14:59,210 --> 00:15:02,370
+like, I have nothing else to do.
+Like if I'm at a hospital with a
+227
+00:15:02,370 --> 00:15:07,090
+newborn, uh, and you're waiting
+for like eight hours and hours
+228
+00:15:07,090 --> 00:15:10,330
+for an appointment, and I would
+probably have listened to podcasts
+229
+00:15:10,610 --> 00:15:14,130
+before becoming a speech fanatic.
+And I'm like, oh, wait,
+230
+00:15:14,170 --> 00:15:16,490
+let me just get down.
+Let me just get these ideas out
+231
+00:15:16,530 --> 00:15:19,730
+of my head.
+And that's when I'll go on my
+232
+00:15:19,770 --> 00:15:21,650
+speech binges.
+But those are like once every
+233
+00:15:21,650 --> 00:15:25,090
+few months, like not frequently.
+But I said, okay, let's just say
+234
+00:15:25,090 --> 00:15:30,770
+if I'm gonna price out.
+Cloud asked if I was like, dedicated
+235
+00:15:30,770 --> 00:15:37,000
+every second of every waking hour to
+transcribing for some odd reason. Um.
+236
+00:15:37,320 --> 00:15:39,800
+I mean, it'd have to, like,
+eat and use the toilet and,
+237
+00:15:39,840 --> 00:15:42,640
+like, you know, there's only so
+many hours I'm awake for.
+238
+00:15:42,640 --> 00:15:44,800
+So, like,
+let's just say a maximum of, like,
+239
+00:15:44,840 --> 00:15:48,800
+40 hours, 45 minutes in the hour.
+Then I said, all right,
+240
+00:15:48,800 --> 00:15:52,720
+let's just say 50. Who knows?
+You're dictating on the toilet.
+241
+00:15:52,760 --> 00:15:54,000
+We do it.
+Uh,
+242
+00:15:54,000 --> 00:15:58,840
+so it could be you could just do 60.
+But whatever I did, and every day,
+243
+00:15:58,880 --> 00:16:02,560
+like, you're going flat out seven
+days a week dictating non-stop.
+244
+00:16:02,600 --> 00:16:06,560
+I was like, what's my monthly API
+bill going to be at this price?
+245
+00:16:06,840 --> 00:16:09,240
+And it came out to like 70 or 80
+bucks.
+246
+00:16:09,240 --> 00:16:14,200
+And I was like, well, that would be
+an extraordinary amount of dictation.
+247
+00:16:14,200 --> 00:16:17,960
+And I would hope that there was
+some compelling reason,
+248
+00:16:18,160 --> 00:16:22,320
+more worth more than $70,
+that I embarked upon that project.
+249
+00:16:22,520 --> 00:16:25,320
+Uh, so given that that's kind of the
+max point for me, I said, that's
+250
+00:16:25,360 --> 00:16:29,120
+actually very, very affordable.
+Um, now you're gonna if you want
+251
+00:16:29,160 --> 00:16:34,200
+to spec out the costs and you want
+to do the post-processing that I
+252
+00:16:34,270 --> 00:16:37,230
+really do feel is valuable.
+Um, that's going to cost some more as
+253
+00:16:37,230 --> 00:16:43,230
+well, unless you're using Gemini,
+which, uh, needless to say, is a
+254
+00:16:43,230 --> 00:16:47,070
+random person sitting in Jerusalem.
+Uh, I have no affiliation,
+255
+00:16:47,070 --> 00:16:51,470
+nor with Google, nor anthropic,
+nor Gemini, nor any major tech vendor
+256
+00:16:51,470 --> 00:16:56,910
+for that matter. Um, I like Gemini.
+Not so much as a everyday model.
+257
+00:16:56,990 --> 00:16:59,950
+Um, it's kind of underwhelmed in
+that respect, I would say.
+258
+00:17:00,350 --> 00:17:03,150
+But for multimodal,
+I think it's got a lot to offer.
+259
+00:17:03,430 --> 00:17:06,990
+And I think that the transcribing
+functionality whereby it can,
+260
+00:17:07,390 --> 00:17:12,270
+um, process audio with a system
+prompt and both give you
+261
+00:17:12,310 --> 00:17:15,510
+transcription that's cleaned up,
+that reduces two steps to one.
+262
+00:17:15,830 --> 00:17:18,750
+And that for me is a very,
+very big deal.
+263
+00:17:18,750 --> 00:17:23,110
+And, uh, I feel like even Google
+has haven't really sort of thought
+264
+00:17:23,110 --> 00:17:27,550
+through how useful the that
+modality is and what kind of use
+265
+00:17:27,550 --> 00:17:30,910
+cases you can achieve with it.
+Because I found in the course of
+266
+00:17:30,910 --> 00:17:36,610
+this year just an endless list
+of really kind of system prompt,
+267
+00:17:36,850 --> 00:17:41,410
+system prompt stuff that I can say,
+okay, I've used it to capture context
+268
+00:17:41,410 --> 00:17:45,690
+data for AI, which is literally I
+might speak for if I wanted to have a
+269
+00:17:45,690 --> 00:17:49,850
+good bank of context data about,
+who knows, my childhood.
+270
+00:17:50,130 --> 00:17:53,570
+Uh, more realistically,
+maybe my career goals, uh,
+271
+00:17:53,570 --> 00:17:56,130
+something that would just be,
+like, really boring to type out.
+272
+00:17:56,250 --> 00:18:01,250
+So I'll just, like, sit in my car
+and record it for ten minutes.
+273
+00:18:01,250 --> 00:18:04,210
+And that ten minutes,
+you get a lot of information in,
+274
+00:18:04,650 --> 00:18:10,210
+um, emails, which is short text.
+Um, just there is a whole bunch.
+275
+00:18:10,210 --> 00:18:13,690
+And all these workflows kind of
+require a little bit of treatment
+276
+00:18:13,690 --> 00:18:17,610
+afterwards and different treatment.
+My context pipeline is kind of like
+277
+00:18:17,610 --> 00:18:21,330
+just extract the bare essentials.
+So you end up with me talking very
+278
+00:18:21,330 --> 00:18:24,370
+loosely about sort of what I've done
+in my career, where I've worked,
+279
+00:18:24,370 --> 00:18:27,730
+where I might like to work,
+and it goes it condenses that
+280
+00:18:27,730 --> 00:18:31,720
+down to very robotic language
+that is easy to chunk, parse,
+281
+00:18:31,720 --> 00:18:36,080
+and maybe put into a vector database.
+Daniel has worked in technology,
+282
+00:18:36,120 --> 00:18:39,760
+Daniel is a has been working in,
+you know, stuff like that.
+283
+00:18:39,760 --> 00:18:43,720
+That's not how you would speak.
+Um, but I figure it's probably easier
+284
+00:18:43,720 --> 00:18:48,240
+to parse for, after all, robots.
+So we've almost got to 20 minutes.
+285
+00:18:48,240 --> 00:18:52,760
+And this is actually a success
+because I wasted 20 minutes of my,
+286
+00:18:52,920 --> 00:18:57,000
+uh, of the evening speaking into
+a microphone, and, uh,
+287
+00:18:57,040 --> 00:19:00,960
+the levels were shot and, uh, it,
+uh, it was clipping and I said,
+288
+00:19:00,960 --> 00:19:03,320
+I can't really do an evaluation.
+I have to be fair.
+289
+00:19:03,320 --> 00:19:07,120
+I have to give the models a
+chance to do their thing.
+290
+00:19:07,640 --> 00:19:09,480
+Uh,
+what am I hoping to achieve in this?
+291
+00:19:09,520 --> 00:19:12,720
+Okay, my fine tune was a dud,
+as mentioned Deepgram SVT.
+292
+00:19:12,760 --> 00:19:15,640
+I'm really, really hopeful that
+this prototype will work.
+293
+00:19:15,920 --> 00:19:19,080
+And it's a built in public open
+source, so anyone is welcome to
+294
+00:19:19,120 --> 00:19:23,040
+use it if I make anything good.
+Um, but that was really exciting for
+295
+00:19:23,040 --> 00:19:27,520
+me last night when after hours of,
+um, trying my own prototype,
+296
+00:19:27,520 --> 00:19:31,350
+seeing someone just made
+something that works like that.
+297
+00:19:31,390 --> 00:19:32,790
+You know,
+you're not going to have to build a
+298
+00:19:32,790 --> 00:19:38,350
+custom conda environment and image.
+I have AMD GPU, which makes
+299
+00:19:38,350 --> 00:19:42,430
+things much more complicated.
+I didn't find it and I was about
+300
+00:19:42,430 --> 00:19:44,110
+to give up and I said,
+all right, let me just give deep
+301
+00:19:44,110 --> 00:19:48,870
+grams Linux thing a shot.
+And if this doesn't work, um,
+302
+00:19:48,870 --> 00:19:51,270
+I'm just going to go back to
+trying to code something myself.
+303
+00:19:51,630 --> 00:19:56,310
+And when I ran the script,
+I was using cloud code to do the
+304
+00:19:56,310 --> 00:20:00,150
+installation process.
+It ran the script and oh my gosh,
+305
+00:20:00,190 --> 00:20:05,470
+it works just like that.
+Uh, the tricky thing for all those
+306
+00:20:05,470 --> 00:20:10,430
+who wants to know all the nitty
+gritty, nitty gritty details, um, was
+307
+00:20:10,430 --> 00:20:13,870
+that I don't think it was actually
+struggling with transcription, but
+308
+00:20:13,870 --> 00:20:18,670
+pasting Wayland makes life very hard,
+and I think there was something not
+309
+00:20:18,670 --> 00:20:21,990
+running in the right time anyway.
+Deepgram I looked at how they
+310
+00:20:21,990 --> 00:20:24,830
+actually handle that because it
+worked out of the box when other
+311
+00:20:24,830 --> 00:20:29,260
+stuff didn't, and it was quite a
+clever little mechanism,
+312
+00:20:29,580 --> 00:20:32,220
+and but more so than that,
+the accuracy was brilliant.
+313
+00:20:32,260 --> 00:20:35,140
+Now, what am I doing here?
+This is going to be a 20 minute
+314
+00:20:35,380 --> 00:20:43,100
+audio sample, and I'm I think
+I've done 1 or 2 of these before,
+315
+00:20:43,100 --> 00:20:49,300
+but I did it with short, snappy voice
+notes. This is kind of long form.
+316
+00:20:49,580 --> 00:20:51,860
+This actually might be a better
+approximation for what's useful
+317
+00:20:51,860 --> 00:20:56,220
+to me than voice memos.
+Like I need to buy three liters
+318
+00:20:56,220 --> 00:20:59,300
+of milk tomorrow, and pita bread,
+which is probably how like half
+319
+00:20:59,300 --> 00:21:02,940
+my voice voice notes sound like
+if anyone were to, I don't know,
+320
+00:21:02,980 --> 00:21:04,700
+like find my phone,
+they'd be like, this is the most
+321
+00:21:04,700 --> 00:21:07,540
+boring person in the world.
+Although actually there are some
+322
+00:21:07,580 --> 00:21:09,820
+like kind of, uh,
+journaling thoughts as well.
+323
+00:21:09,820 --> 00:21:13,820
+But it's a lot of content like that.
+And the probably for the evaluation,
+324
+00:21:13,820 --> 00:21:20,780
+the most useful thing is slightly
+obscure tech GitHub uh, hugging face
+325
+00:21:21,300 --> 00:21:24,780
+not so obscure that it's not going
+to have a chance of knowing it,
+326
+00:21:24,780 --> 00:21:27,760
+but hopefully sufficiently well
+known that the model should get it.
+327
+00:21:28,320 --> 00:21:30,880
+I tried to do a little bit of
+speaking really fast and
+328
+00:21:30,880 --> 00:21:33,320
+speaking very slowly.
+I would say in general,
+329
+00:21:33,320 --> 00:21:37,000
+I've spoken, delivered this at a
+faster pace than I usually would
+330
+00:21:37,040 --> 00:21:40,400
+owing to strong coffee flowing
+through my bloodstream.
+331
+00:21:41,040 --> 00:21:44,320
+And the thing that I'm not going
+to get in this benchmark is
+332
+00:21:44,320 --> 00:21:47,000
+background noise, which in my first
+take that I had to get rid of,
+333
+00:21:47,800 --> 00:21:51,360
+my wife came in with my son and
+for a good night kiss.
+334
+00:21:51,560 --> 00:21:55,240
+And that actually would have
+been super helpful to get in
+335
+00:21:55,240 --> 00:21:59,880
+because it was not diarised.
+Or if we had diarisation a female,
+336
+00:22:00,000 --> 00:22:02,400
+I could say I want the male
+voice and that wasn't intended
+337
+00:22:02,400 --> 00:22:05,400
+for transcription.
+Um, and we're not going to get
+338
+00:22:05,400 --> 00:22:07,080
+background noise like people
+honking their horns,
+339
+00:22:07,080 --> 00:22:11,400
+which is something I've done in my
+main data set where I am trying to
+340
+00:22:11,560 --> 00:22:15,640
+go back to some of my voice notes,
+annotate them, and run a benchmark.
+341
+00:22:15,640 --> 00:22:19,080
+But this is going to be just a
+pure quick test.
+342
+00:22:19,560 --> 00:22:24,000
+And as someone I'm working on a
+voice note idea,
+343
+00:22:24,000 --> 00:22:28,350
+that's my sort of end motivation.
+Besides thinking it's an
+344
+00:22:28,350 --> 00:22:31,710
+absolutely outstanding technology
+that's coming to viability.
+345
+00:22:31,710 --> 00:22:34,790
+And really, I know this sounds
+cheesy can actually have a very
+346
+00:22:34,790 --> 00:22:38,950
+transformative effect.
+Um, it's, you know, voice technology
+347
+00:22:38,990 --> 00:22:45,030
+has been life changing for, uh,
+folks living with, um, disabilities.
+348
+00:22:45,750 --> 00:22:48,670
+And I think there's something
+really nice about the fact that
+349
+00:22:48,670 --> 00:22:52,830
+it can also benefit, you know,
+folks who are able bodied and like,
+350
+00:22:52,870 --> 00:22:59,070
+we can all in different ways, um,
+make this tech as useful as possible,
+351
+00:22:59,110 --> 00:23:01,230
+regardless of the exact way that
+we're using it.
+352
+00:23:01,630 --> 00:23:04,830
+Um, and I think there's something
+very powerful in that, and it can be
+353
+00:23:04,830 --> 00:23:09,030
+very cool. Um, I see use potential.
+What excites me about voice tech?
+354
+00:23:09,870 --> 00:23:13,670
+A lot of things, actually.
+Firstly, the fact that it's cheap
+355
+00:23:13,670 --> 00:23:17,230
+and accurate, as I mentioned at
+the very start of this, um,
+356
+00:23:17,230 --> 00:23:20,910
+and it's getting better and better
+with stuff like accent handling, um,
+357
+00:23:20,910 --> 00:23:24,300
+I'm not sure my, my fine tune will
+actually ever come to fruition in the
+358
+00:23:24,300 --> 00:23:27,980
+sense that I'll use it day to day,
+as I imagine I get like superb,
+359
+00:23:27,980 --> 00:23:33,660
+flawless word error rates because I'm
+just kind of skeptical about local
+360
+00:23:33,660 --> 00:23:38,220
+speech to texts, as I mentioned.
+And I think the pace of innovation
+361
+00:23:38,220 --> 00:23:42,180
+and improvement in the models,
+the main reasons for fine tuning from
+362
+00:23:42,180 --> 00:23:46,460
+what I've seen have been people who
+are something that really blows,
+363
+00:23:46,500 --> 00:23:53,060
+blows my mind about ASR is the idea
+that it's inherently a lingual
+364
+00:23:53,060 --> 00:23:59,220
+or multilingual phonetic based.
+So as folks who use speak very
+365
+00:23:59,260 --> 00:24:02,340
+obscure languages that there may
+be there might be a paucity of
+366
+00:24:02,340 --> 00:24:05,620
+training data or almost none at all,
+and therefore the accuracy is
+367
+00:24:05,620 --> 00:24:10,780
+significantly reduced or folks
+in very critical environments.
+368
+00:24:10,820 --> 00:24:13,500
+I know there are.
+This is used extensively in medical
+369
+00:24:13,500 --> 00:24:18,260
+transcription and dispatcher work as,
+um, you know, the call centers who
+370
+00:24:18,260 --> 00:24:22,610
+send out ambulances, etc., where
+accuracy is absolutely paramount.
+371
+00:24:22,610 --> 00:24:26,170
+And in the case of doctors,
+radiologists, they might be using
+372
+00:24:26,170 --> 00:24:29,730
+very specialized vocab all the time.
+So those are kind of the main
+373
+00:24:29,730 --> 00:24:31,650
+two things.
+And I'm not sure that really just for
+374
+00:24:31,650 --> 00:24:37,410
+trying to make it better on a few
+random tech words with my slightly.
+375
+00:24:37,450 --> 00:24:41,370
+I mean, I have an accent, but like,
+not, you know, an accent that a few
+376
+00:24:41,410 --> 00:24:47,330
+other million people have. Ish.
+I'm not sure that my little fine
+377
+00:24:47,330 --> 00:24:52,370
+tune is going to actually like the
+bump in word error rate reduction.
+378
+00:24:52,370 --> 00:24:54,690
+If I ever actually figure out how
+to do it and get it up to the
+379
+00:24:54,690 --> 00:24:58,730
+cloud by the time I've done that.
+I suspect that the next
+380
+00:24:58,730 --> 00:25:01,530
+generation of ASR will just be
+so good that it will kind of be.
+381
+00:25:02,050 --> 00:25:03,890
+Ah, well,
+that would be cool if it worked out,
+382
+00:25:03,890 --> 00:25:08,850
+but I'll just use this instead.
+So that's going to be it for today's
+383
+00:25:08,850 --> 00:25:14,250
+episode of, uh, voice training data.
+Single long shot evaluation.
+384
+00:25:14,530 --> 00:25:17,450
+Who am I going to compare?
+Whisper is always good as a
+385
+00:25:17,450 --> 00:25:20,720
+benchmark, but I'm more
+interested in seeing Whisperer
+386
+00:25:20,720 --> 00:25:25,200
+head to head with two things,
+really. One is whisper variance.
+387
+00:25:25,200 --> 00:25:30,000
+So you've got these projects like
+faster Whisper, Still whisper.
+388
+00:25:30,000 --> 00:25:31,760
+It's a bit confusing.
+There's a whole bunch of them
+389
+00:25:32,040 --> 00:25:34,920
+and the emerging acers,
+which are also a thing.
+390
+00:25:35,320 --> 00:25:37,800
+My intention for this is I'm not
+sure I'm going to have the time
+391
+00:25:37,800 --> 00:25:41,760
+in any point in the foreseeable
+future to go back through this whole
+392
+00:25:41,760 --> 00:25:46,680
+episode and create a proper source,
+truth or a fix.
+393
+00:25:47,440 --> 00:25:51,800
+Everything might do it if I can
+get one transcription that
+394
+00:25:51,800 --> 00:25:56,840
+sufficiently close to perfection.
+But what I would actually love
+395
+00:25:56,840 --> 00:25:59,920
+to do on Hugging Face I think
+would be a great.
+396
+00:25:59,920 --> 00:26:03,680
+Probably how I might visualize this
+is having the audio waveform play,
+397
+00:26:04,160 --> 00:26:09,920
+and then have the transcript for each
+model below it, and maybe even a,
+398
+00:26:10,600 --> 00:26:15,240
+um, like, you know, two scale and
+maybe even a local one as well,
+399
+00:26:15,280 --> 00:26:21,820
+like local whisper versus open
+AI API, Etc. and, um, I can then
+400
+00:26:21,820 --> 00:26:24,500
+actually listen back to segments
+or anyone who wants to can listen
+401
+00:26:24,500 --> 00:26:29,540
+back to segments of this recording
+and see where a particular model
+402
+00:26:29,580 --> 00:26:33,060
+struggled and others didn't, as well
+as the sort of headline finding
+403
+00:26:33,100 --> 00:26:36,900
+of which had the best, uh, wer.
+But that would require the source
+404
+00:26:36,900 --> 00:26:40,140
+of truth. Okay. That's it.
+Hope this was, I don't know,
+405
+00:26:40,300 --> 00:26:43,580
+maybe useful for other folks
+interested in stuff you want to see.
+406
+00:26:44,060 --> 00:26:48,220
+I always feel think I've just said
+something I didn't intend to say.
+407
+00:26:48,780 --> 00:26:51,140
+I said for those, listen carefully.
+Including, hopefully,
+408
+00:26:51,140 --> 00:26:54,180
+the models themselves.
+This has been myself,
+409
+00:26:54,220 --> 00:26:58,020
+Daniel Rosehill, for more, um,
+jumbled repositories about my,
+410
+00:26:58,060 --> 00:27:00,940
+uh, roving interest in AI,
+but particularly Agentic,
+411
+00:27:01,300 --> 00:27:05,460
+MCP and voice tech.
+Uh, you can find me on GitHub.
+412
+00:27:05,940 --> 00:27:11,260
+Hugging face. Where else?
+Daniel, which is my personal website,
+413
+00:27:11,260 --> 00:27:15,380
+as well as this podcast whose
+name I sadly cannot remember.
+414
+00:27:15,820 --> 00:27:17,540
+Until next time.
+Thanks for listening.

style.css CHANGED Viewed

@@ -1,28 +1,198 @@
 body {
-	padding: 2rem;
-	font-family: -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
 }
 h1 {
-	font-size: 16px;
-	margin-top: 0;
 }
 p {
-	color: rgb(107, 114, 128);
-	font-size: 15px;
-	margin-bottom: 10px;
-	margin-top: 5px;
 }
-.card {
-	max-width: 620px;
-	margin: 0 auto;
-	padding: 16px;
-	border: 1px solid lightgray;
-	border-radius: 16px;
 }
-.card p:last-child {
 	margin-bottom: 0;
 }

+*,
+*::before,
+*::after {
+	box-sizing: border-box;
+}
+:root {
+	color-scheme: light;
+	font-family: "Inter", "Segoe UI", system-ui, -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
+	background: #f5f6fa;
+	color: #1f2937;
+}
 body {
+	margin: 0;
+	background: radial-gradient(circle at top, #ffffff, #eef2ff 60%);
+	min-height: 100vh;
+	padding: 2.5rem clamp(1rem, 4vw, 4rem);
+}
+.app {
+	max-width: 960px;
+	margin: 0 auto;
+	display: grid;
+	gap: 2rem;
 }
 h1 {
+	font-weight: 700;
+	margin-bottom: 0.25rem;
 }
 p {
+	margin: 0;
+	color: #4b5563;
+	line-height: 1.6;
 }
+.hero {
+	display: grid;
+	gap: 1.5rem;
+	padding: 1.5rem;
+	border-radius: 24px;
+	border: 1px solid rgba(255, 255, 255, 0.6);
+	background: rgba(255, 255, 255, 0.85);
+	box-shadow: 0 25px 60px rgba(15, 23, 42, 0.08);
+}
+.audio-shell {
+	background: rgba(15, 23, 42, 0.9);
+	border-radius: 18px;
+	padding: 1rem;
+	box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.1);
+	display: grid;
+	gap: 0.75rem;
+}
+audio {
+	width: 100%;
+	filter: drop-shadow(0 5px 20px rgba(0, 0, 0, 0.3));
+}
+#waveform {
+	width: 100%;
+	height: 140px;
+	border-radius: 12px;
+	background: rgba(255, 255, 255, 0.05);
 }
+.transcripts {
+	display: grid;
+	gap: 1.75rem;
+}
+#reference-track {
+	display: grid;
+}
+.models-grid {
+	display: grid;
+	gap: 1.25rem;
+}
+.track {
+	background: #ffffff;
+	border-radius: 20px;
+	padding: 1.25rem;
+	box-shadow: 0 20px 40px rgba(15, 23, 42, 0.07);
+	border: 1px solid rgba(31, 41, 55, 0.08);
+}
+.track--error {
+	border: 1px dashed rgba(239, 68, 68, 0.6);
+	background: rgba(254, 242, 242, 0.9);
+	box-shadow: none;
+}
+.track--emphasis {
+	border: 2px solid #00b894;
+	box-shadow: 0 25px 45px rgba(0, 184, 148, 0.15);
+}
+.track header {
+	display: flex;
+	flex-wrap: wrap;
+	align-items: center;
+	justify-content: space-between;
+	gap: 0.75rem;
+	margin-bottom: 1rem;
+}
+.track h2 {
+	font-size: 1.25rem;
+	margin: 0;
+}
+.badge {
+	background: rgba(31, 41, 55, 0.05);
+	color: #1f2937;
+	padding: 0.2rem 0.75rem;
+	border-radius: 999px;
+	font-size: 0.85rem;
+}
+.badge--error {
+	background: rgba(239, 68, 68, 0.15);
+	color: #b91c1c;
+}
+.track-error {
+	margin: 0;
+	color: #b91c1c;
+	font-weight: 500;
+}
+.track-body {
+	display: grid;
+	gap: 0.5rem;
+	max-height: 300px;
+	overflow: auto;
+	scrollbar-width: thin;
+}
+.track-body::-webkit-scrollbar {
+	width: 6px;
+}
+.track-body::-webkit-scrollbar-thumb {
+	background: rgba(31, 41, 55, 0.25);
+	border-radius: 999px;
+}
+.segment {
+	padding: 0.85rem 1rem;
+	border-radius: 14px;
+	background: rgba(15, 23, 42, 0.03);
+	border: 1px solid rgba(0, 0, 0, 0.04);
+	transition: background 0.25s ease, transform 0.25s ease, border 0.25s ease;
+}
+.segment p {
+	margin-top: 0.35rem;
 	margin-bottom: 0;
+	color: #111827;
+}
+.segment-time {
+	display: inline-flex;
+	align-items: center;
+	gap: 0.35rem;
+	font-family: "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
+	font-size: 0.85rem;
+	color: rgba(55, 65, 81, 0.9);
+}
+.segment.is-active {
+	background: rgba(64, 112, 244, 0.08);
+	border-color: var(--accent, rgba(64, 112, 244, 0.5));
+	box-shadow: 0 10px 20px rgba(64, 112, 244, 0.15);
+	transform: translateY(-2px);
+}
+.track--emphasis .segment.is-active {
+	background: rgba(0, 184, 148, 0.1);
+	box-shadow: 0 12px 24px rgba(0, 184, 148, 0.2);
+	border-color: rgba(0, 184, 148, 0.6);
+}
+@media (min-width: 720px) {
+	.hero {
+		grid-template-columns: 1.1fr 0.9fr;
+		align-items: center;
+	}
+	.models-grid {
+		grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+	}
 }

transcripts.js ADDED Viewed

The diff for this file is too large to render. See raw diff