danielrosehill's picture
Fix SRT timestamp alignment with ground truth
0aa8adc
1
00:00:00,000 --> 00:00:06,160
Hello and welcome to a audio dataset consisting of one
2
00:00:06,160 --> 00:00:08,320
single episode of a nonexistent podcast.
3
00:00:08,720 --> 00:00:12,800
Or it I may append this to a podcast that
4
00:00:12,800 --> 00:00:18,734
I set up recently regarding my with my thoughts on
5
00:00:18,735 --> 00:00:20,735
speech tech and A.
6
00:00:20,735 --> 00:00:21,134
I.
7
00:00:21,134 --> 00:00:22,734
In particular, more A.
8
00:00:22,734 --> 00:00:22,974
I.
9
00:00:22,974 --> 00:00:23,855
And generative A.
10
00:00:23,855 --> 00:00:24,015
I.
11
00:00:24,015 --> 00:00:26,414
I would I would say.
12
00:00:26,734 --> 00:00:30,789
But in any event, the purpose of this voice recording
13
00:00:30,789 --> 00:00:35,510
is actually to create a lengthy voice sample for a
14
00:00:35,510 --> 00:00:38,870
quick evaluation, a back of the envelope evaluation, they might
15
00:00:38,870 --> 00:00:41,349
say, for different speech attacks models.
16
00:00:41,349 --> 00:00:43,865
I'm doing this because I thought I'd made a great
17
00:00:43,865 --> 00:00:47,704
breakthrough in my journey with speech tech and that was
18
00:00:47,704 --> 00:00:51,305
succeeding in the elusive task of fine tuning whisper.
19
00:00:51,624 --> 00:00:56,344
Whisper is, and I'm to just talk, I'm trying to
20
00:00:55,749 --> 00:00:56,709
mix up.
21
00:00:56,789 --> 00:01:00,310
I'm going to try a few different styles of speaking
22
00:01:00,310 --> 00:01:02,789
whisper something at some points as well.
23
00:01:03,270 --> 00:01:06,710
And I'll go back to speaking loud in in different
24
00:01:06,710 --> 00:01:08,950
parts are going to sound really like a crazy person
25
00:01:08,950 --> 00:01:12,344
because I'm also going to try to speak at different
26
00:01:12,904 --> 00:01:17,945
pitches and cadences in order to really try to push
27
00:01:18,264 --> 00:01:21,065
a speech to text model through its paces, which is
28
00:01:21,065 --> 00:01:24,529
trying to make sense of is this guy just rambling
29
00:01:24,529 --> 00:01:29,969
on incoherently in one long sentence or are these just
30
00:01:29,969 --> 00:01:36,370
actually a series of step standalone, standalone, standalone sentences?
31
00:01:36,370 --> 00:01:38,050
And how is it going to handle step alone?
32
00:01:38,050 --> 00:01:38,690
That's not a word.
33
00:01:39,624 --> 00:01:41,945
What happens when you use speech to text and you
34
00:01:41,945 --> 00:01:43,304
use a fake word?
35
00:01:43,304 --> 00:01:45,704
And then you're like, wait, that's not actually that word
36
00:01:45,704 --> 00:01:46,585
doesn't exist.
37
00:01:46,904 --> 00:01:48,504
How does AI handle that?
38
00:01:48,504 --> 00:01:53,670
And these and more are all the questions that I'm
39
00:01:53,670 --> 00:01:55,670
seeking to answer in this training data.
40
00:01:55,749 --> 00:01:58,469
Now, why was I trying to fine tune Whisper?
41
00:01:58,469 --> 00:01:59,670
And what is Whisper?
42
00:01:59,670 --> 00:02:02,630
As I said, I'm going to try to record this
43
00:02:02,630 --> 00:02:06,564
at a couple of different levels of technicality for folks
44
00:02:06,564 --> 00:02:11,684
who are in the normal world and not totally stuck
45
00:02:11,684 --> 00:02:13,684
down the rabbit hole of AI, which you have to
46
00:02:13,684 --> 00:02:17,605
say is a really wonderful rabbit hole to be done.
47
00:02:17,764 --> 00:02:20,839
It's a really interesting area and speech and voice tech
48
00:02:20,839 --> 00:02:24,279
is is the aspect of it that I find actually
49
00:02:24,279 --> 00:02:27,159
most I'm not sure I would say the most interesting
50
00:02:27,159 --> 00:02:30,679
because there's just so much that is fascinating in AI.
51
00:02:31,320 --> 00:02:34,054
But the most that I find the most personally transformative
52
00:02:34,054 --> 00:02:38,454
in terms of the impact that it's had on my
53
00:02:38,454 --> 00:02:41,174
daily work life and productivity and how I sort of
54
00:02:41,174 --> 00:02:41,815
work.
55
00:02:42,855 --> 00:02:47,420
I'm persevering hard with the task of trying to get
56
00:02:47,420 --> 00:02:50,859
a good solution working for Linux, which if anyone actually
57
00:02:50,859 --> 00:02:52,859
does listen to this, not just for the training data
58
00:02:52,859 --> 00:02:56,620
and for the actual content, is sparked.
59
00:02:56,620 --> 00:02:59,900
I had, besides the fine tune not working, well that
60
00:02:59,900 --> 00:03:01,305
was the failure.
61
00:03:02,424 --> 00:03:06,665
I used Claude code because one thinks these days that
62
00:03:06,665 --> 00:03:13,200
there is nothing short of solving, you know, the the
63
00:03:13,200 --> 00:03:17,519
reason of life or something that clause and agentic AI
64
00:03:17,519 --> 00:03:19,600
can't do, which is not really the case.
65
00:03:19,600 --> 00:03:23,119
It does seem that way sometimes, but it fails a
66
00:03:23,119 --> 00:03:23,679
lot as well.
67
00:03:23,679 --> 00:03:26,559
And this is one of those instances where last week
68
00:03:26,559 --> 00:03:30,744
I put together an hour of voice training data, basically
69
00:03:30,744 --> 00:03:33,385
speaking just random things for three minutes.
70
00:03:35,385 --> 00:03:38,024
It was actually kind of tedious because the texts were
71
00:03:38,024 --> 00:03:38,584
really weird.
72
00:03:38,584 --> 00:03:41,290
Some of them were, it was like it was AI
73
00:03:41,290 --> 00:03:42,170
generated.
74
00:03:42,489 --> 00:03:44,809
I tried before to read Sherlock Holmes for an hour
75
00:03:44,809 --> 00:03:47,609
and I just couldn't, I was so bored after ten
76
00:03:47,609 --> 00:03:50,489
minutes that I was like, okay, no, I'm just gonna
77
00:03:50,489 --> 00:03:51,850
have to find something else to read.
78
00:03:51,850 --> 00:03:58,204
So I used a created with AI Studio, VibeCoded, a
79
00:03:58,204 --> 00:04:03,084
synthetic text generator which actually I thought was probably a
80
00:04:03,084 --> 00:04:05,165
better way of doing it because it would give me
81
00:04:05,165 --> 00:04:08,989
more short samples with more varied content.
82
00:04:08,989 --> 00:04:11,630
So I was like, okay, give me a voice note
83
00:04:11,630 --> 00:04:14,829
like I'm recording an email, give me a short story
84
00:04:14,829 --> 00:04:18,109
to read, give me prose to read.
85
00:04:18,109 --> 00:04:20,554
So I came up with all these different things and
86
00:04:20,554 --> 00:04:22,634
they added a little timer to it so I could
87
00:04:22,634 --> 00:04:24,875
see how close I was to one hour.
88
00:04:25,835 --> 00:04:29,035
And I spent like an hour one afternoon or probably
89
00:04:29,035 --> 00:04:33,035
two hours by the time you do retakes and whatever
90
00:04:33,035 --> 00:04:36,089
because you want to it gave me a source of
91
00:04:36,089 --> 00:04:39,929
truth which I'm not sure if that's the scientific way
92
00:04:39,929 --> 00:04:44,089
to approach this topic of gathering training data but I
93
00:04:44,089 --> 00:04:45,369
thought made sense.
94
00:04:46,410 --> 00:04:49,384
I have a lot of audio data from recording voice
95
00:04:49,384 --> 00:04:53,464
notes which I've also kind of used, been experimenting with
96
00:04:53,464 --> 00:04:54,984
using for a different purpose.
97
00:04:55,304 --> 00:04:58,665
Slightly different annotating task types.
98
00:04:58,665 --> 00:05:03,170
It's more a text classification experiment or Well, it's more
99
00:05:03,170 --> 00:05:03,730
than that actually.
100
00:05:03,730 --> 00:05:04,929
I'm working on a voice app.
101
00:05:04,929 --> 00:05:09,249
So it's a prototype, I guess, is really more accurate.
102
00:05:11,329 --> 00:05:13,889
But you can do that and you can work backwards.
103
00:05:13,889 --> 00:05:18,274
Listen back to a voice note and you painfully go
104
00:05:18,274 --> 00:05:21,394
through one of those transcribing, where you start and stop
105
00:05:21,394 --> 00:05:23,554
and scrub around it and you fix the errors, but
106
00:05:23,554 --> 00:05:25,795
it's really, really pouring to do that.
107
00:05:26,035 --> 00:05:27,954
So I thought it would be less tedious in the
108
00:05:27,954 --> 00:05:31,634
long term if I just recorded the source of truth.
109
00:05:31,989 --> 00:05:34,309
So it gave me these three minutes snippets.
110
00:05:34,309 --> 00:05:37,429
I recorded them and saved an MP3 and a TXT
111
00:05:37,670 --> 00:05:40,230
in the same folder and I created an error that
112
00:05:40,230 --> 00:05:40,869
data.
113
00:05:41,910 --> 00:05:44,790
So I was very hopeful, quietly, a little bit hopeful
114
00:05:44,790 --> 00:05:46,949
that I would be able, that I could actually fine
115
00:05:46,949 --> 00:05:47,670
tune Whisper.
116
00:05:48,285 --> 00:05:51,005
I want to fine tune Whisper because when I got
117
00:05:51,005 --> 00:05:54,924
into voice tech last November, my wife was in the
118
00:05:54,924 --> 00:05:57,165
US and I was alone at home.
119
00:05:57,244 --> 00:06:00,924
And when crazy people like me do really wild things
120
00:06:00,924 --> 00:06:03,900
like use voice to tech technology.
121
00:06:03,900 --> 00:06:06,859
That was basically when I started doing it, I didn't
122
00:06:06,859 --> 00:06:09,500
feel like a crazy person speaking to myself.
123
00:06:09,900 --> 00:06:12,700
And my expectations weren't that high.
124
00:06:13,100 --> 00:06:17,605
I'd used speech tech now and again, tried it out.
125
00:06:17,605 --> 00:06:18,804
I was like, it'd be really cool if you could
126
00:06:18,804 --> 00:06:22,324
just like speak into your computer and whatever I tried
127
00:06:22,324 --> 00:06:25,845
out that had Linux support was just, it was not
128
00:06:25,845 --> 00:06:26,725
good basically.
129
00:06:27,285 --> 00:06:29,444
And this blew me away from the first go.
130
00:06:29,444 --> 00:06:32,259
I mean, it wasn't one hundred percent accurate out of
131
00:06:32,259 --> 00:06:34,420
the box and it took work, but it was good
132
00:06:34,420 --> 00:06:36,739
enough that there was a solid foundation and it kind
133
00:06:36,739 --> 00:06:41,059
of passed that pivot point that it's actually worth doing
134
00:06:41,059 --> 00:06:41,540
this.
135
00:06:41,859 --> 00:06:43,859
You know, there's a point where it's so like, the
136
00:06:43,859 --> 00:06:46,405
transcript is you don't have to get one hundred percent
137
00:06:46,405 --> 00:06:49,445
accuracy for it to be worth your time for speech
138
00:06:49,445 --> 00:06:51,845
to text to be a worthwhile addition to your productivity.
139
00:06:51,845 --> 00:06:53,605
But you do need to get above, let's say, I
140
00:06:53,605 --> 00:06:55,045
don't know, eighty five percent.
141
00:06:55,525 --> 00:06:58,725
If it's sixty percent or fifty percent, you inevitably say,
142
00:06:58,960 --> 00:07:00,239
Screw it, I'll just type it.
143
00:07:00,239 --> 00:07:03,600
Because you end up missing errors in the transcript and
144
00:07:03,600 --> 00:07:04,960
it becomes actually worse.
145
00:07:04,960 --> 00:07:06,640
You end up in a worse position than you started
146
00:07:06,640 --> 00:07:06,960
with it.
147
00:07:06,960 --> 00:07:08,160
That's been my experience.
148
00:07:08,480 --> 00:07:12,400
So I was like, Oh, this is actually really, really
149
00:07:12,400 --> 00:07:12,880
good now.
150
00:07:12,880 --> 00:07:13,600
How did that happen?
151
00:07:13,600 --> 00:07:17,915
And the answer is ASR, Whisper being open sourced and
152
00:07:18,634 --> 00:07:21,514
the transformer architecture, if you want to go back to
153
00:07:21,514 --> 00:07:26,314
the underpinnings, which really blows my mind and it's on
154
00:07:26,314 --> 00:07:29,750
my list to read through that paper.
155
00:07:30,309 --> 00:07:35,910
All you need is attention as attentively as can be
156
00:07:35,910 --> 00:07:39,270
done with my limited brain because it's super super high
157
00:07:39,270 --> 00:07:42,965
level stuff, super advanced stuff, mean.
158
00:07:43,205 --> 00:07:48,004
That I think of all the things that are fascinating
159
00:07:48,004 --> 00:07:52,484
about the sudden rise in AI and the dramatic capabilities,
160
00:07:53,259 --> 00:07:55,339
I find it fascinating that few people are like, hang
161
00:07:55,339 --> 00:07:58,220
on, you've got this thing that can speak to you
162
00:07:58,220 --> 00:07:59,980
like a chatbot, an LLM.
163
00:08:00,540 --> 00:08:02,780
And then you've got image generation.
164
00:08:02,780 --> 00:08:03,100
Okay.
165
00:08:03,100 --> 00:08:07,020
So firstly, two things on the surface have nothing in
166
00:08:07,020 --> 00:08:07,339
common.
167
00:08:08,285 --> 00:08:11,964
So how did that just happen all at the same
168
00:08:11,964 --> 00:08:12,205
time?
169
00:08:12,205 --> 00:08:15,884
And then when you extend that further, you're like, Suno.
170
00:08:15,884 --> 00:08:19,405
You can sing a song and AI will come up
171
00:08:19,405 --> 00:08:21,085
with an instrumental.
172
00:08:21,405 --> 00:08:23,405
And then you've got Whisper and you're like, Wait a
173
00:08:23,405 --> 00:08:23,645
second.
174
00:08:24,020 --> 00:08:28,100
How did all this stuff If it's all AI, there
175
00:08:28,100 --> 00:08:29,460
has to be some commonality.
176
00:08:29,460 --> 00:08:35,059
Otherwise, are totally different technologies on the surface of it.
177
00:08:35,140 --> 00:08:39,304
And the transformer architecture is, as far as I know,
178
00:08:39,304 --> 00:08:40,184
the answer.
179
00:08:40,184 --> 00:08:42,905
And I can't even say, can't even pretend that I
180
00:08:42,905 --> 00:08:47,304
really understand what the transformer architecture means in-depth.
181
00:08:47,304 --> 00:08:49,785
But I have scanned this and as I said, I
182
00:08:49,785 --> 00:08:52,799
want to print it and really kind of think over
183
00:08:52,799 --> 00:08:54,080
it at some point.
184
00:08:54,799 --> 00:08:58,000
And I'll probably feel bad about myself, I think, because
185
00:08:58,000 --> 00:08:59,599
weren't those guys in twenties?
186
00:09:00,240 --> 00:09:01,760
Like, that's crazy.
187
00:09:02,080 --> 00:09:06,080
I think I asked ChatGPT once who wrote that paper
188
00:09:06,465 --> 00:09:09,184
and how old were they when it was published in
189
00:09:09,184 --> 00:09:09,745
ArcSiv?
190
00:09:09,745 --> 00:09:13,025
And I was expecting like, I don't know, what do
191
00:09:13,025 --> 00:09:13,505
you imagine?
192
00:09:13,505 --> 00:09:15,585
I personally imagine kind of like, you you have these
193
00:09:15,585 --> 00:09:19,665
breakthroughs during COVID and things like that, where like these
194
00:09:19,665 --> 00:09:22,549
kind of really obscure scientists who are in their 50s
195
00:09:22,549 --> 00:09:26,790
and they've just kind of been laboring in labs and
196
00:09:26,790 --> 00:09:29,750
wearily in writing and publishing in kind of obscure academic
197
00:09:29,750 --> 00:09:30,630
publications.
198
00:09:30,790 --> 00:09:33,589
And they finally hit a big or win a Nobel
199
00:09:33,589 --> 00:09:36,155
Prize and then their household names.
200
00:09:36,554 --> 00:09:38,554
So that was kind of what I had in mind.
201
00:09:38,554 --> 00:09:42,074
That was the mental image I'd formed of the birth
202
00:09:42,074 --> 00:09:42,875
of ArcSim.
203
00:09:42,875 --> 00:09:45,515
Like I wasn't expecting twenty somethings in San Francisco.
204
00:09:45,515 --> 00:09:48,714
I thought that was both very funny, very cool, and
205
00:09:48,714 --> 00:09:49,995
actually kind of inspiring.
206
00:09:50,474 --> 00:09:55,150
It's nice to think that people who just you might
207
00:09:55,150 --> 00:09:58,429
put them in the kind of milieu or bubble or
208
00:09:58,429 --> 00:10:02,589
world that you are in incredibly in through a series
209
00:10:02,589 --> 00:10:05,755
of connections that are coming up with such literally world
210
00:10:05,755 --> 00:10:07,755
changing innovations.
211
00:10:07,834 --> 00:10:11,194
So that was I thought anyway, that's that that was
212
00:10:11,194 --> 00:10:11,755
cool.
213
00:10:12,155 --> 00:10:12,474
Okay.
214
00:10:12,474 --> 00:10:13,354
Voice training data.
215
00:10:13,354 --> 00:10:14,074
How are we doing?
216
00:10:14,074 --> 00:10:17,275
We're about ten minutes, and I'm still talking about voice
217
00:10:17,275 --> 00:10:18,155
technology.
218
00:10:18,554 --> 00:10:22,099
So Whisper was brilliant, and I was so excited that
219
00:10:22,099 --> 00:10:25,780
my first instinct was to guess, like, Oh my gosh,
220
00:10:25,780 --> 00:10:27,939
I have to get a really good microphone for this.
221
00:10:28,099 --> 00:10:31,299
So I didn't go on a spending spree because I
222
00:10:31,299 --> 00:10:33,219
said, I'm gonna have to just wait a month and
223
00:10:33,219 --> 00:10:34,660
see if I still use this.
224
00:10:35,140 --> 00:10:38,795
And it just kind of became it's become really part
225
00:10:38,795 --> 00:10:40,875
of my daily routine.
226
00:10:41,674 --> 00:10:44,235
Like if I'm writing an email, I'll record a voice
227
00:10:44,235 --> 00:10:47,515
note and then I've developed and it's nice to see
228
00:10:47,515 --> 00:10:50,679
that everyone is like developing the same things in parallel.
229
00:10:50,679 --> 00:10:53,319
That's kind of a weird thing to say, when I
230
00:10:53,319 --> 00:11:00,199
started working on these prototypes on GitHub, which is where
231
00:11:00,199 --> 00:11:03,959
I just kind of share very freely and loosely ideas
232
00:11:03,959 --> 00:11:06,865
and first iterations on concepts.
233
00:11:08,944 --> 00:11:10,624
And for want of a better word, I called it
234
00:11:10,624 --> 00:11:14,865
like LLM post processing or clean up or basically a
235
00:11:14,865 --> 00:11:17,665
system prompt that after you get back the raw text
236
00:11:17,665 --> 00:11:21,540
from Whisper, you run it through a model and say,
237
00:11:21,540 --> 00:11:26,259
okay, this is crappy text like add sentence structure and,
238
00:11:26,259 --> 00:11:27,379
you know, fix it up.
239
00:11:27,780 --> 00:11:32,499
And now when I'm exploring the different tools that are
240
00:11:32,499 --> 00:11:35,554
out there that people have built, I see quite a
241
00:11:35,554 --> 00:11:39,395
number of projects have basically done the same thing.
242
00:11:40,674 --> 00:11:43,155
Lest that be misconstrued, I'm not saying for a millisecond
243
00:11:43,155 --> 00:11:44,515
that I inspired them.
244
00:11:44,515 --> 00:11:47,954
I'm sure this has been a thing that's been integrated
245
00:11:47,954 --> 00:11:51,210
into tools for a while, but it's the kind of
246
00:11:51,210 --> 00:11:53,610
thing that when you start using these tools every day,
247
00:11:53,610 --> 00:11:57,530
the need for it is almost instantly apparent because text
248
00:11:57,530 --> 00:12:01,449
that doesn't have any punctuation or paragraph spacing takes a
249
00:12:01,449 --> 00:12:03,885
long time to, you know, it takes so long to
250
00:12:03,885 --> 00:12:08,924
get it into a presentable email that again, moves speech
251
00:12:08,924 --> 00:12:13,005
tech into that before that inflection point where you're like,
252
00:12:13,005 --> 00:12:13,885
nah, it's just not worth it.
253
00:12:13,885 --> 00:12:16,844
It's like, it'll just be quicker to type this.
254
00:12:17,199 --> 00:12:19,760
So it's a big, it's a little touch that actually
255
00:12:20,000 --> 00:12:21,120
is a big deal.
256
00:12:21,439 --> 00:12:25,360
So I was on Whisper and I've been using Whisper
257
00:12:25,360 --> 00:12:27,679
and I kind of early on found a couple of
258
00:12:27,679 --> 00:12:28,319
tools.
259
00:12:28,319 --> 00:12:30,559
I couldn't find what I was looking for on Linux,
260
00:12:30,559 --> 00:12:35,844
which is basically just something that'll run-in the background.
261
00:12:35,844 --> 00:12:38,165
You'll give it an API key and it will just
262
00:12:38,165 --> 00:12:42,964
like transcribe with like a little key to start and
263
00:12:42,964 --> 00:12:43,765
stop the dictation.
264
00:12:45,000 --> 00:12:48,360
And the issues where I discovered that like most people
265
00:12:48,360 --> 00:12:51,960
involved in creating these projects were very much focused on
266
00:12:51,960 --> 00:12:55,720
local models, running Whisper locally because you can.
267
00:12:56,199 --> 00:12:58,120
And I tried that a bunch of times and just
268
00:12:58,120 --> 00:13:00,974
never got results that were as good as the cloud.
269
00:13:01,375 --> 00:13:03,535
And when I began looking at the cost of the
270
00:13:03,535 --> 00:13:06,574
speech to text APIs and what I was spending, I
271
00:13:06,574 --> 00:13:09,775
just thought there is it's actually, in my opinion, just
272
00:13:09,775 --> 00:13:13,080
one of the better deals in API spending in the
273
00:13:13,080 --> 00:13:13,400
cloud.
274
00:13:13,400 --> 00:13:15,640
Like, it's just not that expensive for very, very good
275
00:13:15,640 --> 00:13:19,559
models that are much more, you know, you're gonna be
276
00:13:19,559 --> 00:13:22,679
able to run the full model, the latest model versus
277
00:13:22,679 --> 00:13:26,525
whatever you can run on your average GPU unless you
278
00:13:26,525 --> 00:13:28,765
want to buy a crazy GPU.
279
00:13:28,765 --> 00:13:29,964
It doesn't really make sense to me.
280
00:13:29,964 --> 00:13:33,084
Privacy is another concern that I know is kind of
281
00:13:33,084 --> 00:13:35,245
like a very much a separate thing that people just
282
00:13:35,245 --> 00:13:38,765
don't want their voice data and their voice leaving their
283
00:13:38,765 --> 00:13:42,380
local environment maybe for regulatory reasons as well.
284
00:13:42,620 --> 00:13:43,900
But I'm not in that.
285
00:13:44,140 --> 00:13:48,460
I neither really care about people listening to my, grocery
286
00:13:48,460 --> 00:13:51,500
list, consisting of, reminding myself that I need to buy
287
00:13:51,500 --> 00:13:54,699
more beer, Cheetos, and hummus, which is kind of the
288
00:13:55,254 --> 00:13:59,494
three staples of my diet during periods of poor nutrition.
289
00:13:59,814 --> 00:14:02,295
But the kind of stuff that I transcribe, it's just
290
00:14:02,295 --> 00:14:02,614
not.
291
00:14:02,614 --> 00:14:07,734
It's not a privacy thing I'm that sort of sensitive
292
00:14:07,734 --> 00:14:13,189
about and I don't do anything so sensitive or secure
293
00:14:13,189 --> 00:14:14,710
that requires air capping.
294
00:14:15,590 --> 00:14:17,510
I looked at the pricing and especially the kind of
295
00:14:17,510 --> 00:14:18,870
older model mini.
296
00:14:19,510 --> 00:14:21,830
Some of them are very, very affordable and I did
297
00:14:21,830 --> 00:14:26,684
a calculation once with ChatGPT and I was like, okay,
298
00:14:26,684 --> 00:14:30,285
this is the API price for I can't remember whatever
299
00:14:30,285 --> 00:14:31,324
the model was.
300
00:14:31,724 --> 00:14:34,365
Let's say I just go at it like nonstop, which
301
00:14:34,365 --> 00:14:35,485
rarely happens.
302
00:14:35,564 --> 00:14:38,879
Probably, I would say on average I might dictate thirty
303
00:14:38,879 --> 00:14:41,679
to sixty minutes per day if I was probably summing
304
00:14:41,679 --> 00:14:47,920
up the emails, documents, outlines, which is a lot, but
305
00:14:47,920 --> 00:14:50,079
it's it's still a fairly modest amount.
306
00:14:50,079 --> 00:14:51,759
And I was like, well, some days I do go
307
00:14:51,759 --> 00:14:54,854
on like one or two days where I've been usually
308
00:14:54,854 --> 00:14:56,775
when I'm like kind of out of the house and
309
00:14:56,775 --> 00:15:00,455
just have something like I have nothing else to do.
310
00:15:00,455 --> 00:15:03,095
Like if I'm at a hospital, we have a newborn
311
00:15:03,495 --> 00:15:07,219
and you're waiting for like eight hours and hours for
312
00:15:07,219 --> 00:15:08,020
an appointment.
313
00:15:08,099 --> 00:15:11,939
And I would probably have listened to podcasts before becoming
314
00:15:11,939 --> 00:15:12,900
a speech fanatic.
315
00:15:12,900 --> 00:15:15,299
And I'm like, Oh, wait, let me just get down.
316
00:15:15,299 --> 00:15:17,299
Let me just get these ideas out of my head.
317
00:15:17,460 --> 00:15:20,665
And that's when I'll go on my speech binges.
318
00:15:20,665 --> 00:15:22,584
But those are like once every few months, like not
319
00:15:22,584 --> 00:15:23,464
frequently.
320
00:15:23,704 --> 00:15:25,704
But I said, okay, let's just say if I'm going
321
00:15:25,704 --> 00:15:28,104
to price out cloud STT.
322
00:15:28,905 --> 00:15:33,420
If I was like dedicated every second of every waking
323
00:15:33,420 --> 00:15:37,740
hour to transcribing for some odd reason, I mean I'd
324
00:15:37,740 --> 00:15:39,740
have to eat and use the toilet.
325
00:15:40,460 --> 00:15:42,620
There's only so many hours I'm awake for.
326
00:15:42,620 --> 00:15:46,939
So let's just say a maximum of forty five minutes
327
00:15:47,125 --> 00:15:49,125
in the hour, then I said, All right, let's just
328
00:15:49,125 --> 00:15:50,085
say fifty.
329
00:15:50,564 --> 00:15:51,285
Who knows?
330
00:15:51,285 --> 00:15:52,724
You're dictating on the toilet.
331
00:15:52,724 --> 00:15:53,525
We do it.
332
00:15:53,844 --> 00:15:56,804
So you could just do sixty, but whatever I did
333
00:15:57,045 --> 00:16:01,099
and every day, like you're going flat out seven days
334
00:16:01,099 --> 00:16:02,540
a week dictating nonstop.
335
00:16:02,540 --> 00:16:05,499
I was like, What's my monthly API bill going to
336
00:16:05,499 --> 00:16:06,620
be at this price?
337
00:16:06,699 --> 00:16:09,259
And it came out to like seventy or eighty bucks.
338
00:16:09,259 --> 00:16:12,540
And I was like, Well, that would be an extraordinary
339
00:16:12,860 --> 00:16:14,299
amount of dictation.
340
00:16:14,299 --> 00:16:18,025
And I would hope that there was some compelling reason
341
00:16:18,665 --> 00:16:21,704
worth more than seventy dollars that I embarked upon that
342
00:16:21,704 --> 00:16:22,344
project.
343
00:16:22,584 --> 00:16:24,505
So given that that's kind of the max point for
344
00:16:24,505 --> 00:16:27,224
me I said that's actually very very affordable.
345
00:16:27,944 --> 00:16:30,424
Now you're gonna if you want to spec out the
346
00:16:30,424 --> 00:16:33,829
costs and you want to do the post processing that
347
00:16:33,829 --> 00:16:36,709
I really do feel is valuable, that's going to cost
348
00:16:36,709 --> 00:16:37,670
some more as well.
349
00:16:37,990 --> 00:16:43,189
Unless you're using Gemini, which needless to say is a
350
00:16:43,189 --> 00:16:45,110
random person sitting in Jerusalem.
351
00:16:45,775 --> 00:16:49,375
I have no affiliation nor with Google nor Anthropic nor
352
00:16:49,375 --> 00:16:52,334
Gemini nor any major tech vendor for that matter.
353
00:16:53,775 --> 00:16:57,135
I like Gemini not so much as a everyday model.
354
00:16:57,375 --> 00:16:59,854
It's kind of underwhelmed in that respect, I would say.
355
00:17:00,299 --> 00:17:02,699
But for multimodal, I think it's got a lot to
356
00:17:02,699 --> 00:17:03,259
offer.
357
00:17:03,579 --> 00:17:07,099
And I think that the transcribing functionality whereby it can,
358
00:17:07,979 --> 00:17:12,300
process audio with a system prompt and both give you
359
00:17:12,300 --> 00:17:13,820
transcription that's cleaned up.
360
00:17:13,820 --> 00:17:15,259
That reduces two steps to one.
361
00:17:15,755 --> 00:17:18,874
And that for me is a very, very big deal.
362
00:17:18,875 --> 00:17:22,394
And I feel like even Google hasn't really sort of
363
00:17:22,475 --> 00:17:27,115
thought through how useful the that modality is and what
364
00:17:27,115 --> 00:17:29,620
kind of use cases you can achieve with it.
365
00:17:29,620 --> 00:17:32,259
Because I found in the course of this year just
366
00:17:32,259 --> 00:17:37,939
an endless list of really kind of system prompt stuff
367
00:17:37,939 --> 00:17:40,820
that I can say, okay, I've used it to capture
368
00:17:40,820 --> 00:17:44,035
context data for AI, which is literally I might speak
369
00:17:44,035 --> 00:17:46,675
for if I wanted to have a good bank of
370
00:17:46,675 --> 00:17:49,955
context data about who knows my childhood.
371
00:17:50,354 --> 00:17:54,275
More realistically, maybe my career goals, something that would just
372
00:17:54,275 --> 00:17:56,115
be like really boring to type out.
373
00:17:56,115 --> 00:18:00,420
So I'll just like sit in my car and record
374
00:18:00,420 --> 00:18:01,380
it for ten minutes.
375
00:18:01,380 --> 00:18:03,699
And that ten minutes you get a lot of information
376
00:18:03,699 --> 00:18:04,339
in.
377
00:18:05,539 --> 00:18:07,620
Emails, which is short text.
378
00:18:08,580 --> 00:18:10,339
Just there is a whole bunch.
379
00:18:10,340 --> 00:18:13,295
And all these workflows kind of require a little bit
380
00:18:13,295 --> 00:18:15,054
of treatment afterwards and different treatment.
381
00:18:15,054 --> 00:18:18,334
My context pipeline is kind of like just extract the
382
00:18:18,334 --> 00:18:19,215
bare essentials.
383
00:18:19,215 --> 00:18:22,094
You end up with me talking very loosely about sort
384
00:18:22,094 --> 00:18:24,414
of what I've done in my career, where I've worked,
385
00:18:24,414 --> 00:18:25,374
where I might like to work.
386
00:18:25,920 --> 00:18:29,039
And it goes, it condenses that down to very robotic
387
00:18:29,039 --> 00:18:32,640
language that is easy to chunk parse and maybe put
388
00:18:32,640 --> 00:18:33,920
into a vector database.
389
00:18:33,920 --> 00:18:36,160
Daniel has worked in technology.
390
00:18:36,160 --> 00:18:39,760
Daniel has been working in, know, stuff like that.
391
00:18:39,760 --> 00:18:42,975
That's not how you would speak, but I figure it's
392
00:18:42,975 --> 00:18:46,414
probably easier to parse for, after all, robots.
393
00:18:46,735 --> 00:18:48,654
So we've almost got to twenty minutes and this is
394
00:18:48,654 --> 00:18:53,054
actually a success because I wasted twenty minutes of my
395
00:18:53,455 --> 00:18:57,120
of the evening speaking into you in microphone and the
396
00:18:57,120 --> 00:19:01,039
levels were shot and was clipping and I said I
397
00:19:01,039 --> 00:19:02,320
can't really do an evaluation.
398
00:19:02,320 --> 00:19:03,360
I have to be fair.
399
00:19:03,360 --> 00:19:06,320
I have to give the models a chance to do
400
00:19:06,320 --> 00:19:06,880
their thing.
401
00:19:07,425 --> 00:19:09,505
What am I hoping to achieve in this?
402
00:19:09,505 --> 00:19:11,584
Okay, my fine tune was a dud as mentioned.
403
00:19:11,665 --> 00:19:15,185
Deepgram STT, I'm really, really hopeful that this prototype will
404
00:19:15,185 --> 00:19:17,985
work and it's a build in public open source so
405
00:19:17,985 --> 00:19:20,304
anyone is welcome to use it if I make anything
406
00:19:20,304 --> 00:19:20,625
good.
407
00:19:21,560 --> 00:19:23,800
But that was really exciting for me last night when
408
00:19:23,800 --> 00:19:28,840
after hours of trying my own prototype, seeing someone just
409
00:19:28,840 --> 00:19:32,039
made something that works like that, you you're not gonna
410
00:19:32,039 --> 00:19:36,374
have to build a custom conda environment and image.
411
00:19:36,374 --> 00:19:39,974
I have AMD GPU which makes things much more complicated.
412
00:19:40,214 --> 00:19:42,614
I didn't find it and I was about to give
413
00:19:42,614 --> 00:19:43,894
up and I said, All right, let me just give
414
00:19:43,894 --> 00:19:46,455
Deepgram's Linux thing a shot.
415
00:19:47,029 --> 00:19:49,589
And if this doesn't work, I'm just gonna go back
416
00:19:49,589 --> 00:19:51,349
to trying to vibe code something myself.
417
00:19:51,670 --> 00:19:55,509
And when I ran the script, I was using Cloud
418
00:19:55,509 --> 00:19:59,029
Code to do the installation process, it ran the script
419
00:19:59,029 --> 00:20:01,189
and, oh my gosh, it works just like that.
420
00:20:01,824 --> 00:20:05,985
The tricky thing for all those who wants to know
421
00:20:05,985 --> 00:20:11,425
all the nitty, ditty, nitty gritty details was that I
422
00:20:11,425 --> 00:20:14,624
don't think it was actually struggling with transcription, but pasting
423
00:20:14,705 --> 00:20:17,539
Weyland makes life very hard.
424
00:20:17,539 --> 00:20:19,140
And I think there was something not running at the
425
00:20:19,140 --> 00:20:19,699
right time.
426
00:20:19,699 --> 00:20:22,979
Anyway, Deepgram, I looked at how they actually handle that
427
00:20:22,979 --> 00:20:25,140
because it worked out of the box when other stuff
428
00:20:25,140 --> 00:20:25,779
didn't.
429
00:20:26,100 --> 00:20:28,900
And it was quite a clever little mechanism.
430
00:20:29,495 --> 00:20:32,135
And but more so than that, the accuracy was brilliant.
431
00:20:32,135 --> 00:20:33,574
Now what am I what am I doing here?
432
00:20:33,574 --> 00:20:37,175
This is gonna be a twenty minute audio sample.
433
00:20:38,375 --> 00:20:42,410
And I'm I think I've done one or two of
434
00:20:42,410 --> 00:20:47,130
these before, but I did it with short, snappy voice
435
00:20:47,130 --> 00:20:47,610
notes.
436
00:20:47,610 --> 00:20:49,370
This is kind of long form.
437
00:20:49,449 --> 00:20:51,929
This actually might be a better approximation for what's useful
438
00:20:51,929 --> 00:20:53,849
to me than voice memos.
439
00:20:53,849 --> 00:20:56,894
Like, I need to buy three liters of milk tomorrow
440
00:20:56,894 --> 00:21:00,175
and peter bread, which is probably how half my voice
441
00:21:00,175 --> 00:21:00,735
notes sound.
442
00:21:00,735 --> 00:21:04,094
Like if anyone were to find my phone they'd be
443
00:21:04,094 --> 00:21:05,934
like this is the most boring person in the world.
444
00:21:06,015 --> 00:21:10,050
Although actually there are some journaling thoughts as well, but
445
00:21:10,050 --> 00:21:11,810
it's a lot of content like that.
446
00:21:11,810 --> 00:21:14,610
And the probably for the evaluation, the most useful thing
447
00:21:14,610 --> 00:21:21,834
is slightly obscure tech, GitHub, Nucleano, hugging face, not so
448
00:21:21,834 --> 00:21:24,474
obscure that it's not gonna have a chance of knowing
449
00:21:24,474 --> 00:21:27,194
it, but hopefully sufficiently well known that the model should
450
00:21:27,194 --> 00:21:27,834
get it.
451
00:21:27,914 --> 00:21:29,995
I tried to do a little bit of speaking really
452
00:21:29,995 --> 00:21:32,394
fast and speaking very slowly.
453
00:21:32,394 --> 00:21:35,529
Would say in general, I've spoken, delivered this at a
454
00:21:35,529 --> 00:21:39,130
faster pace than I usually would owing to strong coffee
455
00:21:39,130 --> 00:21:40,570
flowing through my bloodstream.
456
00:21:41,130 --> 00:21:43,529
And the thing that I'm not gonna get in this
457
00:21:43,529 --> 00:21:46,090
benchmark is background noise, which in my first take that
458
00:21:46,090 --> 00:21:48,455
I had to get rid of, my wife came in
459
00:21:48,455 --> 00:21:51,495
with my son and for a good night kiss.
460
00:21:51,574 --> 00:21:55,094
And that actually would have been super helpful to get
461
00:21:55,094 --> 00:21:57,814
in because it was non diarized or if we had
462
00:21:57,814 --> 00:21:58,695
diarization.
463
00:21:59,334 --> 00:22:01,414
A female, I could say, I want the male voice
464
00:22:01,414 --> 00:22:03,094
and that wasn't intended for transcription.
465
00:22:04,509 --> 00:22:06,269
And we're not going to get background noise like people
466
00:22:06,269 --> 00:22:08,989
honking their horns, which is something I've done in my
467
00:22:09,150 --> 00:22:11,870
main data set where I am trying to go back
468
00:22:11,870 --> 00:22:15,070
to some of my voice notes, annotate them and run
469
00:22:15,070 --> 00:22:15,709
a benchmark.
470
00:22:15,709 --> 00:22:18,265
But this is going to be just a pure quick
471
00:22:18,265 --> 00:22:19,064
test.
472
00:22:19,785 --> 00:22:24,025
And as someone I'm working on a voice note idea.
473
00:22:24,025 --> 00:22:28,185
That's my sort of end motivation besides thinking it's an
474
00:22:28,185 --> 00:22:31,785
absolutely outstanding technology that's coming to viability.
475
00:22:31,785 --> 00:22:34,400
And really, I know this sounds cheesy, can actually have
476
00:22:34,400 --> 00:22:36,479
a very transformative effect.
477
00:22:37,920 --> 00:22:43,120
Voice technology has been life changing for folks living with
478
00:22:43,999 --> 00:22:45,039
disabilities.
479
00:22:45,920 --> 00:22:48,545
And I think there's something really nice about the fact
480
00:22:48,545 --> 00:22:52,545
that it can also benefit folks who are able-bodied and
481
00:22:52,545 --> 00:22:57,904
we can all in different ways make this tech as
482
00:22:57,904 --> 00:23:00,705
useful as possible regardless of the exact way that we're
483
00:23:00,705 --> 00:23:01,025
using it.
484
00:23:02,199 --> 00:23:04,439
And I think there's something very powerful in that, and
485
00:23:04,439 --> 00:23:05,559
it can be very cool.
486
00:23:06,120 --> 00:23:07,559
I see huge potential.
487
00:23:07,559 --> 00:23:09,319
What excites me about voice tech?
488
00:23:09,719 --> 00:23:11,159
A lot of things actually.
489
00:23:12,120 --> 00:23:14,839
Firstly, the fact that it's cheap and accurate, as I
490
00:23:14,839 --> 00:23:17,785
mentioned at the very start of this, and it's getting
491
00:23:17,785 --> 00:23:20,104
better and better with stuff like accent handling.
492
00:23:20,745 --> 00:23:23,304
I'm not sure my fine tune will actually ever come
493
00:23:23,304 --> 00:23:25,225
to fruition in the sense that I'll use it day
494
00:23:25,225 --> 00:23:26,584
to day as I imagine.
495
00:23:26,664 --> 00:23:30,505
I get like superb, flawless words error rates because I'm
496
00:23:30,505 --> 00:23:34,949
just kind of skeptical about local speech to text, as
497
00:23:34,949 --> 00:23:35,670
I mentioned.
498
00:23:36,070 --> 00:23:39,830
And I think the pace of innovation and improvement in
499
00:23:39,830 --> 00:23:42,310
the models, the main reasons for fine tuning from what
500
00:23:42,310 --> 00:23:46,150
I've seen have been people who are something that really
501
00:23:46,150 --> 00:23:50,375
blows blows my mind about ASR is the idea that
502
00:23:50,375 --> 00:23:55,574
it's inherently ailingual or multilingual, phonetic based.
503
00:23:56,295 --> 00:24:00,375
So as folks who use speak very obscure languages that
504
00:24:00,375 --> 00:24:03,094
there may be very there might be a paucity of
505
00:24:02,229 --> 00:24:05,030
training data or almost none at all, and therefore the
506
00:24:05,030 --> 00:24:06,790
accuracy is significantly reduced.
507
00:24:06,790 --> 00:24:11,350
Or folks in very critical environments, I know there are
508
00:24:11,510 --> 00:24:15,350
this is used extensively in medical transcription and dispatcher work
509
00:24:15,350 --> 00:24:19,064
as, you know the call centers who send out ambulances
510
00:24:19,064 --> 00:24:19,864
etc.
511
00:24:20,265 --> 00:24:23,545
Where accuracy is absolutely paramount and in the case of
512
00:24:23,545 --> 00:24:27,545
doctors radiologists they might be using very specialized vocab all
513
00:24:27,545 --> 00:24:27,865
the time.
514
00:24:28,630 --> 00:24:30,229
So those are kind of the main two things, and
515
00:24:30,229 --> 00:24:32,150
I'm not sure that really just for trying to make
516
00:24:32,150 --> 00:24:36,390
it better on a few random tech words with my
517
00:24:36,390 --> 00:24:39,429
slightly I mean, I have an accent, but, like, not,
518
00:24:39,429 --> 00:24:42,469
you know, an accent that a few other million people
519
00:24:42,870 --> 00:24:43,910
have ish.
520
00:24:44,685 --> 00:24:47,965
I'm not sure that my little fine tune is gonna
521
00:24:47,965 --> 00:24:52,604
actually like, the bump in word error reduction, if I
522
00:24:52,604 --> 00:24:54,205
ever actually figure out how to do it and get
523
00:24:54,205 --> 00:24:56,365
it up to the cloud, by the time we've done
524
00:24:56,365 --> 00:24:59,959
that, I suspect that the next generation of ASR will
525
00:24:59,959 --> 00:25:01,719
just be so good that it will kind of be,
526
00:25:01,959 --> 00:25:03,959
well, that would have been cool if it worked out,
527
00:25:03,959 --> 00:25:05,479
but I'll just use this instead.
528
00:25:05,719 --> 00:25:10,679
So that's gonna be it for today's episode of voice
529
00:25:10,679 --> 00:25:11,640
training data.
530
00:25:11,880 --> 00:25:14,255
Single, long shot evaluation.
531
00:25:14,495 --> 00:25:15,694
Who am I gonna compare?
532
00:25:16,414 --> 00:25:18,574
Whisper is always good as a benchmark, but I'm more
533
00:25:18,574 --> 00:25:22,175
interested in seeing Whisper head to head with two things
534
00:25:22,175 --> 00:25:22,894
really.
535
00:25:23,295 --> 00:25:25,134
One is Whisper variants.
536
00:25:25,134 --> 00:25:27,695
So you've got these projects like Faster Whisper.
537
00:25:29,110 --> 00:25:29,989
Distill Whisper.
538
00:25:29,989 --> 00:25:30,709
It's a bit confusing.
539
00:25:30,709 --> 00:25:31,909
There's a whole bunch of them.
540
00:25:32,150 --> 00:25:35,110
And the emerging ASRs, which are also a thing.
541
00:25:35,269 --> 00:25:37,110
My intention for this is I'm not sure I'm gonna
542
00:25:37,110 --> 00:25:39,910
have the time in any point in the foreseeable future
543
00:25:39,910 --> 00:25:44,775
to go back to this whole episode and create a
544
00:25:44,775 --> 00:25:48,294
proper source truth where I fix everything.
545
00:25:49,255 --> 00:25:51,894
Might do it if I can get one transcription that's
546
00:25:51,894 --> 00:25:54,134
sufficiently close to perfection.
547
00:25:54,934 --> 00:25:58,400
But what I would actually love to do on Hugging
548
00:25:58,400 --> 00:26:00,479
Face, I think would be a great probably how I
549
00:26:00,479 --> 00:26:04,400
might visualize this is having the audio waveform play and
550
00:26:04,400 --> 00:26:08,880
then have the transcript for each model below it and
551
00:26:08,880 --> 00:26:13,765
maybe even a, like, you know, to scale and maybe
552
00:26:13,765 --> 00:26:16,644
even a local one as well, like local whisper versus
553
00:26:16,644 --> 00:26:19,684
OpenAI API, etcetera.
554
00:26:19,765 --> 00:26:23,124
And I can then actually listen back to segments or
555
00:26:23,124 --> 00:26:25,285
anyone who wants to can listen back to segments of
556
00:26:25,285 --> 00:26:30,219
this recording and see where a particular model struggled and
557
00:26:30,219 --> 00:26:33,099
others didn't as well as the sort of headline finding
558
00:26:33,099 --> 00:26:35,579
of which had the best W E R but that
559
00:26:35,579 --> 00:26:37,659
would require the source of truth.
560
00:26:37,660 --> 00:26:38,459
Okay, that's it.
561
00:26:38,425 --> 00:26:40,985
I hope this was, I don't know, maybe useful for
562
00:26:40,985 --> 00:26:42,904
other folks interested in STT.
563
00:26:42,985 --> 00:26:45,945
You want to see I always think I've just said
564
00:26:45,945 --> 00:26:47,624
it as something I didn't intend to.
565
00:26:47,864 --> 00:26:49,624
STT, I said for those.
566
00:26:49,624 --> 00:26:53,049
Listen carefully, including hopefully the models themselves.
567
00:26:53,289 --> 00:26:55,049
This has been myself, Daniel Rosol.
568
00:26:55,049 --> 00:26:59,370
For more jumbled repositories about my roving interest in AI
569
00:26:59,370 --> 00:27:04,009
but particularly AgenTic, MCP and VoiceTech you can find me
570
00:27:04,009 --> 00:27:05,689
on GitHub.
571
00:27:05,929 --> 00:27:06,650
Hugging Face.
572
00:27:08,045 --> 00:27:08,924
Where else?
573
00:27:08,925 --> 00:27:11,725
DanielRosel dot com, which is my personal website, as well
574
00:27:11,725 --> 00:27:15,485
as this podcast whose name I sadly cannot remember.
575
00:27:15,644 --> 00:27:16,685
Until next time.
576
00:27:16,685 --> 00:27:17,324
Thanks for listening.