danielrosehill commited on
Commit
b5a4032
·
1 Parent(s): d32b5ac
data/audio/podcast.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b200895f2a2ab1640d70eb1d7cc4aeaf8b7d853ba966814aa6acd1452d087a1
3
+ size 20171455
data/ground-truth/truth_1.srt ADDED
@@ -0,0 +1,1032 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,000 --> 00:00:08,640
3
+ Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast.
4
+
5
+ 2
6
+ 00:00:08,640 --> 00:00:19,120
7
+ Or, it eh, I may append this to a podcast that I set up recently regarding my with my thoughts on speech
8
+
9
+ 3
10
+ 00:00:19,120 --> 00:00:28,720
11
+ tech and AI in particular. More AI and generative AI I would, I would say. But in any event, the purpose of this
12
+
13
+ 4
14
+ 00:00:30,080 --> 00:00:37,120
15
+ voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the
16
+
17
+ 5
18
+ 00:00:37,120 --> 00:00:42,320
19
+ envelope evaluation as they might say for different speech to text models. And I'm doing this because I
20
+
21
+ 6
22
+ 00:00:42,800 --> 00:00:48,560
23
+ I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in
24
+
25
+ 7
26
+ 00:00:48,560 --> 00:00:55,120
27
+ the elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to
28
+
29
+ 8
30
+ 00:00:55,760 --> 00:01:01,600
31
+ mix up, I'm going to try a few different styles of speaking. I might whisper something at some
32
+
33
+ 9
34
+ 00:01:01,600 --> 00:01:07,760
35
+ points as well. And I'll go back to speaking loud in different parts. I'm going to sound really
36
+
37
+ 10
38
+ 00:01:07,760 --> 00:01:15,200
39
+ like a crazy person because I'm also going to try to speak at different pitches and cadences
40
+
41
+ 11
42
+ 00:01:15,200 --> 00:01:21,600
43
+ in order to really try to put a speech to text model through its paces, which is trying to make
44
+
45
+ 12
46
+ 00:01:21,600 --> 00:01:30,320
47
+ sense of "is this guy just rambling on incoherently in one long sentence?" Or "are these just actually
48
+
49
+ 13
50
+ 00:01:30,320 --> 00:01:38,320
51
+ a series of step standalone stepalone standalone sentences?" And how is it going to handle stepalone?! That's not a
52
+
53
+ 14
54
+ 00:01:38,320 --> 00:01:43,919
55
+ word! What happens when you use speech to text and you use a fake word and then you're like, wait,
56
+
57
+ 15
58
+ 00:01:43,919 --> 00:01:51,520
59
+ that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the
60
+
61
+ 16
62
+ 00:01:52,880 --> 00:01:57,359
63
+ questions that I'm seeking to answer in this training data. Now, why did why was I trying to
64
+
65
+ 17
66
+ 00:01:57,360 --> 00:02:01,040
67
+ fine tune whisper? And what is Whisper? As I said, I'm going to try to
68
+
69
+ 18
70
+ 00:02:02,080 --> 00:02:04,240
71
+ record this at a couple of different levels of
72
+
73
+ 19
74
+ 00:02:04,880 --> 00:02:10,320
75
+ technicality - for folks who are in the normal world and not totally
76
+
77
+ 20
78
+ 00:02:11,360 --> 00:02:16,079
79
+ stuck down the rabbit hole of AI. Which I have to say is a really wonderful rabbit hole to be
80
+
81
+ 21
82
+ 00:02:16,720 --> 00:02:23,440
83
+ to be down. It's a really interesting area. And speech and voice tech is the aspect of it that
84
+
85
+ 22
86
+ 00:02:23,440 --> 00:02:28,880
87
+ I find actually most - I'm not sure I would say the most interesting because there's just so much
88
+
89
+ 23
90
+ 00:02:28,880 --> 00:02:34,560
91
+ that is fascinating in AI. But the most that I find the most personally transformative in terms of
92
+
93
+ 24
94
+ 00:02:34,560 --> 00:02:42,240
95
+ the impact that it's had on my daily work life and productivity and how I sort of work. And
96
+
97
+ 25
98
+ 00:02:42,960 --> 00:02:49,920
99
+ I am persevering hard with the task of trying to get a good solution working for Linux.
100
+
101
+ 26
102
+ 00:02:49,920 --> 00:02:53,440
103
+ Which if anyone actually does listen to this not just for the training data and for the
104
+
105
+ 27
106
+ 00:02:53,440 --> 00:03:00,399
107
+ actual content, this has sparked. I had, besides the fine tune not working, well that was
108
+
109
+ 28
110
+ 00:03:00,399 --> 00:03:07,679
111
+ the failure. I used Claude Code. Because one thinks these days that there is nothing
112
+
113
+ 29
114
+ 00:03:08,560 --> 00:03:16,799
115
+ short of solving, you know, the reason of life or something that Claude and
116
+
117
+ 30
118
+ 00:03:16,800 --> 00:03:22,720
119
+ agentic AI can't do. Which is not really the case. It does seem that way sometimes. But it
120
+
121
+ 31
122
+ 00:03:22,720 --> 00:03:28,080
123
+ fails a lot as well. And this is one of those instances where last week I put together an hour
124
+
125
+ 32
126
+ 00:03:28,080 --> 00:03:33,600
127
+ of voice training data: basically speaking just random things for three minutes. And
128
+
129
+ 33
130
+ 00:03:35,600 --> 00:03:40,160
131
+ it was actually kind of tedious because the texts were really weird. Some of them were it was like,
132
+
133
+ 34
134
+ 00:03:40,160 --> 00:03:45,440
135
+ it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't,
136
+
137
+ 35
138
+ 00:03:45,440 --> 00:03:51,120
139
+ I was so bored after 10 minutes that I was like, "okay, no, I'm just gonna have to find something
140
+
141
+ 36
142
+ 00:03:51,120 --> 00:03:59,920
143
+ else to read." So I used I created with AI Studio, vibe coded, a synthetic text generator,
144
+
145
+ 37
146
+ 00:04:00,800 --> 00:04:05,680
147
+ which actually I thought was probably a better way of doing it because it would give me more
148
+
149
+ 38
150
+ 00:04:05,680 --> 00:04:12,000
151
+ short samples with more varied content. So I was like, okay, give me a voice note. Like I'm
152
+
153
+ 39
154
+ 00:04:12,000 --> 00:04:18,800
155
+ recording an email. Give me a short story to read. Give me prose. So I came up with all
156
+
157
+ 40
158
+ 00:04:18,800 --> 00:04:24,240
159
+ these different things and I added a little timer to it so I could see how close I was to one
160
+
161
+ 41
162
+ 00:04:24,240 --> 00:04:32,480
163
+ hour. And I spent like an hour one afternoon or probably two hours by the time you do retakes
164
+
165
+ 42
166
+ 00:04:32,480 --> 00:04:39,120
167
+ and whatever because you want to. It gave me a source of truth which I'm not sure if that's the
168
+
169
+ 43
170
+ 00:04:39,120 --> 00:04:45,120
171
+ scientific way to approach this topic of gathering training data but I thought made sense.
172
+
173
+ 44
174
+ 00:04:46,560 --> 00:04:50,880
175
+ I have a lot of audio data from recording voice notes which I've also kind of used
176
+
177
+ 45
178
+ 00:04:52,000 --> 00:04:56,720
179
+ been experimenting with using for a different purpose. It's slightly different - annotating
180
+
181
+ 46
182
+ 00:04:57,840 --> 00:05:03,680
183
+ task types. It's more text classification experiment. Or well it's more than that actually
184
+
185
+ 47
186
+ 00:05:03,680 --> 00:05:08,880
187
+ I'm working on a voice app. So it's a prototype I guess is really more accurate.
188
+
189
+ 48
190
+ 00:05:11,280 --> 00:05:15,920
191
+ But you can do that and you can work backwards. You listen back to a voice note and you
192
+
193
+ 49
194
+ 00:05:17,520 --> 00:05:22,400
195
+ painfully go through one of those - transcribing where you start and stop and scrub around it and
196
+
197
+ 50
198
+ 00:05:22,400 --> 00:05:27,680
199
+ you fix the errors . But it's really really boring to do that. So I thought it would be less tedious
200
+
201
+ 51
202
+ 00:05:27,680 --> 00:05:34,240
203
+ in the long term if I just recorded the source of truth. So it gave me these three minute snippets.
204
+
205
+ 52
206
+ 00:05:34,240 --> 00:05:40,480
207
+ I recorded them and saved an MP3 and a TXT in the same folder and I created an hour of that data.
208
+
209
+ 53
210
+ 00:05:41,840 --> 00:05:47,280
211
+ So I was very hopeful - quitely, you know, a little bit hopeful - that I would be able that I could actually fine tune
212
+
213
+ 54
214
+ 00:05:47,280 --> 00:05:54,720
215
+ Whisper. I want to fine tune Whisper because when I got into voice tech last November my wife was in
216
+
217
+ 55
218
+ 00:05:54,720 --> 00:06:01,920
219
+ the US and I was alone at home. And when crazy people like me do really wild things like use voice
220
+
221
+ 56
222
+ 00:06:01,920 --> 00:06:08,320
223
+ to tech technology that was basically when I started doing it. I didn't feel like a crazy person
224
+
225
+ 57
226
+ 00:06:08,320 --> 00:06:15,760
227
+ speaking to myself. And my expectations weren't that high. I used speech tech now and again
228
+
229
+ 58
230
+ 00:06:16,960 --> 00:06:21,200
231
+ tried it out. I was like "it'd be really cool if you could just like speak into your computer." And
232
+
233
+ 59
234
+ 00:06:21,280 --> 00:06:28,479
235
+ whatever I tried out that had Linux support was just - it was not good, basically. And this blew me away
236
+
237
+ 60
238
+ 00:06:28,479 --> 00:06:34,400
239
+ from the first go. I mean it wasn't 100% accurate out of the box. And it took work. But it was good
240
+
241
+ 61
242
+ 00:06:34,400 --> 00:06:40,320
243
+ enough that there was a solid foundation. And it kind of passed that pivot point that it's actually
244
+
245
+ 62
246
+ 00:06:40,320 --> 00:06:46,320
247
+ worth doing this. You know, there's a point where it's. So like the transcript is you don't have to get 100%
248
+
249
+ 63
250
+ 00:06:46,400 --> 00:06:51,040
251
+ accuracy for it to be worth your time for speech to text to be a worthwhile addition to your
252
+
253
+ 64
254
+ 00:06:51,040 --> 00:06:58,320
255
+ productivity. But you do need to get above let's say I don't know 85%. If it's 60% or 50% you inevitably
256
+
257
+ 65
258
+ 00:06:58,320 --> 00:07:03,920
259
+ say "screw it I'll just type it."Because you end up missing errors in the transcript and it becomes
260
+
261
+ 66
262
+ 00:07:03,920 --> 00:07:07,840
263
+ actually worse. You end up in a worse position than you started with it. That's been my experience.
264
+
265
+ 67
266
+ 00:07:08,400 --> 00:07:14,400
267
+ So I was like "oh, this is actually really really good. Now how did that happen?" The answer is
268
+
269
+ 68
270
+ 00:07:14,400 --> 00:07:21,599
271
+ ASR, Whisper being open-sourced. and the transformer architecture if you want to go back to the
272
+
273
+ 69
274
+ 00:07:23,200 --> 00:07:29,440
275
+ to the underpinnings. Which really blows my mind. And it's on my list to read through that paper
276
+
277
+ 70
278
+ 00:07:30,239 --> 00:07:38,400
279
+ 'All You Need Is Attention' as attentively as can be done with my limited brain. Because it's super
280
+
281
+ 71
282
+ 00:07:38,960 --> 00:07:45,679
283
+ high-level stuff - super advanced stuff I mean. But that I think of all the things that
284
+
285
+ 72
286
+ 00:07:47,280 --> 00:07:54,080
287
+ are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating
288
+
289
+ 73
290
+ 00:07:54,080 --> 00:07:59,599
291
+ that few people are like "hang on, you've got this thing that can speak to you like a chatbot - an LLM.
292
+
293
+ 74
294
+ 00:08:00,640 --> 00:08:06,799
295
+ Then you've got image generation. Okay, so firstly those two things on the surface have nothing
296
+
297
+ 75
298
+ 00:08:06,800 --> 00:08:12,560
299
+ in common. So like, "how are they ... how did THAT just happen all at the same time?" And then when you
300
+
301
+ 76
302
+ 00:08:12,560 --> 00:08:19,920
303
+ extend that further you're like Suno right. You can sing a song and AI will like come up with
304
+
305
+ 77
306
+ 00:08:19,920 --> 00:08:25,200
307
+ an instrumental. And then you've got Whisper. And then you're like "wait a second how did all this stuff
308
+
309
+ 78
310
+ 00:08:25,200 --> 00:08:30,880
311
+ like if it's all AI what's like, there has to be some commonality. Otherwise these are four these are
312
+
313
+ 79
314
+ 00:08:31,600 --> 00:08:38,640
315
+ totally different technologies on the surface of it and the transformer architecture is as far as
316
+
317
+ 80
318
+ 00:08:38,640 --> 00:08:44,720
319
+ I know the answer. And I can't even say I can't even pretend that I really understand what the
320
+
321
+ 81
322
+ 00:08:44,720 --> 00:08:51,200
323
+ transformer architecture means in depth. But I have scanned it. And as I said I want to print it and
324
+
325
+ 82
326
+ 00:08:51,200 --> 00:08:57,760
327
+ really kind of think over it's at some point. And I'll probably feel bad about myself I think!
328
+
329
+ 83
330
+ 00:08:57,760 --> 00:09:03,280
331
+ Because weren't those guys in their in their 20s like? That's crazy! I think I asked ChatGPT
332
+
333
+ 84
334
+ 00:09:03,280 --> 00:09:09,439
335
+ once "who were the? Who wrote that paper and how old were they when it was published in Arxiv?"
336
+
337
+ 85
338
+ 00:09:09,439 --> 00:09:14,640
339
+ And I was expecting like, I don't know. What do you what do you imagine? I personally imagine kind of
340
+
341
+ 86
342
+ 00:09:14,640 --> 00:09:19,840
343
+ like you know you have these breakthroughs during COVID and things like that where like these kind
344
+
345
+ 87
346
+ 00:09:19,840 --> 00:09:24,480
347
+ of really obscure scientists who are like in their 50s and they've just kind of been laboring in
348
+
349
+ 88
350
+ 00:09:24,640 --> 00:09:31,120
351
+ labs and wearily writing and publishing in kind of obscure academic publications and they
352
+
353
+ 89
354
+ 00:09:31,120 --> 00:09:37,200
355
+ finally like hit a big or win a Nobel Prize. And then they're household household names. So I that
356
+
357
+ 90
358
+ 00:09:37,200 --> 00:09:42,680
359
+ was kind of what I had in mind. That was the mental image I'd formed of the birth of Arxiv.
360
+
361
+ 91
362
+ 00:09:42,680 --> 00:09:47,760
363
+ Like, I wasn't expecting 20-somethings in San Francisco! Though I thought that was both very very
364
+
365
+ 92
366
+ 00:09:47,760 --> 00:09:54,160
367
+ funny, very cool, and actually kind of inspiring. It's nice to think that people who you know just
368
+
369
+ 93
370
+ 00:09:54,160 --> 00:10:01,439
371
+ you might put them in the kind of milieu or bubble or world that you are in or credibly in through
372
+
373
+ 94
374
+ 00:10:01,439 --> 00:10:06,079
375
+ you know the series of connections that are coming up with such literally world changing
376
+
377
+ 95
378
+ 00:10:06,880 --> 00:10:13,439
379
+ innovations. So that was I thought anyway that that was cool. Okay voice training data. How
380
+
381
+ 96
382
+ 00:10:13,439 --> 00:10:19,280
383
+ are we doing? We're about 10 minutes. And I'm still talking about voice technology! So Whisper was
384
+
385
+ 97
386
+ 00:10:19,280 --> 00:10:25,680
387
+ brilliant. And I was so excited that I was my first instinct was to like guess it's like "oh my gosh
388
+
389
+ 98
390
+ 00:10:25,680 --> 00:10:31,040
391
+ I have to get like a really good microphone for this." So I didn't go on a spending spree because
392
+
393
+ 99
394
+ 00:10:31,040 --> 00:10:37,760
395
+ I said I'm gonna have to just wait a month and see if I still use this." And it just kind of became
396
+
397
+ 100
398
+ 00:10:37,760 --> 00:10:44,800
399
+ it's become really part of my daily routine. Like, if I'm writing an email I'll record a voice note
400
+
401
+ 101
402
+ 00:10:44,880 --> 00:10:50,079
403
+ and then I'll develop it and it's nice to see that everyone is like developing the same things in
404
+
405
+ 102
406
+ 00:10:50,079 --> 00:10:56,319
407
+ parallel. Like, that's maybe kind of a weird thing to say. But when I look, I kind of came when I started
408
+
409
+ 103
410
+ 00:10:56,319 --> 00:11:02,640
411
+ working on this these prototypes on GitHub, which is where I just kind of share very freely and loosely
412
+
413
+ 104
414
+ 00:11:03,199 --> 00:11:10,800
415
+ ideas and you know first iterations on concepts. And for want of a better word I called it like
416
+
417
+ 105
418
+ 00:11:11,439 --> 00:11:17,680
419
+ "LLM post processing." Or cleanup. Or basically a system prompt that after you get back the raw text
420
+
421
+ 106
422
+ 00:11:17,680 --> 00:11:25,920
423
+ from Whisper, you run it through model and say "okay this is crappy text like add sentence structure
424
+
425
+ 107
426
+ 00:11:25,920 --> 00:11:33,199
427
+ and you know fix it up. " And now when I'm exploring the different tools that are out there that people
428
+
429
+ 108
430
+ 00:11:33,200 --> 00:11:39,040
431
+ have built, I see quite a number of projects have basically you know done the same thing.
432
+
433
+ 109
434
+ 00:11:40,640 --> 00:11:45,040
435
+ Lest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this
436
+
437
+ 110
438
+ 00:11:45,040 --> 00:11:51,440
439
+ has been a thing that's been integrated into tools for a while. But it's, it's the kind of thing that
440
+
441
+ 111
442
+ 00:11:51,440 --> 00:11:57,520
443
+ when you start using these tools every day the need for it is almost instantly apparent. Because text
444
+
445
+ 112
446
+ 00:11:57,600 --> 00:12:03,520
447
+ that doesn't have any punctuation or paragraph spacing takes a long time to you know, it takes so
448
+
449
+ 113
450
+ 00:12:03,520 --> 00:12:10,079
451
+ long to get it into a presentable email that,again, it moves speech tech into that,
452
+
453
+ 114
454
+ 00:12:11,280 --> 00:12:16,000
455
+ before that inflection point where you're like "nah it's just not worth." It it's like it'll just be
456
+
457
+ 115
458
+ 00:12:16,000 --> 00:12:20,800
459
+ quicker to type this. So it's it's a big - it's a little touch that actually is a big deal
460
+
461
+ 116
462
+ 00:12:21,520 --> 00:12:28,319
463
+ So I was on Whisper and I've been using Whisper and I kind of early on find a couple of tools.
464
+
465
+ 117
466
+ 00:12:28,319 --> 00:12:33,680
467
+ I couldn't find what I was looking for on Linux which is basically just something that'll run
468
+
469
+ 118
470
+ 00:12:34,800 --> 00:12:39,120
471
+ in the background. You'll give it an API key and it'll just like transcribe.
472
+
473
+ 119
474
+ 00:12:41,439 --> 00:12:47,359
475
+ With like a little key to start and stop the dictation. And the issues wer I discovered that
476
+
477
+ 120
478
+ 00:12:47,440 --> 00:12:52,720
479
+ like most people involved in creating these projects were very much focused on local models.
480
+
481
+ 121
482
+ 00:12:52,720 --> 00:12:58,400
483
+ And running Whisper locally because you can. And I tried that a bunch of times and just never
484
+
485
+ 122
486
+ 00:12:58,400 --> 00:13:03,920
487
+ got results that were as good as the cloud. And when I began looking at the cost of the speech to
488
+
489
+ 123
490
+ 00:13:03,920 --> 00:13:10,080
491
+ text APIs and what I was spending just thought there it's actually in my opinion just one of
492
+
493
+ 124
494
+ 00:13:10,080 --> 00:13:15,600
495
+ the better deals in API spending and in cloud. Like, it's just not that expensive for very, very good
496
+
497
+ 125
498
+ 00:13:15,600 --> 00:13:22,240
499
+ models that are much more. You know, you're going to be able to run the full model, the latest model
500
+
501
+ 126
502
+ 00:13:22,240 --> 00:13:28,960
503
+ versus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't
504
+
505
+ 127
506
+ 00:13:28,960 --> 00:13:34,000
507
+ really make sense to me. Now, I privacy is another concern that I know is kind of like a very much
508
+
509
+ 128
510
+ 00:13:34,000 --> 00:13:38,720
511
+ a separate thing. That people just don't want their voice data and their voice leaving their
512
+
513
+ 129
514
+ 00:13:38,720 --> 00:13:45,360
515
+ local environment. Maybe for regulatory reasons as well. But I'm not in that. I'm don't really really
516
+
517
+ 130
518
+ 00:13:45,360 --> 00:13:51,440
519
+ care about people listening to my grocery list consisting of reminding myself that I need to buy
520
+
521
+ 131
522
+ 00:13:51,440 --> 00:13:58,240
523
+ more beer, Cheetos and hummus. Which is kind of the three three staples of my diet during periods of
524
+
525
+ 132
526
+ 00:13:58,240 --> 00:14:04,560
527
+ poor nutrition. But the kind of stuff that I transcribe most it's just not it's not a it's not a
528
+
529
+ 133
530
+ 00:14:04,560 --> 00:14:12,640
531
+ privacy thing. I'm not that sort of sensitive about. And I don't do anything so you know sensitive
532
+
533
+ 134
534
+ 00:14:12,640 --> 00:14:17,680
535
+ or secure that requires airgapping. So I looked at the pricing and especially the kind of older
536
+
537
+ 135
538
+ 00:14:17,680 --> 00:14:24,400
539
+ models mini. Some of them are very very affordable. And I did a back of the, I did a calculation once
540
+
541
+ 136
542
+ 00:14:24,400 --> 00:14:30,239
543
+ with ChatGPT and I was like "okay, this is the, this is the API price for I can't remember whatever
544
+
545
+ 137
546
+ 00:14:30,320 --> 00:14:37,040
547
+ the model was. Let's say I just go at it like nonstop which rarely happens. Probably I would say an
548
+
549
+ 138
550
+ 00:14:37,040 --> 00:14:45,200
551
+ average I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents,
552
+
553
+ 139
554
+ 00:14:45,200 --> 00:14:51,360
555
+ outlines. Which is a lot. But it's it's still a fairly modest amount. And I was like well some days I
556
+
557
+ 140
558
+ 00:14:51,360 --> 00:14:56,720
559
+ do go on like one or two days where I've been usually when I'm like kind of out of the house and
560
+
561
+ 141
562
+ 00:14:56,720 --> 00:15:02,800
563
+ just have something like I've nothing else to do. Like if I'm at a hospital. We have a newborn.
564
+
565
+ 142
566
+ 00:15:04,000 --> 00:15:09,040
567
+ And you're waiting for like hours and hours for an appointment. And I would probably have
568
+
569
+ 143
570
+ 00:15:09,040 --> 00:15:15,280
571
+ listened to podcasts before becoming a speech fanatic. And I'm like "oh wait let me just get down
572
+
573
+ 144
574
+ 00:15:15,280 --> 00:15:20,880
575
+ let me just get these ideas out of my head." And that's when I'll go on my speech binges. But those
576
+
577
+ 145
578
+ 00:15:20,880 --> 00:15:26,240
579
+ are like once every few months - like not frequently. But I said okay let's just say if I'm gonna price
580
+
581
+ 146
582
+ 00:15:26,240 --> 00:15:35,440
583
+ out cloud STT. If I was like dedicated every second of every waking hour to transcribing for some
584
+
585
+ 147
586
+ 00:15:35,440 --> 00:15:41,600
587
+ odd reason. I mean, I'd have to like eat and use the toilet! Like, you know there's only so many hours
588
+
589
+ 148
590
+ 00:15:41,600 --> 00:15:48,480
591
+ I'm awake for. So like let's just say a maximum of like 40 hour 45 minutes in the hours and I said
592
+
593
+ 149
594
+ 00:15:48,480 --> 00:15:55,360
595
+ all right let's just say 50. Who knows? You're dictating on the toilet! We do it! So you could just do 60.
596
+
597
+ 150
598
+ 00:15:55,440 --> 00:16:02,560
599
+ But whatever I did - and every day. Like you're going flat out, seven days a week dictating nonstop
600
+
601
+ 151
602
+ 00:16:02,560 --> 00:16:08,640
603
+ as like "what's my monthly API bill gonna be at this price?" And it came out to like 70 or
604
+
605
+ 152
606
+ 00:16:08,640 --> 00:16:14,960
607
+ 80 bucks. And I was like, well that would be an extraordinary amount of dictation! And I would hope
608
+
609
+ 153
610
+ 00:16:15,600 --> 00:16:21,680
611
+ that there was some compelling reason more worth more than 70 dollars that I embarked upon that.
612
+
613
+ 154
614
+ 00:16:22,640 --> 00:16:26,959
615
+ So given that that's kind of the max point for me I said that's actually very very affordable.
616
+
617
+ 155
618
+ 00:16:27,920 --> 00:16:32,640
619
+ Now you're gonna if you want to spec out the costs and you want to do the post processing
620
+
621
+ 156
622
+ 00:16:33,599 --> 00:16:39,199
623
+ that I really do feel is valuable that's gonna cost more as well. Unless you're using
624
+
625
+ 157
626
+ 00:16:40,160 --> 00:16:47,839
627
+ Gemini which needless to say as a random person sitting in Jerusalem I have no affiliation nor with
628
+
629
+ 158
630
+ 00:16:47,840 --> 00:16:54,800
631
+ Google nor Anthropic nor Gemini nor any major tech vendor for that matter. Um I like Gemini
632
+
633
+ 159
634
+ 00:16:54,800 --> 00:17:00,080
635
+ not so much as a everyday model. Um it's kind of underwhelmed in that respect I would say.
636
+
637
+ 160
638
+ 00:17:00,080 --> 00:17:05,920
639
+ But for multimodal I think it's got a lot to offer. And I think that the transcribing functionality
640
+
641
+ 161
642
+ 00:17:05,920 --> 00:17:13,280
643
+ whereby it can um process audio with the system prompt and both give you a transcription that's
644
+
645
+ 162
646
+ 00:17:13,280 --> 00:17:20,079
647
+ cleaned up - that reduces two steps to one. And that for me is a very very big deal. And uh I feel like
648
+
649
+ 163
650
+ 00:17:20,079 --> 00:17:27,280
651
+ even Google hasn't really sort of thought through how useful the that modality is and what kind of
652
+
653
+ 164
654
+ 00:17:27,280 --> 00:17:33,280
655
+ use cases uh you can achieve with it. Because I found in the course of this year just an endless
656
+
657
+ 165
658
+ 00:17:33,280 --> 00:17:40,399
659
+ list of really kind of system prompt system prompt stuff that I can say "okay I've used it
660
+
661
+ 166
662
+ 00:17:40,560 --> 00:17:45,920
663
+ to capture context data for AI which is literally I might speak for if I wanted to have a good
664
+
665
+ 167
666
+ 00:17:45,920 --> 00:17:52,560
667
+ bank of context data about who knows my childhood uh more realistically maybe my career goals
668
+
669
+ 168
670
+ 00:17:53,520 --> 00:17:59,520
671
+ something that would just be like really boring to type out so I'll just like sit in my car
672
+
673
+ 169
674
+ 00:17:59,520 --> 00:18:06,640
675
+ and record it for 10 minutes. And that 10 minutes you get a lot of information in um emails which is
676
+
677
+ 170
678
+ 00:18:06,640 --> 00:18:13,200
679
+ short text uh just there is a whole bunch. And all these workflows kind of require a little bit
680
+
681
+ 171
682
+ 00:18:13,200 --> 00:18:18,320
683
+ of treatment afterwards and different treatment. My context pipeline is kind of like just extract the
684
+
685
+ 172
686
+ 00:18:18,320 --> 00:18:23,520
687
+ bare essential. So you end up with me talking very loosely about sort of what I've done in my career,
688
+
689
+ 173
690
+ 00:18:23,520 --> 00:18:30,000
691
+ where I've worked, where I might like to work. And it goes - it condenses that down to very robotic language
692
+
693
+ 174
694
+ 00:18:30,000 --> 00:18:36,000
695
+ that is easy to chunk, parse, and maybe put into a vector database. "Daniel has worked in technology!
696
+
697
+ 175
698
+ 00:18:36,080 --> 00:18:42,400
699
+ Daniel is a has been working in marketing." Stuff like that. That's not how you would speak um but I
700
+
701
+ 176
702
+ 00:18:42,400 --> 00:18:48,480
703
+ figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this
704
+
705
+ 177
706
+ 00:18:48,480 --> 00:18:56,880
707
+ is actually a success because I wasted 20 minutes of my uh of the evening speaking into microphone and
708
+
709
+ 178
710
+ 00:18:56,880 --> 00:19:02,720
711
+ the levels were shot and it uh it was clipping. And I said I can't really do an evaluation. I have to
712
+
713
+ 179
714
+ 00:19:02,720 --> 00:19:09,440
715
+ be fair. I have to give the models a chance to do their thing. Uh what am I hoping to achieve in this?
716
+
717
+ 180
718
+ 00:19:09,440 --> 00:19:14,960
719
+ Okay my fine tune was a dud as mentioned. Deepgram STT - I'm really really hopeful that this prototype
720
+
721
+ 181
722
+ 00:19:14,960 --> 00:19:20,560
723
+ will work. And it's a build in public open source. So anyone is welcome to use it if I make anything good
724
+
725
+ 182
726
+ 00:19:21,600 --> 00:19:28,000
727
+ But what was really exciting for me last night when after hours of um trying my own prototypes, seeing
728
+
729
+ 183
730
+ 00:19:28,080 --> 00:19:33,120
731
+ someone just made something that works like that. You know, you're not going to have to build a custom
732
+
733
+ 184
734
+ 00:19:34,240 --> 00:19:40,960
735
+ Conda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it
736
+
737
+ 185
738
+ 00:19:41,840 --> 00:19:46,400
739
+ And I was about to give up and I said "all right. Let me just give Deepgram's Linux thing a shot
740
+
741
+ 186
742
+ 00:19:47,040 --> 00:19:50,960
743
+ and if this doesn't work um I'm just going to go back to trying to vibe code something myself."
744
+
745
+ 187
746
+ 00:19:51,600 --> 00:19:57,360
747
+ And when I ran the script - I was using Claude Code to do the installation process -
748
+
749
+ 188
750
+ 00:19:58,160 --> 00:20:02,800
751
+ it ran the script and "oh my gosh, it works!" Just like that! Uh the tricky thing
752
+
753
+ 189
754
+ 00:20:04,480 --> 00:20:12,480
755
+ for all those ones who want to know all the nitty gritty details um was that I don't think it was actually
756
+
757
+ 190
758
+ 00:20:12,480 --> 00:20:18,160
759
+ struggling with transcription but pasting. Wayland makes life very hard. And I think there was
760
+
761
+ 191
762
+ 00:20:18,160 --> 00:20:22,800
763
+ something not running at the right time. Anyway, Deepgram - I looked at how they actually handled
764
+
765
+ 192
766
+ 00:20:22,960 --> 00:20:28,960
767
+ that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism
768
+
769
+ 193
770
+ 00:20:29,520 --> 00:20:34,560
771
+ and but more so than that the accuracy was brilliant. Now, what am I doing here? This is going to be a 20
772
+
773
+ 194
774
+ 00:20:34,560 --> 00:20:44,399
775
+ minute audio uh sample and I'm I think I've done one or two of these before but I did it with
776
+
777
+ 195
778
+ 00:20:45,360 --> 00:20:51,120
779
+ short, snappy voice notes. This is kind of long form. This actually might be a better approximation
780
+
781
+ 196
782
+ 00:20:51,120 --> 00:20:55,040
783
+ for what's useful to me than voice memos like "I need to buy three
784
+
785
+ 197
786
+ 00:20:55,840 --> 00:20:59,840
787
+ liters of milk tomorrow and pita bread." Which is probably how like half my voice note
788
+
789
+ 198
790
+ 00:20:59,840 --> 00:21:04,399
791
+ voice notes sound. Like if anyone were to I don't know like find my phone they'd be like "this is
792
+
793
+ 199
794
+ 00:21:04,399 --> 00:21:09,280
795
+ the most boring person in the world!" Although actually there are some like kind of uh journaling
796
+
797
+ 200
798
+ 00:21:09,280 --> 00:21:14,080
799
+ thoughts as well. But it's a lot of content like that. And the probably for the evaluation the most
800
+
801
+ 201
802
+ 00:21:14,080 --> 00:21:22,560
803
+ useful thing is slightly obscure tech: Github, Nuclino, Hugging Face. Not so obscure that it's not
804
+
805
+ 202
806
+ 00:21:22,560 --> 00:21:27,360
807
+ going to have a chance of knowing it. But hopefully sufficiently well known that the models should get
808
+
809
+ 203
810
+ 00:21:27,360 --> 00:21:32,800
811
+ it. Uh I tried to do a little bit of speaking really fast and speaking very slowly. I would say in
812
+
813
+ 204
814
+ 00:21:32,800 --> 00:21:38,960
815
+ general I've spoken delivered this at a faster pace than I usually would owing to strong coffee
816
+
817
+ 205
818
+ 00:21:39,120 --> 00:21:44,240
819
+ flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is
820
+
821
+ 206
822
+ 00:21:44,240 --> 00:21:49,920
823
+ background noise. Which in my first take that I had to get rid of my wife came in with my son
824
+
825
+ 207
826
+ 00:21:49,920 --> 00:21:55,680
827
+ for a good night kiss. And that actually would have been super helpful to get in because it was
828
+
829
+ 208
830
+ 00:21:56,400 --> 00:22:01,600
831
+ non-diarised. Or if we had diarisation a female I could say I want the male voice and that
832
+
833
+ 209
834
+ 00:22:01,600 --> 00:22:06,240
835
+ wasn't intended for transcription um. And we're not going to get background noise like people
836
+
837
+ 210
838
+ 00:22:06,240 --> 00:22:11,840
839
+ honking their horns. Which is something I've done in my main dataset where I am trying to go back
840
+
841
+ 211
842
+ 00:22:11,840 --> 00:22:16,880
843
+ to some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure
844
+
845
+ 212
846
+ 00:22:17,680 --> 00:22:24,960
847
+ quick test. And as someone I'm working on a voice note idea that's my sort of end
848
+
849
+ 213
850
+ 00:22:26,560 --> 00:22:30,320
851
+ motivation besides thinking it's an absolute outstanding technology that's coming to
852
+
853
+ 214
854
+ 00:22:30,960 --> 00:22:36,240
855
+ viability and really - I know this sounds cheesy - can actually have a very transformative effect.
856
+
857
+ 215
858
+ 00:22:37,120 --> 00:22:42,720
859
+ It's, you know, voice technology has been life changing for folks living with
860
+
861
+ 216
862
+ 00:22:44,000 --> 00:22:49,760
863
+ disabilities. And I think there's something really nice about the fact that it can also benefit
864
+
865
+ 217
866
+ 00:22:50,480 --> 00:22:54,639
867
+ you know folks who are able-bodied. And like we can all in different ways
868
+
869
+ 218
870
+ 00:22:55,120 --> 00:23:02,560
871
+ um make this tech as useful as possible regardless of the exact way that we're using it um. And I
872
+
873
+ 219
874
+ 00:23:02,560 --> 00:23:07,760
875
+ think there's something very powerful in that. And it can be very cool um I see huge potential. What
876
+
877
+ 220
878
+ 00:23:07,760 --> 00:23:14,480
879
+ excites me about voice tech - a lot of things actually. Firstly the fact that it's cheap and accurate
880
+
881
+ 221
882
+ 00:23:14,480 --> 00:23:19,040
883
+ as I mentioned at the very start of this um. And it's getting better and better with stuff like
884
+
885
+ 222
886
+ 00:23:19,040 --> 00:23:24,160
887
+ accent handling um. I'm not sure my my fine tune will actually ever come to fruition in the
888
+
889
+ 223
890
+ 00:23:24,160 --> 00:23:30,240
891
+ sense that I'll use it day to day as I imagine and get like superb flawless words error rates. Because
892
+
893
+ 224
894
+ 00:23:30,240 --> 00:23:37,680
895
+ I'm just kind of skeptical about local speech to tech as I mentioned. And I think the pace of
896
+
897
+ 225
898
+ 00:23:37,680 --> 00:23:42,720
899
+ innovation and improvement in the models. The main reasons for fine tuning from what I've seen
900
+
901
+ 226
902
+ 00:23:44,320 --> 00:23:50,480
903
+ have been people who are something that really blows blows my mind about ASR is the idea that it's
904
+
905
+ 227
906
+ 00:23:50,480 --> 00:24:00,080
907
+ inherently a-llingual. Or multilingual. Phonetic-based. So as folks who use speak very obscure languages
908
+
909
+ 228
910
+ 00:24:00,080 --> 00:24:04,800
911
+ that there may be there there might be a paucity of training data or almost none at all. And therefore
912
+
913
+ 229
914
+ 00:24:04,800 --> 00:24:11,440
915
+ the accuracy is significantly reduced. Or folks in very critical environments. I know there are
916
+
917
+ 230
918
+ 00:24:11,440 --> 00:24:17,680
919
+ this is used extensively in medical transcription and dispatcher work as um you know the call
920
+
921
+ 231
922
+ 00:24:17,680 --> 00:24:24,000
923
+ centers who send out ambulances etc where accuracy is absolutely paramount. And in the case of doctors,
924
+
925
+ 232
926
+ 00:24:24,560 --> 00:24:29,680
927
+ radiologists they might be using very specialized vocab all the time. So those are kind of the main
928
+
929
+ 233
930
+ 00:24:29,680 --> 00:24:35,680
931
+ two things. And I'm not sure that really just for trying to make it better on a few random tech words
932
+
933
+ 234
934
+ 00:24:35,680 --> 00:24:41,840
935
+ with my slightly. I mean, I have an accent! But like, not you know an accent that a few other million
936
+
937
+ 235
938
+ 00:24:41,840 --> 00:24:50,720
939
+ people have. Ish. I'm not sure that my little fine tune is going to actually like the bump in
940
+
941
+ 236
942
+ 00:24:50,720 --> 00:24:55,760
943
+ word error reduction if I ever actually figure out how to do it and get it up to the cloud. By the
944
+
945
+ 237
946
+ 00:24:55,760 --> 00:25:00,879
947
+ time we've done that I suspect that the next generation of ASR will just be so good that it will
948
+
949
+ 238
950
+ 00:25:00,879 --> 00:25:07,040
951
+ kind of be "nah, well, that would be cool if it worked out. But I'll just use this instead." So that's going to be
952
+
953
+ 239
954
+ 00:25:07,280 --> 00:25:15,040
955
+ it for today's episodes of voice training data single long shot evaluation. Who am I going to
956
+
957
+ 240
958
+ 00:25:15,040 --> 00:25:21,200
959
+ compare? Whisper is always good as a benchmark. But I'm more interested in seeing Whisper head-to-head
960
+
961
+ 241
962
+ 00:25:21,200 --> 00:25:27,680
963
+ with two things really. One is Whisper variants. So you've got these projects like Faster Whisper,
964
+
965
+ 242
966
+ 00:25:29,120 --> 00:25:34,000
967
+ Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs which
968
+
969
+ 243
970
+ 00:25:34,160 --> 00:25:38,960
971
+ are also a thing. My intention for this is I'm not sure I'm going to have the time in any point
972
+
973
+ 244
974
+ 00:25:38,960 --> 00:25:46,320
975
+ of the foreseeable future to go back through this whole episode and create a proper source truth or I fix
976
+
977
+ 245
978
+ 00:25:47,520 --> 00:25:53,760
979
+ everything. I might do it if I can get one transcription that's sufficiently close to perfection.
980
+
981
+ 246
982
+ 00:25:54,960 --> 00:26:00,560
983
+ But what I would actually love to do on Hugging Face I think would be a great probably how I might
984
+
985
+ 247
986
+ 00:26:00,560 --> 00:26:08,080
987
+ visualize this is having the audio waveform play. And then have the transcript for each model below
988
+
989
+ 248
990
+ 00:26:08,080 --> 00:26:16,320
991
+ it. And maybe even a like you know to scale. And maybe even a local one as well like local Whisper
992
+
993
+ 249
994
+ 00:26:16,320 --> 00:26:23,919
995
+ versus Open AI API etc. And I can then actually listen back to segments. Or anyone who wants to
996
+
997
+ 250
998
+ 00:26:24,000 --> 00:26:30,000
999
+ can listen back to segments of this recording and see where a particular model struggled
1000
+
1001
+ 251
1002
+ 00:26:30,000 --> 00:26:35,600
1003
+ while others didn't, as well as the sort of headline finding of which had the best WER. But that would
1004
+
1005
+ 252
1006
+ 00:26:35,600 --> 00:26:41,120
1007
+ require the source of truth. Okay, that's it. Hope this was, I don't know, maybe useful for other
1008
+
1009
+ 253
1010
+ 00:26:41,120 --> 00:26:46,480
1011
+ folks interested in STT. You want to see - that I always feel think I've just said as something I
1012
+
1013
+ 254
1014
+ 00:26:46,480 --> 00:26:52,800
1015
+ didn't intend to. STT I said for those listening carefully! Including hopefully the models themselves!
1016
+
1017
+ 255
1018
+ 00:26:53,280 --> 00:26:58,960
1019
+ This has been myself Daniel Rosehill. For more um jumbled repositories about my uh roving interests
1020
+
1021
+ 256
1022
+ 00:26:58,960 --> 00:27:06,639
1023
+ in AI. But particularly agentic AI, MCP, and voice tech, you can find me on Github, Hugging Face.
1024
+
1025
+ 257
1026
+ 00:27:08,080 --> 00:27:14,000
1027
+ Where else? DanielRosehilll.com which is my personal website. As well as this podcast whose name
1028
+
1029
+ 258
1030
+ 00:27:14,000 --> 00:27:17,280
1031
+ I sadly cannot remember! Until next time, thanks for listening!
1032
+
data/ground-truth/truth_1.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast. Or, it eh, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI and generative AI I would, I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say, for different speech to text models. And I'm doing this because I I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in the elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts. I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech to text model through its paces, which is trying to make sense of "is this guy just rambling on incoherently in one long sentence?" Or "are these just actually a series of step standalone stepalone standalone sentences?" And how is it going to handle stepalone?! That's not a word! What happens when you use speech to text and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was I trying to fine tune whisper? And what is Whisper? As I said, I'm going to try to record this at a couple of different levels of technicality - for folks who are in the normal world and not totally stuck down the rabbit hole of AI. Which I have to say is a really wonderful rabbit hole to be to be down. It's a really interesting area. And speech and voice tech is the aspect of it that I find actually most - I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I am persevering hard with the task of trying to get a good solution working for Linux. Which if anyone actually does listen to this not just for the training data and for the actual content, this has sparked. I had, besides the fine tune not working, well that was the failure. I used Claude Code. Because one thinks these days that there is nothing short of solving, you know, the reason of life or something that Claude and agentic AI can't do. Which is not really the case. It does seem that way sometimes. But it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data: basically speaking just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were it was like, it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, "okay, no, I'm just gonna have to find something else to read." So I used I created with AI Studio, vibe coded, a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note. Like I'm recording an email. Give me a short story to read. Give me prose. So I came up with all these different things and I added a little timer to it so I could see how close I was to one hour. And I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to. It gave me a source of truth which I'm not sure if that's the scientific way to approach this topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used been experimenting with using for a different purpose. It's slightly different - annotating task types. It's more text classification experiment. Or well it's more than that actually I'm working on a voice app. So it's a prototype I guess is really more accurate. But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those - transcribing where you start and stop and scrub around it and you fix the errors . But it's really really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an hour of that data. So I was very hopeful - quitely, you know, a little bit hopeful - that I would be able that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November my wife was in the US and I was alone at home. And when crazy people like me do really wild things like use voice to tech technology that was basically when I started doing it. I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. I used speech tech now and again tried it out. I was like "it'd be really cool if you could just like speak into your computer." And whatever I tried out that had Linux support was just - it was not good, basically. And this blew me away from the first go. I mean it wasn't 100% accurate out of the box. And it took work. But it was good enough that there was a solid foundation. And it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's. So like the transcript is you don't have to get 100% accuracy for it to be worth your time for speech to text to be a worthwhile addition to your productivity. But you do need to get above let's say I don't know 85%. If it's 60% or 50% you inevitably say "screw it I'll just type it."Because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like "oh, this is actually really really good. Now how did that happen?" The answer is ASR, Whisper being open-sourced. and the transformer architecture if you want to go back to the to the underpinnings. Which really blows my mind. And it's on my list to read through that paper 'All You Need Is Attention' as attentively as can be done with my limited brain. Because it's super high-level stuff - super advanced stuff I mean. But that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating that few people are like "hang on, you've got this thing that can speak to you like a chatbot - an LLM. Then you've got image generation. Okay, so firstly those two things on the surface have nothing in common. So like, "how are they ... how did THAT just happen all at the same time?" And then when you extend that further you're like Suno right. You can sing a song and AI will like come up with an instrumental. And then you've got Whisper. And then you're like "wait a second how did all this stuff like if it's all AI what's like, there has to be some commonality. Otherwise these are four these are totally different technologies on the surface of it and the transformer architecture is as far as I know the answer. And I can't even say I can't even pretend that I really understand what the transformer architecture means in depth. But I have scanned it. And as I said I want to print it and really kind of think over it's at some point. And I'll probably feel bad about myself I think! Because weren't those guys in their in their 20s like? That's crazy! I think I asked ChatGPT once "who were the? Who wrote that paper and how old were they when it was published in Arxiv?" And I was expecting like, I don't know. What do you what do you imagine? I personally imagine kind of like you know you have these breakthroughs during COVID and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and wearily writing and publishing in kind of obscure academic publications and they finally like hit a big or win a Nobel Prize. And then they're household household names. So I that was kind of what I had in mind. That was the mental image I'd formed of the birth of Arxiv. Like, I wasn't expecting 20-somethings in San Francisco! Though I thought that was both very very funny, very cool, and actually kind of inspiring. It's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world changing innovations. So that was I thought anyway that that was cool. Okay voice training data. How are we doing? We're about 10 minutes. And I'm still talking about voice technology! So Whisper was brilliant. And I was so excited that I was my first instinct was to like guess it's like "oh my gosh I have to get like a really good microphone for this." So I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this." And it just kind of became it's become really part of my daily routine. Like, if I'm writing an email I'll record a voice note and then I'll develop it and it's nice to see that everyone is like developing the same things in parallel. Like, that's maybe kind of a weird thing to say. But when I look, I kind of came when I started working on this these prototypes on GitHub, which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts. And for want of a better word I called it like "LLM post processing." Or cleanup. Or basically a system prompt that after you get back the raw text from Whisper, you run it through model and say "okay this is crappy text like add sentence structure and you know fix it up. " And now when I'm exploring the different tools that are out there that people have built, I see quite a number of projects have basically you know done the same thing. Lest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while. But it's, it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent. Because text that doesn't have any punctuation or paragraph spacing takes a long time to you know, it takes so long to get it into a presentable email that,again, it moves speech tech into that, before that inflection point where you're like "nah it's just not worth." It it's like it'll just be quicker to type this. So it's it's a big - it's a little touch that actually is a big deal So I was on Whisper and I've been using Whisper and I kind of early on find a couple of tools. I couldn't find what I was looking for on Linux which is basically just something that'll run in the background. You'll give it an API key and it'll just like transcribe. With like a little key to start and stop the dictation. And the issues wer I discovered that like most people involved in creating these projects were very much focused on local models. And running Whisper locally because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending just thought there it's actually in my opinion just one of the better deals in API spending and in cloud. Like, it's just not that expensive for very, very good models that are much more. You know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't really make sense to me. Now, I privacy is another concern that I know is kind of like a very much a separate thing. That people just don't want their voice data and their voice leaving their local environment. Maybe for regulatory reasons as well. But I'm not in that. I'm don't really really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos and hummus. Which is kind of the three three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe most it's just not it's not a it's not a privacy thing. I'm not that sort of sensitive about. And I don't do anything so you know sensitive or secure that requires airgapping. So I looked at the pricing and especially the kind of older models mini. Some of them are very very affordable. And I did a back of the, I did a calculation once with ChatGPT and I was like "okay, this is the, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop which rarely happens. Probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents, outlines. Which is a lot. But it's it's still a fairly modest amount. And I was like well some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do. Like if I'm at a hospital. We have a newborn. And you're waiting for like hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like "oh wait let me just get down let me just get these ideas out of my head." And that's when I'll go on my speech binges. But those are like once every few months - like not frequently. But I said okay let's just say if I'm gonna price out cloud STT. If I was like dedicated every second of every waking hour to transcribing for some odd reason. I mean, I'd have to like eat and use the toilet! Like, you know there's only so many hours I'm awake for. So like let's just say a maximum of like 40 hour 45 minutes in the hours and I said all right let's just say 50. Who knows? You're dictating on the toilet! We do it! So you could just do 60. But whatever I did - and every day. Like you're going flat out, seven days a week dictating nonstop as like "what's my monthly API bill gonna be at this price?" And it came out to like 70 or 80 bucks. And I was like, well that would be an extraordinary amount of dictation! And I would hope that there was some compelling reason more worth more than 70 dollars that I embarked upon that. So given that that's kind of the max point for me I said that's actually very very affordable. Now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel is valuable that's gonna cost more as well. Unless you're using Gemini which needless to say as a random person sitting in Jerusalem I have no affiliation nor with Google nor Anthropic nor Gemini nor any major tech vendor for that matter. Um I like Gemini not so much as a everyday model. Um it's kind of underwhelmed in that respect I would say. But for multimodal I think it's got a lot to offer. And I think that the transcribing functionality whereby it can um process audio with the system prompt and both give you a transcription that's cleaned up - that reduces two steps to one. And that for me is a very very big deal. And uh I feel like even Google hasn't really sort of thought through how useful the that modality is and what kind of use cases uh you can achieve with it. Because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say "okay I've used it to capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood uh more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes. And that 10 minutes you get a lot of information in um emails which is short text uh just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essential. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes - it condenses that down to very robotic language that is easy to chunk, parse, and maybe put into a vector database. "Daniel has worked in technology! Daniel is a has been working in marketing." Stuff like that. That's not how you would speak um but I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this is actually a success because I wasted 20 minutes of my uh of the evening speaking into microphone and the levels were shot and it uh it was clipping. And I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. Uh what am I hoping to achieve in this? Okay my fine tune was a dud as mentioned. Deepgram STT - I'm really really hopeful that this prototype will work. And it's a build in public open source. So anyone is welcome to use it if I make anything good But what was really exciting for me last night when after hours of um trying my own prototypes, seeing someone just made something that works like that. You know, you're not going to have to build a custom Conda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it And I was about to give up and I said "all right. Let me just give Deepgram's Linux thing a shot and if this doesn't work um I'm just going to go back to trying to vibe code something myself." And when I ran the script - I was using Claude Code to do the installation process - it ran the script and "oh my gosh, it works!" Just like that! Uh the tricky thing for all those ones who want to know all the nitty gritty details um was that I don't think it was actually struggling with transcription but pasting. Wayland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram - I looked at how they actually handled that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism and but more so than that the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio uh sample and I'm I think I've done one or two of these before but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos like "I need to buy three liters of milk tomorrow and pita bread." Which is probably how like half my voice note voice notes sound. Like if anyone were to I don't know like find my phone they'd be like "this is the most boring person in the world!" Although actually there are some like kind of uh journaling thoughts as well. But it's a lot of content like that. And the probably for the evaluation the most useful thing is slightly obscure tech: Github, Nuclino, Hugging Face. Not so obscure that it's not going to have a chance of knowing it. But hopefully sufficiently well known that the models should get it. Uh I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general I've spoken delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise. Which in my first take that I had to get rid of my wife came in with my son for a good night kiss. And that actually would have been super helpful to get in because it was non-diarised. Or if we had diarisation a female I could say I want the male voice and that wasn't intended for transcription um. And we're not going to get background noise like people honking their horns. Which is something I've done in my main dataset where I am trying to go back to some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure quick test. And as someone I'm working on a voice note idea that's my sort of end motivation besides thinking it's an absolute outstanding technology that's coming to viability and really - I know this sounds cheesy - can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit you know folks who are able-bodied. And like we can all in different ways um make this tech as useful as possible regardless of the exact way that we're using it um. And I think there's something very powerful in that. And it can be very cool um I see huge potential. What excites me about voice tech - a lot of things actually. Firstly the fact that it's cheap and accurate as I mentioned at the very start of this um. And it's getting better and better with stuff like accent handling um. I'm not sure my my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get like superb flawless words error rates. Because I'm just kind of skeptical about local speech to tech as I mentioned. And I think the pace of innovation and improvement in the models. The main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently a-llingual. Or multilingual. Phonetic-based. So as folks who use speak very obscure languages that there may be there there might be a paucity of training data or almost none at all. And therefore the accuracy is significantly reduced. Or folks in very critical environments. I know there are this is used extensively in medical transcription and dispatcher work as um you know the call centers who send out ambulances etc where accuracy is absolutely paramount. And in the case of doctors, radiologists they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly. I mean, I have an accent! But like, not you know an accent that a few other million people have. Ish. I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud. By the time we've done that I suspect that the next generation of ASR will just be so good that it will kind of be "nah, well, that would be cool if it worked out. But I'll just use this instead." So that's going to be it for today's episodes of voice training data single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark. But I'm more interested in seeing Whisper head-to-head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper, Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back through this whole episode and create a proper source truth or I fix everything. I might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face I think would be a great probably how I might visualize this is having the audio waveform play. And then have the transcript for each model below it. And maybe even a like you know to scale. And maybe even a local one as well like local Whisper versus Open AI API etc. And I can then actually listen back to segments. Or anyone who wants to can listen back to segments of this recording and see where a particular model struggled while others didn't, as well as the sort of headline finding of which had the best WER. But that would require the source of truth. Okay, that's it. Hope this was, I don't know, maybe useful for other folks interested in STT. You want to see - that I always feel think I've just said as something I didn't intend to. STT I said for those listening carefully! Including hopefully the models themselves! This has been myself Daniel Rosehill. For more um jumbled repositories about my uh roving interests in AI. But particularly agentic AI, MCP, and voice tech, you can find me on Github, Hugging Face. Where else? DanielRosehilll.com which is my personal website. As well as this podcast whose name I sadly cannot remember! Until next time, thanks for listening!
data/inference/benchmark_results.json ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ground_truth_file": "/home/daniel/repos/github/Long-Form-Audio-Eval/data/ground-truth/truth_1.txt",
3
+ "total_runs_evaluated": 8,
4
+ "results": [
5
+ {
6
+ "run_id": "run-1",
7
+ "run_type": "local-stt",
8
+ "provider": "local",
9
+ "model": "whisper-base",
10
+ "engine": "Buzz",
11
+ "metrics": {
12
+ "wer": 17.52,
13
+ "cer": 5.38,
14
+ "word_accuracy": 82.48,
15
+ "insertions": 44,
16
+ "deletions": 62,
17
+ "substitutions": 726,
18
+ "hits": 3960
19
+ }
20
+ },
21
+ {
22
+ "run_id": "run-2",
23
+ "run_type": "local-stt",
24
+ "provider": "local",
25
+ "model": "whisper-tiny",
26
+ "engine": "Buzz",
27
+ "metrics": {
28
+ "wer": 22.49,
29
+ "cer": 8.39,
30
+ "word_accuracy": 77.51,
31
+ "insertions": 82,
32
+ "deletions": 155,
33
+ "substitutions": 831,
34
+ "hits": 3762
35
+ }
36
+ },
37
+ {
38
+ "run_id": "run-3",
39
+ "run_type": "local-stt",
40
+ "provider": "local",
41
+ "model": "whisper-base",
42
+ "engine": "Buzz",
43
+ "metrics": {
44
+ "wer": 17.52,
45
+ "cer": 5.38,
46
+ "word_accuracy": 82.48,
47
+ "insertions": 44,
48
+ "deletions": 62,
49
+ "substitutions": 726,
50
+ "hits": 3960
51
+ }
52
+ },
53
+ {
54
+ "run_id": "manual-1",
55
+ "run_type": "cloud-stt",
56
+ "provider": "gladia",
57
+ "model": "solaria-1",
58
+ "engine": "api",
59
+ "metrics": {
60
+ "wer": 20.83,
61
+ "cer": 6.3,
62
+ "word_accuracy": 79.17,
63
+ "insertions": 100,
64
+ "deletions": 92,
65
+ "substitutions": 797,
66
+ "hits": 3859
67
+ }
68
+ },
69
+ {
70
+ "run_id": "manual-2",
71
+ "run_type": "cloud-stt",
72
+ "provider": "deepgram",
73
+ "model": "nova-3",
74
+ "engine": "api",
75
+ "metrics": {
76
+ "wer": 18.72,
77
+ "cer": 7.33,
78
+ "word_accuracy": 81.28,
79
+ "insertions": 60,
80
+ "deletions": 214,
81
+ "substitutions": 615,
82
+ "hits": 3919
83
+ }
84
+ },
85
+ {
86
+ "run_id": "manual-3",
87
+ "run_type": "cloud-stt",
88
+ "provider": "assemblyai",
89
+ "model": "best",
90
+ "engine": "api",
91
+ "metrics": {
92
+ "wer": 18.79,
93
+ "cer": 6.24,
94
+ "word_accuracy": 81.21,
95
+ "insertions": 64,
96
+ "deletions": 156,
97
+ "substitutions": 672,
98
+ "hits": 3920
99
+ }
100
+ },
101
+ {
102
+ "run_id": "manual-4",
103
+ "run_type": "cloud-stt",
104
+ "provider": "speechmatics",
105
+ "model": "slam-1-global-english",
106
+ "engine": "api",
107
+ "metrics": {
108
+ "wer": 21.65,
109
+ "cer": 7.15,
110
+ "word_accuracy": 78.35,
111
+ "insertions": 158,
112
+ "deletions": 51,
113
+ "substitutions": 819,
114
+ "hits": 3878
115
+ }
116
+ },
117
+ {
118
+ "run_id": "manual-5",
119
+ "run_type": "cloud-stt",
120
+ "provider": "openai",
121
+ "model": "whisper-1",
122
+ "engine": "api",
123
+ "metrics": {
124
+ "wer": 19.27,
125
+ "cer": 6.4,
126
+ "word_accuracy": 80.73,
127
+ "insertions": 114,
128
+ "deletions": 106,
129
+ "substitutions": 695,
130
+ "hits": 3947
131
+ }
132
+ }
133
+ ]
134
+ }
data/inference/punctuation_results.json ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ground_truth_file": "/home/daniel/repos/github/Long-Form-Audio-Eval/data/ground-truth/truth_1.txt",
3
+ "total_runs_evaluated": 8,
4
+ "results": [
5
+ {
6
+ "run_id": "run-1",
7
+ "provider": "local",
8
+ "model": "whisper-base",
9
+ "metrics": {
10
+ "total_punctuation": {
11
+ "reference": 688,
12
+ "hypothesis": 292,
13
+ "difference": -396
14
+ },
15
+ "punctuation_density": {
16
+ "reference_percent": 14.49,
17
+ "hypothesis_percent": 6.17
18
+ },
19
+ "mark_accuracy": {
20
+ "!": {
21
+ "reference_count": 19,
22
+ "hypothesis_count": 0,
23
+ "count_accuracy": 0
24
+ },
25
+ "\"": {
26
+ "reference_count": 45,
27
+ "hypothesis_count": 0,
28
+ "count_accuracy": 0
29
+ },
30
+ ".": {
31
+ "reference_count": 263,
32
+ "hypothesis_count": 42,
33
+ "count_accuracy": 15.97
34
+ },
35
+ "-": {
36
+ "reference_count": 33,
37
+ "hypothesis_count": 10,
38
+ "count_accuracy": 30.3
39
+ },
40
+ ":": {
41
+ "reference_count": 2,
42
+ "hypothesis_count": 0,
43
+ "count_accuracy": 0
44
+ },
45
+ ",": {
46
+ "reference_count": 104,
47
+ "hypothesis_count": 31,
48
+ "count_accuracy": 29.81
49
+ },
50
+ "'": {
51
+ "reference_count": 203,
52
+ "hypothesis_count": 202,
53
+ "count_accuracy": 99.51
54
+ },
55
+ "?": {
56
+ "reference_count": 19,
57
+ "hypothesis_count": 7,
58
+ "count_accuracy": 36.84
59
+ }
60
+ },
61
+ "context_match_accuracy": 13.02,
62
+ "overall_punctuation_score": 21.9
63
+ }
64
+ },
65
+ {
66
+ "run_id": "run-2",
67
+ "provider": "local",
68
+ "model": "whisper-tiny",
69
+ "metrics": {
70
+ "total_punctuation": {
71
+ "reference": 688,
72
+ "hypothesis": 288,
73
+ "difference": -400
74
+ },
75
+ "punctuation_density": {
76
+ "reference_percent": 14.49,
77
+ "hypothesis_percent": 6.16
78
+ },
79
+ "mark_accuracy": {
80
+ "!": {
81
+ "reference_count": 19,
82
+ "hypothesis_count": 0,
83
+ "count_accuracy": 0
84
+ },
85
+ "\"": {
86
+ "reference_count": 45,
87
+ "hypothesis_count": 0,
88
+ "count_accuracy": 0
89
+ },
90
+ ".": {
91
+ "reference_count": 263,
92
+ "hypothesis_count": 45,
93
+ "count_accuracy": 17.11
94
+ },
95
+ "-": {
96
+ "reference_count": 33,
97
+ "hypothesis_count": 5,
98
+ "count_accuracy": 15.15
99
+ },
100
+ ":": {
101
+ "reference_count": 2,
102
+ "hypothesis_count": 0,
103
+ "count_accuracy": 0
104
+ },
105
+ ",": {
106
+ "reference_count": 104,
107
+ "hypothesis_count": 34,
108
+ "count_accuracy": 32.69
109
+ },
110
+ "'": {
111
+ "reference_count": 203,
112
+ "hypothesis_count": 199,
113
+ "count_accuracy": 98.03
114
+ },
115
+ "?": {
116
+ "reference_count": 19,
117
+ "hypothesis_count": 5,
118
+ "count_accuracy": 26.32
119
+ }
120
+ },
121
+ "context_match_accuracy": 8.6,
122
+ "overall_punctuation_score": 18.78
123
+ }
124
+ },
125
+ {
126
+ "run_id": "run-3",
127
+ "provider": "local",
128
+ "model": "whisper-base",
129
+ "metrics": {
130
+ "total_punctuation": {
131
+ "reference": 688,
132
+ "hypothesis": 292,
133
+ "difference": -396
134
+ },
135
+ "punctuation_density": {
136
+ "reference_percent": 14.49,
137
+ "hypothesis_percent": 6.17
138
+ },
139
+ "mark_accuracy": {
140
+ "!": {
141
+ "reference_count": 19,
142
+ "hypothesis_count": 0,
143
+ "count_accuracy": 0
144
+ },
145
+ "\"": {
146
+ "reference_count": 45,
147
+ "hypothesis_count": 0,
148
+ "count_accuracy": 0
149
+ },
150
+ ".": {
151
+ "reference_count": 263,
152
+ "hypothesis_count": 42,
153
+ "count_accuracy": 15.97
154
+ },
155
+ "-": {
156
+ "reference_count": 33,
157
+ "hypothesis_count": 10,
158
+ "count_accuracy": 30.3
159
+ },
160
+ ":": {
161
+ "reference_count": 2,
162
+ "hypothesis_count": 0,
163
+ "count_accuracy": 0
164
+ },
165
+ ",": {
166
+ "reference_count": 104,
167
+ "hypothesis_count": 31,
168
+ "count_accuracy": 29.81
169
+ },
170
+ "'": {
171
+ "reference_count": 203,
172
+ "hypothesis_count": 202,
173
+ "count_accuracy": 99.51
174
+ },
175
+ "?": {
176
+ "reference_count": 19,
177
+ "hypothesis_count": 7,
178
+ "count_accuracy": 36.84
179
+ }
180
+ },
181
+ "context_match_accuracy": 13.02,
182
+ "overall_punctuation_score": 21.9
183
+ }
184
+ },
185
+ {
186
+ "run_id": "manual-1",
187
+ "provider": "gladia",
188
+ "model": "solaria-1",
189
+ "metrics": {
190
+ "total_punctuation": {
191
+ "reference": 688,
192
+ "hypothesis": 651,
193
+ "difference": -37
194
+ },
195
+ "punctuation_density": {
196
+ "reference_percent": 14.49,
197
+ "hypothesis_percent": 13.69
198
+ },
199
+ "mark_accuracy": {
200
+ "!": {
201
+ "reference_count": 19,
202
+ "hypothesis_count": 0,
203
+ "count_accuracy": 0
204
+ },
205
+ "\"": {
206
+ "reference_count": 45,
207
+ "hypothesis_count": 0,
208
+ "count_accuracy": 0
209
+ },
210
+ ".": {
211
+ "reference_count": 263,
212
+ "hypothesis_count": 180,
213
+ "count_accuracy": 68.44
214
+ },
215
+ "-": {
216
+ "reference_count": 33,
217
+ "hypothesis_count": 9,
218
+ "count_accuracy": 27.27
219
+ },
220
+ ":": {
221
+ "reference_count": 2,
222
+ "hypothesis_count": 0,
223
+ "count_accuracy": 0
224
+ },
225
+ ",": {
226
+ "reference_count": 104,
227
+ "hypothesis_count": 251,
228
+ "count_accuracy": 0
229
+ },
230
+ "'": {
231
+ "reference_count": 203,
232
+ "hypothesis_count": 197,
233
+ "count_accuracy": 97.04
234
+ },
235
+ "?": {
236
+ "reference_count": 19,
237
+ "hypothesis_count": 14,
238
+ "count_accuracy": 73.68
239
+ }
240
+ },
241
+ "context_match_accuracy": 22.56,
242
+ "overall_punctuation_score": 44.13
243
+ }
244
+ },
245
+ {
246
+ "run_id": "manual-2",
247
+ "provider": "deepgram",
248
+ "model": "nova-3",
249
+ "metrics": {
250
+ "total_punctuation": {
251
+ "reference": 688,
252
+ "hypothesis": 698,
253
+ "difference": 10
254
+ },
255
+ "punctuation_density": {
256
+ "reference_percent": 14.49,
257
+ "hypothesis_percent": 15.19
258
+ },
259
+ "mark_accuracy": {
260
+ "!": {
261
+ "reference_count": 19,
262
+ "hypothesis_count": 0,
263
+ "count_accuracy": 0
264
+ },
265
+ "\"": {
266
+ "reference_count": 45,
267
+ "hypothesis_count": 0,
268
+ "count_accuracy": 0
269
+ },
270
+ ".": {
271
+ "reference_count": 263,
272
+ "hypothesis_count": 222,
273
+ "count_accuracy": 84.41
274
+ },
275
+ "-": {
276
+ "reference_count": 33,
277
+ "hypothesis_count": 3,
278
+ "count_accuracy": 9.09
279
+ },
280
+ ":": {
281
+ "reference_count": 2,
282
+ "hypothesis_count": 0,
283
+ "count_accuracy": 0
284
+ },
285
+ ",": {
286
+ "reference_count": 104,
287
+ "hypothesis_count": 265,
288
+ "count_accuracy": 0
289
+ },
290
+ "'": {
291
+ "reference_count": 203,
292
+ "hypothesis_count": 189,
293
+ "count_accuracy": 93.1
294
+ },
295
+ "?": {
296
+ "reference_count": 19,
297
+ "hypothesis_count": 19,
298
+ "count_accuracy": 100.0
299
+ }
300
+ },
301
+ "context_match_accuracy": 32.33,
302
+ "overall_punctuation_score": 51.17
303
+ }
304
+ },
305
+ {
306
+ "run_id": "manual-3",
307
+ "provider": "assemblyai",
308
+ "model": "best",
309
+ "metrics": {
310
+ "total_punctuation": {
311
+ "reference": 688,
312
+ "hypothesis": 791,
313
+ "difference": 103
314
+ },
315
+ "punctuation_density": {
316
+ "reference_percent": 14.49,
317
+ "hypothesis_percent": 16.99
318
+ },
319
+ "mark_accuracy": {
320
+ "!": {
321
+ "reference_count": 19,
322
+ "hypothesis_count": 0,
323
+ "count_accuracy": 0
324
+ },
325
+ "\"": {
326
+ "reference_count": 45,
327
+ "hypothesis_count": 0,
328
+ "count_accuracy": 0
329
+ },
330
+ ".": {
331
+ "reference_count": 263,
332
+ "hypothesis_count": 218,
333
+ "count_accuracy": 82.89
334
+ },
335
+ "-": {
336
+ "reference_count": 33,
337
+ "hypothesis_count": 7,
338
+ "count_accuracy": 21.21
339
+ },
340
+ ":": {
341
+ "reference_count": 2,
342
+ "hypothesis_count": 0,
343
+ "count_accuracy": 0
344
+ },
345
+ ",": {
346
+ "reference_count": 104,
347
+ "hypothesis_count": 356,
348
+ "count_accuracy": 0
349
+ },
350
+ "'": {
351
+ "reference_count": 203,
352
+ "hypothesis_count": 191,
353
+ "count_accuracy": 94.09
354
+ },
355
+ "?": {
356
+ "reference_count": 19,
357
+ "hypothesis_count": 19,
358
+ "count_accuracy": 100.0
359
+ }
360
+ },
361
+ "context_match_accuracy": 33.72,
362
+ "overall_punctuation_score": 48.43
363
+ }
364
+ },
365
+ {
366
+ "run_id": "manual-4",
367
+ "provider": "speechmatics",
368
+ "model": "slam-1-global-english",
369
+ "metrics": {
370
+ "total_punctuation": {
371
+ "reference": 688,
372
+ "hypothesis": 1003,
373
+ "difference": 315
374
+ },
375
+ "punctuation_density": {
376
+ "reference_percent": 14.49,
377
+ "hypothesis_percent": 20.66
378
+ },
379
+ "mark_accuracy": {
380
+ "!": {
381
+ "reference_count": 19,
382
+ "hypothesis_count": 0,
383
+ "count_accuracy": 0
384
+ },
385
+ "\"": {
386
+ "reference_count": 45,
387
+ "hypothesis_count": 0,
388
+ "count_accuracy": 0
389
+ },
390
+ ".": {
391
+ "reference_count": 263,
392
+ "hypothesis_count": 238,
393
+ "count_accuracy": 90.49
394
+ },
395
+ "-": {
396
+ "reference_count": 33,
397
+ "hypothesis_count": 4,
398
+ "count_accuracy": 12.12
399
+ },
400
+ ":": {
401
+ "reference_count": 2,
402
+ "hypothesis_count": 0,
403
+ "count_accuracy": 0
404
+ },
405
+ ",": {
406
+ "reference_count": 104,
407
+ "hypothesis_count": 549,
408
+ "count_accuracy": 0
409
+ },
410
+ "'": {
411
+ "reference_count": 203,
412
+ "hypothesis_count": 195,
413
+ "count_accuracy": 96.06
414
+ },
415
+ "?": {
416
+ "reference_count": 19,
417
+ "hypothesis_count": 17,
418
+ "count_accuracy": 89.47
419
+ }
420
+ },
421
+ "context_match_accuracy": 30.0,
422
+ "overall_punctuation_score": 38.23
423
+ }
424
+ },
425
+ {
426
+ "run_id": "manual-5",
427
+ "provider": "openai",
428
+ "model": "whisper-1",
429
+ "metrics": {
430
+ "total_punctuation": {
431
+ "reference": 688,
432
+ "hypothesis": 911,
433
+ "difference": 223
434
+ },
435
+ "punctuation_density": {
436
+ "reference_percent": 14.49,
437
+ "hypothesis_percent": 19.15
438
+ },
439
+ "mark_accuracy": {
440
+ "!": {
441
+ "reference_count": 19,
442
+ "hypothesis_count": 0,
443
+ "count_accuracy": 0
444
+ },
445
+ "\"": {
446
+ "reference_count": 45,
447
+ "hypothesis_count": 0,
448
+ "count_accuracy": 0
449
+ },
450
+ ".": {
451
+ "reference_count": 263,
452
+ "hypothesis_count": 221,
453
+ "count_accuracy": 84.03
454
+ },
455
+ "-": {
456
+ "reference_count": 33,
457
+ "hypothesis_count": 6,
458
+ "count_accuracy": 18.18
459
+ },
460
+ ":": {
461
+ "reference_count": 2,
462
+ "hypothesis_count": 0,
463
+ "count_accuracy": 0
464
+ },
465
+ ",": {
466
+ "reference_count": 104,
467
+ "hypothesis_count": 471,
468
+ "count_accuracy": 0
469
+ },
470
+ "'": {
471
+ "reference_count": 203,
472
+ "hypothesis_count": 197,
473
+ "count_accuracy": 97.04
474
+ },
475
+ "?": {
476
+ "reference_count": 19,
477
+ "hypothesis_count": 16,
478
+ "count_accuracy": 84.21
479
+ }
480
+ },
481
+ "context_match_accuracy": 34.42,
482
+ "overall_punctuation_score": 44.44
483
+ }
484
+ }
485
+ ]
486
+ }
data/inference/runs-config.json ADDED
@@ -0,0 +1,501 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "runs": [
3
+ {
4
+ "run_id": "run-1",
5
+ "run_type": "local-stt",
6
+ "model": "whisper-base",
7
+ "provider": "local",
8
+ "inference_provider": null,
9
+ "engine": "Buzz",
10
+ "run_method": {
11
+ "interface": "gui",
12
+ "automation": "manual",
13
+ "description": "Processed using Buzz desktop application"
14
+ },
15
+ "settings": {
16
+ "language": "en",
17
+ "task": "transcribe"
18
+ },
19
+ "output_dir": "runs/local-stt/run-1",
20
+ "completed": true,
21
+ "notes": "Whisper Base (local inference) using Buzz",
22
+ "run_notes": {
23
+ "language_detection": {
24
+ "auto_detect": false,
25
+ "language_specified": true,
26
+ "language_code": "en"
27
+ }
28
+ }
29
+ },
30
+ {
31
+ "run_id": "run-2",
32
+ "run_type": "local-stt",
33
+ "model": "whisper-tiny",
34
+ "provider": "local",
35
+ "inference_provider": null,
36
+ "engine": "Buzz",
37
+ "run_method": {
38
+ "interface": "gui",
39
+ "automation": "manual",
40
+ "description": "Processed using Buzz desktop application"
41
+ },
42
+ "settings": {
43
+ "language": "en",
44
+ "task": "transcribe"
45
+ },
46
+ "output_dir": "runs/local-stt/run-2",
47
+ "completed": true,
48
+ "notes": "Whisper Tiny (local inference) using Buzz",
49
+ "run_notes": {
50
+ "language_detection": {
51
+ "auto_detect": false,
52
+ "language_specified": true,
53
+ "language_code": "en"
54
+ }
55
+ }
56
+ },
57
+ {
58
+ "run_id": "run-3",
59
+ "run_type": "local-stt",
60
+ "model": "whisper-base",
61
+ "provider": "local",
62
+ "inference_provider": null,
63
+ "engine": "Buzz",
64
+ "run_method": {
65
+ "interface": "gui",
66
+ "automation": "manual",
67
+ "description": "Processed using Buzz desktop application"
68
+ },
69
+ "settings": {
70
+ "language": "auto-detect",
71
+ "task": "transcribe"
72
+ },
73
+ "output_dir": "runs/local-stt/run-3",
74
+ "completed": true,
75
+ "notes": "Whisper Base (as run 1) locally but with language set to detect rather than specified as English",
76
+ "run_notes": {
77
+ "language_detection": {
78
+ "auto_detect": true,
79
+ "language_specified": false,
80
+ "language_code": "auto"
81
+ }
82
+ }
83
+ },
84
+ {
85
+ "run_id": "run-4",
86
+ "run_type": "cloud-stt",
87
+ "model": "whisper-1",
88
+ "provider": "openai",
89
+ "inference_provider": "edenai",
90
+ "engine": "api",
91
+ "run_method": {
92
+ "interface": "api",
93
+ "automation": "automated",
94
+ "description": "Executed via edenai_stt_runner.py script"
95
+ },
96
+ "settings": {
97
+ "language": "en",
98
+ "temperature": 0.0
99
+ },
100
+ "output_dir": "runs/cloud-stt/run-4",
101
+ "completed": true,
102
+ "notes": "OpenAI Whisper API",
103
+ "run_notes": {
104
+ "language_detection": {
105
+ "auto_detect": false,
106
+ "language_specified": true,
107
+ "language_code": "en"
108
+ }
109
+ }
110
+ },
111
+ {
112
+ "run_id": "run-5",
113
+ "run_type": "cloud-stt",
114
+ "model": "nova-2",
115
+ "provider": "deepgram",
116
+ "inference_provider": "edenai",
117
+ "engine": "api",
118
+ "run_method": {
119
+ "interface": "api",
120
+ "automation": "automated",
121
+ "description": "Executed via edenai_stt_runner.py script"
122
+ },
123
+ "settings": {
124
+ "language": "en",
125
+ "smart_format": true,
126
+ "punctuate": true
127
+ },
128
+ "output_dir": "runs/cloud-stt/run-5",
129
+ "completed": false,
130
+ "notes": "Deepgram Nova-2 model",
131
+ "run_notes": {
132
+ "language_detection": {
133
+ "auto_detect": false,
134
+ "language_specified": true,
135
+ "language_code": "en"
136
+ }
137
+ }
138
+ },
139
+ {
140
+ "run_id": "run-6",
141
+ "run_type": "cloud-stt",
142
+ "model": "chirp",
143
+ "provider": "assemblyai",
144
+ "inference_provider": "edenai",
145
+ "engine": "api",
146
+ "run_method": {
147
+ "interface": "api",
148
+ "automation": "automated",
149
+ "description": "Executed via edenai_stt_runner.py script"
150
+ },
151
+ "settings": {
152
+ "language_code": "en",
153
+ "punctuate": true,
154
+ "format_text": true
155
+ },
156
+ "output_dir": "runs/cloud-stt/run-6",
157
+ "completed": false,
158
+ "notes": "AssemblyAI Universal-1 (Chirp) model",
159
+ "run_notes": {
160
+ "language_detection": {
161
+ "auto_detect": false,
162
+ "language_specified": true,
163
+ "language_code": "en"
164
+ }
165
+ }
166
+ },
167
+ {
168
+ "run_id": "run-7",
169
+ "run_type": "cloud-stt",
170
+ "model": "whisper-large-v3",
171
+ "provider": "groq",
172
+ "inference_provider": "edenai",
173
+ "engine": "api",
174
+ "run_method": {
175
+ "interface": "api",
176
+ "automation": "automated",
177
+ "description": "Executed via edenai_stt_runner.py script"
178
+ },
179
+ "settings": {
180
+ "language": "en",
181
+ "temperature": 0.0
182
+ },
183
+ "output_dir": "runs/cloud-stt/run-7",
184
+ "completed": false,
185
+ "notes": "Groq Whisper Large V3",
186
+ "run_notes": {
187
+ "language_detection": {
188
+ "auto_detect": false,
189
+ "language_specified": true,
190
+ "language_code": "en"
191
+ }
192
+ }
193
+ },
194
+ {
195
+ "run_id": "run-8",
196
+ "run_type": "cloud-stt",
197
+ "model": "enhanced",
198
+ "provider": "speechmatics",
199
+ "inference_provider": "edenai",
200
+ "engine": "api",
201
+ "run_method": {
202
+ "interface": "api",
203
+ "automation": "automated",
204
+ "description": "Executed via edenai_stt_runner.py script"
205
+ },
206
+ "settings": {
207
+ "language": "en",
208
+ "operating_point": "enhanced",
209
+ "enable_partials": false
210
+ },
211
+ "output_dir": "runs/cloud-stt/run-8",
212
+ "completed": false,
213
+ "notes": "Speechmatics Enhanced model via Eden AI",
214
+ "run_notes": {
215
+ "language_detection": {
216
+ "auto_detect": false,
217
+ "language_specified": true,
218
+ "language_code": "en"
219
+ }
220
+ }
221
+ },
222
+ {
223
+ "run_id": "run-9",
224
+ "run_type": "cloud-stt",
225
+ "model": "whisper",
226
+ "provider": "google",
227
+ "inference_provider": "edenai",
228
+ "engine": "api",
229
+ "run_method": {
230
+ "interface": "api",
231
+ "automation": "automated",
232
+ "description": "Executed via edenai_stt_runner.py script"
233
+ },
234
+ "settings": {
235
+ "language": "en"
236
+ },
237
+ "output_dir": "runs/cloud-stt/run-9",
238
+ "completed": false,
239
+ "notes": "Google Speech-to-Text via Eden AI",
240
+ "run_notes": {
241
+ "language_detection": {
242
+ "auto_detect": false,
243
+ "language_specified": true,
244
+ "language_code": "en"
245
+ }
246
+ }
247
+ },
248
+ {
249
+ "run_id": "run-10",
250
+ "run_type": "cloud-stt",
251
+ "model": "transcribe",
252
+ "provider": "amazon",
253
+ "inference_provider": "edenai",
254
+ "engine": "api",
255
+ "run_method": {
256
+ "interface": "api",
257
+ "automation": "automated",
258
+ "description": "Executed via edenai_stt_runner.py script"
259
+ },
260
+ "settings": {
261
+ "language": "en"
262
+ },
263
+ "output_dir": "runs/cloud-stt/run-10",
264
+ "completed": false,
265
+ "notes": "Amazon Transcribe via Eden AI",
266
+ "run_notes": {
267
+ "language_detection": {
268
+ "auto_detect": false,
269
+ "language_specified": true,
270
+ "language_code": "en"
271
+ }
272
+ }
273
+ },
274
+ {
275
+ "run_id": "run-11",
276
+ "run_type": "cloud-stt",
277
+ "model": "azure-stt",
278
+ "provider": "microsoft",
279
+ "inference_provider": "edenai",
280
+ "engine": "api",
281
+ "run_method": {
282
+ "interface": "api",
283
+ "automation": "automated",
284
+ "description": "Executed via edenai_stt_runner.py script"
285
+ },
286
+ "settings": {
287
+ "language": "en"
288
+ },
289
+ "output_dir": "runs/cloud-stt/run-11",
290
+ "completed": false,
291
+ "notes": "Microsoft Azure Speech-to-Text via Eden AI",
292
+ "run_notes": {
293
+ "language_detection": {
294
+ "auto_detect": false,
295
+ "language_specified": true,
296
+ "language_code": "en"
297
+ }
298
+ }
299
+ },
300
+ {
301
+ "run_id": "run-12",
302
+ "run_type": "cloud-stt",
303
+ "model": "default",
304
+ "provider": "symbl",
305
+ "inference_provider": "edenai",
306
+ "engine": "api",
307
+ "run_method": {
308
+ "interface": "api",
309
+ "automation": "automated",
310
+ "description": "Executed via edenai_stt_runner.py script"
311
+ },
312
+ "settings": {
313
+ "language": "en"
314
+ },
315
+ "output_dir": "runs/cloud-stt/run-12",
316
+ "completed": false,
317
+ "notes": "Symbl.ai via Eden AI",
318
+ "run_notes": {
319
+ "language_detection": {
320
+ "auto_detect": false,
321
+ "language_specified": true,
322
+ "language_code": "en"
323
+ }
324
+ }
325
+ },
326
+ {
327
+ "run_id": "run-13",
328
+ "run_type": "cloud-stt",
329
+ "model": "fast",
330
+ "provider": "gladia",
331
+ "inference_provider": "edenai",
332
+ "engine": "api",
333
+ "run_method": {
334
+ "interface": "api",
335
+ "automation": "automated",
336
+ "description": "Executed via edenai_stt_runner.py script"
337
+ },
338
+ "settings": {
339
+ "language": "en"
340
+ },
341
+ "output_dir": "runs/cloud-stt/run-13",
342
+ "completed": false,
343
+ "notes": "Gladia fast model via Eden AI",
344
+ "run_notes": {
345
+ "language_detection": {
346
+ "auto_detect": false,
347
+ "language_specified": true,
348
+ "language_code": "en"
349
+ }
350
+ }
351
+ },
352
+ {
353
+ "run_id": "manual-1",
354
+ "run_type": "cloud-stt",
355
+ "model": "solaria-1",
356
+ "provider": "gladia",
357
+ "inference_provider": "direct",
358
+ "engine": "api",
359
+ "run_method": {
360
+ "interface": "api",
361
+ "automation": "manual",
362
+ "description": "Manual run - standardized output format"
363
+ },
364
+ "settings": {
365
+ "language": "en"
366
+ },
367
+ "output_dir": "runs/cloud-stt/manual-1",
368
+ "completed": true,
369
+ "notes": "Gladia Solaria 1 model - manual run",
370
+ "run_notes": {
371
+ "language_detection": {
372
+ "auto_detect": false,
373
+ "language_specified": true,
374
+ "language_code": "en"
375
+ }
376
+ }
377
+ },
378
+ {
379
+ "run_id": "manual-2",
380
+ "run_type": "cloud-stt",
381
+ "model": "nova-3",
382
+ "provider": "deepgram",
383
+ "inference_provider": "direct",
384
+ "engine": "api",
385
+ "run_method": {
386
+ "interface": "api",
387
+ "automation": "manual",
388
+ "description": "Manual run - standardized output format"
389
+ },
390
+ "settings": {
391
+ "language": "en",
392
+ "smart_format": true,
393
+ "punctuate": true
394
+ },
395
+ "output_dir": "runs/cloud-stt/manual-2",
396
+ "completed": true,
397
+ "notes": "Deepgram Nova-3 model - manual run",
398
+ "run_notes": {
399
+ "language_detection": {
400
+ "auto_detect": false,
401
+ "language_specified": true,
402
+ "language_code": "en"
403
+ },
404
+ "processing_time": 3.0,
405
+ "audio_duration": 1637.9652
406
+ }
407
+ },
408
+ {
409
+ "run_id": "manual-3",
410
+ "run_type": "cloud-stt",
411
+ "model": "best",
412
+ "provider": "assemblyai",
413
+ "inference_provider": "direct",
414
+ "engine": "api",
415
+ "run_method": {
416
+ "interface": "api",
417
+ "automation": "manual",
418
+ "description": "Manual run - standardized output format"
419
+ },
420
+ "settings": {
421
+ "language_code": "en",
422
+ "punctuate": true,
423
+ "format_text": true
424
+ },
425
+ "output_dir": "runs/cloud-stt/manual-3",
426
+ "completed": true,
427
+ "notes": "AssemblyAI Best model - manual run",
428
+ "run_notes": {
429
+ "language_detection": {
430
+ "auto_detect": false,
431
+ "language_specified": true,
432
+ "language_code": "en"
433
+ }
434
+ }
435
+ },
436
+ {
437
+ "run_id": "manual-4",
438
+ "run_type": "cloud-stt",
439
+ "model": "slam-1-global-english",
440
+ "provider": "speechmatics",
441
+ "inference_provider": "direct",
442
+ "engine": "api",
443
+ "run_method": {
444
+ "interface": "api",
445
+ "automation": "manual",
446
+ "description": "Manual run - standardized output format"
447
+ },
448
+ "settings": {
449
+ "language": "en",
450
+ "operating_point": "enhanced",
451
+ "enable_partials": false
452
+ },
453
+ "output_dir": "runs/cloud-stt/manual-4",
454
+ "completed": true,
455
+ "notes": "Speechmatics Slam 1 Global English - manual run",
456
+ "run_notes": {
457
+ "language_detection": {
458
+ "auto_detect": false,
459
+ "language_specified": true,
460
+ "language_code": "en"
461
+ }
462
+ }
463
+ },
464
+ {
465
+ "run_id": "manual-5",
466
+ "run_type": "cloud-stt",
467
+ "model": "whisper-1",
468
+ "provider": "openai",
469
+ "inference_provider": "direct",
470
+ "engine": "api",
471
+ "run_method": {
472
+ "interface": "api",
473
+ "automation": "manual",
474
+ "description": "Manual run - standardized output format"
475
+ },
476
+ "settings": {
477
+ "language": "en",
478
+ "temperature": 0.0
479
+ },
480
+ "output_dir": "runs/cloud-stt/manual-5",
481
+ "completed": true,
482
+ "notes": "OpenAI Whisper-1 model - manual run via web UI",
483
+ "run_notes": {
484
+ "language_detection": {
485
+ "auto_detect": false,
486
+ "language_specified": true,
487
+ "language_code": "en"
488
+ }
489
+ }
490
+ }
491
+ ],
492
+ "source_audio": "data/audio/podcast.mp3",
493
+ "source_of_truth": "data/ground-truth/truth_1.txt",
494
+ "evaluation_metrics": [
495
+ "wer",
496
+ "cer",
497
+ "word_accuracy",
498
+ "processing_time",
499
+ "cost"
500
+ ]
501
+ }
data/inference/runs/cloud-stt/manual-1/raw_response.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "full_transcript": "Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um regarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech to text models. And I'm doing this because I thought I'd made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine tuning Whisper. Whisper is, and I'm going to just talk i'm trying to mix up uh i'm going to try a few different styles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step, standalone, step alone, standalone sentences. And how is it going to handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did, why was it trying to fine tune Whisper? what is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to be down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's Just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of training, I guess, a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is this is sparked. I had besides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the reason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like, it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, I know I'm just going to have to find something else to read. So I used... a created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a voice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see. how close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used Bean. experimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or Well, it's more than that, actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors. But it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder. And I created an error of that data. So I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November, my wife was in the US and I was alone at home and, you know, went crazy. people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high I used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically And this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time, for a speech to text to be a worthwhile addition to your productivity. But you do need to get above, let's say, I don't know, 85%. percent. If it's 60% or 50%, you inevitably say, screw it, I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like, oh, this is actually really, really good now. How did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the to the underpinnings, which really blows my mind. And it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff. Super advanced stuff, I mean. But that, I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM. And then you've got image generation. OK, so firstly, those two things on the surface have nothing in common. So like, how are they? How did that just happen all at the same time? And then when you extend that further, you're like Suno, right? You can sing a song and AI will like come up with an instrumental. And then you've got Whisper. And you're like, wait a second. How did all this stuff, like if it's all AI, what's, like there has to be some commonality. Otherwise, these are totally different technologies on the surface of it. And the transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth. But I have scanned this. And as I said, I want to... printed and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the birth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring It's nice to think that people who, you know, just you might put them in the kind of. milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay voice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess It's like, oh my gosh, I have to get like a really good microphone for this. So I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note. And then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say. But when I look, I... kind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh llm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say, okay, this is crappy. text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built I see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's It's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually. is a big deal. So I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run in the background. You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues were I discovered that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i just thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models That are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me. Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. But I'm not in that. I'm neither really care. about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And I don't do anything so, you know, sensitive or secure that requires air gapping. So. I looked at the pricing and especially the kind of older models, mini, some of them are very, very affordable. And I did a calculation once with ChatGPT and I was like, OK, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails. uh, documents, outlines, um, which is a lot, but it's, it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been. Usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloud STT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said that's actually very, very affordable. Now, you're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost some more as well. Unless you're using Gemini, which needless to say, is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. Um, I like Gemini not so much as a everyday model. Um, it's kind of underwhelmed in that respect, I would say, but for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, um, process audio with a system prompt and both give you transcription that's cleaned up, that reduces two steps to one. And that for me is a very, very big deal. And, uh, I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay i've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood. more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has been working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. Deepgram SDT, I'm really, really hopeful that this prototype will work. And it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you know, you're not going to have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up. And I said, all right, let me just give Deepgram's Linux thing. shot and if it doesn't work, I'm just gonna go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process. It ran the script and oh my gosh, it works just like that. The tricky thing for all those who wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting, Wayland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram, I looked at how they actually handled that because it worked out of the... box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i think i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then voice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong coffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure, quick test and As someone working on a voice note idea, that's my sort of end motivation, besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit. you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, and it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as, um, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that. my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that I suspect that the next generation of ASR will just be so good that it will kind of be, no, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single, long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head-to-head with two things, really. One is Whisper variants. So you've got these... projects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but What I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well, like Local Whisper versus OpenAI API, et cetera. And... I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER, but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that I always feel, think I've just said as something I didn't intend to. STT, I said for those. listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face, where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening."
3
+ }
data/inference/runs/cloud-stt/manual-1/transcript.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um regarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech to text models. And I'm doing this because I thought I'd made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine tuning Whisper. Whisper is, and I'm going to just talk i'm trying to mix up uh i'm going to try a few different styles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step, standalone, step alone, standalone sentences. And how is it going to handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did, why was it trying to fine tune Whisper? what is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to be down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's Just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of training, I guess, a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is this is sparked. I had besides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the reason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like, it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, I know I'm just going to have to find something else to read. So I used... a created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a voice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see. how close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used Bean. experimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or Well, it's more than that, actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors. But it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder. And I created an error of that data. So I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November, my wife was in the US and I was alone at home and, you know, went crazy. people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high I used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically And this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time, for a speech to text to be a worthwhile addition to your productivity. But you do need to get above, let's say, I don't know, 85%. percent. If it's 60% or 50%, you inevitably say, screw it, I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like, oh, this is actually really, really good now. How did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the to the underpinnings, which really blows my mind. And it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff. Super advanced stuff, I mean. But that, I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM. And then you've got image generation. OK, so firstly, those two things on the surface have nothing in common. So like, how are they? How did that just happen all at the same time? And then when you extend that further, you're like Suno, right? You can sing a song and AI will like come up with an instrumental. And then you've got Whisper. And you're like, wait a second. How did all this stuff, like if it's all AI, what's, like there has to be some commonality. Otherwise, these are totally different technologies on the surface of it. And the transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth. But I have scanned this. And as I said, I want to... printed and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the birth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring It's nice to think that people who, you know, just you might put them in the kind of. milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay voice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess It's like, oh my gosh, I have to get like a really good microphone for this. So I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note. And then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say. But when I look, I... kind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh llm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say, okay, this is crappy. text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built I see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's It's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually. is a big deal. So I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run in the background. You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues were I discovered that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i just thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models That are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me. Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. But I'm not in that. I'm neither really care. about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And I don't do anything so, you know, sensitive or secure that requires air gapping. So. I looked at the pricing and especially the kind of older models, mini, some of them are very, very affordable. And I did a calculation once with ChatGPT and I was like, OK, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails. uh, documents, outlines, um, which is a lot, but it's, it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been. Usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloud STT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said that's actually very, very affordable. Now, you're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost some more as well. Unless you're using Gemini, which needless to say, is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. Um, I like Gemini not so much as a everyday model. Um, it's kind of underwhelmed in that respect, I would say, but for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, um, process audio with a system prompt and both give you transcription that's cleaned up, that reduces two steps to one. And that for me is a very, very big deal. And, uh, I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay i've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood. more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has been working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. Deepgram SDT, I'm really, really hopeful that this prototype will work. And it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you know, you're not going to have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up. And I said, all right, let me just give Deepgram's Linux thing. shot and if it doesn't work, I'm just gonna go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process. It ran the script and oh my gosh, it works just like that. The tricky thing for all those who wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting, Wayland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram, I looked at how they actually handled that because it worked out of the... box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i think i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then voice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong coffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure, quick test and As someone working on a voice note idea, that's my sort of end motivation, besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit. you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, and it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as, um, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that. my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that I suspect that the next generation of ASR will just be so good that it will kind of be, no, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single, long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head-to-head with two things, really. One is Whisper variants. So you've got these... projects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but What I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well, like Local Whisper versus OpenAI API, et cetera. And... I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER, but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that I always feel, think I've just said as something I didn't intend to. STT, I said for those. listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face, where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.
data/inference/runs/cloud-stt/manual-2/transcript.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Hello and welcome to a audio dataset consisting of one single episode of a nonexistent podcast. Or it I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and A. I. In particular, more A. I. And generative A. I. I would I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, they might say, for different speech attacks models. I'm doing this because I thought I'd made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine tuning whisper. Whisper is, and I'm to just talk, I'm trying to mix up. I'm going to try a few different styles of speaking whisper something at some points as well. And I'll go back to speaking loud in in different parts are going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to push a speech to text model through its paces, which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step standalone, standalone, standalone sentences? And how is it going to handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was I trying to fine tune Whisper? And what is Whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stuck down the rabbit hole of AI, which you have to say is a really wonderful rabbit hole to be done. It's a really interesting area and speech and voice tech is is the aspect of it that I find actually most I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, is sparked. I had, besides the fine tune not working, well that was the failure. I used Claude code because one thinks these days that there is nothing short of solving, you know, the the reason of life or something that clause and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes. It was actually kind of tedious because the texts were really weird. Some of them were, it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't, I was so bored after ten minutes that I was like, okay, no, I'm just gonna have to find something else to read. So I used a created with AI Studio, VibeCoded, a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note like I'm recording an email, give me a short story to read, give me prose to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to it gave me a source of truth which I'm not sure if that's the scientific way to approach this topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used, been experimenting with using for a different purpose. Slightly different annotating task types. It's more a text classification experiment or Well, it's more than that actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. Listen back to a voice note and you painfully go through one of those transcribing, where you start and stop and scrub around it and you fix the errors, but it's really, really pouring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minutes snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an error that data. So I was very hopeful, quietly, a little bit hopeful that I would be able, that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into voice tech last November, my wife was in the US and I was alone at home. And when crazy people like me do really wild things like use voice to tech technology. That was basically when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. I'd used speech tech now and again, tried it out. I was like, it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just, it was not good basically. And this blew me away from the first go. I mean, it wasn't one hundred percent accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like, the transcript is you don't have to get one hundred percent accuracy for it to be worth your time for speech to text to be a worthwhile addition to your productivity. But you do need to get above, let's say, I don't know, eighty five percent. If it's sixty percent or fifty percent, you inevitably say, Screw it, I'll just type it. Because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So I was like, Oh, this is actually really, really good now. How did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture, if you want to go back to the underpinnings, which really blows my mind and it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super super high level stuff, super advanced stuff, mean. That I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities, I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM. And then you've got image generation. Okay. So firstly, two things on the surface have nothing in common. So how did that just happen all at the same time? And then when you extend that further, you're like, Suno. You can sing a song and AI will come up with an instrumental. And then you've got Whisper and you're like, Wait a second. How did all this stuff If it's all AI, there has to be some commonality. Otherwise, are totally different technologies on the surface of it. And the transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in-depth. But I have scanned this and as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in twenties? Like, that's crazy. I think I asked ChatGPT once who wrote that paper and how old were they when it was published in ArcSiv? And I was expecting like, I don't know, what do you imagine? I personally imagine kind of like, you you have these breakthroughs during COVID and things like that, where like these kind of really obscure scientists who are in their 50s and they've just kind of been laboring in labs and wearily in writing and publishing in kind of obscure academic publications. And they finally hit a big or win a Nobel Prize and then their household names. So that was kind of what I had in mind. That was the mental image I'd formed of the birth of ArcSim. Like I wasn't expecting twenty somethings in San Francisco. I thought that was both very funny, very cool, and actually kind of inspiring. It's nice to think that people who just you might put them in the kind of milieu or bubble or world that you are in incredibly in through a series of connections that are coming up with such literally world changing innovations. So that was I thought anyway, that's that that was cool. Okay. Voice training data. How are we doing? We're about ten minutes, and I'm still talking about voice technology. So Whisper was brilliant, and I was so excited that my first instinct was to guess, like, Oh my gosh, I have to get a really good microphone for this. So I didn't go on a spending spree because I said, I'm gonna have to just wait a month and see if I still use this. And it just kind of became it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel. That's kind of a weird thing to say, when I started working on these prototypes on GitHub, which is where I just kind of share very freely and loosely ideas and first iterations on concepts. And for want of a better word, I called it like LLM post processing or clean up or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy text like add sentence structure and, you know, fix it up. And now when I'm exploring the different tools that are out there that people have built, I see quite a number of projects have basically done the same thing. Lest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while, but it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. So I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run-in the background. You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues where I discovered that like most people involved in creating these projects were very much focused on local models, running Whisper locally because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there is it's actually, in my opinion, just one of the better deals in API spending in the cloud. Like, it's just not that expensive for very, very good models that are much more, you know, you're gonna be able to run the full model, the latest model versus whatever you can run on your average GPU unless you want to buy a crazy GPU. It doesn't really make sense to me. Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment maybe for regulatory reasons as well. But I'm not in that. I neither really care about people listening to my, grocery list, consisting of, reminding myself that I need to buy more beer, Cheetos, and hummus, which is kind of the three staples of my diet during periods of poor nutrition. But the kind of stuff that I transcribe, it's just not. It's not a privacy thing I'm that sort of sensitive about and I don't do anything so sensitive or secure that requires air capping. I looked at the pricing and especially the kind of older model mini. Some of them are very, very affordable and I did a calculation once with ChatGPT and I was like, okay, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which rarely happens. Probably, I would say on average I might dictate thirty to sixty minutes per day if I was probably summing up the emails, documents, outlines, which is a lot, but it's it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, Oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloud STT. If I was like dedicated every second of every waking hour to transcribing for some odd reason, I mean I'd have to eat and use the toilet. There's only so many hours I'm awake for. So let's just say a maximum of forty five minutes in the hour, then I said, All right, let's just say fifty. Who knows? You're dictating on the toilet. We do it. So you could just do sixty, but whatever I did and every day, like you're going flat out seven days a week dictating nonstop. I was like, What's my monthly API bill going to be at this price? And it came out to like seventy or eighty bucks. And I was like, Well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than seventy dollars that I embarked upon that project. So given that that's kind of the max point for me I said that's actually very very affordable. Now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel is valuable, that's going to cost some more as well. Unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation nor with Google nor Anthropic nor Gemini nor any major tech vendor for that matter. I like Gemini not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, process audio with a system prompt and both give you transcription that's cleaned up. That reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google hasn't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year just an endless list of really kind of system prompt stuff that I can say, okay, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood. More realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for ten minutes. And that ten minutes you get a lot of information in. Emails, which is short text. Just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. You end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database. Daniel has worked in technology. Daniel has been working in, know, stuff like that. That's not how you would speak, but I figure it's probably easier to parse for, after all, robots. So we've almost got to twenty minutes and this is actually a success because I wasted twenty minutes of my of the evening speaking into you in microphone and the levels were shot and was clipping and I said I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. Deepgram STT, I'm really, really hopeful that this prototype will work and it's a build in public open source so anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you you're not gonna have to build a custom conda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it and I was about to give up and I said, All right, let me just give Deepgram's Linux thing a shot. And if this doesn't work, I'm just gonna go back to trying to vibe code something myself. And when I ran the script, I was using Cloud Code to do the installation process, it ran the script and, oh my gosh, it works just like that. The tricky thing for all those who wants to know all the nitty, ditty, nitty gritty details was that I don't think it was actually struggling with transcription, but pasting Weyland makes life very hard. And I think there was something not running at the right time. Anyway, Deepgram, I looked at how they actually handle that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now what am I what am I doing here? This is gonna be a twenty minute audio sample. And I'm I think I've done one or two of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three liters of milk tomorrow and peter bread, which is probably how half my voice notes sound. Like if anyone were to find my phone they'd be like this is the most boring person in the world. Although actually there are some journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech, GitHub, Nucleano, hugging face, not so obscure that it's not gonna have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. Would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not gonna get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was non diarized or if we had diarization. A female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone I'm working on a voice note idea. That's my sort of end motivation besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. Voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit folks who are able-bodied and we can all in different ways make this tech as useful as possible regardless of the exact way that we're using it. And I think there's something very powerful in that, and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, and it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine. I get like superb, flawless words error rates because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently ailingual or multilingual, phonetic based. So as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments, I know there are this is used extensively in medical transcription and dispatcher work as, you know the call centers who send out ambulances etc. Where accuracy is absolutely paramount and in the case of doctors radiologists they might be using very specialized vocab all the time. So those are kind of the main two things, and I'm not sure that really just for trying to make it better on a few random tech words with my slightly I mean, I have an accent, but, like, not, you know, an accent that a few other million people have ish. I'm not sure that my little fine tune is gonna actually like, the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud, by the time we've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's gonna be it for today's episode of voice training data. Single, long shot evaluation. Who am I gonna compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper. Distill Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm gonna have the time in any point in the foreseeable future to go back to this whole episode and create a proper source truth where I fix everything. Might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a, like, you know, to scale and maybe even a local one as well, like local whisper versus OpenAI API, etcetera. And I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't as well as the sort of headline finding of which had the best W E R but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see I always think I've just said it as something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosol. For more jumbled repositories about my roving interest in AI but particularly AgenTic, MCP and VoiceTech you can find me on GitHub. Hugging Face. Where else? DanielRosel dot com, which is my personal website, as well as this podcast whose name I sadly cannot remember. Until next time. Thanks for listening.
data/inference/runs/cloud-stt/manual-3/raw_response.json ADDED
The diff for this file is too large to render. See raw diff
 
data/inference/runs/cloud-stt/manual-3/transcript.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular, more AI in generative AI, I would say. But in any event, the purpose of this Voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech attack models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some point. As well. And I'll go back to speaking loud in, in different parts. I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces, which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these just actually a series of step, standalone, step alone, standalone sentences? And how is it gonna handle step alone? That's not a word. What happens when you use speech to text and you use a fake word? And then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was it trying to fine tune Whisper? And what is Whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are, you know, in the normal world and not totally stuck down the rabbit hole of AI, which I have to say is a really wonderful rabbit hole to be down. It's a really interesting area and speech and voice tech is the aspect of it that I find actually the most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is sparked I had, besides the fine tune not working, well, that was the failure. Um, I used Claude code because one thinks these days that there is nothing short of solving, you know, the, the reason of life or something, that Claude and agentic AI can't do, which is not really the case. Uh, it does seem that way sometimes, but it fails a lot as well. And this is one of those, instances where last week I put together an hour of voice training data, basically speaking, just random things for 3 minutes. And it was actually kind of tedious because the texts were really weird. Some of them were it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored after 10 minutes that I was like, okay, no, I'm just going to have to find something else to read. So I used a created with AI studio vibe coded a synthetic text generator. Which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour one afternoon or probably two hours by the time you you do retakes. And whatever, because you want to, it gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering, training data, but I thought made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, been experimenting with using for a different purpose, slightly different annotating task types. It's more a text classification experiment or, Well, it's more than that actually. I'm working on a voice app. So it's a prototype, I guess, is really more accurate. But you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors, but it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them. It saved an MP3 and a TXT in the same folder, and I created an error with that data. So I was very hopeful, quietly, a little bit hopeful that I could actually fine tune Whisper. I want to fine tune Whisper because when I got into Voicetech last November, my wife was in the US and I was alone at home. And when crazy people like me do really wild things like use voice to tech technology. That was basically when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. I used speech tech now and again, tried it out. It was like, it'd be really cool if you could just, like, speak into your computer. And whatever I tried out that had Linux support was just. It was not good, basically. And this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for speech attacks to be a worthwhile addition to your productivity, but you do need to get above, let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it, I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with. That's been my experience. So I was like, oh, this is actually really, really good now. How did that happen? And the answer is ASR whisper being open source and the transformer architecture. If you want to go back to the to the underpinnings, which really blows my mind and it's on my list. To read through that paper. All you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff, super advanced stuff, I mean. But that, I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that a few people are like, hang on, you've got this thing that can speak to you, like a chatbot, an LLM, and then you've got image generation. Okay, so firstly, those two things on the surface have nothing in common. So like, how are they, how did that just happen all at the same time? And then when you extend that further, you're like, Suno, right? You can sing a song and AI will come up with and instrumental. And then you've got Whisper and you're like, wait a second, how did all this stuff, like, if it's all AI, what's like, there has to be some commonality. Otherwise, these are totally different technologies on the surface of it. And the Transformer architecture is, as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the Transformer architecture means. In depth, but I have scanned it and as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in their 20s? Like, that's crazy. I think I asked ChatGPT once who wrote that paper and how old were they when it was published in Arciv? And I was expecting, like, I don't know, What do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during COVID and things like that where like these kind of really obscure scientists are like in their 50s and they've just kind of been laboring in labs and wearily in writing and publishing in kind of obscure academic publications. And they finally like hit a big or win a Nobel Prize and then their household names. So that was kind of what I had in mind. That was the mental image I'd formed of the birth of Arcsight. Like I wasn't expecting 20-somethings in San Francisco, though. I thought that was both very, very funny, very cool, and actually kind of inspiring. It's nice to think that people who, you know, just you might put them in the kind of milieu or bubble or world that you are in are credibly in through, you know, the series of connections that are coming up with such literally world changing innovations. So that was, I thought, anyway. That's that was cool. Okay, voice training data. How are we doing? We're about 10 minutes and I'm still talking about voice technology. So Whisper was brilliant and I was so excited that I was my first instinct was to like guess like, oh my gosh, I have to get like a really good microphone for this. So I didn't go on a spending spree because I said, I'm gonna have to just wait a month and see if I still use this. And It just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note. And then I've developed and it's nice to see that everyone is like developing the same things in parallel. Like that's my kind of a weird thing to say, but when I look, I kind of came, when I started working on this, these prototypes on GitHub, which is where I just kind of share very freely and loosely, ideas and first iterations on concepts. And for want of a better word, I called it like LLM post-processing or cleanup or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy text, like add sentence structure and fix it up. And now when I'm exploring the different tools that are out there that people have built, I see quite a number of projects have basically done the same thing, lest that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while, but it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent because text that doesn't have any punctuation or Paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, no, it's just not worth it. It's like, it's, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. Uh, so I was on Whisper and I've been using Whisper and I kind of, early on found a couple of tools. I couldn't find what I was looking for on Linux, which is basically just something that'll run in the background. It'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation. And the issues were I discovered that like most people involved in creating these projects were very much focused on local models, running Whisper locally because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there is, it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're gonna be able to run the full model. The latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me. Now, privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. But I'm not in that. I neither really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer, Cheetos, and hummus, which is kind of the three staples of my diet during periods of poorer nutrition. But the kind of stuff that I transcribe, it's just not, it's not a privacy thing I'm that sort of sensitive about and I don't do anything so sensitive or secure that requires air gapping. So I looked at the pricing and especially the kind of older model mini Some of them are very, very affordable. And I did a back of the, I did a calculation once with ChatGPT and I was like, okay, this is the API price for I can't remember whatever the model was. Let's say I just go at it like nonstop, which it rarely happens. Probably, I would say on average, I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents, outlines, which is a lot, but it's still a fairly modest amount. And I was like, Some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I have nothing else to do. Like if I'm at a hospital, we have a newborn and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm gonna price out Cloud SCT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, I mean, I'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hour, 45 minutes. In the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So it could be. You could just do 60. But whatever I did. And every day, like, you're going flat out seven days a week dictating non-stop I was like, what's my monthly API bill gonna be at this price? And it came out to, like, 70 or 80 bucks. And I was like, well, that would be an extraordinary. Amount of dictation. And I would hope that there was some compelling reason more worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said that's actually very, very affordable. Now you're gonna, if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's gonna cost some more as well, unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor anthropic, nor Gemini, nor any major tech vendor for that matter. I like Gemini not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can process audio with a system prompt and both give you transcription that's cleaned up that reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year, just an endless list of really kind of system prompt system prompt stuff that I can say, okay, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood more realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for 10 minutes. And that 10 minutes you get a lot of information in. Um, emails, which is short text, just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database. Daniel has worked in technology. Daniel has been working in, you know, stuff like that. That's not how you would speak, but I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? Okay, my fine tune was a dud as mentioned. DeepChrom ST, I'm really, really hopeful that this prototype will work and it's a build in public open source, so anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that, you know, you're not gonna have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up and I said, all right, let me just give Deep Grams Linux thing a shot. And if this doesn't work, I'm just going to go back to trying to Vibe code something myself. And when I ran the script, I was using Claude code to do the installation process. It ran the script and oh my gosh, it works just like that. The tricky thing For all those who want to know all the nitty gritty details, was that I don't think it was actually struggling with transcription, but pasting Wayland makes life very hard. And I think there was something not running the right time. Anyway, Deepgram, I looked at how they actually handled that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample. And I think I've done one or two of these before, but I did it with short snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three Bread, eaters of milk tomorrow and Peter bread, which is probably how like half my voice notes sound. Like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although actually, there are some like kind of journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech, GitHub, NeocleNo, hugging face, Not so obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, My wife came in with my son and for a goodnight kiss. And that actually would have been super helpful to get in because it was non diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes. Annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone, I'm working on a voice note idea. That's my sort of end motivation. Besides thinking it's an ask to the outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about Voicetech? A lot of things actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this. And it's getting better and better with stuff like accent handling. I'm not sure my fine-tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine. I get like superb flawless words error rates because I'm just kind of skeptical about Local speech to text, as I mentioned, and I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently a lingual or multilingual phonetic based. So as folks who use speak very obscure languages, that there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments, I know this is used extensively in medical transcription and dispatcher work, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologist, they might be using very specialized vocab all the time. So those are kind of the main two things that I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have ish. I'm not sure that my little fine tune is gonna actually like the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud. By the time we've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things, really. One is Whisper variants. So you've got these projects like faster Distill Whisper, it's a bit confusing, there's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth, where I fix everything. Might do it if I can get one transcriptions that sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like, you know, to scale and maybe even a local one as well, like local whisper versus OpenAI API, et cetera. And, I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER, but that would require the source of truth. Okay, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that I always feel think I've just said as something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosell. For more jumbled repositories about my roving interests in AI, but particularly agentic, MCP and Voicetech, you can find me on GitHub, huggingface.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.
data/inference/runs/cloud-stt/manual-4/transcript.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or it, uh, I may append this to a podcast that I set up recently. Um, regarding my, uh, with my thoughts on speech, tech and AI in particular, more AI and generative AI, I would, uh, I would say, but in any event, the purpose of this, um, voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation, as they might say, for different speech to text models. And I'm doing this because I, uh, I thought I'd made a great breakthrough in my journey with speech tech, and that was succeeding in the elusive task of fine tuning. Whisper, whisper is. And I'm going to just talk. I'm trying to mix up, uh, I'm going to try a few different styles of speaking. I might whisper something at some point as well, and I'll go back to speaking loud in, uh, in different parts. I'm going to sound really like a crazy person, because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech to text model through its paces, which is trying to make sense of, is this guy just on incoherently in one long sentence, or are these just actually a series of step standalone, standalone, standalone sentences? And how is it going to handle step alone? That's not a word. Uh, what happens when you use speech to text and you use a fake word and then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And, uh, these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was it trying to fine tune a whisper? And what is whisper? As I said, I'm gonna try to, uh, record this at a couple of different levels of technicality for folks who are, uh, you know, in the normal, uh, world and not totally stuck down the rabbit hole of AI, uh, which I have to say is a really wonderful, uh, rabbit hole to be to be down. Um, it's a really interesting area. And speech and voice tech is is the aspect of it that I find actually most. I'm not sure I would say the most interesting, because there's just so much that is fascinating in AI. Uh, but the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of trying to guess a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, uh, this is this is has sparked I had besides the fine tune not working. Well, that was the failure. Um, I used clod code because one thinks these days that there is nothing short of solving, you know, the, uh, the reason of life or something. Uh, that clod and agentic AI can't do, uh, which is not really the case. Uh, it does seem that way sometimes, but it fails a lot as well. And this is one of those, uh, instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes. And, um, it was actually kind of tedious because the texts were really weird. Some of them were it was like it was AI generated. Um, I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored, uh, after ten minutes that I was like, okay, now I'm just gonna have to find something else to read. So I used a created with AI studio vibe coded. A synthetic text generator. Um, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose, um, to read. So I came up with all these different things, and I added a little timer to it so I could see how close I was to one hour. Um, and, uh, I spent like an hour one afternoon or probably two hours by the time you, um, you do retakes or whatever because you want to. It gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering, uh, training data, but I thought it made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, um, been experimenting with using for a different purpose, slightly different annotating task types. It's more text classification experiment or uh, well, it's more than that, actually. I'm working on a voice app, so it's a prototype I guess is really more accurate. Um, but you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors. But it's really, really boring to do that. So I thought it would be less tedious in the long term if I just recorded The Source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a txt in the same folder, and I created an hour of that data. Uh, so I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune, whisper. Um, I want to fine tune whisper because when I got into voice tech last November, my wife was in the US and I was alone at home. And you know, when crazy people like me do really wild things like use voice to tech, uh, technology. That was basically, um, when I started doing it, I didn't feel like a crazy person speaking to myself, and my expectations weren't that high. Uh, I used speech tech now and again. Um, tried it out. I was like, it'd be really cool if you could just, like, speak into your computer. And whatever I tried out that had Linux support was just. It was not good, basically. Um, and this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that, uh, pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for speech to text to be a worthwhile addition to your productivity. But you do need to get above. Let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it. I'll just type it because you end up missing errors in the transcript and it becomes actually worse. You end up in a worse position than you started with. And that's been my experience. So, um, I was like, oh, this is actually really, really good. Now how did that happen? And the answer is ASR whisper being open sourced and the transformer architecture, if you want to go back to the, um, to the underpinnings, which really blows my mind and it's on my list to read through that paper. Um, all you need is attention as attentively as can be done with my limited brain because it's super, super high level stuff. Um, super advanced stuff. I mean, uh, but that I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities. I find it fascinating that few people are like, hang on, you've got this thing that can speak to you like a chatbot, an LLM, and then you've got image generation. Okay, so firstly, those two things on the surface have nothing in common. Um, so like how are they how did that just happen all at the same time. And then when you extend that further, um, you're like sooner, right? You can sing a song and AI will like, come up with an instrumental and then you've got whisper and you're like, wait a second, how did all this stuff, like, if it's all AI, what's like there has to be some commonality. Otherwise these are four. These are totally different technologies on the surface of it. And, uh, the transformer architecture is, as far as I know, the answer. And I can't even say can't even pretend that I really understand what the transformer architecture means in depth, but I have scanned it and as I said, I want to print it and really kind of think over it at some point, and I'll probably feel bad about myself, I think, because weren't those guys in their in their 20s like, that's crazy. I think I asked ChatGPT once who were the who wrote that paper and how old were they when it was published in arXiv? And I was expecting like, I don't know, what do you what do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during Covid and things like that where like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring in labs and, uh, wearily and writing in publishing in kind of obscure academic publications. And they finally, like, hit a big or win a Nobel Prize and then their household household names. Uh, so that was kind of what I had in mind. That was the mental image I'd formed of the birth of arXiv. Like, I wasn't expecting 20 somethings in San Francisco, though I thought that was both very, very funny, very cool, and actually kind of inspiring. It's nice to think that people who, you know, just you might put them in the kind of milieu or bubble or world that you are in or credibly in, through, you know, a series of connections that are coming up with such literally world changing, um, innovations. Uh, so that was, I thought, anyway, that, that that was cool. Okay. Voice training data. How are we doing? We're about ten minutes, and I'm still talking about voice technology. Um, so whisper was brilliant, and I was so excited that I was. My first instinct was to, like, get like, oh, my gosh, I have to get, like, a really good microphone for this. So, um, I didn't go on a spending spree because I said, I'm gonna have to just wait a month and see if I still use this. And it just kind of became it's become really part of my daily routine. Like, if I'm writing an email, I'll record a voice note. And then I've developed and it's nice to see that everyone is like developing the same
2
+ things in parallel. Like, that's kind of a weird thing to say, but when I look, I kind of came when I started working on this, these prototypes on GitHub, which is where I just kind of share very freely and loosely, uh, ideas and, you know, first iterations on, on concepts, um, and for want of a better word, I called it like, uh, lm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper, you run it through a model and say, okay, this is crappy text, like add sentence structure and, you know, fix it up. And, um, now when I'm exploring the different tools that are out there that people have built, I see, uh, quite a number of projects have basically done the same thing, um, less that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a thing that's been integrated into tools for a while, but it's it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent, uh, because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's it's it moves speech tech into that before that inflection point where you're like, no, it's just not worth it. It's like it'll just be quicker to type this. So it's a big it's a little touch. That actually is a big deal. Uh, so I was on whisper and I've been using whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is, um, basically just something that'll run in the background. You'll give it an API key and it will just transcribe. Um. with, like, a little key to start and stop the dictation. Uh, and the issues were I discovered that, like most people involved in creating these projects were very much focused on local models running whisper locally, because you can. And I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there's it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't really make sense to me. Now, privacy is another concern. Um, that I know is kind of like a very much a separate thing that people just don't want their voice, data, and their voice leaving their local environment, maybe for regulatory reasons as well. Um, but I'm not in that. Um, I'm neither really care about people listening to my, uh, grocery list consisting of, uh, reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet. Um, during periods of poor nutrition. Uh, but the kind of stuff that I transcribe, it's just not it's not a, it's not a privacy thing and that sort of sensitive about and, uh, I don't do anything so, you know, sensitive or secure, that requires air gapping. So, um, I looked at the pricing and especially the kind of older models, mini, um, some of them are very, very affordable. And I did a back of the I did a calculation once with ChatGPT and I was like, okay, this is a, this is the API price for I can't remember whatever the model was. Uh, let's say I just go at it like nonstop, which it rarely happens. Probably. I would say on average, I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents, outlines, um, which is a lot, but it's it's still a fairly modest amount. And I was like, well, some days I do go on like 1 or 2 days where I've been. Usually when I'm like kind of out of the house and just have something like, I have nothing else to do. Like if I'm at a hospital with a newborn, uh, and you're waiting for like eight hours and hours for an appointment, and I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh, wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm gonna price out. Cloud asked if I was like, dedicated every second of every waking hour to transcribing for some odd reason. Um. I mean, it'd have to, like, eat and use the toilet and, like, you know, there's only so many hours I'm awake for. So, like, let's just say a maximum of, like, 40 hours, 45 minutes in the hour. Then I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. Uh, so it could be you could just do 60. But whatever I did, and every day, like, you're going flat out seven days a week dictating non-stop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason, more worth more than $70, that I embarked upon that project. Uh, so given that that's kind of the max point for me, I said, that's actually very, very affordable. Um, now you're gonna if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable. Um, that's going to cost some more as well, unless you're using Gemini, which, uh, needless to say, is a random person sitting in Jerusalem. Uh, I have no affiliation, nor with Google, nor anthropic, nor Gemini, nor any major tech vendor for that matter. Um, I like Gemini. Not so much as a everyday model. Um, it's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can, um, process audio with a system prompt and both give you transcription that's cleaned up, that reduces two steps to one. And that for me is a very, very big deal. And, uh, I feel like even Google has haven't really sort of thought through how useful the that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year just an endless list of really kind of system prompt, system prompt stuff that I can say, okay, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about, who knows, my childhood. Uh, more realistically, maybe my career goals, uh, something that would just be, like, really boring to type out. So I'll just, like, sit in my car and record it for ten minutes. And that ten minutes, you get a lot of information in, um, emails, which is short text. Um, just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work, and it goes it condenses that down to very robotic language that is easy to chunk, parse, and maybe put into a vector database. Daniel has worked in technology, Daniel is a has been working in, you know, stuff like that. That's not how you would speak. Um, but I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this is actually a success because I wasted 20 minutes of my, uh, of the evening speaking into a microphone, and, uh, the levels were shot and, uh, it, uh, it was clipping and I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. Uh, what am I hoping to achieve in this? Okay, my fine tune was a dud, as mentioned Deepgram SVT. I'm really, really hopeful that this prototype will work. And it's a built in public open source, so anyone is welcome to use it if I make anything good. Um, but that was really exciting for me last night when after hours of, um, trying my own prototype, seeing someone just made something that works like that. You know, you're not going to have to build a custom conda environment and image. I have AMD GPU, which makes things much more complicated. I didn't find it and I was about to give up and I said, all right, let me just give deep grams Linux thing a shot. And if this doesn't work, um, I'm just going to go back to trying to code something myself. And when I ran the script, I was using cloud code to do the installation process. It ran the script and oh my gosh, it works just like that. Uh, the tricky thing for all those who wants to know all the nitty gritty, nitty gritty details, um, was that I don't think it was actually struggling with transcription, but pasting Wayland makes life very hard, and I think there was something not running in the right time anyway. Deepgram I looked at how they actually handle that because it worked out of the box when other stuff didn't, and it was quite a clever little mechanism, and but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample, and I'm I think I've done 1 or 2 of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like I need to buy three liters of milk tomorrow, and pita bread, which is probably how like half my voice voice notes sound like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although actually there are some like kind of, uh, journaling thoughts as well. But it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech GitHub uh, hugging face not so
3
+ obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was not diarised. Or if we had diarisation a female, I could say I want the male voice and that wasn't intended for transcription. Um, and we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure quick test. And as someone I'm working on a voice note idea, that's my sort of end motivation. Besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy can actually have a very transformative effect. Um, it's, you know, voice technology has been life changing for, uh, folks living with, um, disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able bodied and like, we can all in different ways, um, make this tech as useful as possible, regardless of the exact way that we're using it. Um, and I think there's something very powerful in that, and it can be very cool. Um, I see use potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this, um, and it's getting better and better with stuff like accent handling, um, I'm not sure my, my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine I get like superb, flawless word error rates because I'm just kind of skeptical about local speech to texts, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows, blows my mind about ASR is the idea that it's inherently a lingual or multilingual phonetic based. So as folks who use speak very obscure languages that there may be there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced or folks in very critical environments. I know there are. This is used extensively in medical transcription and dispatcher work as, um, you know, the call centers who send out ambulances, etc., where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly. I mean, I have an accent, but like, not, you know, an accent that a few other million people have. Ish. I'm not sure that my little fine tune is going to actually like the bump in word error rate reduction. If I ever actually figure out how to do it and get it up to the cloud by the time I've done that. I suspect that the next generation of ASR will just be so good that it will kind of be. Ah, well, that would be cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of, uh, voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisperer head to head with two things, really. One is whisper variance. So you've got these projects like faster Whisper, Still whisper. It's a bit confusing. There's a whole bunch of them and the emerging acers, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source, truth or a fix. Everything might do it if I can get one transcription that sufficiently close to perfection. But what I would actually love to do on Hugging Face I think would be a great. Probably how I might visualize this is having the audio waveform play, and then have the transcript for each model below it, and maybe even a, um, like, you know, two scale and maybe even a local one as well, like local whisper versus open AI API, Etc. and, um, I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best, uh, wer. But that would require the source of truth. Okay. That's it. Hope this was, I don't know, maybe useful for other folks interested in stuff you want to see. I always feel think I've just said something I didn't intend to say. I said for those, listen carefully. Including, hopefully, the models themselves. This has been myself, Daniel Rosehill, for more, um, jumbled repositories about my, uh, roving interest in AI, but particularly Agentic, MCP and voice tech. Uh, you can find me on GitHub. Hugging face. Where else? Daniel, which is my personal website, as well as this podcast whose name I sadly cannot remember. Until next time. Thanks for listening.
data/inference/runs/cloud-stt/manual-5/raw_response.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+
2
+ "run_id": "run-4",
3
+ "provider": "openai",
4
+ "model": "whisper-1",
5
+
6
+ "text": "Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast. Or I may append this to a podcast that I set up recently regarding my, with my thoughts on speech tech and AI in particular, more AI and generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine tuning whisper. And I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts, I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces, which is trying to make sense of, is this guy just rambling on incoherently in one long sentence or are these just actually a series of step standalone, step alone, standalone sentences and how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was I trying to fine tune whisper and what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are, you know, in the normal world and not totally stuck down the rabbit hole of AI, which I have to say is a really wonderful rabbit hole to be down. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI, but the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work and I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is sparked. I had, besides the fine tune not working, well, that was the failure. I used flawed code because one thinks these days that there is nothing short of solving, you know, the reason of life or something that flawed and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored after 10 minutes that I was like, okay, no, I'm just going to have to find something else to read. So I used a, created with AI Studio, vibe coded a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose to read. So it came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour, one afternoon or probably two hours by the time you, you do retakes and whatever, because you want to, it gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering training data, but I thought made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, uh, been experimenting with using for a different purpose, slightly different annotating task types. It's more text classification experiments or, uh, well, it's more than that actually working on a voice app. So it's a prototype, I guess, is really more accurate. Um, but you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors, but it's really, really boring to do that. So I thought it would be less tedious in the longterm if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an error that data. Uh, so I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. Um, I want to fine tune Whisper because when I got into voice tech, uh, last November, uh, my wife was in the U S and I was alone at home. And, uh, you know, when crazy people like me do really wild things, like use voice to tech, uh, technology, that was basically, um, when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. Uh, I used speech tech now and again, um, tried it out. I was like, it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just, it was not good basically. Um, and this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that, uh, pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for a speech attacks to be a worthwhile addition to your productivity, but you do need to get above, let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it. I'll just type it because you end up missing, um, errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So, um, I was like, oh, this is actually really, really good. Now, how did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the, uh, to the underpinnings, which really blows my mind and it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain, because it's super, super high level stuff. Um, super advanced stuff, I mean, uh, but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities, I find it fascinating. A few people are like, hang on, you've got this thing that can speak to you like a chat bot, an LLM, and then you've got image generation. Okay. So firstly, those two things on the surface have nothing in common. Um, so like, how are they, how did that just happen all at the same time? And then when you extend that further, um, you're like Suno, right? You can sing a song and AI will like come up with an instrumental and then you've got Whisper and you're like, wait a second, how did all this stuff, like if it's all AI, what's like, there has to be some commonality. Otherwise these are four, these are totally different technologies on the surface of it. And, uh, the transformer architecture is as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth, but I have scanned this. And as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in their twenties? Like, that's crazy. I think I asked Chad TPT once who were the, who wrote that paper and how old were they when it was published in ARCSYV. And I was expecting like, I don't know, what do you, what do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during COVID and things like that, where like these kind of really obscure scientists who are like in their fifties and they've just kind of been laboring in labs and, uh, wearily and writing and publishing and kind of obscure academic publications. And they finally like hit a big or win a Nobel prize. And then their household, household names. Uh, so that was kind of what I had in mind. That was the mental image I'd formed of the birth of ARCSYV. Like I wasn't expecting 20 somethings in San Francisco though. I thought that was both very, very funny, very cool. And actually kind of inspiring. It's nice to think that people who, you know, just, you might put them in the kind of milieu or bubble or world that you are in or credibly in through, you know, the series of connections that are coming up with such literally world-changing, um, innovations. Uh, so that was, I thought, anyway, that's, that, that was cool. Okay. Voice training data. How are we doing? We're about 10 minutes and I'm still talking about voice technology. Um, so Whisper was brilliant and I was so excited that I was, my first instinct was to like guess it's like, Oh my gosh, I have to get like a really good microphone for this. So, um, I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note and then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say, but when I look, I, I kind of came when I started working on this, uh, these prototypes on GitHub, which is where I just kind of share very freely and loosely, uh, ideas and, you know, first iterations on, on concepts. Um, and for want of a better word, I called it like, uh, LLM post-processing or cleanup, or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy, uh, text, like add sentence structure and, you know, fix it up. And, um, now when I'm exploring the different tools that are out there that people have built, I see, uh, quite a number of projects have basically, you know, done the same thing. Um, lest that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a, a thing that's been, uh, integrated into tools for a while, but it's, it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent, uh, because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. Uh, so I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is, um, basically just something that'll run in the background. You'll give it an API key and it will just like transcribe, um, with like a little key to start and stop the dictation. Uh, and the issues were, I discovered that like most people involved in creating these projects were very much focused on local models, uh, running, running Whisper locally because you can, and I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there's, it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me now. Privacy is another concern, um, that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. Um, but I'm not in that, um, I'm neither really care about people listening to my, uh, grocery list, uh, consisting of, uh, reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet, um, during, uh, periods of poor nutrition. Uh, but the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And, uh, I don't do anything so, you know, sensitive or secure that requires air gapping. So, um, I looked at the pricing and especially the kind of older models, mini, um, some of them are very, very affordable. And I did a back of the, I did a calculation once with chat GBT and I was like, okay, this is the, this is the API price for, I can't remember whatever the model was. Uh, let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average, I might dictate 30 to 60 minutes per day. If I was probably summing up the emails, uh, documents, outlines, um, which is a lot, but it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do. Like if I'm at a hospital, we have a newborn, uh, and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloudSTT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hours. And I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week, dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said, that's actually very, very affordable. Now you're going to, if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost them more as well. Unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. I like Gemini, not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can process audio with a system prompt and both give you transcription, that's cleaned up. That reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google hasn't really sort of thought through how useful that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year, just an endless list of really kind of system prompt stuff that I can say, OK, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about, who knows, my childhood. More realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for 10 minutes and that 10 minutes, you get a lot of information in emails, which is short text. Just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk, parse and maybe put into a vector database. Daniel has worked in technology. Daniel is a has been working in, you know, stuff like that. That's not how you would speak. But I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes, and this is actually a success because I wasted 20 minutes of my of the evening speaking into a microphone and the levels were shot and it was clipping. And I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? OK, my fine tune was a dud, as mentioned, DeepGram SDT. I'm really, really hopeful that this prototype will work and it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that. You know, you're not going to have to build a custom Conda environment, an image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up and I said, all right, let me just give DeepGram's Linux thing a shot. And if it doesn't work, I'm just going to go back to trying to code something myself. And when I ran the script, I was using cloud code to do the installation process. It ran the script and oh, my gosh, it works just like that. The tricky thing. For all those wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting. Wayland makes life very hard. And I think there was something not running at the right time. Anyway, DeepGram, I looked at how they actually handle that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample. And I'm I think I've done one or two of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three liters of milk tomorrow and pita bread, which is probably how like half my voice note voice note sound like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although, actually, there are some like kind of journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech. GitHub, Nuclino, Hugging Face, not so obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone working on a voice note idea, that's my sort of end motivation besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this. And it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual, phonetic based. So as folks who speak very obscure languages, that there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments. I know this is used extensively in medical transcription and dispatcher work as, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that my little fine tune is going to actually, like the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud, by the time I've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper, Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where I fix everything. I might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this, is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well. Like local Whisper versus OpenAI API, etc. And I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER. But that would require the source of truth. OK, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that, I always feel, think I've just said something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face. Where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.",
7
+
data/inference/runs/cloud-stt/manual-5/transcript.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Hello and welcome to a audio dataset consisting of one single episode of a non-existent podcast. Or I may append this to a podcast that I set up recently regarding my, with my thoughts on speech tech and AI in particular, more AI and generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine tuning whisper. And I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts, I'm going to sound really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its paces, which is trying to make sense of, is this guy just rambling on incoherently in one long sentence or are these just actually a series of step standalone, step alone, standalone sentences and how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was I trying to fine tune whisper and what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are, you know, in the normal world and not totally stuck down the rabbit hole of AI, which I have to say is a really wonderful rabbit hole to be down. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI, but the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work and I'm persevering hard with the task of trying to get a good solution working for Linux, which if anyone actually does listen to this, not just for the training data and for the actual content, this is sparked. I had, besides the fine tune not working, well, that was the failure. I used flawed code because one thinks these days that there is nothing short of solving, you know, the reason of life or something that flawed and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking, just random things for three minutes. And it was actually kind of tedious because the texts were really weird. Some of them were, it was like it was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't. I was so bored after 10 minutes that I was like, okay, no, I'm just going to have to find something else to read. So I used a, created with AI Studio, vibe coded a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note, like I'm recording an email, give me a short story to read, give me prose to read. So it came up with all these different things and they added a little timer to it so I could see how close I was to one hour. And I spent like an hour, one afternoon or probably two hours by the time you, you do retakes and whatever, because you want to, it gave me a source of truth, which I'm not sure if that's the scientific way to approach this topic of gathering training data, but I thought made sense. Um, I have a lot of audio data from recording voice notes, which I've also kind of used, uh, been experimenting with using for a different purpose, slightly different annotating task types. It's more text classification experiments or, uh, well, it's more than that actually working on a voice app. So it's a prototype, I guess, is really more accurate. Um, but you can do that and you can work backwards. You're like, you listen back to a voice note and you painfully go through one of those transcribing, you know, where you start and stop and scrub around it and you fix the errors, but it's really, really boring to do that. So I thought it would be less tedious in the longterm if I just recorded the source of truth. So it gave me these three minute snippets. I recorded them and saved an MP3 and a TXT in the same folder and I created an error that data. Uh, so I was very hopeful, quietly, you know, a little bit hopeful that I would be able that I could actually fine tune Whisper. Um, I want to fine tune Whisper because when I got into voice tech, uh, last November, uh, my wife was in the U S and I was alone at home. And, uh, you know, when crazy people like me do really wild things, like use voice to tech, uh, technology, that was basically, um, when I started doing it, I didn't feel like a crazy person speaking to myself. And my expectations weren't that high. Uh, I used speech tech now and again, um, tried it out. I was like, it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just, it was not good basically. Um, and this blew me away from the first go. I mean, it wasn't 100% accurate out of the box and it took work, but it was good enough that there was a solid foundation and it kind of passed that, uh, pivot point that it's actually worth doing this. You know, there's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for a speech attacks to be a worthwhile addition to your productivity, but you do need to get above, let's say, I don't know, 85%. If it's 60% or 50%, you inevitably say, screw it. I'll just type it because you end up missing, um, errors in the transcript and it becomes actually worse. You end up in a worse position than you started with it. That's been my experience. So, um, I was like, oh, this is actually really, really good. Now, how did that happen? And the answer is ASR, Whisper being open sourced and the transformer architecture. If you want to go back to the, uh, to the underpinnings, which really blows my mind and it's on my list to read through that paper. All you need is attention as attentively as can be done with my limited brain, because it's super, super high level stuff. Um, super advanced stuff, I mean, uh, but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities, I find it fascinating. A few people are like, hang on, you've got this thing that can speak to you like a chat bot, an LLM, and then you've got image generation. Okay. So firstly, those two things on the surface have nothing in common. Um, so like, how are they, how did that just happen all at the same time? And then when you extend that further, um, you're like Suno, right? You can sing a song and AI will like come up with an instrumental and then you've got Whisper and you're like, wait a second, how did all this stuff, like if it's all AI, what's like, there has to be some commonality. Otherwise these are four, these are totally different technologies on the surface of it. And, uh, the transformer architecture is as far as I know, the answer. And I can't even say, can't even pretend that I really understand what the transformer architecture means in depth, but I have scanned this. And as I said, I want to print it and really kind of think over it at some point. And I'll probably feel bad about myself, I think, because weren't those guys in their twenties? Like, that's crazy. I think I asked Chad TPT once who were the, who wrote that paper and how old were they when it was published in ARCSYV. And I was expecting like, I don't know, what do you, what do you imagine? I personally imagine kind of like, you know, you have these breakthroughs during COVID and things like that, where like these kind of really obscure scientists who are like in their fifties and they've just kind of been laboring in labs and, uh, wearily and writing and publishing and kind of obscure academic publications. And they finally like hit a big or win a Nobel prize. And then their household, household names. Uh, so that was kind of what I had in mind. That was the mental image I'd formed of the birth of ARCSYV. Like I wasn't expecting 20 somethings in San Francisco though. I thought that was both very, very funny, very cool. And actually kind of inspiring. It's nice to think that people who, you know, just, you might put them in the kind of milieu or bubble or world that you are in or credibly in through, you know, the series of connections that are coming up with such literally world-changing, um, innovations. Uh, so that was, I thought, anyway, that's, that, that was cool. Okay. Voice training data. How are we doing? We're about 10 minutes and I'm still talking about voice technology. Um, so Whisper was brilliant and I was so excited that I was, my first instinct was to like guess it's like, Oh my gosh, I have to get like a really good microphone for this. So, um, I didn't go on a spending spree because I said, I'm going to have to just wait a month and see if I still use this. And it just kind of became, it's become really part of my daily routine. Like if I'm writing an email, I'll record a voice note and then I've developed. And it's nice to see that everyone is like developing the same things in parallel. Like that's kind of a weird thing to say, but when I look, I, I kind of came when I started working on this, uh, these prototypes on GitHub, which is where I just kind of share very freely and loosely, uh, ideas and, you know, first iterations on, on concepts. Um, and for want of a better word, I called it like, uh, LLM post-processing or cleanup, or basically a system prompt that after you get back the raw text from Whisper, you run it through a model and say, okay, this is crappy, uh, text, like add sentence structure and, you know, fix it up. And, um, now when I'm exploring the different tools that are out there that people have built, I see, uh, quite a number of projects have basically, you know, done the same thing. Um, lest that be misconstrued. I'm not saying for a millisecond that I inspired them. I'm sure this has been a, a thing that's been, uh, integrated into tools for a while, but it's, it's the kind of thing that when you start using these tools every day, the need for it is almost instantly apparent, uh, because text that doesn't have any punctuation or paragraph spacing takes a long time to, you know, it takes so long to get it into a presentable email that again, it's, it's, it, it moves speech tech into that before that inflection point where you're like, nah, it's just not worth it. It's like, it'll just be quicker to type this. So it's a big, it's a little touch that actually is a big deal. Uh, so I was on Whisper and I've been using Whisper and I kind of early on found a couple of tools. I couldn't find what I was looking for on Linux, which is, um, basically just something that'll run in the background. You'll give it an API key and it will just like transcribe, um, with like a little key to start and stop the dictation. Uh, and the issues were, I discovered that like most people involved in creating these projects were very much focused on local models, uh, running, running Whisper locally because you can, and I tried that a bunch of times and just never got results that were as good as the cloud. And when I began looking at the cost of the speech to text APIs and what I was spending, I just thought there's, it's actually, in my opinion, just one of the better deals in API spending and in cloud. Like it's just not that expensive for very, very good models that are much more, you know, you're going to be able to run the full model, the latest model versus whatever you can run on your average GPU, unless you want to buy a crazy GPU. It doesn't really make sense to me now. Privacy is another concern, um, that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment, maybe for regulatory reasons as well. Um, but I'm not in that, um, I'm neither really care about people listening to my, uh, grocery list, uh, consisting of, uh, reminding myself that I need to buy more beer, Cheetos and hummus, which is kind of the three, three staples of my diet, um, during, uh, periods of poor nutrition. Uh, but the kind of stuff that I transcribe, it's just not, it's not a, it's not a privacy thing I'm that sort of sensitive about. And, uh, I don't do anything so, you know, sensitive or secure that requires air gapping. So, um, I looked at the pricing and especially the kind of older models, mini, um, some of them are very, very affordable. And I did a back of the, I did a calculation once with chat GBT and I was like, okay, this is the, this is the API price for, I can't remember whatever the model was. Uh, let's say I just go at it like nonstop, which it rarely happens. Probably I would say on average, I might dictate 30 to 60 minutes per day. If I was probably summing up the emails, uh, documents, outlines, um, which is a lot, but it's still a fairly modest amount. And I was like, well, some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do. Like if I'm at a hospital, we have a newborn, uh, and you're waiting for like eight hours and hours for an appointment. And I would probably have listened to podcasts before becoming a speech fanatic. And I'm like, oh wait, let me just get down. Let me just get these ideas out of my head. And that's when I'll go on my speech binges. But those are like once every few months, like not frequently. But I said, okay, let's just say if I'm going to price out cloudSTT, if I was like dedicated every second of every waking hour to transcribing for some odd reason, um, I mean, it'd have to like eat and use the toilet. Like, you know, there's only so many hours I'm awake for. So like, let's just say a maximum of like 40 hours, 45 minutes in the hours. And I said, all right, let's just say 50. Who knows? You're dictating on the toilet. We do it. So you could just do 60, but whatever I did and every day, like you're going flat out seven days a week, dictating nonstop. I was like, what's my monthly API bill going to be at this price? And it came out to like 70 or 80 bucks. And I was like, well, that would be an extraordinary amount of dictation. And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project. So given that that's kind of the max point for me, I said, that's actually very, very affordable. Now you're going to, if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable, that's going to cost them more as well. Unless you're using Gemini, which needless to say is a random person sitting in Jerusalem. I have no affiliation, nor with Google, nor Anthropic, nor Gemini, nor any major tech vendor for that matter. I like Gemini, not so much as a everyday model. It's kind of underwhelmed in that respect, I would say. But for multimodal, I think it's got a lot to offer. And I think that the transcribing functionality whereby it can process audio with a system prompt and both give you transcription, that's cleaned up. That reduces two steps to one. And that for me is a very, very big deal. And I feel like even Google hasn't really sort of thought through how useful that modality is and what kind of use cases you can achieve with it. Because I found in the course of this year, just an endless list of really kind of system prompt stuff that I can say, OK, I've used it to capture context data for AI, which is literally I might speak for if I wanted to have a good bank of context data about, who knows, my childhood. More realistically, maybe my career goals, something that would just be like really boring to type out. So I'll just like sit in my car and record it for 10 minutes and that 10 minutes, you get a lot of information in emails, which is short text. Just there is a whole bunch. And all these workflows kind of require a little bit of treatment afterwards and different treatment. My context pipeline is kind of like just extract the bare essentials. So you end up with me talking very loosely about sort of what I've done in my career, where I've worked, where I might like to work. And it goes, it condenses that down to very robotic language that is easy to chunk, parse and maybe put into a vector database. Daniel has worked in technology. Daniel is a has been working in, you know, stuff like that. That's not how you would speak. But I figure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes, and this is actually a success because I wasted 20 minutes of my of the evening speaking into a microphone and the levels were shot and it was clipping. And I said, I can't really do an evaluation. I have to be fair. I have to give the models a chance to do their thing. What am I hoping to achieve in this? OK, my fine tune was a dud, as mentioned, DeepGram SDT. I'm really, really hopeful that this prototype will work and it's a built in public open source. So anyone is welcome to use it if I make anything good. But that was really exciting for me last night when after hours of trying my own prototype, seeing someone just made something that works like that. You know, you're not going to have to build a custom Conda environment, an image. I have AMD GPU, which makes things much more complicated. I didn't find it. And I was about to give up and I said, all right, let me just give DeepGram's Linux thing a shot. And if it doesn't work, I'm just going to go back to trying to code something myself. And when I ran the script, I was using cloud code to do the installation process. It ran the script and oh, my gosh, it works just like that. The tricky thing. For all those wants to know all the nitty gritty details was that I don't think it was actually struggling with transcription, but pasting. Wayland makes life very hard. And I think there was something not running at the right time. Anyway, DeepGram, I looked at how they actually handle that because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism. And but more so than that, the accuracy was brilliant. Now, what am I doing here? This is going to be a 20 minute audio sample. And I'm I think I've done one or two of these before, but I did it with short, snappy voice notes. This is kind of long form. This actually might be a better approximation for what's useful to me than voice memos. Like, I need to buy three liters of milk tomorrow and pita bread, which is probably how like half my voice note voice note sound like if anyone were to, I don't know, like find my phone, they'd be like, this is the most boring person in the world. Although, actually, there are some like kind of journaling thoughts as well, but it's a lot of content like that. And the probably for the evaluation, the most useful thing is slightly obscure tech. GitHub, Nuclino, Hugging Face, not so obscure that it's not going to have a chance of knowing it, but hopefully sufficiently well known that the model should get it. I tried to do a little bit of speaking really fast and speaking very slowly. I would say in general, I've spoken, delivered this at a faster pace than I usually would owing to strong coffee flowing through my bloodstream. And the thing that I'm not going to get in this benchmark is background noise, which in my first take that I had to get rid of, my wife came in with my son and for a good night kiss. And that actually would have been super helpful to get in because it was non-diarized or if we had diarization, a female, I could say, I want the male voice and that wasn't intended for transcription. And we're not going to get background noise like people honking their horns, which is something I've done in my main data set where I am trying to go back to some of my voice notes, annotate them and run a benchmark. But this is going to be just a pure quick test. And as someone working on a voice note idea, that's my sort of end motivation besides thinking it's an absolutely outstanding technology that's coming to viability. And really, I know this sounds cheesy, can actually have a very transformative effect. It's, you know, voice technology has been life changing for folks living with disabilities. And I think there's something really nice about the fact that it can also benefit, you know, folks who are able-bodied and like we can all in different ways make this tech as useful as possible, regardless of the exact way that we're using it. And I think there's something very powerful in that and it can be very cool. I see huge potential. What excites me about voice tech? A lot of things, actually. Firstly, the fact that it's cheap and accurate, as I mentioned at the very start of this. And it's getting better and better with stuff like accent handling. I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day, as I imagine. I get like superb, flawless words, error rates, because I'm just kind of skeptical about local speech to text, as I mentioned. And I think the pace of innovation and improvement in the models, the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alingual or multilingual, phonetic based. So as folks who speak very obscure languages, that there might be a paucity of training data or almost none at all, and therefore the accuracy is significantly reduced. Or folks in very critical environments. I know this is used extensively in medical transcription and dispatcher work as, you know, the call centers who send out ambulances, et cetera, where accuracy is absolutely paramount. And in the case of doctors, radiologists, they might be using very specialized vocab all the time. So those are kind of the main two things. And I'm not sure that really just for trying to make it better on a few random tech words with my slightly, I mean, I have an accent, but like not, you know, an accent that a few other million people have it. I'm not sure that my little fine tune is going to actually, like the bump in word error reduction, if I ever actually figure out how to do it and get it up to the cloud, by the time I've done that, I suspect that the next generation of ASR will just be so good that it will kind of be, well, that would have been cool if it worked out, but I'll just use this instead. So that's going to be it for today's episode of voice training data. Single long shot evaluation. Who am I going to compare? Whisper is always good as a benchmark, but I'm more interested in seeing Whisper head to head with two things really. One is Whisper variants. So you've got these projects like Faster Whisper, Distilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs, which are also a thing. My intention for this is I'm not sure I'm going to have the time in any point in the foreseeable future to go back through this whole episode and create a proper source truth where I fix everything. I might do it if I can get one transcription that's sufficiently close to perfection. But what I would actually love to do on Hugging Face, I think would be a great, probably how I might visualize this, is having the audio waveform play and then have the transcript for each model below it. And maybe even a, like, you know, two scale and maybe even a local one as well. Like local Whisper versus OpenAI API, etc. And I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't, as well as the sort of headline finding of which had the best WER. But that would require the source of truth. OK, that's it. I hope this was, I don't know, maybe useful for other folks interested in STT. You want to see that, I always feel, think I've just said something I didn't intend to. STT, I said for those. Listen carefully, including hopefully the models themselves. This has been myself, Daniel Rosehill. For more jumbled repositories about my roving interest in AI, but particularly agentic, MCP and voice tech, you can find me on GitHub, Hugging Face. Where else? Danielrosehill.com, which is my personal website, as well as this podcast, whose name I sadly cannot remember. Until next time, thanks for listening.
data/inference/runs/local-stt/run-1/transcript.srt ADDED
@@ -0,0 +1,1032 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,000 --> 00:00:08,640
3
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast.
4
+
5
+ 2
6
+ 00:00:08,640 --> 00:00:19,120
7
+ Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech
8
+
9
+ 3
10
+ 00:00:19,120 --> 00:00:28,720
11
+ tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this
12
+
13
+ 4
14
+ 00:00:30,080 --> 00:00:37,120
15
+ voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the
16
+
17
+ 5
18
+ 00:00:37,120 --> 00:00:42,320
19
+ envelope evaluation as they might say for different speech attacks models. And I'm doing this because I
20
+
21
+ 6
22
+ 00:00:42,800 --> 00:00:48,560
23
+ I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in
24
+
25
+ 7
26
+ 00:00:48,560 --> 00:00:55,120
27
+ the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to
28
+
29
+ 8
30
+ 00:00:55,760 --> 00:01:01,600
31
+ mix up, I'm going to try a few different styles of speaking. I might whisper something at some
32
+
33
+ 9
34
+ 00:01:01,600 --> 00:01:07,760
35
+ points as well. And I'll go back to speaking loud in different parts. I'm going to send really
36
+
37
+ 10
38
+ 00:01:07,760 --> 00:01:15,200
39
+ like a crazy person because I'm also going to try to speak at different pitches and cadences
40
+
41
+ 11
42
+ 00:01:15,200 --> 00:01:21,600
43
+ in order to really try to push a speech attacks model through its paces, which is trying to make
44
+
45
+ 12
46
+ 00:01:21,600 --> 00:01:30,320
47
+ sense of is this guy just rambling on and coherently in one long sentence or are these just actually
48
+
49
+ 13
50
+ 00:01:30,320 --> 00:01:38,320
51
+ series of step standalone standalone sentences? And how is it going to handle step alone? That's not a
52
+
53
+ 14
54
+ 00:01:38,320 --> 00:01:43,919
55
+ word. What happens when you use speech attacks and you use a fake word and then you're like, wait,
56
+
57
+ 15
58
+ 00:01:43,919 --> 00:01:51,520
59
+ that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the
60
+
61
+ 16
62
+ 00:01:52,880 --> 00:01:57,359
63
+ questions that I'm seeking to answer in this training data. Now, why did why was it trying to
64
+
65
+ 17
66
+ 00:01:57,360 --> 00:02:01,040
67
+ find China whisper? And what is whisper? As I said, I'm going to try to
68
+
69
+ 18
70
+ 00:02:02,080 --> 00:02:04,240
71
+ record this at a couple of different levels of
72
+
73
+ 19
74
+ 00:02:04,880 --> 00:02:10,320
75
+ technicality for folks who are in the normal world and not totally
76
+
77
+ 20
78
+ 00:02:11,360 --> 00:02:16,079
79
+ stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be
80
+
81
+ 21
82
+ 00:02:16,720 --> 00:02:23,440
83
+ to be done. It's a really interesting area and speech and voice tech is the aspect of it that
84
+
85
+ 22
86
+ 00:02:23,440 --> 00:02:28,880
87
+ I find actually most I'm not sure I would say the most interesting because there's just so much
88
+
89
+ 23
90
+ 00:02:28,880 --> 00:02:34,560
91
+ that is fascinating in AI. But the most that I find the most personally transformative in terms of
92
+
93
+ 24
94
+ 00:02:34,560 --> 00:02:42,240
95
+ the impact that it's had on my daily work life and productivity and how I sort of work. And
96
+
97
+ 25
98
+ 00:02:42,960 --> 00:02:49,920
99
+ I am persevering hard with the task of training, I guess, a good solution working for Linux.
100
+
101
+ 26
102
+ 00:02:49,920 --> 00:02:53,440
103
+ Would you have anyone actually does listen to this not just for the training data and for the
104
+
105
+ 27
106
+ 00:02:53,440 --> 00:03:00,399
107
+ actual content? This is this is sparked. I had, besides the fine tune not working, well that was
108
+
109
+ 28
110
+ 00:03:00,399 --> 00:03:07,679
111
+ the failure. I used plot code because one thing these days that there is nothing
112
+
113
+ 29
114
+ 00:03:08,560 --> 00:03:16,799
115
+ short of solving, you know, the reason of life or something that's flawed and
116
+
117
+ 30
118
+ 00:03:16,800 --> 00:03:22,720
119
+ agentically I can't do, which is not really the case. It does seem that way sometimes but it
120
+
121
+ 31
122
+ 00:03:22,720 --> 00:03:28,080
123
+ fails a lot as well. And this is one of those instances where last week I put together an hour
124
+
125
+ 32
126
+ 00:03:28,080 --> 00:03:33,600
127
+ of voice training data, basically speaking just random things for three minutes and
128
+
129
+ 33
130
+ 00:03:35,600 --> 00:03:40,160
131
+ it was actually kind of tedious because the text were really weird. Some of them were it was like,
132
+
133
+ 34
134
+ 00:03:40,160 --> 00:03:45,440
135
+ it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't,
136
+
137
+ 35
138
+ 00:03:45,440 --> 00:03:51,120
139
+ I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something
140
+
141
+ 36
142
+ 00:03:51,120 --> 00:03:59,920
143
+ else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator,
144
+
145
+ 37
146
+ 00:04:00,800 --> 00:04:05,680
147
+ which actually I thought was probably a better way of doing it because it would give me more
148
+
149
+ 38
150
+ 00:04:05,680 --> 00:04:12,000
151
+ short samples with more varied content. So I was like, okay, give me a voice note. Like I'm
152
+
153
+ 39
154
+ 00:04:12,000 --> 00:04:18,800
155
+ recording an email, give me a short story to read, give me pros to read. So I came up with all
156
+
157
+ 40
158
+ 00:04:18,800 --> 00:04:24,240
159
+ these different things and they added a little timer to it so I could see how close I was to one
160
+
161
+ 41
162
+ 00:04:24,240 --> 00:04:32,480
163
+ hour and I spent like an hour one afternoon or probably two hours by the time you do retakes
164
+
165
+ 42
166
+ 00:04:32,480 --> 00:04:39,120
167
+ and whatever because you want to, it gave me a source of truth which I'm not sure if that's the
168
+
169
+ 43
170
+ 00:04:39,120 --> 00:04:45,120
171
+ scientific way to approach this. Topic of gathering training data but I thought made sense.
172
+
173
+ 44
174
+ 00:04:46,560 --> 00:04:50,880
175
+ I have a lot of audio data from recording voice notes which I've also kind of used
176
+
177
+ 45
178
+ 00:04:52,000 --> 00:04:56,720
179
+ being experimenting with using for a different purpose. It's slightly different annotating
180
+
181
+ 46
182
+ 00:04:57,840 --> 00:05:03,680
183
+ task types. It's more text classification experiment or well it's more than that actually
184
+
185
+ 47
186
+ 00:05:03,680 --> 00:05:08,880
187
+ working on a voice app so it's a prototype I guess is really more accurate.
188
+
189
+ 48
190
+ 00:05:11,280 --> 00:05:15,920
191
+ But you can do that and you can work backwards. You listen back to a voice note and you
192
+
193
+ 49
194
+ 00:05:17,520 --> 00:05:22,400
195
+ painfully go through one of those transcribing where you start and stop and scrub around it and
196
+
197
+ 50
198
+ 00:05:22,400 --> 00:05:27,680
199
+ you fix the errors but it's really really pouring to do that. So I thought it would be last tedious
200
+
201
+ 51
202
+ 00:05:27,680 --> 00:05:34,240
203
+ in the long term if I just recorded this source of truth so it gave me these three minute snippets.
204
+
205
+ 52
206
+ 00:05:34,240 --> 00:05:40,480
207
+ I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data.
208
+
209
+ 53
210
+ 00:05:41,840 --> 00:05:47,280
211
+ So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune
212
+
213
+ 54
214
+ 00:05:47,280 --> 00:05:54,720
215
+ whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in
216
+
217
+ 55
218
+ 00:05:54,720 --> 00:06:01,920
219
+ the US and I was alone at home and when crazy people like me do really wild things like use voice
220
+
221
+ 56
222
+ 00:06:01,920 --> 00:06:08,320
223
+ to tech technology that was basically when I started doing it I didn't feel like a crazy person
224
+
225
+ 57
226
+ 00:06:08,320 --> 00:06:15,760
227
+ speaking to myself and my expectations weren't that high. I used speech tech now and again
228
+
229
+ 58
230
+ 00:06:16,960 --> 00:06:21,200
231
+ try it out. It's like it'd be really cool if you could just like speak into your computer and
232
+
233
+ 59
234
+ 00:06:21,280 --> 00:06:28,479
235
+ whatever I tried out that had Linux support was just it was not good basically and this blew me away
236
+
237
+ 60
238
+ 00:06:28,479 --> 00:06:34,400
239
+ from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good
240
+
241
+ 61
242
+ 00:06:34,400 --> 00:06:40,320
243
+ enough that there was a solid foundation and it kind of passed that pivot point that it's actually
244
+
245
+ 62
246
+ 00:06:40,320 --> 00:06:46,320
247
+ worth doing this. There's a point where it's so like the transcript is you don't have to get 100%
248
+
249
+ 63
250
+ 00:06:46,400 --> 00:06:51,040
251
+ accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your
252
+
253
+ 64
254
+ 00:06:51,040 --> 00:06:58,320
255
+ productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably
256
+
257
+ 65
258
+ 00:06:58,320 --> 00:07:03,920
259
+ say screw it I'll just type it because you end up missing errors in the transcript and it becomes
260
+
261
+ 66
262
+ 00:07:03,920 --> 00:07:07,840
263
+ actually worse you end up in a worse position than you started with it that's been my experience.
264
+
265
+ 67
266
+ 00:07:08,400 --> 00:07:14,400
267
+ So I was like oh this is actually really really good now how did that happen? The answer is
268
+
269
+ 68
270
+ 00:07:14,400 --> 00:07:21,599
271
+ ASR with per being open-sourced and the transformer architecture if you want to go back to the
272
+
273
+ 69
274
+ 00:07:23,200 --> 00:07:29,440
275
+ to the underpinnings which really blows my mind and it's on my list to read through that paper
276
+
277
+ 70
278
+ 00:07:30,239 --> 00:07:38,400
279
+ all you need is attention as attentively as can be done with my limited brain because it's super
280
+
281
+ 71
282
+ 00:07:38,960 --> 00:07:45,679
283
+ high-level stuff super advanced stuff I mean but that I think of all the things that
284
+
285
+ 72
286
+ 00:07:47,280 --> 00:07:54,080
287
+ are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating
288
+
289
+ 73
290
+ 00:07:54,080 --> 00:07:59,599
291
+ a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM
292
+
293
+ 74
294
+ 00:08:00,640 --> 00:08:06,799
295
+ then you've got image generation okay so firstly those two things on the surface have nothing
296
+
297
+ 75
298
+ 00:08:06,800 --> 00:08:12,560
299
+ in common so like how are they how did that just happen all at the same time and then when you
300
+
301
+ 76
302
+ 00:08:12,560 --> 00:08:19,920
303
+ extend that further you're like sooner right you can sing a song an AI will like come up with
304
+
305
+ 77
306
+ 00:08:19,920 --> 00:08:25,200
307
+ an instrumental and then you've got whisper and you're like wait a second how did all this stuff
308
+
309
+ 78
310
+ 00:08:25,200 --> 00:08:30,880
311
+ like if it's all AI what's like there has to be some commonality otherwise these are four these
312
+
313
+ 79
314
+ 00:08:31,600 --> 00:08:38,640
315
+ totally different technologies on the surface of it and the transformer architecture is as far as
316
+
317
+ 80
318
+ 00:08:38,640 --> 00:08:44,720
319
+ I know the answer and I can't even say I can't even pretend that I really understand what the
320
+
321
+ 81
322
+ 00:08:44,720 --> 00:08:51,200
323
+ transformer architecture means in depth but I have scandis and as I said I want to print it and
324
+
325
+ 82
326
+ 00:08:51,200 --> 00:08:57,760
327
+ really kind of think over it's at some point and I'll probably feel bad about myself I think
328
+
329
+ 83
330
+ 00:08:57,760 --> 00:09:03,280
331
+ because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt
332
+
333
+ 84
334
+ 00:09:03,280 --> 00:09:09,439
335
+ once who were the who wrote that paper and how old were they when it was published in ARC
336
+
337
+ 85
338
+ 00:09:09,439 --> 00:09:14,640
339
+ and I was expecting like I don't know what do you what do you imagine I personally imagine kind of
340
+
341
+ 86
342
+ 00:09:14,640 --> 00:09:19,840
343
+ like you know you have these breakthroughs during COVID and things like that were like these kind
344
+
345
+ 87
346
+ 00:09:19,840 --> 00:09:24,480
347
+ of really obscure scientists who are like in their 50s and they've just kind of been laboring
348
+
349
+ 88
350
+ 00:09:24,640 --> 00:09:31,120
351
+ labs and we're really in writing and publishing and kind of obscure academic publications and they
352
+
353
+ 89
354
+ 00:09:31,120 --> 00:09:37,200
355
+ finally like hit a big or win a Nobel Prize and then their household household names so I that
356
+
357
+ 90
358
+ 00:09:37,200 --> 00:09:42,680
359
+ was kind of what I had in mind that was the mental image I'd formed of the birth of ARC
360
+
361
+ 91
362
+ 00:09:42,680 --> 00:09:47,760
363
+ like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very
364
+
365
+ 92
366
+ 00:09:47,760 --> 00:09:54,160
367
+ funny very cool and actually kind of inspiring it's nice to think that people who you know just
368
+
369
+ 93
370
+ 00:09:54,160 --> 00:10:01,439
371
+ you might put them in the kind of milieu or bubble or world that you are in are credibly in through
372
+
373
+ 94
374
+ 00:10:01,439 --> 00:10:06,079
375
+ you know the series of connections that are coming up with such literally world changing
376
+
377
+ 95
378
+ 00:10:06,880 --> 00:10:13,439
379
+ innovations so that was I thought anyway that that that was cool okay voice training data how
380
+
381
+ 96
382
+ 00:10:13,439 --> 00:10:19,280
383
+ were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was
384
+
385
+ 97
386
+ 00:10:19,280 --> 00:10:25,680
387
+ brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh
388
+
389
+ 98
390
+ 00:10:25,680 --> 00:10:31,040
391
+ I have to get like a really good microphone for this so I didn't go on a spending spree because
392
+
393
+ 99
394
+ 00:10:31,040 --> 00:10:37,760
395
+ I said I'm gonna have to just wait a month and see if I still use this and it just kind of became
396
+
397
+ 100
398
+ 00:10:37,760 --> 00:10:44,800
399
+ it's become really part of my daily routine like if I'm writing an email I'll record a voice note
400
+
401
+ 101
402
+ 00:10:44,880 --> 00:10:50,079
403
+ and then I've developed and it's nice to see that everyone is like developing the same things in
404
+
405
+ 102
406
+ 00:10:50,079 --> 00:10:56,319
407
+ parallel like that's my kind of a weird thing to say but when I look I kind of came when I started
408
+
409
+ 103
410
+ 00:10:56,319 --> 00:11:02,640
411
+ working on this these prototypes on GitHub which is where I just kind of share very freely and loosely
412
+
413
+ 104
414
+ 00:11:03,199 --> 00:11:10,800
415
+ ideas and you know first iterations on concepts and for one of a better word I called it like
416
+
417
+ 105
418
+ 00:11:11,439 --> 00:11:17,680
419
+ LLM post processing or cleanup or basically a system prompt that after you get back the raw text
420
+
421
+ 106
422
+ 00:11:17,680 --> 00:11:25,920
423
+ from whisper you run it through model and say okay this is crappy text like add sentence structure
424
+
425
+ 107
426
+ 00:11:25,920 --> 00:11:33,199
427
+ and you know fix it up and now when I'm exploring the different tools that are out there the people
428
+
429
+ 108
430
+ 00:11:33,200 --> 00:11:39,040
431
+ have built I see quite a number of projects have basically you know done the same thing
432
+
433
+ 109
434
+ 00:11:40,640 --> 00:11:45,040
435
+ less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this
436
+
437
+ 110
438
+ 00:11:45,040 --> 00:11:51,440
439
+ has been a thing that's been integrated into tools for a while but it's it's the kind of thing that
440
+
441
+ 111
442
+ 00:11:51,440 --> 00:11:57,520
443
+ when you start using these tools every day the need for it is almost instantly apparent because text
444
+
445
+ 112
446
+ 00:11:57,600 --> 00:12:03,520
447
+ that doesn't have any punctuation or progress basing takes a long time to you know it takes so
448
+
449
+ 113
450
+ 00:12:03,520 --> 00:12:10,079
451
+ long to get it into a presentable email that again it's it moves speech tech into that
452
+
453
+ 114
454
+ 00:12:11,280 --> 00:12:16,000
455
+ before that inflection point where you're like nah she's not worth it it's like it'll just be
456
+
457
+ 115
458
+ 00:12:16,000 --> 00:12:20,800
459
+ quicker to type this so it's it's a big it's a little touch that actually is a big deal
460
+
461
+ 116
462
+ 00:12:21,520 --> 00:12:28,319
463
+ so I was on whisper and I've been using whisper and I kind of early on find a couple of tools
464
+
465
+ 117
466
+ 00:12:28,319 --> 00:12:33,680
467
+ I couldn't find what I was looking for on Linux which is basically just something that'll run
468
+
469
+ 118
470
+ 00:12:34,800 --> 00:12:39,120
471
+ in the background you'll give it an API key and it'll just like transcribe
472
+
473
+ 119
474
+ 00:12:41,439 --> 00:12:47,359
475
+ with like a little key to start and start the dictation and the issues where I discovered that
476
+
477
+ 120
478
+ 00:12:47,440 --> 00:12:52,720
479
+ like most people involved in creating these projects were very much focused on local models
480
+
481
+ 121
482
+ 00:12:52,720 --> 00:12:58,400
483
+ and running whisper locally because you can and I tried that a bunch of times and just never
484
+
485
+ 122
486
+ 00:12:58,400 --> 00:13:03,920
487
+ got results that were as good as the cloud and when I began looking at the cost of the speech
488
+
489
+ 123
490
+ 00:13:03,920 --> 00:13:10,080
491
+ text API is what I was spending I just thought there is it's actually in my opinion just one of
492
+
493
+ 124
494
+ 00:13:10,080 --> 00:13:15,600
495
+ the better deals in API spending and in cloud like it's just not that expensive for very very good
496
+
497
+ 125
498
+ 00:13:15,600 --> 00:13:22,240
499
+ models that are much more you know you're going to be able to run the full model the latest model
500
+
501
+ 126
502
+ 00:13:22,240 --> 00:13:28,960
503
+ versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't
504
+
505
+ 127
506
+ 00:13:28,960 --> 00:13:34,000
507
+ really make sense to me and I privacy is another concern that I know is kind of like a very much
508
+
509
+ 128
510
+ 00:13:34,000 --> 00:13:38,720
511
+ a separate thing that people just don't want their voice data and their voice leaving their
512
+
513
+ 129
514
+ 00:13:38,720 --> 00:13:45,360
515
+ local environment maybe for regulatory reasons as well but I'm not in that I'm neither really
516
+
517
+ 130
518
+ 00:13:45,360 --> 00:13:51,440
519
+ care about people listening to my grocery list consisting of reminding myself that I need to buy
520
+
521
+ 131
522
+ 00:13:51,440 --> 00:13:58,240
523
+ more beer cheetos and hummus which is kind of the three three staples of my diet during periods of
524
+
525
+ 132
526
+ 00:13:58,240 --> 00:14:04,560
527
+ poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a
528
+
529
+ 133
530
+ 00:14:04,560 --> 00:14:12,640
531
+ privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive
532
+
533
+ 134
534
+ 00:14:12,640 --> 00:14:17,680
535
+ or secure that requires airgapping so I looked at the pricing and especially the kind of older
536
+
537
+ 135
538
+ 00:14:17,680 --> 00:14:24,400
539
+ models mini some of them are very very affordable and I did it back to the I did a calculation once
540
+
541
+ 136
542
+ 00:14:24,400 --> 00:14:30,239
543
+ with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever
544
+
545
+ 137
546
+ 00:14:30,320 --> 00:14:37,040
547
+ the model was let's say I just go at it like nonstop which rarely happens probably I would say an
548
+
549
+ 138
550
+ 00:14:37,040 --> 00:14:45,200
551
+ average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents
552
+
553
+ 139
554
+ 00:14:45,200 --> 00:14:51,360
555
+ outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I
556
+
557
+ 140
558
+ 00:14:51,360 --> 00:14:56,720
559
+ do go on like one or two days where I've been usually when I'm like kind of out of the house and
560
+
561
+ 141
562
+ 00:14:56,720 --> 00:15:02,800
563
+ just have something like I've nothing else to do like if I'm at a hospital we have a newborn
564
+
565
+ 142
566
+ 00:15:04,000 --> 00:15:09,040
567
+ and you're waiting for like eight hours and hours for an appointment and I would probably have
568
+
569
+ 143
570
+ 00:15:09,040 --> 00:15:15,280
571
+ listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down
572
+
573
+ 144
574
+ 00:15:15,280 --> 00:15:20,880
575
+ let me just get these ideas out of my head and that's when I'll go on my speech spinges but those
576
+
577
+ 145
578
+ 00:15:20,880 --> 00:15:26,240
579
+ are like ones every few months like not frequently but I said okay let's just say if I'm gonna price
580
+
581
+ 146
582
+ 00:15:26,240 --> 00:15:35,440
583
+ out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some
584
+
585
+ 147
586
+ 00:15:35,440 --> 00:15:41,600
587
+ odd reason I mean it have to like ease and use the toilet like you know there's only so many hours
588
+
589
+ 148
590
+ 00:15:41,600 --> 00:15:48,480
591
+ I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said
592
+
593
+ 149
594
+ 00:15:48,480 --> 00:15:55,360
595
+ all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60
596
+
597
+ 150
598
+ 00:15:55,440 --> 00:16:02,560
599
+ but whatever I did and every day like you're going flat out seven days a week dictating nonstop
600
+
601
+ 151
602
+ 00:16:02,560 --> 00:16:08,640
603
+ as like what's my monthly API bill gonna be at this price and it came out to like seven to your
604
+
605
+ 152
606
+ 00:16:08,640 --> 00:16:14,960
607
+ 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope
608
+
609
+ 153
610
+ 00:16:15,600 --> 00:16:21,680
611
+ that there was some compelling reason more worth more than 70 dollars that I embarked upon that
612
+
613
+ 154
614
+ 00:16:22,640 --> 00:16:26,959
615
+ so given that that's kind of the max point for me I said that's actually very very affordable
616
+
617
+ 155
618
+ 00:16:27,920 --> 00:16:32,640
619
+ now you're gonna if you want to spec out the costs and you want to do the post processing
620
+
621
+ 156
622
+ 00:16:33,599 --> 00:16:39,199
623
+ that I really do feel as valuable that's gonna cost more as well on a less you're using
624
+
625
+ 157
626
+ 00:16:40,160 --> 00:16:47,839
627
+ Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with
628
+
629
+ 158
630
+ 00:16:47,840 --> 00:16:54,800
631
+ Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini
632
+
633
+ 159
634
+ 00:16:54,800 --> 00:17:00,080
635
+ not so much as a everyday model um it's kind of underwhelmed in that respect I would say
636
+
637
+ 160
638
+ 00:17:00,080 --> 00:17:05,920
639
+ but for multimodal I think it's got a lot to offer and I think that the transcribing functionality
640
+
641
+ 161
642
+ 00:17:05,920 --> 00:17:13,280
643
+ whereby it can um process audio with the system prompt and both give you a transgression that's
644
+
645
+ 162
646
+ 00:17:13,280 --> 00:17:20,079
647
+ cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like
648
+
649
+ 163
650
+ 00:17:20,079 --> 00:17:27,280
651
+ even Google hasn't really sort of thought through how useful the that modality is more kind of
652
+
653
+ 164
654
+ 00:17:27,280 --> 00:17:33,280
655
+ use cases uh you can achieve with it because I found in the course of this year just an endless
656
+
657
+ 165
658
+ 00:17:33,280 --> 00:17:40,399
659
+ list of really kind of system prompt system prompt stuff that I can say okay I've used it
660
+
661
+ 166
662
+ 00:17:40,560 --> 00:17:45,920
663
+ for a capture context data for AI which is literally I might speak for if I wanted to have a good
664
+
665
+ 167
666
+ 00:17:45,920 --> 00:17:52,560
667
+ bank of context data about who knows my childhood uh more realistically maybe my career goals
668
+
669
+ 168
670
+ 00:17:53,520 --> 00:17:59,520
671
+ something that would just be like really boring to type out so I'll just like sit in my car
672
+
673
+ 169
674
+ 00:17:59,520 --> 00:18:06,640
675
+ and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is
676
+
677
+ 170
678
+ 00:18:06,640 --> 00:18:13,200
679
+ short text uh just there is a whole bunch and all these workflows kind of require a little bit
680
+
681
+ 171
682
+ 00:18:13,200 --> 00:18:18,320
683
+ of treatment afterwards and different treatment my context pipeline is kind of like just extract the
684
+
685
+ 172
686
+ 00:18:18,320 --> 00:18:23,520
687
+ bare essential so you end up with me talking very loosely about sort of what I've done in my career
688
+
689
+ 173
690
+ 00:18:23,520 --> 00:18:30,000
691
+ where I've worked where my light to work and it goes it condenses that down to very robotic language
692
+
693
+ 174
694
+ 00:18:30,000 --> 00:18:36,000
695
+ that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology
696
+
697
+ 175
698
+ 00:18:36,080 --> 00:18:42,400
699
+ Daniel is a has been working in martino stuff like that that's not how you would speak um but I
700
+
701
+ 176
702
+ 00:18:42,400 --> 00:18:48,480
703
+ figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this
704
+
705
+ 177
706
+ 00:18:48,480 --> 00:18:56,880
707
+ is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and
708
+
709
+ 178
710
+ 00:18:56,880 --> 00:19:02,720
711
+ the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to
712
+
713
+ 179
714
+ 00:19:02,720 --> 00:19:09,440
715
+ be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this
716
+
717
+ 180
718
+ 00:19:09,440 --> 00:19:14,960
719
+ okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype
720
+
721
+ 181
722
+ 00:19:14,960 --> 00:19:20,560
723
+ will work and it's a built in public open source so anyone is welcome to use it if I make anything good
724
+
725
+ 182
726
+ 00:19:21,600 --> 00:19:28,000
727
+ but that was really exciting for me last night when after hours of um try my own prototype seeing
728
+
729
+ 183
730
+ 00:19:28,080 --> 00:19:33,120
731
+ someone just made something that works like that you know you're not going to have to build a custom
732
+
733
+ 184
734
+ 00:19:34,240 --> 00:19:40,960
735
+ condo environment and image I have AMD GPU which makes things much more complicated I didn't find it
736
+
737
+ 185
738
+ 00:19:41,840 --> 00:19:46,400
739
+ and I was about to give up and I said all right let me just give deep grams Linux thing a shot
740
+
741
+ 186
742
+ 00:19:47,040 --> 00:19:50,960
743
+ and if this doesn't work um I'm just going to go back to trying to vibe code something myself
744
+
745
+ 187
746
+ 00:19:51,600 --> 00:19:57,360
747
+ and when I ran the script I was using cloud code to do the installation process
748
+
749
+ 188
750
+ 00:19:58,160 --> 00:20:02,800
751
+ it ran the script and oh my gosh it works just like that uh the tricky thing
752
+
753
+ 189
754
+ 00:20:04,480 --> 00:20:12,480
755
+ for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually
756
+
757
+ 190
758
+ 00:20:12,480 --> 00:20:18,160
759
+ struggling with transcription but pasting wailant makes life very hard and I think there was
760
+
761
+ 191
762
+ 00:20:18,160 --> 00:20:22,800
763
+ something not running at the right time anyway deep gram I looked at how they actually handled
764
+
765
+ 192
766
+ 00:20:22,960 --> 00:20:28,960
767
+ that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism
768
+
769
+ 193
770
+ 00:20:29,520 --> 00:20:34,560
771
+ and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20
772
+
773
+ 194
774
+ 00:20:34,560 --> 00:20:44,399
775
+ minute audio uh sample and I'm I think I've done one or two of these before but I did it with
776
+
777
+ 195
778
+ 00:20:45,360 --> 00:20:51,120
779
+ sure snappy voice notes this is kind of long form this actually might be a better approximation
780
+
781
+ 196
782
+ 00:20:51,120 --> 00:20:55,040
783
+ for what's useful to me than voice memos like I need to buy three
784
+
785
+ 197
786
+ 00:20:55,840 --> 00:20:59,840
787
+ beaters of moat tomorrow and peter bread which is probably how like half my voice note
788
+
789
+ 198
790
+ 00:20:59,840 --> 00:21:04,399
791
+ voice notes sound like if anyone were to I don't know like find my phone they'd be like this is
792
+
793
+ 199
794
+ 00:21:04,399 --> 00:21:09,280
795
+ the most boring person in the world although actually there are some like kind of uh journaling
796
+
797
+ 200
798
+ 00:21:09,280 --> 00:21:14,080
799
+ thoughts as well but it's a lot of content like that and the probably for the evaluation the most
800
+
801
+ 201
802
+ 00:21:14,080 --> 00:21:22,560
803
+ useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not
804
+
805
+ 202
806
+ 00:21:22,560 --> 00:21:27,360
807
+ going to have a chance of knowing it but hopefully sufficiently well known that the models should get
808
+
809
+ 203
810
+ 00:21:27,360 --> 00:21:32,800
811
+ it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in
812
+
813
+ 204
814
+ 00:21:32,800 --> 00:21:38,960
815
+ general I've spoken deliver this at a faster pace than I usually would go into strong coffee
816
+
817
+ 205
818
+ 00:21:39,120 --> 00:21:44,240
819
+ flowing through my bloodstream and the thing that I'm not going to get into spanish mark is
820
+
821
+ 206
822
+ 00:21:44,240 --> 00:21:49,920
823
+ background noise which is my first take that I had to get rid of my wife come in with my son
824
+
825
+ 207
826
+ 00:21:49,920 --> 00:21:55,680
827
+ and for a good night kiss and that actually would have been super helpful to get in because it was
828
+
829
+ 208
830
+ 00:21:56,400 --> 00:22:01,600
831
+ non-diarray sorry if we had diarrayization a female I could say I want the male voice and that
832
+
833
+ 209
834
+ 00:22:01,600 --> 00:22:06,240
835
+ wasn't intended for transcription um and we're not going to get background noise like people
836
+
837
+ 210
838
+ 00:22:06,240 --> 00:22:11,840
839
+ hunking their horns which is something I've done in my main data set where I am trying to go back
840
+
841
+ 211
842
+ 00:22:11,840 --> 00:22:16,880
843
+ to some of my voice notes annotate them and run a benchmark but this is going to be just a pure
844
+
845
+ 212
846
+ 00:22:17,680 --> 00:22:24,960
847
+ quick test and as someone I'm working on a voice note idea that's my sort of end
848
+
849
+ 213
850
+ 00:22:26,560 --> 00:22:30,320
851
+ motivation besides thinking it's an absolute outstanding technology that's coming to
852
+
853
+ 214
854
+ 00:22:30,960 --> 00:22:36,240
855
+ viability and really I know the same as cheesy can actually have a very transformative effect
856
+
857
+ 215
858
+ 00:22:37,120 --> 00:22:42,720
859
+ it's you know voice technology has been life changing for folks living with
860
+
861
+ 216
862
+ 00:22:44,000 --> 00:22:49,760
863
+ disabilities and I think there's something really nice about the fact that it can also benefit
864
+
865
+ 217
866
+ 00:22:50,480 --> 00:22:54,639
867
+ you know folks who are able bodies and like we can all in different ways
868
+
869
+ 218
870
+ 00:22:55,120 --> 00:23:02,560
871
+ um make this tech as useful as possible regardless of the exact way that we're using it um and I
872
+
873
+ 219
874
+ 00:23:02,560 --> 00:23:07,760
875
+ think there's something very powerful in that and it can be very cool um I see huge potential what
876
+
877
+ 220
878
+ 00:23:07,760 --> 00:23:14,480
879
+ excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate
880
+
881
+ 221
882
+ 00:23:14,480 --> 00:23:19,040
883
+ as I mentioned at the very start of this um and it's getting better and better with stuff like
884
+
885
+ 222
886
+ 00:23:19,040 --> 00:23:24,160
887
+ accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the
888
+
889
+ 223
890
+ 00:23:24,160 --> 00:23:30,240
891
+ sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because
892
+
893
+ 224
894
+ 00:23:30,240 --> 00:23:37,680
895
+ I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of
896
+
897
+ 225
898
+ 00:23:37,680 --> 00:23:42,720
899
+ innovation and improvement in the models the main reasons for fine tuning from what I've seen
900
+
901
+ 226
902
+ 00:23:44,320 --> 00:23:50,480
903
+ have been people who are something that really blows blows my mind about ASR is the idea that it's
904
+
905
+ 227
906
+ 00:23:50,480 --> 00:24:00,080
907
+ inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages
908
+
909
+ 228
910
+ 00:24:00,080 --> 00:24:04,800
911
+ that there may be there there might be a positive training data or almost none at all and therefore
912
+
913
+ 229
914
+ 00:24:04,800 --> 00:24:11,440
915
+ the accuracy is significantly reduced or folks in very critical environments I know there are
916
+
917
+ 230
918
+ 00:24:11,440 --> 00:24:17,680
919
+ you this is used extensively in medical transcription and dispatch your work as um you know the call
920
+
921
+ 231
922
+ 00:24:17,680 --> 00:24:24,000
923
+ sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors
924
+
925
+ 232
926
+ 00:24:24,560 --> 00:24:29,680
927
+ radiologists they might be using very specialized vocab all the time so those are kind of the main
928
+
929
+ 233
930
+ 00:24:29,680 --> 00:24:35,680
931
+ two things and I'm not sure that really just for trying to make it better on a few random tech words
932
+
933
+ 234
934
+ 00:24:35,680 --> 00:24:41,840
935
+ with my slightly I mean I have an accent but like not you know an accent that a few other million
936
+
937
+ 235
938
+ 00:24:41,840 --> 00:24:50,720
939
+ people have ish I'm not sure that my little fine tune is going to actually like the bump in
940
+
941
+ 236
942
+ 00:24:50,720 --> 00:24:55,760
943
+ word error reduction if I ever actually figure out how to do it and get it up to the cloud by the
944
+
945
+ 237
946
+ 00:24:55,760 --> 00:25:00,879
947
+ time we've done that I suspect that the next generation of ASR will just be so good that it will
948
+
949
+ 238
950
+ 00:25:00,879 --> 00:25:07,040
951
+ kind of be well that would be cool for a dive but I'll just use this instead so that's going to be
952
+
953
+ 239
954
+ 00:25:07,280 --> 00:25:15,040
955
+ is for today's episodes of voice training data single long shot evaluation who am I going to
956
+
957
+ 240
958
+ 00:25:15,040 --> 00:25:21,200
959
+ compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head
960
+
961
+ 241
962
+ 00:25:21,200 --> 00:25:27,680
963
+ with two things really one is whisper variance so you've got these projects like faster whisper
964
+
965
+ 242
966
+ 00:25:29,120 --> 00:25:34,000
967
+ distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which
968
+
969
+ 243
970
+ 00:25:34,160 --> 00:25:38,960
971
+ are also a thing my intention for this is I'm not sure I'm going to have the time in any point
972
+
973
+ 244
974
+ 00:25:38,960 --> 00:25:46,320
975
+ of the foreseeable future to go back to this whole episode and create a proper source truth or I fix
976
+
977
+ 245
978
+ 00:25:47,520 --> 00:25:53,760
979
+ everything might do it if I can get one transcriptions that's sufficiently close to perfection
980
+
981
+ 246
982
+ 00:25:54,960 --> 00:26:00,560
983
+ but what I would actually love to do on hogging face I think would be a great probably how I might
984
+
985
+ 247
986
+ 00:26:00,560 --> 00:26:08,080
987
+ visualize this is having the audio waveform play and then have the transcript for each model below
988
+
989
+ 248
990
+ 00:26:08,080 --> 00:26:16,320
991
+ it and maybe even a like you know two scale and maybe even a local one as well like local whisper
992
+
993
+ 249
994
+ 00:26:16,320 --> 00:26:23,919
995
+ versus open AI API etc and I can then actually listen back to segments or anyone who wants to
996
+
997
+ 250
998
+ 00:26:24,000 --> 00:26:30,000
999
+ can listen back to segments of this recording and see where a particular model to struggle
1000
+
1001
+ 251
1002
+ 00:26:30,000 --> 00:26:35,600
1003
+ with others didn't as well as the sort of headline finding of which had the best WER but that would
1004
+
1005
+ 252
1006
+ 00:26:35,600 --> 00:26:41,120
1007
+ require the source of truth okay that's it hope this was I don't know maybe useful for other
1008
+
1009
+ 253
1010
+ 00:26:41,120 --> 00:26:46,480
1011
+ folks interested in STT you want to see that I always feel think I've just said as something I
1012
+
1013
+ 254
1014
+ 00:26:46,480 --> 00:26:52,800
1015
+ didn't intend to STT I said for those isn't carefully including hopefully the models themselves
1016
+
1017
+ 255
1018
+ 00:26:53,280 --> 00:26:58,960
1019
+ this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests
1020
+
1021
+ 256
1022
+ 00:26:58,960 --> 00:27:06,639
1023
+ in AI but particularly agentic mcp and voice tech you can find me on github hogging face
1024
+
1025
+ 257
1026
+ 00:27:08,080 --> 00:27:14,000
1027
+ where else daniel rosal dot com which is my personal website as well as this podcast whose name
1028
+
1029
+ 258
1030
+ 00:27:14,000 --> 00:27:17,280
1031
+ I sadly cannot remember until next time thanks for listening
1032
+
data/inference/runs/local-stt/run-1/transcript.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts. I'm going to send really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to push a speech attacks model through its paces, which is trying to make sense of is this guy just rambling on and coherently in one long sentence or are these just actually series of step standalone standalone sentences? And how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was it trying to find China whisper? And what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be to be done. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I am persevering hard with the task of training, I guess, a good solution working for Linux. Would you have anyone actually does listen to this not just for the training data and for the actual content? This is this is sparked. I had, besides the fine tune not working, well that was the failure. I used plot code because one thing these days that there is nothing short of solving, you know, the reason of life or something that's flawed and agentically I can't do, which is not really the case. It does seem that way sometimes but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes and
2
+
3
+ it was actually kind of tedious because the text were really weird. Some of them were it was like, it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note. Like I'm recording an email, give me a short story to read, give me pros to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour and I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to, it gave me a source of truth which I'm not sure if that's the scientific way to approach this. Topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used being experimenting with using for a different purpose. It's slightly different annotating task types. It's more text classification experiment or well it's more than that actually working on a voice app so it's a prototype I guess is really more accurate.
4
+
5
+ But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those transcribing where you start and stop and scrub around it and you fix the errors but it's really really pouring to do that. So I thought it would be last tedious in the long term if I just recorded this source of truth so it gave me these three minute snippets. I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data. So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in the US and I was alone at home and when crazy people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high. I used speech tech now and again try it out. It's like it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just it was not good basically and this blew me away from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. There's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably say screw it I'll just type it because you end up missing errors in the transcript and it becomes actually worse you end up in a worse position than you started with it that's been my experience. So I was like oh this is actually really really good now how did that happen? The answer is ASR with per being open-sourced and the transformer architecture if you want to go back to the to the underpinnings which really blows my mind and it's on my list to read through that paper all you need is attention as attentively as can be done with my limited brain because it's super high-level stuff super advanced stuff I mean but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM then you've got image generation okay so firstly those two things on the surface have nothing in common so like how are they how did that just happen all at the same time and then when you extend that further you're like sooner right you can sing a song an AI will like come up with an instrumental and then you've got whisper and you're like wait a second how did all this stuff like if it's all AI what's like there has to be some commonality otherwise these are four these totally different technologies on the surface of it and the transformer architecture is as far as I know the answer and I can't even say I can't even pretend that I really understand what the transformer architecture means in depth but I have scandis and as I said I want to print it and really kind of think over it's at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt once who were the who wrote that paper and how old were they when it was published in ARC and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during COVID and things like that were like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring labs and we're really in writing and publishing and kind of obscure academic publications and they finally like hit a big or win a Nobel Prize and then their household household names so I that was kind of what I had in mind that was the mental image I'd formed of the birth of ARC like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very funny very cool and actually kind of inspiring it's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in are credibly in through you know the series of connections that are coming up with such literally world changing innovations so that was I thought anyway that that that was cool okay voice training data how were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh I have to get like a really good microphone for this so I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this and it just kind of became it's become really part of my daily routine like if I'm writing an email I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel like that's my kind of a weird thing to say but when I look I kind of came when I started working on this these prototypes on GitHub which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts and for one of a better word I called it like LLM post processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through model and say okay this is crappy text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there the people have built I see quite a number of projects have basically you know done the same thing less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent because text that doesn't have any punctuation or progress basing takes a long time to you know it takes so long to get it into a presentable email that again it's it moves speech tech into that before that inflection point where you're like nah she's not worth it it's like it'll just be quicker to type this so it's it's a big it's a little touch that actually is a big deal so I was on whisper and I've been using whisper and I kind of early on find a couple of tools I couldn't find what I was looking for on Linux which is basically just something that'll run in the background you'll give it an API key and it'll just like transcribe
6
+
7
+ with like a little key to start and start the dictation and the issues where I discovered that like most people involved in creating these projects were very much focused on local models and running whisper locally because you can and I tried that a bunch of times and just never got results that were as good as the cloud and when I began looking at the cost of the speech text API is what I was spending I just thought there is it's actually in my opinion just one of the better deals in API spending and in cloud like it's just not that expensive for very very good models that are much more you know you're going to be able to run the full model the latest model versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't really make sense to me and I privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment maybe for regulatory reasons as well but I'm not in that I'm neither really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer cheetos and hummus which is kind of the three three staples of my diet during periods of poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive or secure that requires airgapping so I looked at the pricing and especially the kind of older models mini some of them are very very affordable and I did it back to the I did a calculation once with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever the model was let's say I just go at it like nonstop which rarely happens probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do like if I'm at a hospital we have a newborn and you're waiting for like eight hours and hours for an appointment and I would probably have listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me just get these ideas out of my head and that's when I'll go on my speech spinges but those are like ones every few months like not frequently but I said okay let's just say if I'm gonna price out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some odd reason I mean it have to like ease and use the toilet like you know there's only so many hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60 but whatever I did and every day like you're going flat out seven days a week dictating nonstop as like what's my monthly API bill gonna be at this price and it came out to like seven to your 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope that there was some compelling reason more worth more than 70 dollars that I embarked upon that so given that that's kind of the max point for me I said that's actually very very affordable now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel as valuable that's gonna cost more as well on a less you're using Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini not so much as a everyday model um it's kind of underwhelmed in that respect I would say but for multimodal I think it's got a lot to offer and I think that the transcribing functionality whereby it can um process audio with the system prompt and both give you a transgression that's cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like even Google hasn't really sort of thought through how useful the that modality is more kind of use cases uh you can achieve with it because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say okay I've used it for a capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood uh more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is short text uh just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essential so you end up with me talking very loosely about sort of what I've done in my career where I've worked where my light to work and it goes it condenses that down to very robotic language that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology Daniel is a has been working in martino stuff like that that's not how you would speak um but I figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype will work and it's a built in public open source so anyone is welcome to use it if I make anything good but that was really exciting for me last night when after hours of um try my own prototype seeing someone just made something that works like that you know you're not going to have to build a custom condo environment and image I have AMD GPU which makes things much more complicated I didn't find it and I was about to give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work um I'm just going to go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that uh the tricky thing for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually struggling with transcription but pasting wailant makes life very hard and I think there was something not running at the right time anyway deep gram I looked at how they actually handled that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20 minute audio uh sample and I'm I think I've done one or two of these before but I did it with sure snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me than voice memos like I need to buy three beaters of moat tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to I don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the models should get it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in general I've spoken deliver this at a faster pace than I usually would go into strong coffee flowing through my bloodstream and the thing that I'm not going to get into spanish mark is background noise which is my first take that I had to get rid of my wife come in with my son and for a good night kiss and that actually would have been super helpful to get in because it was non-diarray sorry if we had diarrayization a female I could say I want the male voice and that wasn't intended for transcription um and we're not going to get background noise like people hunking their horns which is something I've done in my main data set where I am trying to go back to some of my voice notes annotate them and run a benchmark but this is going to be just a pure quick test and as someone I'm working on a voice note idea that's my sort of end motivation besides thinking it's an absolute outstanding technology that's coming to viability and really I know the same as cheesy can actually have a very transformative effect it's you know voice technology has been life changing for folks living with disabilities and I think there's something really nice about the fact that it can also benefit you know folks who are able bodies and like we can all in different ways um make this tech as useful as possible regardless of the exact way that we're using it um and I think there's something very powerful in that and it can be very cool um I see huge potential what excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate as I mentioned at the very start of this um and it's getting better and better with stuff like accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of innovation and improvement in the models the main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages that there may be there there might be a positive training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments I know there are you this is used extensively in medical transcription and dispatch your work as um you know the call sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors radiologists they might be using very specialized vocab all the time so those are kind of the main two things and I'm not sure that really just for trying to make it better on a few random tech words with my slightly I mean I have an accent but like not you know an accent that a few other million people have ish I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time we've done that I suspect that the next generation of ASR will just be so good that it will kind of be well that would be cool for a dive but I'll just use this instead so that's going to be is for today's episodes of voice training data single long shot evaluation who am I going to compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head with two things really one is whisper variance so you've got these projects like faster whisper distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which are also a thing my intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back to this whole episode and create a proper source truth or I fix everything might do it if I can get one transcriptions that's sufficiently close to perfection but what I would actually love to do on hogging face I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like you know two scale and maybe even a local one as well like local whisper versus open AI API etc and I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model to struggle with others didn't as well as the sort of headline finding of which had the best WER but that would require the source of truth okay that's it hope this was I don't know maybe useful for other folks interested in STT you want to see that I always feel think I've just said as something I didn't intend to STT I said for those isn't carefully including hopefully the models themselves this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests in AI but particularly agentic mcp and voice tech you can find me on github hogging face where else daniel rosal dot com which is my personal website as well as this podcast whose name I sadly cannot remember until next time thanks for listening
data/inference/runs/local-stt/run-2/whisper-tiny.srt ADDED
@@ -0,0 +1,1024 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,000 --> 00:00:08,000
3
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast.
4
+
5
+ 2
6
+ 00:00:08,640 --> 00:00:16,000
7
+ Or, I may append this to a podcast that I set up recently regarding my
8
+
9
+ 3
10
+ 00:00:16,640 --> 00:00:26,000
11
+ with my thoughts on speech tech and AI in particular. More AI in generative AI, I would say.
12
+
13
+ 4
14
+ 00:00:26,720 --> 00:00:34,000
15
+ But in any event, the purpose of this voice recording is actually to create a lengthy voice
16
+
17
+ 5
18
+ 00:00:34,000 --> 00:00:39,840
19
+ sample for a quick evaluation of back of the envelope evaluation as they might say for
20
+
21
+ 6
22
+ 00:00:39,840 --> 00:00:44,240
23
+ different speech attacks models. And I'm doing this because I thought I had made a great breakthrough
24
+
25
+ 7
26
+ 00:00:44,240 --> 00:00:50,960
27
+ in my journey with speech tech and that was succeeding in the elusive task of fine-tuning whisper.
28
+
29
+ 8
30
+ 00:00:51,600 --> 00:00:58,800
31
+ Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different
32
+
33
+ 9
34
+ 00:00:59,360 --> 00:01:04,000
35
+ styles of speaking, I might whisper something at some points as well. And I'll go back to
36
+
37
+ 10
38
+ 00:01:04,000 --> 00:01:09,600
39
+ speaking loud in a different part. So I'm going to send really like a crazy person because I'm also
40
+
41
+ 11
42
+ 00:01:09,600 --> 00:01:17,520
43
+ going to try to speak at different pitches and cadences in order to really try to put
44
+
45
+ 12
46
+ 00:01:18,399 --> 00:01:23,280
47
+ a speech attacks model through its pieces, which is trying to make sense of, is this guy just
48
+
49
+ 13
50
+ 00:01:24,000 --> 00:01:30,880
51
+ ramling on and coherently in one long sentence or are these just actually a series of
52
+
53
+ 14
54
+ 00:01:32,960 --> 00:01:37,440
55
+ step of standalone, standalone, standalone sentences. And how is it going to handle
56
+
57
+ 15
58
+ 00:01:37,440 --> 00:01:43,280
59
+ step alone? That's not a word. What happens when you use speech attacks and you use a fake word.
60
+
61
+ 16
62
+ 00:01:43,280 --> 00:01:48,640
63
+ And then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And
64
+
65
+ 17
66
+ 00:01:49,520 --> 00:01:56,160
67
+ these and more are all the questions that I'm seeking to answer in this training data. Now,
68
+
69
+ 18
70
+ 00:01:56,160 --> 00:02:00,960
71
+ why was it trying to find you to whisper? And what is whisper, as I said, I'm going to try to
72
+
73
+ 19
74
+ 00:02:02,160 --> 00:02:08,240
75
+ record this at a couple of different levels of technicality for folks who are in the normal
76
+
77
+ 20
78
+ 00:02:09,120 --> 00:02:14,960
79
+ world and not totally stocked down the rabbit hole of AI. What you have to say is a really wonderful
80
+
81
+ 21
82
+ 00:02:14,960 --> 00:02:22,480
83
+ rabbit hole to be, to be done. It's a really interesting area and speech and voice attack is the
84
+
85
+ 22
86
+ 00:02:22,480 --> 00:02:28,000
87
+ aspect of it that I find actually most, I'm not sure I would say the most interesting because there's
88
+
89
+ 23
90
+ 00:02:28,000 --> 00:02:34,080
91
+ just so much that is fascinating in AI. But the most that I find the most personally transformative
92
+
93
+ 24
94
+ 00:02:34,160 --> 00:02:41,920
95
+ in terms of the impact that it's had on my daily work life and productivity and how I sort of work.
96
+
97
+ 25
98
+ 00:02:41,920 --> 00:02:49,920
99
+ And I'm persevering hard with the task of training, yes, a good solution working for Linux.
100
+
101
+ 26
102
+ 00:02:49,920 --> 00:02:53,760
103
+ Would you have anyone actually, does listen to this not just for the training data and for the actual
104
+
105
+ 27
106
+ 00:02:53,760 --> 00:03:00,960
107
+ content? This is this is sparked. I had, besides the fine tune not working. Well, that was the failure.
108
+
109
+ 28
110
+ 00:03:01,280 --> 00:03:09,920
111
+ I used Claude Codes because one thing's these days that there is nothing sort of solving
112
+
113
+ 29
114
+ 00:03:10,880 --> 00:03:18,960
115
+ you know, the reason of life or something at that's flawed and agentic AI can't do, which is not really
116
+
117
+ 30
118
+ 00:03:18,960 --> 00:03:24,320
119
+ the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those
120
+
121
+ 31
122
+ 00:03:25,200 --> 00:03:29,760
123
+ instances where last week I put together an hour of voice training data.
124
+
125
+ 32
126
+ 00:03:30,399 --> 00:03:37,280
127
+ Basically speaking just random things for three minutes and it was actually kind of tedious because
128
+
129
+ 33
130
+ 00:03:37,280 --> 00:03:43,200
131
+ the texts were really weird. Some of them were it was like AI generated. I tried before
132
+
133
+ 34
134
+ 00:03:43,200 --> 00:03:48,640
135
+ to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was
136
+
137
+ 35
138
+ 00:03:48,640 --> 00:03:56,720
139
+ okay knowing just gonna have to find something out to read. So I used a created with AI Studio
140
+
141
+ 36
142
+ 00:03:56,800 --> 00:04:03,600
143
+ vibe code as a synthetic text generator, which actually I thought was probably a better way of
144
+
145
+ 37
146
+ 00:04:03,600 --> 00:04:09,920
147
+ doing it because it would give me more short samples with more varied content. So I was like okay,
148
+
149
+ 38
150
+ 00:04:09,920 --> 00:04:16,160
151
+ give me a voice note like I'm recording an email, give me a short story to read, give me pros
152
+
153
+ 39
154
+ 00:04:17,440 --> 00:04:22,320
155
+ to read. So I came up with all these different things and they added a little timer to it so I
156
+
157
+ 40
158
+ 00:04:22,320 --> 00:04:29,040
159
+ could see how to let us say well as to one hour. And I spent like an hour, one afternoon or probably
160
+
161
+ 41
162
+ 00:04:29,040 --> 00:04:36,560
163
+ two hours by the time you do retakes on whatever because you want to, it gave me a source of truth
164
+
165
+ 42
166
+ 00:04:37,280 --> 00:04:43,440
167
+ which I'm not sure if that's the scientific way to approach this topic of gathering training data,
168
+
169
+ 43
170
+ 00:04:43,440 --> 00:04:50,159
171
+ but I thought made sense. I have a lot of audio data from recording voice notes which I've also
172
+
173
+ 44
174
+ 00:04:50,160 --> 00:04:56,160
175
+ kind of used being experimenting with using for a different purpose. It's slightly different
176
+
177
+ 45
178
+ 00:04:56,160 --> 00:05:03,680
179
+ annotating task types. It's more text classification experiment or well it's more than that actually
180
+
181
+ 46
182
+ 00:05:03,680 --> 00:05:12,480
183
+ I'm working on a voice app so it's a prototype I guess is really more accurate. But you can do that
184
+
185
+ 47
186
+ 00:05:12,480 --> 00:05:18,960
187
+ and you can work backwards. You listen back to a voice note and you painfully go through one of those
188
+
189
+ 48
190
+ 00:05:19,039 --> 00:05:24,080
191
+ transcribing where you start and stop and scrub around in a new fixie areas but it's really
192
+
193
+ 49
194
+ 00:05:24,080 --> 00:05:29,039
195
+ really pouring to do that. So I thought it would be less tedious in the long term if I just
196
+
197
+ 50
198
+ 00:05:29,599 --> 00:05:35,200
199
+ recorded this source of truth. So it gave me these three minutes snippets. I recorded them
200
+
201
+ 51
202
+ 00:05:35,200 --> 00:05:43,200
203
+ it saved an MP3 and a TXT in the same folder and I created an error that data. So I was very hopeful
204
+
205
+ 52
206
+ 00:05:43,520 --> 00:05:50,960
207
+ that I could actually find you in whisper. I want to find you in whisper because when I got
208
+
209
+ 53
210
+ 00:05:50,960 --> 00:05:59,280
211
+ into voice tech last November my wife was in the US and I was alone at home and when crazy
212
+
213
+ 54
214
+ 00:05:59,280 --> 00:06:05,840
215
+ people like me do really wild things like use voice tech technology that was basically when I
216
+
217
+ 55
218
+ 00:06:05,840 --> 00:06:11,520
219
+ started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't
220
+
221
+ 56
222
+ 00:06:11,599 --> 00:06:18,719
223
+ that high. I used speech tech now and again tried to write as like it would be really cool if you
224
+
225
+ 57
226
+ 00:06:18,719 --> 00:06:24,479
227
+ just like speak into your computer and whatever I tried I used that had Linux support was just
228
+
229
+ 58
230
+ 00:06:25,359 --> 00:06:30,719
231
+ it was not good basically and this blew me away from the first go. I mean it wasn't 100%
232
+
233
+ 59
234
+ 00:06:31,680 --> 00:06:36,240
235
+ accurate either the box and it took work but it was good enough that there was a solid foundation
236
+
237
+ 60
238
+ 00:06:36,240 --> 00:06:42,880
239
+ and it kind of passed that pivot point that it's actually worth doing this. There's a point where
240
+
241
+ 61
242
+ 00:06:42,880 --> 00:06:48,160
243
+ it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time
244
+
245
+ 62
246
+ 00:06:48,880 --> 00:06:52,960
247
+ for it's speech tech to be worth while it isn't your productivity but you do need to get above
248
+
249
+ 63
250
+ 00:06:52,960 --> 00:07:00,480
251
+ let's say 85%. If it's 60% or 50% you inevitably say screw it I'll just type it because you
252
+
253
+ 64
254
+ 00:07:00,480 --> 00:07:06,080
255
+ end up missing errors in the transcript and it becomes actually worse you end up in a worse position
256
+
257
+ 65
258
+ 00:07:06,080 --> 00:07:12,560
259
+ than you started with that's been my experience. So I was like oh this is actually really really good
260
+
261
+ 66
262
+ 00:07:12,560 --> 00:07:19,440
263
+ now how did that happen? The answer is ASR whisper being open sourced and the transformer
264
+
265
+ 67
266
+ 00:07:19,440 --> 00:07:26,160
267
+ architecture if you want to go back to the to the underpinnings which really blows my mind and it's
268
+
269
+ 68
270
+ 00:07:26,240 --> 00:07:37,280
271
+ on my list to reto that paper all you need is attention as attentively as can be done with my limited
272
+
273
+ 69
274
+ 00:07:37,280 --> 00:07:45,040
275
+ brain because it's super super high level stuff super advanced stuff I mean but that I think of all the
276
+
277
+ 70
278
+ 00:07:45,040 --> 00:07:53,680
279
+ things that are fascinating about the sudden rise and AI and the dramatic capabilities I find
280
+
281
+ 71
282
+ 00:07:53,680 --> 00:07:58,880
283
+ fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot
284
+
285
+ 72
286
+ 00:07:58,880 --> 00:08:04,400
287
+ and LLM and then you've got image generation okay so first of all those two things
288
+
289
+ 73
290
+ 00:08:05,360 --> 00:08:12,240
291
+ on the surface have nothing in common so like how did that just happen all at the same time and
292
+
293
+ 74
294
+ 00:08:12,240 --> 00:08:19,200
295
+ then when you extend that further you're like Suno right you can sing a song an AI will like come
296
+
297
+ 75
298
+ 00:08:19,280 --> 00:08:24,880
299
+ up with an instrumental and then you've got whisper and you're like wait a second how did all this
300
+
301
+ 76
302
+ 00:08:24,880 --> 00:08:30,400
303
+ stuff like if it's all AI what's like there has to be some commonality otherwise these are
304
+
305
+ 77
306
+ 00:08:30,400 --> 00:08:37,439
307
+ for these are totally different technologies on the surface of it and the transformer architecture
308
+
309
+ 78
310
+ 00:08:37,439 --> 00:08:43,520
311
+ is as far as I know the answer and I can't even say you can't even pretend that I really understand
312
+
313
+ 79
314
+ 00:08:44,159 --> 00:08:50,160
315
+ what the transformer architecture means in depth but I have scandice and as I said once a
316
+
317
+ 80
318
+ 00:08:50,720 --> 00:08:57,360
319
+ printed I'm really kind of think over it's at some point and I'll probably feel bad about myself
320
+
321
+ 81
322
+ 00:08:57,360 --> 00:09:02,720
323
+ I think because when those guys in the in their 20s like that's crazy I think I asked
324
+
325
+ 82
326
+ 00:09:02,720 --> 00:09:09,199
327
+ you to be one who were the who wrote that paper and how old were they when it was published in
328
+
329
+ 83
330
+ 00:09:09,200 --> 00:09:14,800
331
+ arcs of I was expecting like I don't know what do you imagine I personally imagine kind of like
332
+
333
+ 84
334
+ 00:09:14,800 --> 00:09:20,400
335
+ you know you have these breakthroughs during Covid and things like that were like these kind of
336
+
337
+ 85
338
+ 00:09:20,400 --> 00:09:25,040
339
+ really obscure scientists who are like in their 50s and they've just kind of been laboring labs
340
+
341
+ 86
342
+ 00:09:25,040 --> 00:09:31,120
343
+ and we're really in writing and publishing and kind of obscure academic publications and they
344
+
345
+ 87
346
+ 00:09:31,120 --> 00:09:37,520
347
+ finally like hit a bake or when a noble apprise and then their household names so that was kind
348
+
349
+ 88
350
+ 00:09:37,600 --> 00:09:43,280
351
+ of what I had that was the mental image I'd formed of the birth of arcs of like I wasn't
352
+
353
+ 89
354
+ 00:09:43,280 --> 00:09:48,560
355
+ expecting 20 somethings in San Francisco though I thought that was both very very funny very cool
356
+
357
+ 90
358
+ 00:09:48,560 --> 00:09:55,600
359
+ and actually kind of inspiring it's nice to think that people who you know just you might put them
360
+
361
+ 91
362
+ 00:09:55,600 --> 00:10:02,079
363
+ in the kind of milieu or bubble or world that you are in or incredibly in through you know
364
+
365
+ 92
366
+ 00:10:02,080 --> 00:10:07,680
367
+ the series of connections that are coming up with such literally world changing innovations
368
+
369
+ 93
370
+ 00:10:07,680 --> 00:10:14,000
371
+ so that was I thought anyway that that that was cool okay voice training data how are we doing
372
+
373
+ 94
374
+ 00:10:14,000 --> 00:10:20,080
375
+ we're at by 10 minutes and I'm still talking about voice technology so whisper was brilliant and
376
+
377
+ 95
378
+ 00:10:20,880 --> 00:10:26,000
379
+ I was so excited that I was my first instinct was to like guess like oh my gosh I have to
380
+
381
+ 96
382
+ 00:10:26,000 --> 00:10:31,760
383
+ get like a really good microphone for this so I didn't go on a spending spree because I said
384
+
385
+ 97
386
+ 00:10:31,840 --> 00:10:37,840
387
+ I'm gonna have to just wait a month and see if I still use this and it just kind of became
388
+
389
+ 98
390
+ 00:10:37,840 --> 00:10:44,800
391
+ it's become really part of my daily routine like if I'm writing an email I'll record a voice note
392
+
393
+ 99
394
+ 00:10:44,800 --> 00:10:49,040
395
+ and then I've developed and it's nice to see that everyone is like developing the same
396
+
397
+ 100
398
+ 00:10:49,600 --> 00:10:55,040
399
+ things in parallel like that's my kind of a weird thing to say but when I look I kind of came
400
+
401
+ 101
402
+ 00:10:55,040 --> 00:11:01,200
403
+ when I started working on this these prototypes on GitHub which is where I just kind of share
404
+
405
+ 102
406
+ 00:11:01,200 --> 00:11:09,600
407
+ very freely and loosely ideas and you know first iterations on concepts and for one of
408
+
409
+ 103
410
+ 00:11:09,600 --> 00:11:15,760
411
+ a better word I called it like LLM post processing or cleanup or basically a system prompt that
412
+
413
+ 104
414
+ 00:11:15,760 --> 00:11:22,880
415
+ after you get back the raw text from whisper you run it through model and say okay this is crappy
416
+
417
+ 105
418
+ 00:11:23,600 --> 00:11:31,040
419
+ text like add sentence structure and you know fix it up and now when I'm exploring
420
+
421
+ 106
422
+ 00:11:31,040 --> 00:11:36,480
423
+ the different tools that are out there the people of built I see quite a number of projects have
424
+
425
+ 107
426
+ 00:11:37,280 --> 00:11:42,480
427
+ basically you know done the same thing last that we missed construit I'm not saying for a
428
+
429
+ 108
430
+ 00:11:42,480 --> 00:11:48,480
431
+ millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools
432
+
433
+ 109
434
+ 00:11:48,480 --> 00:11:53,520
435
+ for a while but it's it's the kind of thing that when you start using these tools every day
436
+
437
+ 110
438
+ 00:11:53,520 --> 00:11:59,760
439
+ the need for it is almost instantly apparent because text that doesn't have any punctuation or
440
+
441
+ 111
442
+ 00:11:59,760 --> 00:12:05,360
443
+ paragraph spacing takes a long time to you know it takes so long to get it into a presentable email
444
+
445
+ 112
446
+ 00:12:05,360 --> 00:12:13,280
447
+ that again it's it moves speech tech into that before that inflection point we're like that's just
448
+
449
+ 113
450
+ 00:12:13,280 --> 00:12:19,040
451
+ not worth it it's like it's just be quicker to type this so it's it's a big it's a little touch that
452
+
453
+ 114
454
+ 00:12:19,040 --> 00:12:26,959
455
+ actually is a big deal so I was on whisper and I've been using whisper and I kind of early on
456
+
457
+ 115
458
+ 00:12:26,959 --> 00:12:32,640
459
+ find a couple of tools I couldn't find what I was looking for on Linux which is basically just
460
+
461
+ 116
462
+ 00:12:32,640 --> 00:12:39,120
463
+ something that'll run in the background you'll give it an API key and it'll just like transcribe
464
+
465
+ 117
466
+ 00:12:39,200 --> 00:12:47,120
467
+ with like a little key to start and start the dictation and the issues where I discovered
468
+
469
+ 118
470
+ 00:12:47,120 --> 00:12:52,800
471
+ that like most people involved in creating these projects were very much focused on local models
472
+
473
+ 119
474
+ 00:12:52,800 --> 00:12:58,800
475
+ running whisper locally because you can and I tried that a bunch of times and just never got
476
+
477
+ 120
478
+ 00:12:58,800 --> 00:13:04,160
479
+ results that were as good as the cloud and when I began looking at the cost of the speech tech
480
+
481
+ 121
482
+ 00:13:04,160 --> 00:13:10,160
483
+ to API is what I was spending I just thought there is it's actually in my opinion just one of the
484
+
485
+ 122
486
+ 00:13:10,160 --> 00:13:15,680
487
+ better deals in API spending and in cloud like it's just not that expensive for very very good
488
+
489
+ 123
490
+ 00:13:15,680 --> 00:13:22,240
491
+ models that are much more you know you're going to be able to run the full model the latest model
492
+
493
+ 124
494
+ 00:13:22,240 --> 00:13:29,199
495
+ versus whatever you can run on your average GPU unless you want to buy crazy GPU it doesn't really
496
+
497
+ 125
498
+ 00:13:29,280 --> 00:13:34,000
499
+ make sense to me and I've been a lot of things that I know is kind of like a very much
500
+
501
+ 126
502
+ 00:13:34,000 --> 00:13:39,040
503
+ just everything the people just don't want their voice data and their voice leaving their local
504
+
505
+ 127
506
+ 00:13:39,040 --> 00:13:45,920
507
+ environment maybe for regular few reasons as well but I'm not in that I'm neither really care
508
+
509
+ 128
510
+ 00:13:45,920 --> 00:13:51,680
511
+ about people listening to my gross readest consisting of reminding myself that I need to buy more
512
+
513
+ 129
514
+ 00:13:51,680 --> 00:13:58,320
515
+ beer cheetos and hummus which is kind of the three three staples of my diet during periods of
516
+
517
+ 130
518
+ 00:13:58,320 --> 00:14:04,640
519
+ poor nutrition but the kind of stuff that I transcribe it's just not it's not a it's not a
520
+
521
+ 131
522
+ 00:14:04,640 --> 00:14:12,800
523
+ privacy thing that sort of sensitive about and I don't do anything so you know sensitive or
524
+
525
+ 132
526
+ 00:14:12,800 --> 00:14:17,760
527
+ secure that requires air capping so I looked at the pricing and especially the kind of older
528
+
529
+ 133
530
+ 00:14:17,760 --> 00:14:24,320
531
+ model mini some of them were very very affordable and I did it back the I did a calculation once
532
+
533
+ 134
534
+ 00:14:24,400 --> 00:14:30,800
535
+ with chatchewet and I was like okay this is the API price for I can't remember whatever the model
536
+
537
+ 135
538
+ 00:14:30,800 --> 00:14:37,440
539
+ was let's say I just go out at like nonstop which really happens probably I would say an average
540
+
541
+ 136
542
+ 00:14:37,440 --> 00:14:42,800
543
+ I might dictate 30 to 60 minutes per day if I was probably summing up the emails
544
+
545
+ 137
546
+ 00:14:44,560 --> 00:14:50,800
547
+ documents outlines which is a lot but it's still a fairly modest amount and I was like well
548
+
549
+ 138
550
+ 00:14:50,800 --> 00:14:56,319
551
+ some days I do go on like one or two days right being usually when I'm like kind of I'd do the
552
+
553
+ 139
554
+ 00:14:56,319 --> 00:15:02,800
555
+ house and just have something like I've nothing else to do like if I met a hospital we've a newborn
556
+
557
+ 140
558
+ 00:15:04,079 --> 00:15:09,040
559
+ and you're waiting for like eight hours and hours for an appointment and I would probably have
560
+
561
+ 141
562
+ 00:15:09,040 --> 00:15:15,520
563
+ listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me
564
+
565
+ 142
566
+ 00:15:15,520 --> 00:15:20,800
567
+ just get these ideas out of my head and that's when I'll go on my speech spinches but those
568
+
569
+ 143
570
+ 00:15:20,800 --> 00:15:25,680
571
+ were like once every few months like not frequently but I said okay let's just say if I'm going
572
+
573
+ 144
574
+ 00:15:25,680 --> 00:15:34,960
575
+ to price out cloud STT if I was like dedicated every second of every waking hour to transcribing
576
+
577
+ 145
578
+ 00:15:34,960 --> 00:15:41,280
579
+ for some odd reason I mean it have to like ease and use the toilet like you know there's only so many
580
+
581
+ 146
582
+ 00:15:41,360 --> 00:15:47,920
583
+ hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hour
584
+
585
+ 147
586
+ 00:15:47,920 --> 00:15:53,280
587
+ then I said all right let's just say 50 who knows you're dictating on the toilet we do it
588
+
589
+ 148
590
+ 00:15:53,920 --> 00:16:01,040
591
+ so you could just do 60 but whatever I did and every day like you're going flat out seven days
592
+
593
+ 149
594
+ 00:16:01,040 --> 00:16:07,199
595
+ a week dictating nonstop it's like what's my monthly API bill gonna be at this price and it came
596
+
597
+ 150
598
+ 00:16:07,200 --> 00:16:14,160
599
+ out to like 70 or 80 bucks and I was like well that would be an extraordinary amount of dictation
600
+
601
+ 151
602
+ 00:16:14,160 --> 00:16:20,960
603
+ and I would hope that there were some compelling reason more worth more than 70 dollars that I
604
+
605
+ 152
606
+ 00:16:20,960 --> 00:16:25,760
607
+ embarked upon their project so given the dots kind of the max point for me I said that's actually
608
+
609
+ 153
610
+ 00:16:25,760 --> 00:16:32,080
611
+ very very affordable now you're gonna if you want to specide the costs and you want to do the post
612
+
613
+ 154
614
+ 00:16:32,080 --> 00:16:38,640
615
+ processing that I really do feel as valuable that's gonna cost them more as well on the last
616
+
617
+ 155
618
+ 00:16:38,640 --> 00:16:46,320
619
+ you're using Gemini which needless to say is a random person sitting in Jerusalem I have no
620
+
621
+ 156
622
+ 00:16:46,320 --> 00:16:52,000
623
+ affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter
624
+
625
+ 157
626
+ 00:16:53,760 --> 00:17:00,240
627
+ I like Gemini not so much as a everyday model it's kind of underwhelmed in that respect I would say
628
+
629
+ 158
630
+ 00:17:00,320 --> 00:17:05,839
631
+ but for multi-model I think it's got a lot to offer and I think that the transcribing functionality
632
+
633
+ 159
634
+ 00:17:05,839 --> 00:17:13,520
635
+ whereby it can process audio with the system prompt and both give you transcription that's cleaned
636
+
637
+ 160
638
+ 00:17:13,520 --> 00:17:20,720
639
+ up that reduces two steps to one and that for me is a very very big deal and I feel like even Google
640
+
641
+ 161
642
+ 00:17:20,720 --> 00:17:27,839
643
+ hasn't really sort of thought through how useful the downloadability is more kind of use cases
644
+
645
+ 162
646
+ 00:17:28,319 --> 00:17:34,000
647
+ you can achieve with it because I found in the course of this year just an endless list of
648
+
649
+ 163
650
+ 00:17:34,959 --> 00:17:40,639
651
+ really kind of system prompt system prompt stuff that I can say okay I've used the trick
652
+
653
+ 164
654
+ 00:17:40,639 --> 00:17:46,320
655
+ capture context data for AI which is literally I might speak for if I wanted to have a good bank
656
+
657
+ 165
658
+ 00:17:46,320 --> 00:17:52,560
659
+ of context data about who knows my childhood more realistically maybe my career goals
660
+
661
+ 166
662
+ 00:17:53,520 --> 00:17:59,520
663
+ something that would just be like really boring to type out so I'll just like sit in my car
664
+
665
+ 167
666
+ 00:17:59,520 --> 00:18:06,480
667
+ and record it for 10 minutes and that's 10 minutes you get a lot of information in emails which
668
+
669
+ 168
670
+ 00:18:06,480 --> 00:18:13,200
671
+ is short text just there is a whole bunch and all these workflows kind of require a little bit
672
+
673
+ 169
674
+ 00:18:13,200 --> 00:18:17,919
675
+ of treatment afterwards and different treatment my context pipeline is kind of like just to
676
+
677
+ 170
678
+ 00:18:17,920 --> 00:18:22,320
679
+ extract the bare essential so you end up with me talking very loosely about sort of what
680
+
681
+ 171
682
+ 00:18:22,320 --> 00:18:27,920
683
+ I've done in my career where I've worked where my light work and it goes it condenses that down
684
+
685
+ 172
686
+ 00:18:27,920 --> 00:18:33,920
687
+ to very robotic language that is easy to chunk parse and maybe put into a vector database
688
+
689
+ 173
690
+ 00:18:33,920 --> 00:18:39,760
691
+ Daniel has worked in technology Daniel is a has been working in Martin you know stuff like that
692
+
693
+ 174
694
+ 00:18:39,760 --> 00:18:46,160
695
+ that's not how you would speak but I figure it's probably easier to parse for after all robots
696
+
697
+ 175
698
+ 00:18:46,800 --> 00:18:52,560
699
+ so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my
700
+
701
+ 176
702
+ 00:18:53,600 --> 00:19:00,880
703
+ of the evening speaking into my headphone and the levels were a shot and it was clipping and I said
704
+
705
+ 177
706
+ 00:19:00,880 --> 00:19:06,880
707
+ I can't read each and evaluation I have to be fair I have to give the models a chance to do their thing
708
+
709
+ 178
710
+ 00:19:07,600 --> 00:19:11,520
711
+ at what am I hoping to achieve in this okay my function was a daughter's mentioned
712
+
713
+ 179
714
+ 00:19:11,840 --> 00:19:16,879
715
+ Deep Gram STT I'm really really hopeful that this prototype will work and it's a build and
716
+
717
+ 180
718
+ 00:19:16,879 --> 00:19:22,960
719
+ public open source so anyone is welcome to use it if I make anything good but that was really exciting
720
+
721
+ 181
722
+ 00:19:22,960 --> 00:19:30,160
723
+ for me last night when after hours of trying my own prototype seeing someone just made something that
724
+
725
+ 182
726
+ 00:19:30,160 --> 00:19:36,320
727
+ works like that you know you're not going to have to build a custom condo environment and image
728
+
729
+ 183
730
+ 00:19:36,320 --> 00:19:42,399
731
+ I have AMD GPU which makes things much more complicated I didn't find it and I was about to
732
+
733
+ 184
734
+ 00:19:42,399 --> 00:19:48,240
735
+ give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work
736
+
737
+ 185
738
+ 00:19:48,879 --> 00:19:53,520
739
+ I'm just going to go back to trying to vibe code something myself and when I ran the script
740
+
741
+ 186
742
+ 00:19:54,560 --> 00:20:01,120
743
+ I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that
744
+
745
+ 187
746
+ 00:20:01,760 --> 00:20:08,479
747
+ the tricky thing for all those who wants to know all the nitty ditty degree nitty gritty details
748
+
749
+ 188
750
+ 00:20:09,919 --> 00:20:14,800
751
+ was that I don't think it was actually struggling with transcription but pasting
752
+
753
+ 189
754
+ 00:20:14,800 --> 00:20:20,320
755
+ wailant makes life very hard and I think there was something not running in the right time anyway
756
+
757
+ 190
758
+ 00:20:20,320 --> 00:20:25,120
759
+ Deep Gram I looked at how they actually handled that because it worked out of the box one other stuff
760
+
761
+ 191
762
+ 00:20:25,120 --> 00:20:32,159
763
+ didn't and it was quite a clever little mechanism and but more and so it's not the accuracy was brilliant
764
+
765
+ 192
766
+ 00:20:32,159 --> 00:20:40,639
767
+ now what am I doing here this is going to be a 20 minutes audio sample and I'm I think I've done
768
+
769
+ 193
770
+ 00:20:40,639 --> 00:20:49,120
771
+ one or two of these before but I did this with sure snappy voice notes this is kind of long form
772
+
773
+ 194
774
+ 00:20:49,520 --> 00:20:54,080
775
+ it's actually might be a better approximation for what's useful to me than voice mammals like I
776
+
777
+ 195
778
+ 00:20:54,080 --> 00:20:59,840
779
+ need to buy three beaters of moke tomorrow and Peter bread which is probably how like half my voice
780
+
781
+ 196
782
+ 00:20:59,840 --> 00:21:04,639
783
+ note send like if anyone were to I don't know like find my phone they be like this is the most
784
+
785
+ 197
786
+ 00:21:04,639 --> 00:21:09,760
787
+ boring person in the world although actually there are some like kind of journaling thoughts as well
788
+
789
+ 198
790
+ 00:21:09,760 --> 00:21:14,800
791
+ but it's a lot of content like that and the probably for the evaluation the most useful thing is
792
+
793
+ 199
794
+ 00:21:15,440 --> 00:21:23,280
795
+ slightly obscure tech get hub the clean hooking face not so obscure that is not going to have a chance
796
+
797
+ 200
798
+ 00:21:23,280 --> 00:21:28,879
799
+ of knowing it but hopefully sufficiently well known that the models should get us I tried to do a
800
+
801
+ 201
802
+ 00:21:28,879 --> 00:21:33,919
803
+ little bit of speaking really fast and speaking very slowly I would say in general I've spoken
804
+
805
+ 202
806
+ 00:21:34,560 --> 00:21:40,159
807
+ delivered this at a faster pace than I usually would owe into strong coffee if flowing through my bloodstream
808
+
809
+ 203
810
+ 00:21:41,040 --> 00:21:45,760
811
+ and the thing that I'm not going to get into spent work is background noise which in my first
812
+
813
+ 204
814
+ 00:21:45,760 --> 00:21:52,000
815
+ take that I had to get rid of my wife come in with my son and for a good night kiss and that actually
816
+
817
+ 205
818
+ 00:21:52,000 --> 00:21:58,560
819
+ would have been super helpful to get in because it was non-diorized or if we had diorization
820
+
821
+ 206
822
+ 00:21:59,440 --> 00:22:03,120
823
+ a female I could say I want the male voice and that wasn't intended for transcription
824
+
825
+ 207
826
+ 00:22:04,480 --> 00:22:07,920
827
+ and I'm not going to get background noise like people honking their horns which is something
828
+
829
+ 208
830
+ 00:22:08,080 --> 00:22:13,440
831
+ of done to my main data set where I am trying to go back to some of my voice notes
832
+
833
+ 209
834
+ 00:22:13,440 --> 00:22:19,920
835
+ and I take them and run a benchmark but this is going to be just a pure quick test and
836
+
837
+ 210
838
+ 00:22:21,200 --> 00:22:28,080
839
+ someone working on a voice note idea that's my sort of end motivation besides thinking it's
840
+
841
+ 211
842
+ 00:22:28,080 --> 00:22:33,440
843
+ an astudiate standing technology that's coming to viability and really I know the same it's cheesy
844
+
845
+ 212
846
+ 00:22:33,520 --> 00:22:40,880
847
+ can actually have a very transformative effect it's you know voice technology has been life changing
848
+
849
+ 213
850
+ 00:22:40,880 --> 00:22:48,640
851
+ for folks living with disabilities and I think there's something really nice about the fact that
852
+
853
+ 214
854
+ 00:22:48,640 --> 00:22:54,560
855
+ it can also benefit you know folks who are able bodies and like we can all in different ways
856
+
857
+ 215
858
+ 00:22:57,040 --> 00:23:01,040
859
+ make this tech as useful as possible regardless of the exact way that we're using it
860
+
861
+ 216
862
+ 00:23:02,000 --> 00:23:06,639
863
+ and I think there's something very powerful in that and it can be very cool I see
864
+
865
+ 217
866
+ 00:23:06,639 --> 00:23:10,800
867
+ huge potential what excites me about voice tech a lot of things actually
868
+
869
+ 218
870
+ 00:23:12,080 --> 00:23:16,000
871
+ firstly the fact that it's cheap and accurate as I mentioned at the very start of this
872
+
873
+ 219
874
+ 00:23:17,200 --> 00:23:19,760
875
+ and it's getting better and better with stuff like accent handling
876
+
877
+ 220
878
+ 00:23:20,879 --> 00:23:25,040
879
+ I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it
880
+
881
+ 221
882
+ 00:23:25,040 --> 00:23:30,800
883
+ day to day as I imagine and get likes you per flawless words error rates because I'm just kind of
884
+
885
+ 222
886
+ 00:23:30,800 --> 00:23:38,399
887
+ skeptical about local speech attacks as I mentioned and I think the pace of innovation and
888
+
889
+ 223
890
+ 00:23:38,399 --> 00:23:42,720
891
+ improvement in the models the main reason for fine tuning from what I've seen
892
+
893
+ 224
894
+ 00:23:44,240 --> 00:23:51,120
895
+ have been people who are something that really blows my mind about ASR is the idea that it's inherently
896
+
897
+ 225
898
+ 00:23:52,320 --> 00:24:00,080
899
+ alien fuel or multilingual fanatic based so as folks who use speak very obsturately languages
900
+
901
+ 226
902
+ 00:24:00,159 --> 00:24:05,520
903
+ that there might be a policy of training data or almost none at all and therefore the accuracy
904
+
905
+ 227
906
+ 00:24:05,520 --> 00:24:11,840
907
+ is significantly reduced or folks in very critical environments I know they're you this is
908
+
909
+ 228
910
+ 00:24:11,840 --> 00:24:18,080
911
+ using extensively in medical transcription and dispatcher work as you know the call centers
912
+
913
+ 229
914
+ 00:24:18,080 --> 00:24:24,000
915
+ use send out ambulances etc where accuracy is absolutely paramount and in the case of doctors
916
+
917
+ 230
918
+ 00:24:24,560 --> 00:24:29,679
919
+ radiologists there might be using very specialized vocab all the time so those are kind of the main
920
+
921
+ 231
922
+ 00:24:29,760 --> 00:24:35,040
923
+ two things and I'm not sure that really just for training make it better on a few random
924
+
925
+ 232
926
+ 00:24:35,040 --> 00:24:41,360
927
+ tech words with my slightly I mean I have an accent but like not you know an accent that a few
928
+
929
+ 233
930
+ 00:24:41,360 --> 00:24:48,240
931
+ other million people have ish I'm not sure that my little fine tune is going to actually
932
+
933
+ 234
934
+ 00:24:49,440 --> 00:24:54,560
935
+ like the bump in word error reduction if I ever actually figure out how to do it and get it up to the
936
+
937
+ 235
938
+ 00:24:55,520 --> 00:25:00,560
939
+ by the time I've done that I suspect that the next generation of ASR will just be so good that
940
+
941
+ 236
942
+ 00:25:00,560 --> 00:25:06,399
943
+ it will kind of be no well that would be cool for a doubt but all this uses instead so that's
944
+
945
+ 237
946
+ 00:25:06,399 --> 00:25:14,480
947
+ going to be is for today's episodes of voice training data single long-shaw evaluation
948
+
949
+ 238
950
+ 00:25:14,480 --> 00:25:20,560
951
+ who am I going to compare with supposed to be a benchmark but I'm more interested in seeing whisper
952
+
953
+ 239
954
+ 00:25:20,639 --> 00:25:27,679
955
+ head to head with two things ready one is whisper variants so you've got these projects like faster whisper
956
+
957
+ 240
958
+ 00:25:29,120 --> 00:25:33,840
959
+ distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASR is
960
+
961
+ 241
962
+ 00:25:33,840 --> 00:25:38,879
963
+ which are also a thing my intention for this is I'm not sure I'm going to have the time in any point
964
+
965
+ 242
966
+ 00:25:38,879 --> 00:25:45,840
967
+ of the foreseeable future to go back to this whole episode and create a proper source true through
968
+
969
+ 243
970
+ 00:25:45,840 --> 00:25:53,760
971
+ a fix everything might do it if I can get one transcriptions as efficiently close to perfection
972
+
973
+ 244
974
+ 00:25:54,959 --> 00:25:59,840
975
+ but what I would actually love to do on hugging face I think would be a great
976
+
977
+ 245
978
+ 00:25:59,840 --> 00:26:06,240
979
+ probably how I might visualize this is having the audio waveform play and then have the transcript
980
+
981
+ 246
982
+ 00:26:06,240 --> 00:26:14,720
983
+ for each model below it and maybe even a like you know two scale and maybe even a local one as
984
+
985
+ 247
986
+ 00:26:14,720 --> 00:26:22,960
987
+ well like local whisper versus open AI API etc and I can then actually listen back to segments
988
+
989
+ 248
990
+ 00:26:22,960 --> 00:26:28,960
991
+ or anyone who wants to can listen back to segments of this recording and see where a particular
992
+
993
+ 249
994
+ 00:26:28,960 --> 00:26:34,240
995
+ model struggled and others didn't as well as the sort of headline finding of which had the best
996
+
997
+ 250
998
+ 00:26:34,240 --> 00:26:40,240
999
+ WER but that would require the source of truth okay that's it I hope this was I don't know
1000
+
1001
+ 251
1002
+ 00:26:40,320 --> 00:26:45,760
1003
+ maybe useful for other folks interested in STT you want to see that I always feel think I've just
1004
+
1005
+ 252
1006
+ 00:26:45,760 --> 00:26:51,120
1007
+ said something I didn't and do STT I said for those it's in carefully including hopefully the
1008
+
1009
+ 253
1010
+ 00:26:51,840 --> 00:26:57,840
1011
+ models themselves this has been myself Daniel Rosal for more jumbled repository is about my
1012
+
1013
+ 254
1014
+ 00:26:58,240 --> 00:27:04,640
1015
+ roving interest in AI but particularly agentic mcp and voice tech you can find me on
1016
+
1017
+ 255
1018
+ 00:27:04,800 --> 00:27:11,920
1019
+ get up hugging face where else Daniel Rosal.com which is my personal website as well as
1020
+
1021
+ 256
1022
+ 00:27:12,560 --> 00:27:17,360
1023
+ this podcast whose name I sadly cannot remember but the next time thanks for listening
1024
+
data/inference/runs/local-stt/run-2/whisper-tiny.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI in generative AI, I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation of back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I thought I had made a great breakthrough in my journey with speech tech and that was succeeding in the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking, I might whisper something at some points as well. And I'll go back to speaking loud in a different part. So I'm going to send really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to put a speech attacks model through its pieces, which is trying to make sense of, is this guy just ramling on and coherently in one long sentence or are these just actually a series of
2
+
3
+ step of standalone, standalone, standalone sentences. And how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word. And then you're like, wait, that's not actually that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why was it trying to find you to whisper? And what is whisper, as I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stocked down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be, to be done. It's a really interesting area and speech and voice attack is the aspect of it that I find actually most, I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I'm persevering hard with the task of training, yes, a good solution working for Linux. Would you have anyone actually, does listen to this not just for the training data and for the actual content? This is this is sparked. I had, besides the fine tune not working. Well, that was the failure. I used Claude Codes because one thing's these days that there is nothing sort of solving you know, the reason of life or something at that's flawed and agentic AI can't do, which is not really the case. It does seem that way sometimes, but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data. Basically speaking just random things for three minutes and it was actually kind of tedious because the texts were really weird. Some of them were it was like AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was okay knowing just gonna have to find something out to read. So I used a created with AI Studio vibe code as a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like okay, give me a voice note like I'm recording an email, give me a short story to read, give me pros to read. So I came up with all these different things and they added a little timer to it so I could see how to let us say well as to one hour. And I spent like an hour, one afternoon or probably two hours by the time you do retakes on whatever because you want to, it gave me a source of truth which I'm not sure if that's the scientific way to approach this topic of gathering training data, but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used being experimenting with using for a different purpose. It's slightly different annotating task types. It's more text classification experiment or well it's more than that actually I'm working on a voice app so it's a prototype I guess is really more accurate. But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those transcribing where you start and stop and scrub around in a new fixie areas but it's really really pouring to do that. So I thought it would be less tedious in the long term if I just recorded this source of truth. So it gave me these three minutes snippets. I recorded them it saved an MP3 and a TXT in the same folder and I created an error that data. So I was very hopeful that I could actually find you in whisper. I want to find you in whisper because when I got into voice tech last November my wife was in the US and I was alone at home and when crazy people like me do really wild things like use voice tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high. I used speech tech now and again tried to write as like it would be really cool if you just like speak into your computer and whatever I tried I used that had Linux support was just it was not good basically and this blew me away from the first go. I mean it wasn't 100% accurate either the box and it took work but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. There's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for it's speech tech to be worth while it isn't your productivity but you do need to get above let's say 85%. If it's 60% or 50% you inevitably say screw it I'll just type it because you end up missing errors in the transcript and it becomes actually worse you end up in a worse position than you started with that's been my experience. So I was like oh this is actually really really good now how did that happen? The answer is ASR whisper being open sourced and the transformer architecture if you want to go back to the to the underpinnings which really blows my mind and it's on my list to reto that paper all you need is attention as attentively as can be done with my limited brain because it's super super high level stuff super advanced stuff I mean but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM and then you've got image generation okay so first of all those two things on the surface have nothing in common so like how did that just happen all at the same time and then when you extend that further you're like Suno right you can sing a song an AI will like come up with an instrumental and then you've got whisper and you're like wait a second how did all this stuff like if it's all AI what's like there has to be some commonality otherwise these are for these are totally different technologies on the surface of it and the transformer architecture is as far as I know the answer and I can't even say you can't even pretend that I really understand what the transformer architecture means in depth but I have scandice and as I said once a printed I'm really kind of think over it's at some point and I'll probably feel bad about myself I think because when those guys in the in their 20s like that's crazy I think I asked you to be one who were the who wrote that paper and how old were they when it was published in arcs of I was expecting like I don't know what do you imagine I personally imagine kind of like you know you have these breakthroughs during Covid and things like that were like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring labs and we're really in writing and publishing and kind of obscure academic publications and they finally like hit a bake or when a noble apprise and then their household names so that was kind of what I had that was the mental image I'd formed of the birth of arcs of like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very funny very cool and actually kind of inspiring it's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in or incredibly in through you know the series of connections that are coming up with such literally world changing innovations so that was I thought anyway that that that was cool okay voice training data how are we doing we're at by 10 minutes and I'm still talking about voice technology so whisper was brilliant and I was so excited that I was my first instinct was to like guess like oh my gosh I have to get like a really good microphone for this so I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this and it just kind of became it's become really part of my daily routine like if I'm writing an email I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel like that's my kind of a weird thing to say but when I look I kind of came when I started working on this these prototypes on GitHub which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts and for one of a better word I called it like LLM post processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through model and say okay this is crappy text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there the people of built I see quite a number of projects have basically you know done the same thing last that we missed construit I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to you know it takes so long to get it into a presentable email that again it's it moves speech tech into that before that inflection point we're like that's just not worth it it's like it's just be quicker to type this so it's it's a big it's a little touch that actually is a big deal so I was on whisper and I've been using whisper and I kind of early on find a couple of tools I couldn't find what I was looking for on Linux which is basically just something that'll run in the background you'll give it an API key and it'll just like transcribe with like a little key to start and start the dictation and the issues where I discovered that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and I tried that a bunch of times and just never got results that were as good as the cloud and when I began looking at the cost of the speech tech to API is what I was spending I just thought there is it's actually in my opinion just one of the better deals in API spending and in cloud like it's just not that expensive for very very good models that are much more you know you're going to be able to run the full model the latest model versus whatever you can run on your average GPU unless you want to buy crazy GPU it doesn't really make sense to me and I've been a lot of things that I know is kind of like a very much just everything the people just don't want their voice data and their voice leaving their local environment maybe for regular few reasons as well but I'm not in that I'm neither really care about people listening to my gross readest consisting of reminding myself that I need to buy more beer cheetos and hummus which is kind of the three three staples of my diet during periods of poor nutrition but the kind of stuff that I transcribe it's just not it's not a it's not a privacy thing that sort of sensitive about and I don't do anything so you know sensitive or secure that requires air capping so I looked at the pricing and especially the kind of older model mini some of them were very very affordable and I did it back the I did a calculation once with chatchewet and I was like okay this is the API price for I can't remember whatever the model was let's say I just go out at like nonstop which really happens probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents outlines which is a lot but it's still a fairly modest amount and I was like well some days I do go on like one or two days right being usually when I'm like kind of I'd do the house and just have something like I've nothing else to do like if I met a hospital we've a newborn and you're waiting for like eight hours and hours for an appointment and I would probably have listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me just get these ideas out of my head and that's when I'll go on my speech spinches but those were like once every few months like not frequently but I said okay let's just say if I'm going to price out cloud STT if I was like dedicated every second of every waking hour to transcribing for some odd reason I mean it have to like ease and use the toilet like you know there's only so many hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hour then I said all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60 but whatever I did and every day like you're going flat out seven days a week dictating nonstop it's like what's my monthly API bill gonna be at this price and it came out to like 70 or 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope that there were some compelling reason more worth more than 70 dollars that I embarked upon their project so given the dots kind of the max point for me I said that's actually very very affordable now you're gonna if you want to specide the costs and you want to do the post processing that I really do feel as valuable that's gonna cost them more as well on the last you're using Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter I like Gemini not so much as a everyday model it's kind of underwhelmed in that respect I would say but for multi-model I think it's got a lot to offer and I think that the transcribing functionality whereby it can process audio with the system prompt and both give you transcription that's cleaned up that reduces two steps to one and that for me is a very very big deal and I feel like even Google hasn't really sort of thought through how useful the downloadability is more kind of use cases you can achieve with it because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say okay I've used the trick capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that's 10 minutes you get a lot of information in emails which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just to extract the bare essential so you end up with me talking very loosely about sort of what I've done in my career where I've worked where my light work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database Daniel has worked in technology Daniel is a has been working in Martin you know stuff like that that's not how you would speak but I figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my of the evening speaking into my headphone and the levels were a shot and it was clipping and I said I can't read each and evaluation I have to be fair I have to give the models a chance to do their thing at what am I hoping to achieve in this okay my function was a daughter's mentioned Deep Gram STT I'm really really hopeful that this prototype will work and it's a build and public open source so anyone is welcome to use it if I make anything good but that was really exciting for me last night when after hours of trying my own prototype seeing someone just made something that works like that you know you're not going to have to build a custom condo environment and image I have AMD GPU which makes things much more complicated I didn't find it and I was about to give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work I'm just going to go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that the tricky thing for all those who wants to know all the nitty ditty degree nitty gritty details was that I don't think it was actually struggling with transcription but pasting wailant makes life very hard and I think there was something not running in the right time anyway Deep Gram I looked at how they actually handled that because it worked out of the box one other stuff didn't and it was quite a clever little mechanism and but more and so it's not the accuracy was brilliant now what am I doing here this is going to be a 20 minutes audio sample and I'm I think I've done one or two of these before but I did this with sure snappy voice notes this is kind of long form it's actually might be a better approximation for what's useful to me than voice mammals like I need to buy three beaters of moke tomorrow and Peter bread which is probably how like half my voice note send like if anyone were to I don't know like find my phone they be like this is the most boring person in the world although actually there are some like kind of journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech get hub the clean hooking face not so obscure that is not going to have a chance of knowing it but hopefully sufficiently well known that the models should get us I tried to do a little bit of speaking really fast and speaking very slowly I would say in general I've spoken delivered this at a faster pace than I usually would owe into strong coffee if flowing through my bloodstream and the thing that I'm not going to get into spent work is background noise which in my first take that I had to get rid of my wife come in with my son and for a good night kiss and that actually would have been super helpful to get in because it was non-diorized or if we had diorization a female I could say I want the male voice and that wasn't intended for transcription and I'm not going to get background noise like people honking their horns which is something of done to my main data set where I am trying to go back to some of my voice notes and I take them and run a benchmark but this is going to be just a pure quick test and someone working on a voice note idea that's my sort of end motivation besides thinking it's an astudiate standing technology that's coming to viability and really I know the same it's cheesy can actually have a very transformative effect it's you know voice technology has been life changing for folks living with disabilities and I think there's something really nice about the fact that it can also benefit you know folks who are able bodies and like we can all in different ways
4
+
5
+ make this tech as useful as possible regardless of the exact way that we're using it and I think there's something very powerful in that and it can be very cool I see huge potential what excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate as I mentioned at the very start of this and it's getting better and better with stuff like accent handling I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of innovation and improvement in the models the main reason for fine tuning from what I've seen have been people who are something that really blows my mind about ASR is the idea that it's inherently alien fuel or multilingual fanatic based so as folks who use speak very obsturately languages that there might be a policy of training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments I know they're you this is using extensively in medical transcription and dispatcher work as you know the call centers use send out ambulances etc where accuracy is absolutely paramount and in the case of doctors radiologists there might be using very specialized vocab all the time so those are kind of the main two things and I'm not sure that really just for training make it better on a few random tech words with my slightly I mean I have an accent but like not you know an accent that a few other million people have ish I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the by the time I've done that I suspect that the next generation of ASR will just be so good that it will kind of be no well that would be cool for a doubt but all this uses instead so that's going to be is for today's episodes of voice training data single long-shaw evaluation who am I going to compare with supposed to be a benchmark but I'm more interested in seeing whisper head to head with two things ready one is whisper variants so you've got these projects like faster whisper distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASR is which are also a thing my intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back to this whole episode and create a proper source true through a fix everything might do it if I can get one transcriptions as efficiently close to perfection but what I would actually love to do on hugging face I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like you know two scale and maybe even a local one as well like local whisper versus open AI API etc and I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't as well as the sort of headline finding of which had the best WER but that would require the source of truth okay that's it I hope this was I don't know maybe useful for other folks interested in STT you want to see that I always feel think I've just said something I didn't and do STT I said for those it's in carefully including hopefully the models themselves this has been myself Daniel Rosal for more jumbled repository is about my roving interest in AI but particularly agentic mcp and voice tech you can find me on get up hugging face where else Daniel Rosal.com which is my personal website as well as this podcast whose name I sadly cannot remember but the next time thanks for listening
data/inference/runs/local-stt/run-3/transcript.srt ADDED
@@ -0,0 +1,1032 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,000 --> 00:00:08,640
3
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast.
4
+
5
+ 2
6
+ 00:00:08,640 --> 00:00:19,120
7
+ Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech
8
+
9
+ 3
10
+ 00:00:19,120 --> 00:00:28,720
11
+ tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this
12
+
13
+ 4
14
+ 00:00:30,080 --> 00:00:37,120
15
+ voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the
16
+
17
+ 5
18
+ 00:00:37,120 --> 00:00:42,320
19
+ envelope evaluation as they might say for different speech attacks models. And I'm doing this because I
20
+
21
+ 6
22
+ 00:00:42,800 --> 00:00:48,560
23
+ I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in
24
+
25
+ 7
26
+ 00:00:48,560 --> 00:00:55,120
27
+ the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to
28
+
29
+ 8
30
+ 00:00:55,760 --> 00:01:01,600
31
+ mix up, I'm going to try a few different styles of speaking. I might whisper something at some
32
+
33
+ 9
34
+ 00:01:01,600 --> 00:01:07,760
35
+ points as well. And I'll go back to speaking loud in different parts. I'm going to send really
36
+
37
+ 10
38
+ 00:01:07,760 --> 00:01:15,200
39
+ like a crazy person because I'm also going to try to speak at different pitches and cadences
40
+
41
+ 11
42
+ 00:01:15,200 --> 00:01:21,600
43
+ in order to really try to push a speech attacks model through its paces, which is trying to make
44
+
45
+ 12
46
+ 00:01:21,600 --> 00:01:30,320
47
+ sense of is this guy just rambling on and coherently in one long sentence or are these just actually
48
+
49
+ 13
50
+ 00:01:30,320 --> 00:01:38,320
51
+ series of step standalone standalone sentences? And how is it going to handle step alone? That's not a
52
+
53
+ 14
54
+ 00:01:38,320 --> 00:01:43,919
55
+ word. What happens when you use speech attacks and you use a fake word and then you're like, wait,
56
+
57
+ 15
58
+ 00:01:43,919 --> 00:01:51,520
59
+ that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the
60
+
61
+ 16
62
+ 00:01:52,880 --> 00:01:57,359
63
+ questions that I'm seeking to answer in this training data. Now, why did why was it trying to
64
+
65
+ 17
66
+ 00:01:57,360 --> 00:02:01,040
67
+ find China whisper? And what is whisper? As I said, I'm going to try to
68
+
69
+ 18
70
+ 00:02:02,080 --> 00:02:04,240
71
+ record this at a couple of different levels of
72
+
73
+ 19
74
+ 00:02:04,880 --> 00:02:10,320
75
+ technicality for folks who are in the normal world and not totally
76
+
77
+ 20
78
+ 00:02:11,360 --> 00:02:16,079
79
+ stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be
80
+
81
+ 21
82
+ 00:02:16,720 --> 00:02:23,440
83
+ to be done. It's a really interesting area and speech and voice tech is the aspect of it that
84
+
85
+ 22
86
+ 00:02:23,440 --> 00:02:28,880
87
+ I find actually most I'm not sure I would say the most interesting because there's just so much
88
+
89
+ 23
90
+ 00:02:28,880 --> 00:02:34,560
91
+ that is fascinating in AI. But the most that I find the most personally transformative in terms of
92
+
93
+ 24
94
+ 00:02:34,560 --> 00:02:42,240
95
+ the impact that it's had on my daily work life and productivity and how I sort of work. And
96
+
97
+ 25
98
+ 00:02:42,960 --> 00:02:49,920
99
+ I am persevering hard with the task of training, I guess, a good solution working for Linux.
100
+
101
+ 26
102
+ 00:02:49,920 --> 00:02:53,440
103
+ Would you have anyone actually does listen to this not just for the training data and for the
104
+
105
+ 27
106
+ 00:02:53,440 --> 00:03:00,399
107
+ actual content? This is this is sparked. I had, besides the fine tune not working, well that was
108
+
109
+ 28
110
+ 00:03:00,399 --> 00:03:07,679
111
+ the failure. I used plot code because one thing these days that there is nothing
112
+
113
+ 29
114
+ 00:03:08,560 --> 00:03:16,799
115
+ short of solving, you know, the reason of life or something that's flawed and
116
+
117
+ 30
118
+ 00:03:16,800 --> 00:03:22,720
119
+ agentically I can't do, which is not really the case. It does seem that way sometimes but it
120
+
121
+ 31
122
+ 00:03:22,720 --> 00:03:28,080
123
+ fails a lot as well. And this is one of those instances where last week I put together an hour
124
+
125
+ 32
126
+ 00:03:28,080 --> 00:03:33,600
127
+ of voice training data, basically speaking just random things for three minutes and
128
+
129
+ 33
130
+ 00:03:35,600 --> 00:03:40,160
131
+ it was actually kind of tedious because the text were really weird. Some of them were it was like,
132
+
133
+ 34
134
+ 00:03:40,160 --> 00:03:45,440
135
+ it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't,
136
+
137
+ 35
138
+ 00:03:45,440 --> 00:03:51,120
139
+ I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something
140
+
141
+ 36
142
+ 00:03:51,120 --> 00:03:59,920
143
+ else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator,
144
+
145
+ 37
146
+ 00:04:00,800 --> 00:04:05,680
147
+ which actually I thought was probably a better way of doing it because it would give me more
148
+
149
+ 38
150
+ 00:04:05,680 --> 00:04:12,000
151
+ short samples with more varied content. So I was like, okay, give me a voice note. Like I'm
152
+
153
+ 39
154
+ 00:04:12,000 --> 00:04:18,800
155
+ recording an email, give me a short story to read, give me pros to read. So I came up with all
156
+
157
+ 40
158
+ 00:04:18,800 --> 00:04:24,240
159
+ these different things and they added a little timer to it so I could see how close I was to one
160
+
161
+ 41
162
+ 00:04:24,240 --> 00:04:32,480
163
+ hour and I spent like an hour one afternoon or probably two hours by the time you do retakes
164
+
165
+ 42
166
+ 00:04:32,480 --> 00:04:39,120
167
+ and whatever because you want to, it gave me a source of truth which I'm not sure if that's the
168
+
169
+ 43
170
+ 00:04:39,120 --> 00:04:45,120
171
+ scientific way to approach this. Topic of gathering training data but I thought made sense.
172
+
173
+ 44
174
+ 00:04:46,560 --> 00:04:50,880
175
+ I have a lot of audio data from recording voice notes which I've also kind of used
176
+
177
+ 45
178
+ 00:04:52,000 --> 00:04:56,720
179
+ being experimenting with using for a different purpose. It's slightly different annotating
180
+
181
+ 46
182
+ 00:04:57,840 --> 00:05:03,680
183
+ task types. It's more text classification experiment or well it's more than that actually
184
+
185
+ 47
186
+ 00:05:03,680 --> 00:05:08,880
187
+ working on a voice app so it's a prototype I guess is really more accurate.
188
+
189
+ 48
190
+ 00:05:11,280 --> 00:05:15,920
191
+ But you can do that and you can work backwards. You listen back to a voice note and you
192
+
193
+ 49
194
+ 00:05:17,520 --> 00:05:22,400
195
+ painfully go through one of those transcribing where you start and stop and scrub around it and
196
+
197
+ 50
198
+ 00:05:22,400 --> 00:05:27,680
199
+ you fix the errors but it's really really pouring to do that. So I thought it would be last tedious
200
+
201
+ 51
202
+ 00:05:27,680 --> 00:05:34,240
203
+ in the long term if I just recorded this source of truth so it gave me these three minute snippets.
204
+
205
+ 52
206
+ 00:05:34,240 --> 00:05:40,480
207
+ I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data.
208
+
209
+ 53
210
+ 00:05:41,840 --> 00:05:47,280
211
+ So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune
212
+
213
+ 54
214
+ 00:05:47,280 --> 00:05:54,720
215
+ whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in
216
+
217
+ 55
218
+ 00:05:54,720 --> 00:06:01,920
219
+ the US and I was alone at home and when crazy people like me do really wild things like use voice
220
+
221
+ 56
222
+ 00:06:01,920 --> 00:06:08,320
223
+ to tech technology that was basically when I started doing it I didn't feel like a crazy person
224
+
225
+ 57
226
+ 00:06:08,320 --> 00:06:15,760
227
+ speaking to myself and my expectations weren't that high. I used speech tech now and again
228
+
229
+ 58
230
+ 00:06:16,960 --> 00:06:21,200
231
+ try it out. It's like it'd be really cool if you could just like speak into your computer and
232
+
233
+ 59
234
+ 00:06:21,280 --> 00:06:28,479
235
+ whatever I tried out that had Linux support was just it was not good basically and this blew me away
236
+
237
+ 60
238
+ 00:06:28,479 --> 00:06:34,400
239
+ from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good
240
+
241
+ 61
242
+ 00:06:34,400 --> 00:06:40,320
243
+ enough that there was a solid foundation and it kind of passed that pivot point that it's actually
244
+
245
+ 62
246
+ 00:06:40,320 --> 00:06:46,320
247
+ worth doing this. There's a point where it's so like the transcript is you don't have to get 100%
248
+
249
+ 63
250
+ 00:06:46,400 --> 00:06:51,040
251
+ accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your
252
+
253
+ 64
254
+ 00:06:51,040 --> 00:06:58,320
255
+ productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably
256
+
257
+ 65
258
+ 00:06:58,320 --> 00:07:03,920
259
+ say screw it I'll just type it because you end up missing errors in the transcript and it becomes
260
+
261
+ 66
262
+ 00:07:03,920 --> 00:07:07,840
263
+ actually worse you end up in a worse position than you started with it that's been my experience.
264
+
265
+ 67
266
+ 00:07:08,400 --> 00:07:14,400
267
+ So I was like oh this is actually really really good now how did that happen? The answer is
268
+
269
+ 68
270
+ 00:07:14,400 --> 00:07:21,599
271
+ ASR with per being open-sourced and the transformer architecture if you want to go back to the
272
+
273
+ 69
274
+ 00:07:23,200 --> 00:07:29,440
275
+ to the underpinnings which really blows my mind and it's on my list to read through that paper
276
+
277
+ 70
278
+ 00:07:30,239 --> 00:07:38,400
279
+ all you need is attention as attentively as can be done with my limited brain because it's super
280
+
281
+ 71
282
+ 00:07:38,960 --> 00:07:45,679
283
+ high-level stuff super advanced stuff I mean but that I think of all the things that
284
+
285
+ 72
286
+ 00:07:47,280 --> 00:07:54,080
287
+ are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating
288
+
289
+ 73
290
+ 00:07:54,080 --> 00:07:59,599
291
+ a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM
292
+
293
+ 74
294
+ 00:08:00,640 --> 00:08:06,799
295
+ then you've got image generation okay so firstly those two things on the surface have nothing
296
+
297
+ 75
298
+ 00:08:06,800 --> 00:08:12,560
299
+ in common so like how are they how did that just happen all at the same time and then when you
300
+
301
+ 76
302
+ 00:08:12,560 --> 00:08:19,920
303
+ extend that further you're like sooner right you can sing a song an AI will like come up with
304
+
305
+ 77
306
+ 00:08:19,920 --> 00:08:25,200
307
+ an instrumental and then you've got whisper and you're like wait a second how did all this stuff
308
+
309
+ 78
310
+ 00:08:25,200 --> 00:08:30,880
311
+ like if it's all AI what's like there has to be some commonality otherwise these are four these
312
+
313
+ 79
314
+ 00:08:31,600 --> 00:08:38,640
315
+ totally different technologies on the surface of it and the transformer architecture is as far as
316
+
317
+ 80
318
+ 00:08:38,640 --> 00:08:44,720
319
+ I know the answer and I can't even say I can't even pretend that I really understand what the
320
+
321
+ 81
322
+ 00:08:44,720 --> 00:08:51,200
323
+ transformer architecture means in depth but I have scandis and as I said I want to print it and
324
+
325
+ 82
326
+ 00:08:51,200 --> 00:08:57,760
327
+ really kind of think over it's at some point and I'll probably feel bad about myself I think
328
+
329
+ 83
330
+ 00:08:57,760 --> 00:09:03,280
331
+ because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt
332
+
333
+ 84
334
+ 00:09:03,280 --> 00:09:09,439
335
+ once who were the who wrote that paper and how old were they when it was published in ARC
336
+
337
+ 85
338
+ 00:09:09,439 --> 00:09:14,640
339
+ and I was expecting like I don't know what do you what do you imagine I personally imagine kind of
340
+
341
+ 86
342
+ 00:09:14,640 --> 00:09:19,840
343
+ like you know you have these breakthroughs during COVID and things like that were like these kind
344
+
345
+ 87
346
+ 00:09:19,840 --> 00:09:24,480
347
+ of really obscure scientists who are like in their 50s and they've just kind of been laboring
348
+
349
+ 88
350
+ 00:09:24,640 --> 00:09:31,120
351
+ labs and we're really in writing and publishing and kind of obscure academic publications and they
352
+
353
+ 89
354
+ 00:09:31,120 --> 00:09:37,200
355
+ finally like hit a big or win a Nobel Prize and then their household household names so I that
356
+
357
+ 90
358
+ 00:09:37,200 --> 00:09:42,680
359
+ was kind of what I had in mind that was the mental image I'd formed of the birth of ARC
360
+
361
+ 91
362
+ 00:09:42,680 --> 00:09:47,760
363
+ like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very
364
+
365
+ 92
366
+ 00:09:47,760 --> 00:09:54,160
367
+ funny very cool and actually kind of inspiring it's nice to think that people who you know just
368
+
369
+ 93
370
+ 00:09:54,160 --> 00:10:01,439
371
+ you might put them in the kind of milieu or bubble or world that you are in are credibly in through
372
+
373
+ 94
374
+ 00:10:01,439 --> 00:10:06,079
375
+ you know the series of connections that are coming up with such literally world changing
376
+
377
+ 95
378
+ 00:10:06,880 --> 00:10:13,439
379
+ innovations so that was I thought anyway that that that was cool okay voice training data how
380
+
381
+ 96
382
+ 00:10:13,439 --> 00:10:19,280
383
+ were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was
384
+
385
+ 97
386
+ 00:10:19,280 --> 00:10:25,680
387
+ brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh
388
+
389
+ 98
390
+ 00:10:25,680 --> 00:10:31,040
391
+ I have to get like a really good microphone for this so I didn't go on a spending spree because
392
+
393
+ 99
394
+ 00:10:31,040 --> 00:10:37,760
395
+ I said I'm gonna have to just wait a month and see if I still use this and it just kind of became
396
+
397
+ 100
398
+ 00:10:37,760 --> 00:10:44,800
399
+ it's become really part of my daily routine like if I'm writing an email I'll record a voice note
400
+
401
+ 101
402
+ 00:10:44,880 --> 00:10:50,079
403
+ and then I've developed and it's nice to see that everyone is like developing the same things in
404
+
405
+ 102
406
+ 00:10:50,079 --> 00:10:56,319
407
+ parallel like that's my kind of a weird thing to say but when I look I kind of came when I started
408
+
409
+ 103
410
+ 00:10:56,319 --> 00:11:02,640
411
+ working on this these prototypes on GitHub which is where I just kind of share very freely and loosely
412
+
413
+ 104
414
+ 00:11:03,199 --> 00:11:10,800
415
+ ideas and you know first iterations on concepts and for one of a better word I called it like
416
+
417
+ 105
418
+ 00:11:11,439 --> 00:11:17,680
419
+ LLM post processing or cleanup or basically a system prompt that after you get back the raw text
420
+
421
+ 106
422
+ 00:11:17,680 --> 00:11:25,920
423
+ from whisper you run it through model and say okay this is crappy text like add sentence structure
424
+
425
+ 107
426
+ 00:11:25,920 --> 00:11:33,199
427
+ and you know fix it up and now when I'm exploring the different tools that are out there the people
428
+
429
+ 108
430
+ 00:11:33,200 --> 00:11:39,040
431
+ have built I see quite a number of projects have basically you know done the same thing
432
+
433
+ 109
434
+ 00:11:40,640 --> 00:11:45,040
435
+ less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this
436
+
437
+ 110
438
+ 00:11:45,040 --> 00:11:51,440
439
+ has been a thing that's been integrated into tools for a while but it's it's the kind of thing that
440
+
441
+ 111
442
+ 00:11:51,440 --> 00:11:57,520
443
+ when you start using these tools every day the need for it is almost instantly apparent because text
444
+
445
+ 112
446
+ 00:11:57,600 --> 00:12:03,520
447
+ that doesn't have any punctuation or progress basing takes a long time to you know it takes so
448
+
449
+ 113
450
+ 00:12:03,520 --> 00:12:10,079
451
+ long to get it into a presentable email that again it's it moves speech tech into that
452
+
453
+ 114
454
+ 00:12:11,280 --> 00:12:16,000
455
+ before that inflection point where you're like nah she's not worth it it's like it'll just be
456
+
457
+ 115
458
+ 00:12:16,000 --> 00:12:20,800
459
+ quicker to type this so it's it's a big it's a little touch that actually is a big deal
460
+
461
+ 116
462
+ 00:12:21,520 --> 00:12:28,319
463
+ so I was on whisper and I've been using whisper and I kind of early on find a couple of tools
464
+
465
+ 117
466
+ 00:12:28,319 --> 00:12:33,680
467
+ I couldn't find what I was looking for on Linux which is basically just something that'll run
468
+
469
+ 118
470
+ 00:12:34,800 --> 00:12:39,120
471
+ in the background you'll give it an API key and it'll just like transcribe
472
+
473
+ 119
474
+ 00:12:41,439 --> 00:12:47,359
475
+ with like a little key to start and start the dictation and the issues where I discovered that
476
+
477
+ 120
478
+ 00:12:47,440 --> 00:12:52,720
479
+ like most people involved in creating these projects were very much focused on local models
480
+
481
+ 121
482
+ 00:12:52,720 --> 00:12:58,400
483
+ and running whisper locally because you can and I tried that a bunch of times and just never
484
+
485
+ 122
486
+ 00:12:58,400 --> 00:13:03,920
487
+ got results that were as good as the cloud and when I began looking at the cost of the speech
488
+
489
+ 123
490
+ 00:13:03,920 --> 00:13:10,080
491
+ text API is what I was spending I just thought there is it's actually in my opinion just one of
492
+
493
+ 124
494
+ 00:13:10,080 --> 00:13:15,600
495
+ the better deals in API spending and in cloud like it's just not that expensive for very very good
496
+
497
+ 125
498
+ 00:13:15,600 --> 00:13:22,240
499
+ models that are much more you know you're going to be able to run the full model the latest model
500
+
501
+ 126
502
+ 00:13:22,240 --> 00:13:28,960
503
+ versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't
504
+
505
+ 127
506
+ 00:13:28,960 --> 00:13:34,000
507
+ really make sense to me and I privacy is another concern that I know is kind of like a very much
508
+
509
+ 128
510
+ 00:13:34,000 --> 00:13:38,720
511
+ a separate thing that people just don't want their voice data and their voice leaving their
512
+
513
+ 129
514
+ 00:13:38,720 --> 00:13:45,360
515
+ local environment maybe for regulatory reasons as well but I'm not in that I'm neither really
516
+
517
+ 130
518
+ 00:13:45,360 --> 00:13:51,440
519
+ care about people listening to my grocery list consisting of reminding myself that I need to buy
520
+
521
+ 131
522
+ 00:13:51,440 --> 00:13:58,240
523
+ more beer cheetos and hummus which is kind of the three three staples of my diet during periods of
524
+
525
+ 132
526
+ 00:13:58,240 --> 00:14:04,560
527
+ poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a
528
+
529
+ 133
530
+ 00:14:04,560 --> 00:14:12,640
531
+ privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive
532
+
533
+ 134
534
+ 00:14:12,640 --> 00:14:17,680
535
+ or secure that requires airgapping so I looked at the pricing and especially the kind of older
536
+
537
+ 135
538
+ 00:14:17,680 --> 00:14:24,400
539
+ models mini some of them are very very affordable and I did it back to the I did a calculation once
540
+
541
+ 136
542
+ 00:14:24,400 --> 00:14:30,239
543
+ with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever
544
+
545
+ 137
546
+ 00:14:30,320 --> 00:14:37,040
547
+ the model was let's say I just go at it like nonstop which rarely happens probably I would say an
548
+
549
+ 138
550
+ 00:14:37,040 --> 00:14:45,200
551
+ average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents
552
+
553
+ 139
554
+ 00:14:45,200 --> 00:14:51,360
555
+ outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I
556
+
557
+ 140
558
+ 00:14:51,360 --> 00:14:56,720
559
+ do go on like one or two days where I've been usually when I'm like kind of out of the house and
560
+
561
+ 141
562
+ 00:14:56,720 --> 00:15:02,800
563
+ just have something like I've nothing else to do like if I'm at a hospital we have a newborn
564
+
565
+ 142
566
+ 00:15:04,000 --> 00:15:09,040
567
+ and you're waiting for like eight hours and hours for an appointment and I would probably have
568
+
569
+ 143
570
+ 00:15:09,040 --> 00:15:15,280
571
+ listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down
572
+
573
+ 144
574
+ 00:15:15,280 --> 00:15:20,880
575
+ let me just get these ideas out of my head and that's when I'll go on my speech spinges but those
576
+
577
+ 145
578
+ 00:15:20,880 --> 00:15:26,240
579
+ are like ones every few months like not frequently but I said okay let's just say if I'm gonna price
580
+
581
+ 146
582
+ 00:15:26,240 --> 00:15:35,440
583
+ out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some
584
+
585
+ 147
586
+ 00:15:35,440 --> 00:15:41,600
587
+ odd reason I mean it have to like ease and use the toilet like you know there's only so many hours
588
+
589
+ 148
590
+ 00:15:41,600 --> 00:15:48,480
591
+ I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said
592
+
593
+ 149
594
+ 00:15:48,480 --> 00:15:55,360
595
+ all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60
596
+
597
+ 150
598
+ 00:15:55,440 --> 00:16:02,560
599
+ but whatever I did and every day like you're going flat out seven days a week dictating nonstop
600
+
601
+ 151
602
+ 00:16:02,560 --> 00:16:08,640
603
+ as like what's my monthly API bill gonna be at this price and it came out to like seven to your
604
+
605
+ 152
606
+ 00:16:08,640 --> 00:16:14,960
607
+ 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope
608
+
609
+ 153
610
+ 00:16:15,600 --> 00:16:21,680
611
+ that there was some compelling reason more worth more than 70 dollars that I embarked upon that
612
+
613
+ 154
614
+ 00:16:22,640 --> 00:16:26,959
615
+ so given that that's kind of the max point for me I said that's actually very very affordable
616
+
617
+ 155
618
+ 00:16:27,920 --> 00:16:32,640
619
+ now you're gonna if you want to spec out the costs and you want to do the post processing
620
+
621
+ 156
622
+ 00:16:33,599 --> 00:16:39,199
623
+ that I really do feel as valuable that's gonna cost more as well on a less you're using
624
+
625
+ 157
626
+ 00:16:40,160 --> 00:16:47,839
627
+ Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with
628
+
629
+ 158
630
+ 00:16:47,840 --> 00:16:54,800
631
+ Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini
632
+
633
+ 159
634
+ 00:16:54,800 --> 00:17:00,080
635
+ not so much as a everyday model um it's kind of underwhelmed in that respect I would say
636
+
637
+ 160
638
+ 00:17:00,080 --> 00:17:05,920
639
+ but for multimodal I think it's got a lot to offer and I think that the transcribing functionality
640
+
641
+ 161
642
+ 00:17:05,920 --> 00:17:13,280
643
+ whereby it can um process audio with the system prompt and both give you a transgression that's
644
+
645
+ 162
646
+ 00:17:13,280 --> 00:17:20,079
647
+ cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like
648
+
649
+ 163
650
+ 00:17:20,079 --> 00:17:27,280
651
+ even Google hasn't really sort of thought through how useful the that modality is more kind of
652
+
653
+ 164
654
+ 00:17:27,280 --> 00:17:33,280
655
+ use cases uh you can achieve with it because I found in the course of this year just an endless
656
+
657
+ 165
658
+ 00:17:33,280 --> 00:17:40,399
659
+ list of really kind of system prompt system prompt stuff that I can say okay I've used it
660
+
661
+ 166
662
+ 00:17:40,560 --> 00:17:45,920
663
+ for a capture context data for AI which is literally I might speak for if I wanted to have a good
664
+
665
+ 167
666
+ 00:17:45,920 --> 00:17:52,560
667
+ bank of context data about who knows my childhood uh more realistically maybe my career goals
668
+
669
+ 168
670
+ 00:17:53,520 --> 00:17:59,520
671
+ something that would just be like really boring to type out so I'll just like sit in my car
672
+
673
+ 169
674
+ 00:17:59,520 --> 00:18:06,640
675
+ and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is
676
+
677
+ 170
678
+ 00:18:06,640 --> 00:18:13,200
679
+ short text uh just there is a whole bunch and all these workflows kind of require a little bit
680
+
681
+ 171
682
+ 00:18:13,200 --> 00:18:18,320
683
+ of treatment afterwards and different treatment my context pipeline is kind of like just extract the
684
+
685
+ 172
686
+ 00:18:18,320 --> 00:18:23,520
687
+ bare essential so you end up with me talking very loosely about sort of what I've done in my career
688
+
689
+ 173
690
+ 00:18:23,520 --> 00:18:30,000
691
+ where I've worked where my light to work and it goes it condenses that down to very robotic language
692
+
693
+ 174
694
+ 00:18:30,000 --> 00:18:36,000
695
+ that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology
696
+
697
+ 175
698
+ 00:18:36,080 --> 00:18:42,400
699
+ Daniel is a has been working in martino stuff like that that's not how you would speak um but I
700
+
701
+ 176
702
+ 00:18:42,400 --> 00:18:48,480
703
+ figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this
704
+
705
+ 177
706
+ 00:18:48,480 --> 00:18:56,880
707
+ is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and
708
+
709
+ 178
710
+ 00:18:56,880 --> 00:19:02,720
711
+ the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to
712
+
713
+ 179
714
+ 00:19:02,720 --> 00:19:09,440
715
+ be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this
716
+
717
+ 180
718
+ 00:19:09,440 --> 00:19:14,960
719
+ okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype
720
+
721
+ 181
722
+ 00:19:14,960 --> 00:19:20,560
723
+ will work and it's a built in public open source so anyone is welcome to use it if I make anything good
724
+
725
+ 182
726
+ 00:19:21,600 --> 00:19:28,000
727
+ but that was really exciting for me last night when after hours of um try my own prototype seeing
728
+
729
+ 183
730
+ 00:19:28,080 --> 00:19:33,120
731
+ someone just made something that works like that you know you're not going to have to build a custom
732
+
733
+ 184
734
+ 00:19:34,240 --> 00:19:40,960
735
+ condo environment and image I have AMD GPU which makes things much more complicated I didn't find it
736
+
737
+ 185
738
+ 00:19:41,840 --> 00:19:46,400
739
+ and I was about to give up and I said all right let me just give deep grams Linux thing a shot
740
+
741
+ 186
742
+ 00:19:47,040 --> 00:19:50,960
743
+ and if this doesn't work um I'm just going to go back to trying to vibe code something myself
744
+
745
+ 187
746
+ 00:19:51,600 --> 00:19:57,360
747
+ and when I ran the script I was using cloud code to do the installation process
748
+
749
+ 188
750
+ 00:19:58,160 --> 00:20:02,800
751
+ it ran the script and oh my gosh it works just like that uh the tricky thing
752
+
753
+ 189
754
+ 00:20:04,480 --> 00:20:12,480
755
+ for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually
756
+
757
+ 190
758
+ 00:20:12,480 --> 00:20:18,160
759
+ struggling with transcription but pasting wailant makes life very hard and I think there was
760
+
761
+ 191
762
+ 00:20:18,160 --> 00:20:22,800
763
+ something not running at the right time anyway deep gram I looked at how they actually handled
764
+
765
+ 192
766
+ 00:20:22,960 --> 00:20:28,960
767
+ that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism
768
+
769
+ 193
770
+ 00:20:29,520 --> 00:20:34,560
771
+ and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20
772
+
773
+ 194
774
+ 00:20:34,560 --> 00:20:44,399
775
+ minute audio uh sample and I'm I think I've done one or two of these before but I did it with
776
+
777
+ 195
778
+ 00:20:45,360 --> 00:20:51,120
779
+ sure snappy voice notes this is kind of long form this actually might be a better approximation
780
+
781
+ 196
782
+ 00:20:51,120 --> 00:20:55,040
783
+ for what's useful to me than voice memos like I need to buy three
784
+
785
+ 197
786
+ 00:20:55,840 --> 00:20:59,840
787
+ beaters of moat tomorrow and peter bread which is probably how like half my voice note
788
+
789
+ 198
790
+ 00:20:59,840 --> 00:21:04,399
791
+ voice notes sound like if anyone were to I don't know like find my phone they'd be like this is
792
+
793
+ 199
794
+ 00:21:04,399 --> 00:21:09,280
795
+ the most boring person in the world although actually there are some like kind of uh journaling
796
+
797
+ 200
798
+ 00:21:09,280 --> 00:21:14,080
799
+ thoughts as well but it's a lot of content like that and the probably for the evaluation the most
800
+
801
+ 201
802
+ 00:21:14,080 --> 00:21:22,560
803
+ useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not
804
+
805
+ 202
806
+ 00:21:22,560 --> 00:21:27,360
807
+ going to have a chance of knowing it but hopefully sufficiently well known that the models should get
808
+
809
+ 203
810
+ 00:21:27,360 --> 00:21:32,800
811
+ it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in
812
+
813
+ 204
814
+ 00:21:32,800 --> 00:21:38,960
815
+ general I've spoken deliver this at a faster pace than I usually would go into strong coffee
816
+
817
+ 205
818
+ 00:21:39,120 --> 00:21:44,240
819
+ flowing through my bloodstream and the thing that I'm not going to get into spanish mark is
820
+
821
+ 206
822
+ 00:21:44,240 --> 00:21:49,920
823
+ background noise which is my first take that I had to get rid of my wife come in with my son
824
+
825
+ 207
826
+ 00:21:49,920 --> 00:21:55,680
827
+ and for a good night kiss and that actually would have been super helpful to get in because it was
828
+
829
+ 208
830
+ 00:21:56,400 --> 00:22:01,600
831
+ non-diarray sorry if we had diarrayization a female I could say I want the male voice and that
832
+
833
+ 209
834
+ 00:22:01,600 --> 00:22:06,240
835
+ wasn't intended for transcription um and we're not going to get background noise like people
836
+
837
+ 210
838
+ 00:22:06,240 --> 00:22:11,840
839
+ hunking their horns which is something I've done in my main data set where I am trying to go back
840
+
841
+ 211
842
+ 00:22:11,840 --> 00:22:16,880
843
+ to some of my voice notes annotate them and run a benchmark but this is going to be just a pure
844
+
845
+ 212
846
+ 00:22:17,680 --> 00:22:24,960
847
+ quick test and as someone I'm working on a voice note idea that's my sort of end
848
+
849
+ 213
850
+ 00:22:26,560 --> 00:22:30,320
851
+ motivation besides thinking it's an absolute outstanding technology that's coming to
852
+
853
+ 214
854
+ 00:22:30,960 --> 00:22:36,240
855
+ viability and really I know the same as cheesy can actually have a very transformative effect
856
+
857
+ 215
858
+ 00:22:37,120 --> 00:22:42,720
859
+ it's you know voice technology has been life changing for folks living with
860
+
861
+ 216
862
+ 00:22:44,000 --> 00:22:49,760
863
+ disabilities and I think there's something really nice about the fact that it can also benefit
864
+
865
+ 217
866
+ 00:22:50,480 --> 00:22:54,639
867
+ you know folks who are able bodies and like we can all in different ways
868
+
869
+ 218
870
+ 00:22:55,120 --> 00:23:02,560
871
+ um make this tech as useful as possible regardless of the exact way that we're using it um and I
872
+
873
+ 219
874
+ 00:23:02,560 --> 00:23:07,760
875
+ think there's something very powerful in that and it can be very cool um I see huge potential what
876
+
877
+ 220
878
+ 00:23:07,760 --> 00:23:14,480
879
+ excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate
880
+
881
+ 221
882
+ 00:23:14,480 --> 00:23:19,040
883
+ as I mentioned at the very start of this um and it's getting better and better with stuff like
884
+
885
+ 222
886
+ 00:23:19,040 --> 00:23:24,160
887
+ accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the
888
+
889
+ 223
890
+ 00:23:24,160 --> 00:23:30,240
891
+ sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because
892
+
893
+ 224
894
+ 00:23:30,240 --> 00:23:37,680
895
+ I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of
896
+
897
+ 225
898
+ 00:23:37,680 --> 00:23:42,720
899
+ innovation and improvement in the models the main reasons for fine tuning from what I've seen
900
+
901
+ 226
902
+ 00:23:44,320 --> 00:23:50,480
903
+ have been people who are something that really blows blows my mind about ASR is the idea that it's
904
+
905
+ 227
906
+ 00:23:50,480 --> 00:24:00,080
907
+ inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages
908
+
909
+ 228
910
+ 00:24:00,080 --> 00:24:04,800
911
+ that there may be there there might be a positive training data or almost none at all and therefore
912
+
913
+ 229
914
+ 00:24:04,800 --> 00:24:11,440
915
+ the accuracy is significantly reduced or folks in very critical environments I know there are
916
+
917
+ 230
918
+ 00:24:11,440 --> 00:24:17,680
919
+ you this is used extensively in medical transcription and dispatch your work as um you know the call
920
+
921
+ 231
922
+ 00:24:17,680 --> 00:24:24,000
923
+ sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors
924
+
925
+ 232
926
+ 00:24:24,560 --> 00:24:29,680
927
+ radiologists they might be using very specialized vocab all the time so those are kind of the main
928
+
929
+ 233
930
+ 00:24:29,680 --> 00:24:35,680
931
+ two things and I'm not sure that really just for trying to make it better on a few random tech words
932
+
933
+ 234
934
+ 00:24:35,680 --> 00:24:41,840
935
+ with my slightly I mean I have an accent but like not you know an accent that a few other million
936
+
937
+ 235
938
+ 00:24:41,840 --> 00:24:50,720
939
+ people have ish I'm not sure that my little fine tune is going to actually like the bump in
940
+
941
+ 236
942
+ 00:24:50,720 --> 00:24:55,760
943
+ word error reduction if I ever actually figure out how to do it and get it up to the cloud by the
944
+
945
+ 237
946
+ 00:24:55,760 --> 00:25:00,879
947
+ time we've done that I suspect that the next generation of ASR will just be so good that it will
948
+
949
+ 238
950
+ 00:25:00,879 --> 00:25:07,040
951
+ kind of be well that would be cool for a dive but I'll just use this instead so that's going to be
952
+
953
+ 239
954
+ 00:25:07,280 --> 00:25:15,040
955
+ is for today's episodes of voice training data single long shot evaluation who am I going to
956
+
957
+ 240
958
+ 00:25:15,040 --> 00:25:21,200
959
+ compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head
960
+
961
+ 241
962
+ 00:25:21,200 --> 00:25:27,680
963
+ with two things really one is whisper variance so you've got these projects like faster whisper
964
+
965
+ 242
966
+ 00:25:29,120 --> 00:25:34,000
967
+ distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which
968
+
969
+ 243
970
+ 00:25:34,160 --> 00:25:38,960
971
+ are also a thing my intention for this is I'm not sure I'm going to have the time in any point
972
+
973
+ 244
974
+ 00:25:38,960 --> 00:25:46,320
975
+ of the foreseeable future to go back to this whole episode and create a proper source truth or I fix
976
+
977
+ 245
978
+ 00:25:47,520 --> 00:25:53,760
979
+ everything might do it if I can get one transcriptions that's sufficiently close to perfection
980
+
981
+ 246
982
+ 00:25:54,960 --> 00:26:00,560
983
+ but what I would actually love to do on hogging face I think would be a great probably how I might
984
+
985
+ 247
986
+ 00:26:00,560 --> 00:26:08,080
987
+ visualize this is having the audio waveform play and then have the transcript for each model below
988
+
989
+ 248
990
+ 00:26:08,080 --> 00:26:16,320
991
+ it and maybe even a like you know two scale and maybe even a local one as well like local whisper
992
+
993
+ 249
994
+ 00:26:16,320 --> 00:26:23,919
995
+ versus open AI API etc and I can then actually listen back to segments or anyone who wants to
996
+
997
+ 250
998
+ 00:26:24,000 --> 00:26:30,000
999
+ can listen back to segments of this recording and see where a particular model to struggle
1000
+
1001
+ 251
1002
+ 00:26:30,000 --> 00:26:35,600
1003
+ with others didn't as well as the sort of headline finding of which had the best WER but that would
1004
+
1005
+ 252
1006
+ 00:26:35,600 --> 00:26:41,120
1007
+ require the source of truth okay that's it hope this was I don't know maybe useful for other
1008
+
1009
+ 253
1010
+ 00:26:41,120 --> 00:26:46,480
1011
+ folks interested in STT you want to see that I always feel think I've just said as something I
1012
+
1013
+ 254
1014
+ 00:26:46,480 --> 00:26:52,800
1015
+ didn't intend to STT I said for those isn't carefully including hopefully the models themselves
1016
+
1017
+ 255
1018
+ 00:26:53,280 --> 00:26:58,960
1019
+ this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests
1020
+
1021
+ 256
1022
+ 00:26:58,960 --> 00:27:06,639
1023
+ in AI but particularly agentic mcp and voice tech you can find me on github hogging face
1024
+
1025
+ 257
1026
+ 00:27:08,080 --> 00:27:14,000
1027
+ where else daniel rosal dot com which is my personal website as well as this podcast whose name
1028
+
1029
+ 258
1030
+ 00:27:14,000 --> 00:27:17,280
1031
+ I sadly cannot remember until next time thanks for listening
1032
+
data/inference/runs/local-stt/run-3/transcript.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast. Or, I may append this to a podcast that I set up recently regarding my with my thoughts on speech tech and AI in particular. More AI in generative AI I would say. But in any event, the purpose of this voice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the envelope evaluation as they might say for different speech attacks models. And I'm doing this because I I thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in the elusive task of fine-tuning whisper. Whisper is, and I'm going to just talk, I'm trying to mix up, I'm going to try a few different styles of speaking. I might whisper something at some points as well. And I'll go back to speaking loud in different parts. I'm going to send really like a crazy person because I'm also going to try to speak at different pitches and cadences in order to really try to push a speech attacks model through its paces, which is trying to make sense of is this guy just rambling on and coherently in one long sentence or are these just actually series of step standalone standalone sentences? And how is it going to handle step alone? That's not a word. What happens when you use speech attacks and you use a fake word and then you're like, wait, that's not actually, that word doesn't exist. How does AI handle that? And these and more are all the questions that I'm seeking to answer in this training data. Now, why did why was it trying to find China whisper? And what is whisper? As I said, I'm going to try to record this at a couple of different levels of technicality for folks who are in the normal world and not totally stuck down the rabbit hole of AI. What you have to say is a really wonderful rabbit hole to be to be done. It's a really interesting area and speech and voice tech is the aspect of it that I find actually most I'm not sure I would say the most interesting because there's just so much that is fascinating in AI. But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work. And I am persevering hard with the task of training, I guess, a good solution working for Linux. Would you have anyone actually does listen to this not just for the training data and for the actual content? This is this is sparked. I had, besides the fine tune not working, well that was the failure. I used plot code because one thing these days that there is nothing short of solving, you know, the reason of life or something that's flawed and agentically I can't do, which is not really the case. It does seem that way sometimes but it fails a lot as well. And this is one of those instances where last week I put together an hour of voice training data, basically speaking just random things for three minutes and
2
+
3
+ it was actually kind of tedious because the text were really weird. Some of them were it was like, it was AI generated. I tried before to reach Sherlock Holmes for an hour and I just couldn't, I was so bored after 10 minutes that I was like, okay, knowing just gonna have to find something else to read. So I used a created with AI Studio, a vibe code is a synthetic text generator, which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content. So I was like, okay, give me a voice note. Like I'm recording an email, give me a short story to read, give me pros to read. So I came up with all these different things and they added a little timer to it so I could see how close I was to one hour and I spent like an hour one afternoon or probably two hours by the time you do retakes and whatever because you want to, it gave me a source of truth which I'm not sure if that's the scientific way to approach this. Topic of gathering training data but I thought made sense. I have a lot of audio data from recording voice notes which I've also kind of used being experimenting with using for a different purpose. It's slightly different annotating task types. It's more text classification experiment or well it's more than that actually working on a voice app so it's a prototype I guess is really more accurate.
4
+
5
+ But you can do that and you can work backwards. You listen back to a voice note and you painfully go through one of those transcribing where you start and stop and scrub around it and you fix the errors but it's really really pouring to do that. So I thought it would be last tedious in the long term if I just recorded this source of truth so it gave me these three minute snippets. I recorded them it saved in MP3 and the TXT in the same folder and I created an error that data. So I was very hopeful quite a little bit hopeful that I would be able that I could actually fine tune whisper. I want to fine tune whisper because when I got into voice tech last November my wife was in the US and I was alone at home and when crazy people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high. I used speech tech now and again try it out. It's like it'd be really cool if you could just like speak into your computer and whatever I tried out that had Linux support was just it was not good basically and this blew me away from the first go. I mean it wasn't 100% accurate out of the box and it took work but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this. There's a point where it's so like the transcript is you don't have to get 100% accuracy for it to be worth your time for it's speech attacks to be a worthwhile addition to your productivity but you do need to get above let's say I don't know 85% if it's 60% or 50% you inevitably say screw it I'll just type it because you end up missing errors in the transcript and it becomes actually worse you end up in a worse position than you started with it that's been my experience. So I was like oh this is actually really really good now how did that happen? The answer is ASR with per being open-sourced and the transformer architecture if you want to go back to the to the underpinnings which really blows my mind and it's on my list to read through that paper all you need is attention as attentively as can be done with my limited brain because it's super high-level stuff super advanced stuff I mean but that I think of all the things that are fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating a few people are like hang on you've got this thing that can speak to you like a chatbot and LLM then you've got image generation okay so firstly those two things on the surface have nothing in common so like how are they how did that just happen all at the same time and then when you extend that further you're like sooner right you can sing a song an AI will like come up with an instrumental and then you've got whisper and you're like wait a second how did all this stuff like if it's all AI what's like there has to be some commonality otherwise these are four these totally different technologies on the surface of it and the transformer architecture is as far as I know the answer and I can't even say I can't even pretend that I really understand what the transformer architecture means in depth but I have scandis and as I said I want to print it and really kind of think over it's at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy I think I asked chat Gbt once who were the who wrote that paper and how old were they when it was published in ARC and I was expecting like I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during COVID and things like that were like these kind of really obscure scientists who are like in their 50s and they've just kind of been laboring labs and we're really in writing and publishing and kind of obscure academic publications and they finally like hit a big or win a Nobel Prize and then their household household names so I that was kind of what I had in mind that was the mental image I'd formed of the birth of ARC like I wasn't expecting 20 somethings in San Francisco though I thought that was both very very funny very cool and actually kind of inspiring it's nice to think that people who you know just you might put them in the kind of milieu or bubble or world that you are in are credibly in through you know the series of connections that are coming up with such literally world changing innovations so that was I thought anyway that that that was cool okay voice training data how were we doing we're about 10 minutes and I'm still talking about voice technology so whisper was brilliant and I was so excited that I was my first instinct was to like guess it's like oh my gosh I have to get like a really good microphone for this so I didn't go on a spending spree because I said I'm gonna have to just wait a month and see if I still use this and it just kind of became it's become really part of my daily routine like if I'm writing an email I'll record a voice note and then I've developed and it's nice to see that everyone is like developing the same things in parallel like that's my kind of a weird thing to say but when I look I kind of came when I started working on this these prototypes on GitHub which is where I just kind of share very freely and loosely ideas and you know first iterations on concepts and for one of a better word I called it like LLM post processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through model and say okay this is crappy text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there the people have built I see quite a number of projects have basically you know done the same thing less that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's it's the kind of thing that when you start using these tools every day the need for it is almost instantly apparent because text that doesn't have any punctuation or progress basing takes a long time to you know it takes so long to get it into a presentable email that again it's it moves speech tech into that before that inflection point where you're like nah she's not worth it it's like it'll just be quicker to type this so it's it's a big it's a little touch that actually is a big deal so I was on whisper and I've been using whisper and I kind of early on find a couple of tools I couldn't find what I was looking for on Linux which is basically just something that'll run in the background you'll give it an API key and it'll just like transcribe
6
+
7
+ with like a little key to start and start the dictation and the issues where I discovered that like most people involved in creating these projects were very much focused on local models and running whisper locally because you can and I tried that a bunch of times and just never got results that were as good as the cloud and when I began looking at the cost of the speech text API is what I was spending I just thought there is it's actually in my opinion just one of the better deals in API spending and in cloud like it's just not that expensive for very very good models that are much more you know you're going to be able to run the full model the latest model versus whatever you can run on your average GPU unless you want to buy a crazy GPU it doesn't really make sense to me and I privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment maybe for regulatory reasons as well but I'm not in that I'm neither really care about people listening to my grocery list consisting of reminding myself that I need to buy more beer cheetos and hummus which is kind of the three three staples of my diet during periods of poor nutrition but the kind of stuff that I transcribe most it's just not it's not a it's not a privacy thing I'm that sort of sensitive about and I don't do anything so you know sensitive or secure that requires airgapping so I looked at the pricing and especially the kind of older models mini some of them are very very affordable and I did it back to the I did a calculation once with Chachi BT and I was like okay this is the this is the API price for I can't remember whatever the model was let's say I just go at it like nonstop which rarely happens probably I would say an average I might dictate 30 to 60 minutes per day if I was probably summing up the emails documents outlines which is a lot but it's it's still a fairly modest amount and I was like well some days I do go on like one or two days where I've been usually when I'm like kind of out of the house and just have something like I've nothing else to do like if I'm at a hospital we have a newborn and you're waiting for like eight hours and hours for an appointment and I would probably have listened to podcasts before becoming a speech fanatic and I'm like oh wait let me just get down let me just get these ideas out of my head and that's when I'll go on my speech spinges but those are like ones every few months like not frequently but I said okay let's just say if I'm gonna price out cloud sgt if I was like dedicated every second of every waking hour to transcribing for some odd reason I mean it have to like ease and use the toilet like you know there's only so many hours I'm awake for so like let's just say a maximum of like 40 hour 45 minutes in the hours and I said all right let's just say 50 who knows you're dictating on the toilet we do it so you could just do 60 but whatever I did and every day like you're going flat out seven days a week dictating nonstop as like what's my monthly API bill gonna be at this price and it came out to like seven to your 80 bucks and I was like well that would be an extraordinary amount of dictation and I would hope that there was some compelling reason more worth more than 70 dollars that I embarked upon that so given that that's kind of the max point for me I said that's actually very very affordable now you're gonna if you want to spec out the costs and you want to do the post processing that I really do feel as valuable that's gonna cost more as well on a less you're using Gemini which needless to say is a random person sitting in Jerusalem I have no affiliation nor with Google nor anthropic nor Gemini nor any major tech vendor for that matter um I like Gemini not so much as a everyday model um it's kind of underwhelmed in that respect I would say but for multimodal I think it's got a lot to offer and I think that the transcribing functionality whereby it can um process audio with the system prompt and both give you a transgression that's cleaned up that reduces two steps to one and that for me is a very very big deal and uh I feel like even Google hasn't really sort of thought through how useful the that modality is more kind of use cases uh you can achieve with it because I found in the course of this year just an endless list of really kind of system prompt system prompt stuff that I can say okay I've used it for a capture context data for AI which is literally I might speak for if I wanted to have a good bank of context data about who knows my childhood uh more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that's 10 minutes you get a lot of information in um emails which is short text uh just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context pipeline is kind of like just extract the bare essential so you end up with me talking very loosely about sort of what I've done in my career where I've worked where my light to work and it goes it condenses that down to very robotic language that is easy to chunk parts and maybe put into a vector database Daniel has worked in technology Daniel is a has been working in martino stuff like that that's not how you would speak um but I figure it's probably easier to parse for after all robots so we've almost got to 20 minutes and this is actually a success because I waste 20 minutes of my uh of the evening speaking into microphone and the levels were shot and it uh it was clipping and I said I can't read you an evaluation I have to be fair I have to give the models a chance to do their thing uh what am I hoping to achieve in this okay my fine shun was a dud as mentioned deep gram sdt I'm really really hopeful that this prototype will work and it's a built in public open source so anyone is welcome to use it if I make anything good but that was really exciting for me last night when after hours of um try my own prototype seeing someone just made something that works like that you know you're not going to have to build a custom condo environment and image I have AMD GPU which makes things much more complicated I didn't find it and I was about to give up and I said all right let me just give deep grams Linux thing a shot and if this doesn't work um I'm just going to go back to trying to vibe code something myself and when I ran the script I was using cloud code to do the installation process it ran the script and oh my gosh it works just like that uh the tricky thing for all those ones and all the nitty-ditty-ditty-gritty details um was that I don't think it was actually struggling with transcription but pasting wailant makes life very hard and I think there was something not running at the right time anyway deep gram I looked at how they actually handled that because it worked out in the box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am I doing here this is going to be a 20 minute audio uh sample and I'm I think I've done one or two of these before but I did it with sure snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me than voice memos like I need to buy three beaters of moat tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to I don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github new cleano hugging face not so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the models should get it uh I tried to do a little bit of speaking really fast and speaking very slowly I would say in general I've spoken deliver this at a faster pace than I usually would go into strong coffee flowing through my bloodstream and the thing that I'm not going to get into spanish mark is background noise which is my first take that I had to get rid of my wife come in with my son and for a good night kiss and that actually would have been super helpful to get in because it was non-diarray sorry if we had diarrayization a female I could say I want the male voice and that wasn't intended for transcription um and we're not going to get background noise like people hunking their horns which is something I've done in my main data set where I am trying to go back to some of my voice notes annotate them and run a benchmark but this is going to be just a pure quick test and as someone I'm working on a voice note idea that's my sort of end motivation besides thinking it's an absolute outstanding technology that's coming to viability and really I know the same as cheesy can actually have a very transformative effect it's you know voice technology has been life changing for folks living with disabilities and I think there's something really nice about the fact that it can also benefit you know folks who are able bodies and like we can all in different ways um make this tech as useful as possible regardless of the exact way that we're using it um and I think there's something very powerful in that and it can be very cool um I see huge potential what excites me about voice tech a lot of things actually firstly the fact that it's cheap and accurate as I mentioned at the very start of this um and it's getting better and better with stuff like accent handling um I'm not sure my fight my fine tune will actually ever come to fruition in the sense that I'll use it day to day as I imagine and get likes you per flawless words error rates because I'm just kind of skeptical about local speech attacks as I mentioned and I think the pace of innovation and improvement in the models the main reasons for fine tuning from what I've seen have been people who are something that really blows blows my mind about ASR is the idea that it's inherently ailing you or multilingual phonetic based so as folks who use speak very obscure languages that there may be there there might be a positive training data or almost none at all and therefore the accuracy is significantly reduced or folks in very critical environments I know there are you this is used extensively in medical transcription and dispatch your work as um you know the call sentries who send out ambulances etc where accuracy is absolutely paramount and in the case of doctors radiologists they might be using very specialized vocab all the time so those are kind of the main two things and I'm not sure that really just for trying to make it better on a few random tech words with my slightly I mean I have an accent but like not you know an accent that a few other million people have ish I'm not sure that my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time we've done that I suspect that the next generation of ASR will just be so good that it will kind of be well that would be cool for a dive but I'll just use this instead so that's going to be is for today's episodes of voice training data single long shot evaluation who am I going to compare whisper is always good as a benchmark but I'm more interested in seeing whisper head-to-head with two things really one is whisper variance so you've got these projects like faster whisper distilled whisper it's a bit confusing there's a whole bunch of them and the emerging ASRs which are also a thing my intention for this is I'm not sure I'm going to have the time in any point of the foreseeable future to go back to this whole episode and create a proper source truth or I fix everything might do it if I can get one transcriptions that's sufficiently close to perfection but what I would actually love to do on hogging face I think would be a great probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it and maybe even a like you know two scale and maybe even a local one as well like local whisper versus open AI API etc and I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model to struggle with others didn't as well as the sort of headline finding of which had the best WER but that would require the source of truth okay that's it hope this was I don't know maybe useful for other folks interested in STT you want to see that I always feel think I've just said as something I didn't intend to STT I said for those isn't carefully including hopefully the models themselves this has been myself Daniel Rosal for more um jumbled repositories about my uh roving interests in AI but particularly agentic mcp and voice tech you can find me on github hogging face where else daniel rosal dot com which is my personal website as well as this podcast whose name I sadly cannot remember until next time thanks for listening
index.html CHANGED
@@ -1,19 +1,250 @@
1
  <!doctype html>
2
- <html>
3
  <head>
4
  <meta charset="utf-8" />
5
  <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
  <link rel="stylesheet" href="style.css" />
8
  </head>
9
  <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  </body>
19
  </html>
 
1
  <!doctype html>
2
+ <html lang="en">
3
  <head>
4
  <meta charset="utf-8" />
5
  <meta name="viewport" content="width=device-width" />
6
+ <title>STT Comparison Playground</title>
7
  <link rel="stylesheet" href="style.css" />
8
  </head>
9
  <body>
10
+ <main class="app">
11
+ <section class="hero">
12
+ <div>
13
+ <h1>Speech-to-Text Comparison</h1>
14
+ <p>
15
+ Play the sample podcast and compare how each transcription model handled it.
16
+ The ground-truth reference stays on top in green so you can quickly gauge accuracy.
17
+ </p>
18
+ </div>
19
+ <div class="audio-shell">
20
+ <audio id="audio" controls preload="auto" src="data/audio/podcast.mp3"></audio>
21
+ <canvas id="waveform" role="img" aria-label="Audio waveform preview"></canvas>
22
+ </div>
23
+ </section>
24
+ <section class="transcripts">
25
+ <div id="reference-track" aria-live="polite"></div>
26
+ <div class="models-grid" id="models-grid" aria-live="polite"></div>
27
+ </section>
28
+ </main>
29
+ <script src="transcripts.js"></script>
30
+ <script type="module">
31
+ const referenceTrackEl = document.getElementById("reference-track");
32
+ const modelsGridEl = document.getElementById("models-grid");
33
+ const audioElem = document.getElementById("audio");
34
+ const waveformCanvas = document.getElementById("waveform");
35
+ const transcriptSources = window.TRANSCRIPTS || {};
36
+ const tracks = [
37
+ {
38
+ id: "truth",
39
+ label: "Ground Truth",
40
+ file: "data/ground-truth/truth_1.srt",
41
+ accent: "#00b894",
42
+ emphasis: true
43
+ },
44
+ {
45
+ id: "assembly",
46
+ label: "AssemblyAI",
47
+ file: "srt-out/assembly.srt",
48
+ accent: "#4070f4"
49
+ },
50
+ {
51
+ id: "gladia",
52
+ label: "Gladia",
53
+ file: "srt-out/gladia.srt",
54
+ accent: "#9b5de5"
55
+ },
56
+ {
57
+ id: "nova3",
58
+ label: "Whisper Nova 3",
59
+ file: "srt-out/nova3.srt",
60
+ accent: "#ff6b6b"
61
+ },
62
+ {
63
+ id: "speechmatics",
64
+ label: "Speechmatics",
65
+ file: "srt-out/speechmatics.srt",
66
+ accent: "#ffa600"
67
+ }
68
+ ];
69
+
70
+ const segmentNodes = [];
71
+
72
+ function parseTimestamp(value) {
73
+ const [time, millisecondPart] = value.split(",");
74
+ const [hours, minutes, seconds] = time.split(":").map(Number);
75
+ const milliseconds = Number(millisecondPart);
76
+ return hours * 3600 + minutes * 60 + seconds + milliseconds / 1000;
77
+ }
78
+
79
+ function parseSrt(text) {
80
+ const blocks = text.replace(/\r/g, "").trim().split(/\n{2,}/);
81
+ return blocks
82
+ .map((block) => {
83
+ const lines = block.split("\n");
84
+ if (lines.length < 3) return null;
85
+ const timing = lines[1];
86
+ const [start, end] = timing.split("-->").map((part) => parseTimestamp(part.trim()));
87
+ const content = lines.slice(2).join(" ").replace(/\s+/g, " ").trim();
88
+ return { start, end, content };
89
+ })
90
+ .filter(Boolean);
91
+ }
92
+
93
+ function createSegmentElement(segment, accent) {
94
+ const segmentEl = document.createElement("div");
95
+ segmentEl.className = "segment";
96
+ segmentEl.dataset.start = segment.start;
97
+ segmentEl.dataset.end = segment.end;
98
+ segmentEl.innerHTML = `<span class="segment-time">${formatTime(segment.start)}</span><p>${segment.content}</p>`;
99
+ segmentEl.style.setProperty("--accent", accent);
100
+ return segmentEl;
101
+ }
102
+
103
+ function formatTime(seconds) {
104
+ const minutes = Math.floor(seconds / 60)
105
+ .toString()
106
+ .padStart(2, "0");
107
+ const secs = Math.floor(seconds % 60)
108
+ .toString()
109
+ .padStart(2, "0");
110
+ return `${minutes}:${secs}`;
111
+ }
112
+
113
+ function getTranscriptText(track) {
114
+ const cached = transcriptSources[track.id];
115
+ if (cached) {
116
+ return cached.replace(/^\ufeff/, "");
117
+ }
118
+ return null;
119
+ }
120
+
121
+ async function fetchTranscript(track) {
122
+ const response = await fetch(track.file);
123
+ if (!response.ok) {
124
+ throw new Error(`Unable to load ${track.label}`);
125
+ }
126
+ return (await response.text()).replace(/^\ufeff/, "");
127
+ }
128
+
129
+ async function loadTrack(track) {
130
+ let transcriptText = getTranscriptText(track);
131
+ if (!transcriptText) {
132
+ try {
133
+ transcriptText = await fetchTranscript(track);
134
+ } catch (error) {
135
+ renderFallbackCard(track, error.message);
136
+ throw error;
137
+ }
138
+ }
139
+ const transcript = parseSrt(transcriptText);
140
+
141
+ const trackEl = document.createElement("article");
142
+ trackEl.className = "track";
143
+ if (track.emphasis) {
144
+ trackEl.classList.add("track--emphasis");
145
+ }
146
+ trackEl.style.setProperty("--accent", track.accent);
147
+
148
+ trackEl.innerHTML = `
149
+ <header>
150
+ <h2>${track.label}</h2>
151
+ <span class="badge">Segments: ${transcript.length}</span>
152
+ </header>
153
+ `;
154
+
155
+ const contentEl = document.createElement("div");
156
+ contentEl.className = "track-body";
157
+ transcript.forEach((segment) => {
158
+ const segmentEl = createSegmentElement(segment, track.accent);
159
+ contentEl.appendChild(segmentEl);
160
+ segmentNodes.push(segmentEl);
161
+ });
162
+
163
+ trackEl.appendChild(contentEl);
164
+ if (track.emphasis) {
165
+ referenceTrackEl.appendChild(trackEl);
166
+ } else {
167
+ modelsGridEl.appendChild(trackEl);
168
+ }
169
+ }
170
+
171
+ function updateActiveSegments(time) {
172
+ segmentNodes.forEach((node) => {
173
+ const start = Number(node.dataset.start);
174
+ const end = Number(node.dataset.end);
175
+ const isActive = time >= start && time <= end;
176
+ node.classList.toggle("is-active", isActive);
177
+ });
178
+ }
179
+
180
+ async function drawWaveform() {
181
+ const response = await fetch(audioElem.currentSrc || audioElem.src);
182
+ if (!response.ok) return;
183
+ const arrayBuffer = await response.arrayBuffer();
184
+ const audioContext = new AudioContext();
185
+ const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
186
+ const rawData = audioBuffer.getChannelData(0);
187
+ const canvas = waveformCanvas;
188
+ const dpr = window.devicePixelRatio || 1;
189
+ canvas.width = canvas.clientWidth * dpr;
190
+ canvas.height = canvas.clientHeight * dpr;
191
+ const ctx = canvas.getContext("2d");
192
+ ctx.scale(dpr, dpr);
193
+ ctx.clearRect(0, 0, canvas.clientWidth, canvas.clientHeight);
194
+ const sliceWidth = Math.floor(rawData.length / canvas.clientWidth);
195
+ const halfHeight = canvas.clientHeight / 2;
196
+ ctx.lineWidth = 1.25;
197
+ ctx.strokeStyle = "#1f2937";
198
+ ctx.beginPath();
199
+ for (let i = 0; i < canvas.clientWidth; i++) {
200
+ const sliceStart = i * sliceWidth;
201
+ let sum = 0;
202
+ for (let j = 0; j < sliceWidth; j++) {
203
+ sum += Math.abs(rawData[sliceStart + j] || 0);
204
+ }
205
+ const amplitude = sum / sliceWidth;
206
+ const y = halfHeight - amplitude * halfHeight;
207
+ const yBottom = halfHeight + amplitude * halfHeight;
208
+ ctx.moveTo(i, y);
209
+ ctx.lineTo(i, yBottom);
210
+ }
211
+ ctx.stroke();
212
+ }
213
+
214
+ async function bootstrap() {
215
+ await Promise.all(
216
+ tracks.map(async (track) => {
217
+ try {
218
+ await loadTrack(track);
219
+ } catch (error) {
220
+ console.error(error);
221
+ }
222
+ })
223
+ );
224
+ void drawWaveform();
225
+ }
226
+
227
+ function renderFallbackCard(track, message) {
228
+ const card = document.createElement("article");
229
+ card.className = "track track--error";
230
+ card.innerHTML = `
231
+ <header>
232
+ <h2>${track.label}</h2>
233
+ <span class="badge badge--error">Unavailable</span>
234
+ </header>
235
+ <p class="track-error">${message}</p>
236
+ `;
237
+ if (track.emphasis) {
238
+ referenceTrackEl.appendChild(card);
239
+ } else {
240
+ modelsGridEl.appendChild(card);
241
+ }
242
+ }
243
+
244
+ audioElem.addEventListener("timeupdate", () => updateActiveSegments(audioElem.currentTime));
245
+ window.addEventListener("resize", () => drawWaveform());
246
+
247
+ bootstrap();
248
+ </script>
249
  </body>
250
  </html>
srt-out/assembly.srt ADDED
@@ -0,0 +1,1880 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,080 --> 00:00:05,680
3
+ Hello and welcome to a audio data set consisting
4
+
5
+ 2
6
+ 00:00:05,680 --> 00:00:10,640
7
+ of one single episode of a non-existent podcast. Or I
8
+
9
+ 3
10
+ 00:00:10,720 --> 00:00:13,360
11
+ may append this to a podcast that I set up
12
+
13
+ 4
14
+ 00:00:13,600 --> 00:00:19,200
15
+ recently regarding my with my thoughts on speech
16
+
17
+ 5
18
+ 00:00:19,280 --> 00:00:24,000
19
+ tech and AI in particular, more AI in generative AI,
20
+
21
+ 6
22
+ 00:00:24,240 --> 00:00:28,640
23
+ I would say. But in any event, the purpose of
24
+
25
+ 7
26
+ 00:00:28,720 --> 00:00:33,850
27
+ this Voice recording is actually to create a lengthy
28
+
29
+ 8
30
+ 00:00:33,930 --> 00:00:37,130
31
+ voice sample for a quick evaluation, a back of the
32
+
33
+ 9
34
+ 00:00:37,130 --> 00:00:40,650
35
+ envelope evaluation, as they might say, for different speech attack
36
+
37
+ 10
38
+ 00:00:40,890 --> 00:00:43,450
39
+ models. And I'm doing this because I thought I had
40
+
41
+ 11
42
+ 00:00:43,450 --> 00:00:46,810
43
+ made a great breakthrough in my journey with speech tech,
44
+
45
+ 12
46
+ 00:00:47,130 --> 00:00:50,730
47
+ and that was succeeding in the elusive task of fine-tuning
48
+
49
+ 13
50
+ 00:00:50,730 --> 00:00:54,810
51
+ Whisper. Whisper is, and I'm going to just talk, I'm
52
+
53
+ 14
54
+ 00:00:54,890 --> 00:00:58,250
55
+ trying to mix up, I'm going to try a few
56
+
57
+ 15
58
+ 00:00:58,410 --> 00:01:01,530
59
+ different styles of speaking. I might whisper something at some
60
+
61
+ 16
62
+ 00:01:01,610 --> 00:01:04,880
63
+ point. As well. And I'll go back to speaking loud
64
+
65
+ 17
66
+ 00:01:04,960 --> 00:01:08,080
67
+ in, in different parts. I'm going to sound really like
68
+
69
+ 18
70
+ 00:01:08,160 --> 00:01:11,120
71
+ a crazy person because I'm also going to try to
72
+
73
+ 19
74
+ 00:01:11,280 --> 00:01:16,240
75
+ speak at different pitches and cadences in order to really
76
+
77
+ 20
78
+ 00:01:16,560 --> 00:01:20,560
79
+ try to put a speech attacks model through its paces,
80
+
81
+ 21
82
+ 00:01:20,720 --> 00:01:23,040
83
+ which is trying to make sense of is this guy
84
+
85
+ 22
86
+ 00:01:23,200 --> 00:01:28,060
87
+ just rambling on incoherently in one long sentence or are
88
+
89
+ 23
90
+ 00:01:28,460 --> 00:01:34,220
91
+ these just actually a series of step, standalone,
92
+
93
+ 24
94
+ 00:01:34,380 --> 00:01:37,420
95
+ step alone, standalone sentences? And how is it gonna handle
96
+
97
+ 25
98
+ 00:01:37,500 --> 00:01:40,460
99
+ step alone? That's not a word. What happens when you
100
+
101
+ 26
102
+ 00:01:40,540 --> 00:01:43,020
103
+ use speech to text and you use a fake word?
104
+
105
+ 27
106
+ 00:01:43,180 --> 00:01:45,580
107
+ And then you're like, wait, that's not actually, that word
108
+
109
+ 28
110
+ 00:01:45,740 --> 00:01:50,220
111
+ doesn't exist. How does AI handle that? And these and
112
+
113
+ 29
114
+ 00:01:50,460 --> 00:01:54,300
115
+ more are all the questions that I'm seeking to answer
116
+
117
+ 30
118
+ 00:01:54,460 --> 00:01:57,500
119
+ in this training data. Now, why was it trying to
120
+
121
+ 31
122
+ 00:01:57,500 --> 00:02:00,290
123
+ fine tune Whisper? And what is Whisper? As I said,
124
+
125
+ 32
126
+ 00:02:00,370 --> 00:02:03,010
127
+ I'm going to try to record this at a couple
128
+
129
+ 33
130
+ 00:02:03,170 --> 00:02:07,490
131
+ of different levels of technicality for folks who are, you
132
+
133
+ 34
134
+ 00:02:07,490 --> 00:02:11,730
135
+ know, in the normal world and not totally stuck down
136
+
137
+ 35
138
+ 00:02:11,810 --> 00:02:13,810
139
+ the rabbit hole of AI, which I have to say
140
+
141
+ 36
142
+ 00:02:13,970 --> 00:02:18,130
143
+ is a really wonderful rabbit hole to be down. It's
144
+
145
+ 37
146
+ 00:02:18,210 --> 00:02:21,570
147
+ a really interesting area and speech and voice tech is
148
+
149
+ 38
150
+ 00:02:21,970 --> 00:02:24,610
151
+ the aspect of it that I find actually the most,
152
+
153
+ 39
154
+ 00:02:25,010 --> 00:02:27,410
155
+ I'm not sure I would say the most interesting because
156
+
157
+ 40
158
+ 00:02:27,650 --> 00:02:31,370
159
+ there's just so much that is fascinating in AI. But
160
+
161
+ 41
162
+ 00:02:31,530 --> 00:02:34,330
163
+ the most that I find the most personally transformative in
164
+
165
+ 42
166
+ 00:02:34,410 --> 00:02:38,970
167
+ terms of the impact that it's had on my daily
168
+
169
+ 43
170
+ 00:02:39,050 --> 00:02:41,530
171
+ work life and productivity and how I sort of work.
172
+
173
+ 44
174
+ 00:02:42,170 --> 00:02:47,290
175
+ And I'm persevering hard with the task of trying
176
+
177
+ 45
178
+ 00:02:47,290 --> 00:02:50,330
179
+ to get a good solution working for Linux, which if
180
+
181
+ 46
182
+ 00:02:50,330 --> 00:02:52,330
183
+ anyone actually does listen to this, not just for the
184
+
185
+ 47
186
+ 00:02:52,330 --> 00:02:56,490
187
+ training data and for the actual content, this is sparked
188
+
189
+ 48
190
+ 00:02:56,830 --> 00:03:00,030
191
+ I had, besides the fine tune not working, well, that
192
+
193
+ 49
194
+ 00:03:00,110 --> 00:03:05,310
195
+ was the failure. Um, I used Claude code because one
196
+
197
+ 50
198
+ 00:03:05,550 --> 00:03:10,030
199
+ thinks these days that there is nothing short of solving,
200
+
201
+ 51
202
+ 00:03:11,070 --> 00:03:15,470
203
+ you know, the, the reason of life or something, that
204
+
205
+ 52
206
+ 00:03:15,870 --> 00:03:19,070
207
+ Claude and agentic AI can't do, which is not really
208
+
209
+ 53
210
+ 00:03:19,150 --> 00:03:22,270
211
+ the case. Uh, it does seem that way sometimes, but
212
+
213
+ 54
214
+ 00:03:22,430 --> 00:03:24,270
215
+ it fails a lot as well. And this is one
216
+
217
+ 55
218
+ 00:03:24,270 --> 00:03:27,710
219
+ of those, instances where last week I put together an
220
+
221
+ 56
222
+ 00:03:27,790 --> 00:03:32,090
223
+ hour of voice training data, basically speaking, just random things
224
+
225
+ 57
226
+ 00:03:32,330 --> 00:03:37,130
227
+ for 3 minutes. And it was actually kind of tedious
228
+
229
+ 58
230
+ 00:03:37,210 --> 00:03:39,290
231
+ because the texts were really weird. Some of them were
232
+
233
+ 59
234
+ 00:03:39,530 --> 00:03:43,130
235
+ it was like it was AI generated. I tried before
236
+
237
+ 60
238
+ 00:03:43,290 --> 00:03:45,210
239
+ to read Sherlock Holmes for an hour and I just
240
+
241
+ 61
242
+ 00:03:45,210 --> 00:03:48,410
243
+ couldn't. I was so bored after 10 minutes that I
244
+
245
+ 62
246
+ 00:03:48,410 --> 00:03:50,810
247
+ was like, okay, no, I'm just going to have to
248
+
249
+ 63
250
+ 00:03:50,810 --> 00:03:55,370
251
+ find something else to read. So I used a created
252
+
253
+ 64
254
+ 00:03:55,770 --> 00:04:01,360
255
+ with AI studio vibe coded a synthetic text generator. Which
256
+
257
+ 65
258
+ 00:04:01,680 --> 00:04:03,920
259
+ actually I thought was probably a better way of doing
260
+
261
+ 66
262
+ 00:04:04,000 --> 00:04:07,520
263
+ it because it would give me more short samples with
264
+
265
+ 67
266
+ 00:04:07,760 --> 00:04:10,560
267
+ more varied content. So I was like, okay, give me
268
+
269
+ 68
270
+ 00:04:10,960 --> 00:04:13,840
271
+ a voice note, like I'm recording an email, give me
272
+
273
+ 69
274
+ 00:04:14,080 --> 00:04:17,760
275
+ a short story to read, give me prose to read.
276
+
277
+ 70
278
+ 00:04:18,080 --> 00:04:20,480
279
+ So I came up with all these different things and
280
+
281
+ 71
282
+ 00:04:20,640 --> 00:04:22,640
283
+ they added a little timer to it so I could
284
+
285
+ 72
286
+ 00:04:22,800 --> 00:04:26,480
287
+ see how close I was to one hour. And I
288
+
289
+ 73
290
+ 00:04:26,640 --> 00:04:29,680
291
+ spent like an hour one afternoon or probably two hours
292
+
293
+ 74
294
+ 00:04:29,840 --> 00:04:33,410
295
+ by the time you you do retakes. And whatever, because
296
+
297
+ 75
298
+ 00:04:33,490 --> 00:04:36,690
299
+ you want to, it gave me a source of truth,
300
+
301
+ 76
302
+ 00:04:37,410 --> 00:04:40,130
303
+ which I'm not sure if that's the scientific way to
304
+
305
+ 77
306
+ 00:04:40,290 --> 00:04:44,290
307
+ approach this topic of gathering, training data, but I thought
308
+
309
+ 78
310
+ 00:04:44,530 --> 00:04:48,210
311
+ made sense. Um, I have a lot of audio data
312
+
313
+ 79
314
+ 00:04:48,290 --> 00:04:50,850
315
+ from recording voice notes, which I've also kind of used,
316
+
317
+ 80
318
+ 00:04:52,130 --> 00:04:55,890
319
+ been experimenting with using for a different purpose, slightly different
320
+
321
+ 81
322
+ 00:04:56,290 --> 00:05:01,490
323
+ annotating task types. It's more a text classification experiment
324
+
325
+ 82
326
+ 00:05:01,810 --> 00:05:04,240
327
+ or, Well, it's more than that actually. I'm working on
328
+
329
+ 83
330
+ 00:05:04,240 --> 00:05:08,160
331
+ a voice app. So it's a prototype, I guess, is
332
+
333
+ 84
334
+ 00:05:08,320 --> 00:05:12,800
335
+ really more accurate. But you can do that and you
336
+
337
+ 85
338
+ 00:05:12,800 --> 00:05:15,280
339
+ can work backwards. You're like, you listen back to a
340
+
341
+ 86
342
+ 00:05:15,280 --> 00:05:18,800
343
+ voice note and you painfully go through one of those
344
+
345
+ 87
346
+ 00:05:19,120 --> 00:05:21,920
347
+ transcribing, you know, where you start and stop and scrub
348
+
349
+ 88
350
+ 00:05:22,080 --> 00:05:24,000
351
+ around it and you fix the errors, but it's really,
352
+
353
+ 89
354
+ 00:05:24,160 --> 00:05:26,800
355
+ really boring to do that. So I thought it would
356
+
357
+ 90
358
+ 00:05:26,880 --> 00:05:29,120
359
+ be less tedious in the long term if I just
360
+
361
+ 91
362
+ 00:05:30,139 --> 00:05:33,020
363
+ recorded the source of truth. So it gave me these
364
+
365
+ 92
366
+ 00:05:33,100 --> 00:05:36,220
367
+ three minute snippets. I recorded them. It saved an MP3
368
+
369
+ 93
370
+ 00:05:36,460 --> 00:05:39,580
371
+ and a TXT in the same folder, and I created
372
+
373
+ 94
374
+ 00:05:39,660 --> 00:05:42,940
375
+ an error with that data. So I was very hopeful,
376
+
377
+ 95
378
+ 00:05:43,340 --> 00:05:46,940
379
+ quietly, a little bit hopeful that I could actually fine
380
+
381
+ 96
382
+ 00:05:47,020 --> 00:05:50,540
383
+ tune Whisper. I want to fine tune Whisper because when
384
+
385
+ 97
386
+ 00:05:50,620 --> 00:05:54,860
387
+ I got into Voicetech last November, my wife was in
388
+
389
+ 98
390
+ 00:05:54,860 --> 00:05:58,220
391
+ the US and I was alone at home. And when
392
+
393
+ 99
394
+ 00:05:58,680 --> 00:06:01,480
395
+ crazy people like me do really wild things like use
396
+
397
+ 100
398
+ 00:06:01,720 --> 00:06:06,200
399
+ voice to tech technology. That was basically when I started
400
+
401
+ 101
402
+ 00:06:06,280 --> 00:06:08,840
403
+ doing it, I didn't feel like a crazy person speaking
404
+
405
+ 102
406
+ 00:06:08,920 --> 00:06:13,800
407
+ to myself. And my expectations weren't that high. I used
408
+
409
+ 103
410
+ 00:06:14,360 --> 00:06:17,720
411
+ speech tech now and again, tried it out. It was
412
+
413
+ 104
414
+ 00:06:17,720 --> 00:06:19,240
415
+ like, it'd be really cool if you could just, like,
416
+
417
+ 105
418
+ 00:06:19,400 --> 00:06:22,840
419
+ speak into your computer. And whatever I tried out that
420
+
421
+ 106
422
+ 00:06:23,080 --> 00:06:26,670
423
+ had Linux support was just. It was not good, basically.
424
+
425
+ 107
426
+ 00:06:27,310 --> 00:06:29,550
427
+ And this blew me away from the first go. I
428
+
429
+ 108
430
+ 00:06:29,550 --> 00:06:32,830
431
+ mean, it wasn't 100% accurate out of the box and
432
+
433
+ 109
434
+ 00:06:32,910 --> 00:06:34,990
435
+ it took work, but it was good enough that there
436
+
437
+ 110
438
+ 00:06:35,070 --> 00:06:37,550
439
+ was a solid foundation and it kind of passed that
440
+
441
+ 111
442
+ 00:06:38,750 --> 00:06:41,950
443
+ pivot point that it's actually worth doing this. You know,
444
+
445
+ 112
446
+ 00:06:42,110 --> 00:06:44,750
447
+ there's a point where it's so like the transcript is
448
+
449
+ 113
450
+ 00:06:44,990 --> 00:06:47,390
451
+ you don't have to get 100% accuracy for it to
452
+
453
+ 114
454
+ 00:06:47,390 --> 00:06:50,110
455
+ be worth your time for speech attacks to be a
456
+
457
+ 115
458
+ 00:06:50,110 --> 00:06:52,510
459
+ worthwhile addition to your productivity, but you do need to
460
+
461
+ 116
462
+ 00:06:52,510 --> 00:06:56,050
463
+ get above, let's say, I don't know, 85%. If it's
464
+
465
+ 117
466
+ 00:06:56,210 --> 00:06:59,890
467
+ 60% or 50%, you inevitably say, screw it, I'll just
468
+
469
+ 118
470
+ 00:06:59,890 --> 00:07:02,850
471
+ type it because you end up missing errors in the
472
+
473
+ 119
474
+ 00:07:02,850 --> 00:07:05,570
475
+ transcript and it becomes actually worse. You end up in
476
+
477
+ 120
478
+ 00:07:05,570 --> 00:07:07,650
479
+ a worse position than you started with. That's been my
480
+
481
+ 121
482
+ 00:07:07,730 --> 00:07:12,050
483
+ experience. So I was like, oh, this is actually really,
484
+
485
+ 122
486
+ 00:07:12,210 --> 00:07:14,050
487
+ really good now. How did that happen? And the answer
488
+
489
+ 123
490
+ 00:07:14,210 --> 00:07:19,490
491
+ is ASR whisper being open source and the transformer
492
+
493
+ 124
494
+ 00:07:19,490 --> 00:07:23,250
495
+ architecture. If you want to go back to the to
496
+
497
+ 125
498
+ 00:07:23,330 --> 00:07:26,450
499
+ the underpinnings, which really blows my mind and it's on
500
+
501
+ 126
502
+ 00:07:26,530 --> 00:07:30,760
503
+ my list. To read through that paper. All you need
504
+
505
+ 127
506
+ 00:07:30,840 --> 00:07:36,040
507
+ is attention as attentively as can be done
508
+
509
+ 128
510
+ 00:07:36,280 --> 00:07:39,400
511
+ with my limited brain because it's super, super high level
512
+
513
+ 129
514
+ 00:07:39,720 --> 00:07:44,600
515
+ stuff, super advanced stuff, I mean. But that, I think
516
+
517
+ 130
518
+ 00:07:44,760 --> 00:07:49,400
519
+ of all the things that are fascinating about the sudden
520
+
521
+ 131
522
+ 00:07:49,720 --> 00:07:53,780
523
+ rise in AI and the dramatic capabilities. I find it
524
+
525
+ 132
526
+ 00:07:53,780 --> 00:07:56,180
527
+ fascinating that a few people are like, hang on, you've
528
+
529
+ 133
530
+ 00:07:56,180 --> 00:07:58,500
531
+ got this thing that can speak to you, like a
532
+
533
+ 134
534
+ 00:07:58,500 --> 00:08:03,060
535
+ chatbot, an LLM, and then you've got image generation. Okay,
536
+
537
+ 135
538
+ 00:08:03,140 --> 00:08:06,660
539
+ so firstly, those two things on the surface have nothing
540
+
541
+ 136
542
+ 00:08:06,980 --> 00:08:10,820
543
+ in common. So like, how are they, how did that
544
+
545
+ 137
546
+ 00:08:10,980 --> 00:08:12,580
547
+ just happen all at the same time? And then when
548
+
549
+ 138
550
+ 00:08:12,580 --> 00:08:16,660
551
+ you extend that further, you're like, Suno, right? You can
552
+
553
+ 139
554
+ 00:08:17,140 --> 00:08:20,110
555
+ sing a song and AI will come up with and
556
+
557
+ 140
558
+ 00:08:20,270 --> 00:08:23,470
559
+ instrumental. And then you've got Whisper and you're like, wait
560
+
561
+ 141
562
+ 00:08:23,470 --> 00:08:25,950
563
+ a second, how did all this stuff, like, if it's
564
+
565
+ 142
566
+ 00:08:25,950 --> 00:08:29,310
567
+ all AI, what's like, there has to be some commonality.
568
+
569
+ 143
570
+ 00:08:29,550 --> 00:08:34,670
571
+ Otherwise, these are totally different technologies on the surface of
572
+
573
+ 144
574
+ 00:08:34,670 --> 00:08:38,910
575
+ it. And the Transformer architecture is, as far as I
576
+
577
+ 145
578
+ 00:08:38,990 --> 00:08:41,630
579
+ know, the answer. And I can't even say, can't even
580
+
581
+ 146
582
+ 00:08:41,710 --> 00:08:46,350
583
+ pretend that I really understand what the Transformer architecture means.
584
+
585
+ 147
586
+ 00:08:46,850 --> 00:08:49,330
587
+ In depth, but I have scanned it and as I
588
+
589
+ 148
590
+ 00:08:49,490 --> 00:08:51,890
591
+ said, I want to print it and really kind of
592
+
593
+ 149
594
+ 00:08:52,290 --> 00:08:56,130
595
+ think over it at some point. And I'll probably feel
596
+
597
+ 150
598
+ 00:08:56,370 --> 00:08:59,330
599
+ bad about myself, I think, because weren't those guys in
600
+
601
+ 151
602
+ 00:08:59,410 --> 00:09:03,490
603
+ their 20s? Like, that's crazy. I think I asked ChatGPT
604
+
605
+ 152
606
+ 00:09:03,570 --> 00:09:07,970
607
+ once who wrote that paper and how old were they
608
+
609
+ 153
610
+ 00:09:08,130 --> 00:09:10,850
611
+ when it was published in Arciv? And I was expecting,
612
+
613
+ 154
614
+ 00:09:11,090 --> 00:09:13,970
615
+ like, I don't know, What do you imagine? I personally
616
+
617
+ 155
618
+ 00:09:14,050 --> 00:09:16,290
619
+ imagine kind of like, you know, you have these breakthroughs
620
+
621
+ 156
622
+ 00:09:16,450 --> 00:09:19,890
623
+ during COVID and things like that where like these kind
624
+
625
+ 157
626
+ 00:09:19,970 --> 00:09:22,850
627
+ of really obscure scientists are like in their 50s and
628
+
629
+ 158
630
+ 00:09:22,850 --> 00:09:27,250
631
+ they've just kind of been laboring in labs and wearily
632
+
633
+ 159
634
+ 00:09:27,250 --> 00:09:30,530
635
+ in writing and publishing in kind of obscure academic publications.
636
+
637
+ 160
638
+ 00:09:30,850 --> 00:09:33,250
639
+ And they finally like hit a big or win a
640
+
641
+ 161
642
+ 00:09:33,250 --> 00:09:37,330
643
+ Nobel Prize and then their household names. So that was
644
+
645
+ 162
646
+ 00:09:37,410 --> 00:09:39,070
647
+ kind of what I had in mind. That was the
648
+
649
+ 163
650
+ 00:09:39,070 --> 00:09:43,070
651
+ mental image I'd formed of the birth of Arcsight. Like
652
+
653
+ 164
654
+ 00:09:43,070 --> 00:09:46,350
655
+ I wasn't expecting 20-somethings in San Francisco, though. I thought
656
+
657
+ 165
658
+ 00:09:46,430 --> 00:09:48,910
659
+ that was both very, very funny, very cool, and actually
660
+
661
+ 166
662
+ 00:09:49,070 --> 00:09:52,590
663
+ kind of inspiring. It's nice to think that people who,
664
+
665
+ 167
666
+ 00:09:53,390 --> 00:09:56,190
667
+ you know, just you might put them in the kind
668
+
669
+ 168
670
+ 00:09:56,270 --> 00:09:59,630
671
+ of milieu or bubble or world that you are in
672
+
673
+ 169
674
+ 00:09:59,710 --> 00:10:03,310
675
+ are credibly in through, you know, the series of connections
676
+
677
+ 170
678
+ 00:10:03,390 --> 00:10:07,470
679
+ that are coming up with such literally world changing innovations.
680
+
681
+ 171
682
+ 00:10:07,950 --> 00:10:11,540
683
+ So that was, I thought, anyway. That's that was cool.
684
+
685
+ 172
686
+ 00:10:11,940 --> 00:10:14,580
687
+ Okay, voice training data. How are we doing? We're about
688
+
689
+ 173
690
+ 00:10:14,580 --> 00:10:18,660
691
+ 10 minutes and I'm still talking about voice technology. So
692
+
693
+ 174
694
+ 00:10:18,740 --> 00:10:22,180
695
+ Whisper was brilliant and I was so excited that I
696
+
697
+ 175
698
+ 00:10:22,260 --> 00:10:25,460
699
+ was my first instinct was to like guess like, oh
700
+
701
+ 176
702
+ 00:10:25,460 --> 00:10:26,900
703
+ my gosh, I have to get like a really good
704
+
705
+ 177
706
+ 00:10:26,900 --> 00:10:30,660
707
+ microphone for this. So I didn't go on a spending
708
+
709
+ 178
710
+ 00:10:30,660 --> 00:10:32,820
711
+ spree because I said, I'm gonna have to just wait
712
+
713
+ 179
714
+ 00:10:32,820 --> 00:10:35,220
715
+ a month and see if I still use this. And
716
+
717
+ 180
718
+ 00:10:36,510 --> 00:10:38,990
719
+ It just kind of became, it's become really part of
720
+
721
+ 181
722
+ 00:10:39,150 --> 00:10:43,470
723
+ my daily routine. Like if I'm writing an email, I'll
724
+
725
+ 182
726
+ 00:10:43,550 --> 00:10:47,070
727
+ record a voice note. And then I've developed and it's
728
+
729
+ 183
730
+ 00:10:47,070 --> 00:10:49,150
731
+ nice to see that everyone is like developing the same
732
+
733
+ 184
734
+ 00:10:49,630 --> 00:10:52,030
735
+ things in parallel. Like that's my kind of a weird
736
+
737
+ 185
738
+ 00:10:52,030 --> 00:10:54,590
739
+ thing to say, but when I look, I kind of
740
+
741
+ 186
742
+ 00:10:54,750 --> 00:10:59,070
743
+ came, when I started working on this, these prototypes on
744
+
745
+ 187
746
+ 00:10:59,150 --> 00:11:01,550
747
+ GitHub, which is where I just kind of share very
748
+
749
+ 188
750
+ 00:11:01,790 --> 00:11:06,810
751
+ freely and loosely, ideas and first iterations on concepts.
752
+
753
+ 189
754
+ 00:11:08,570 --> 00:11:10,730
755
+ And for want of a better word, I called it
756
+
757
+ 190
758
+ 00:11:10,810 --> 00:11:15,530
759
+ like LLM post-processing or cleanup or basically a system prompt
760
+
761
+ 191
762
+ 00:11:15,610 --> 00:11:18,970
763
+ that after you get back the raw text from Whisper,
764
+
765
+ 192
766
+ 00:11:19,130 --> 00:11:22,090
767
+ you run it through a model and say, okay, this
768
+
769
+ 193
770
+ 00:11:22,170 --> 00:11:27,050
771
+ is crappy text, like add sentence structure and fix it
772
+
773
+ 194
774
+ 00:11:27,130 --> 00:11:32,330
775
+ up. And now when I'm exploring the different tools that
776
+
777
+ 195
778
+ 00:11:32,410 --> 00:11:35,260
779
+ are out there that people have built, I see quite
780
+
781
+ 196
782
+ 00:11:35,500 --> 00:11:39,180
783
+ a number of projects have basically done the same thing,
784
+
785
+ 197
786
+ 00:11:40,540 --> 00:11:43,260
787
+ lest that be misconstrued. I'm not saying for a millisecond
788
+
789
+ 198
790
+ 00:11:43,340 --> 00:11:46,300
791
+ that I inspired them. I'm sure this has been a
792
+
793
+ 199
794
+ 00:11:46,380 --> 00:11:49,580
795
+ thing that's been integrated into tools for a while, but
796
+
797
+ 200
798
+ 00:11:50,460 --> 00:11:52,380
799
+ it's the kind of thing that when you start using
800
+
801
+ 201
802
+ 00:11:52,380 --> 00:11:54,860
803
+ these tools every day, the need for it is almost
804
+
805
+ 202
806
+ 00:11:55,020 --> 00:11:59,500
807
+ instantly apparent because text that doesn't have any punctuation or
808
+
809
+ 203
810
+ 00:11:59,880 --> 00:12:03,080
811
+ Paragraph spacing takes a long time to, you know, it
812
+
813
+ 204
814
+ 00:12:03,240 --> 00:12:05,480
815
+ takes so long to get it into a presentable email
816
+
817
+ 205
818
+ 00:12:05,640 --> 00:12:09,800
819
+ that again, it's, it's, it, it moves speech tech into
820
+
821
+ 206
822
+ 00:12:10,040 --> 00:12:13,560
823
+ that before that inflection point where you're like, no, it's
824
+
825
+ 207
826
+ 00:12:13,560 --> 00:12:16,040
827
+ just not worth it. It's like, it's, it'll just be
828
+
829
+ 208
830
+ 00:12:16,120 --> 00:12:18,600
831
+ quicker to type this. So it's a big, it's a
832
+
833
+ 209
834
+ 00:12:18,600 --> 00:12:21,640
835
+ little touch that actually is a big deal. Uh, so
836
+
837
+ 210
838
+ 00:12:21,800 --> 00:12:25,720
839
+ I was on Whisper and I've been using Whisper and
840
+
841
+ 211
842
+ 00:12:25,720 --> 00:12:28,190
843
+ I kind of, early on found a couple of tools.
844
+
845
+ 212
846
+ 00:12:28,350 --> 00:12:30,590
847
+ I couldn't find what I was looking for on Linux,
848
+
849
+ 213
850
+ 00:12:30,750 --> 00:12:35,550
851
+ which is basically just something that'll run in the background.
852
+
853
+ 214
854
+ 00:12:35,790 --> 00:12:38,110
855
+ It'll give it an API key and it will just
856
+
857
+ 215
858
+ 00:12:38,270 --> 00:12:42,990
859
+ like transcribe with like a little key to start and
860
+
861
+ 216
862
+ 00:12:43,070 --> 00:12:47,390
863
+ stop the dictation. And the issues were I discovered that
864
+
865
+ 217
866
+ 00:12:47,550 --> 00:12:51,150
867
+ like most people involved in creating these projects were very
868
+
869
+ 218
870
+ 00:12:51,310 --> 00:12:55,150
871
+ much focused on local models, running Whisper locally because you
872
+
873
+ 219
874
+ 00:12:55,230 --> 00:12:58,020
875
+ can. And I tried that a bunch of times and
876
+
877
+ 220
878
+ 00:12:58,100 --> 00:13:00,420
879
+ just never got results that were as good as the
880
+
881
+ 221
882
+ 00:13:00,420 --> 00:13:03,220
883
+ cloud. And when I began looking at the cost of
884
+
885
+ 222
886
+ 00:13:03,300 --> 00:13:05,780
887
+ the speech to text APIs and what I was spending,
888
+
889
+ 223
890
+ 00:13:06,340 --> 00:13:09,540
891
+ I just thought there is, it's actually, in my opinion,
892
+
893
+ 224
894
+ 00:13:09,700 --> 00:13:12,900
895
+ just one of the better deals in API spending and
896
+
897
+ 225
898
+ 00:13:12,900 --> 00:13:15,220
899
+ in cloud. Like it's just not that expensive for very,
900
+
901
+ 226
902
+ 00:13:15,380 --> 00:13:19,380
903
+ very good models that are much more, you know, you're
904
+
905
+ 227
906
+ 00:13:19,380 --> 00:13:21,960
907
+ gonna be able to run the full model. The latest
908
+
909
+ 228
910
+ 00:13:21,960 --> 00:13:25,960
911
+ model versus whatever you can run on your average GPU,
912
+
913
+ 229
914
+ 00:13:26,200 --> 00:13:29,240
915
+ unless you want to buy a crazy GPU. It doesn't
916
+
917
+ 230
918
+ 00:13:29,240 --> 00:13:31,160
919
+ really make sense to me. Now, privacy is another concern
920
+
921
+ 231
922
+ 00:13:32,200 --> 00:13:33,960
923
+ that I know is kind of like a very much
924
+
925
+ 232
926
+ 00:13:34,040 --> 00:13:36,840
927
+ a separate thing that people just don't want their voice
928
+
929
+ 233
930
+ 00:13:37,080 --> 00:13:40,760
931
+ data and their voice leaving their local environment, maybe for
932
+
933
+ 234
934
+ 00:13:40,760 --> 00:13:44,280
935
+ regulatory reasons as well. But I'm not in that. I
936
+
937
+ 235
938
+ 00:13:44,680 --> 00:13:48,920
939
+ neither really care about people listening to my grocery list
940
+
941
+ 236
942
+ 00:13:49,160 --> 00:13:51,800
943
+ consisting of reminding myself that I need to buy more
944
+
945
+ 237
946
+ 00:13:51,880 --> 00:13:55,230
947
+ beer, Cheetos, and hummus, which is kind of the three
948
+
949
+ 238
950
+ 00:13:55,390 --> 00:13:59,950
951
+ staples of my diet during periods of poorer nutrition. But
952
+
953
+ 239
954
+ 00:14:00,030 --> 00:14:02,510
955
+ the kind of stuff that I transcribe, it's just not,
956
+
957
+ 240
958
+ 00:14:04,030 --> 00:14:07,790
959
+ it's not a privacy thing I'm that sort of sensitive
960
+
961
+ 241
962
+ 00:14:07,870 --> 00:14:13,230
963
+ about and I don't do anything so sensitive or secure
964
+
965
+ 242
966
+ 00:14:13,310 --> 00:14:16,510
967
+ that requires air gapping. So I looked at the pricing
968
+
969
+ 243
970
+ 00:14:16,590 --> 00:14:19,870
971
+ and especially the kind of older model mini Some of
972
+
973
+ 244
974
+ 00:14:19,950 --> 00:14:22,030
975
+ them are very, very affordable. And I did a back
976
+
977
+ 245
978
+ 00:14:22,270 --> 00:14:25,950
979
+ of the, I did a calculation once with ChatGPT and
980
+
981
+ 246
982
+ 00:14:25,950 --> 00:14:29,310
983
+ I was like, okay, this is the API price for
984
+
985
+ 247
986
+ 00:14:29,470 --> 00:14:32,350
987
+ I can't remember whatever the model was. Let's say I
988
+
989
+ 248
990
+ 00:14:32,430 --> 00:14:35,310
991
+ just go at it like nonstop, which it rarely happens.
992
+
993
+ 249
994
+ 00:14:35,550 --> 00:14:38,910
995
+ Probably, I would say on average, I might dictate 30
996
+
997
+ 250
998
+ 00:14:38,990 --> 00:14:41,870
999
+ to 60 minutes per day if I was probably summing
1000
+
1001
+ 251
1002
+ 00:14:41,870 --> 00:14:47,070
1003
+ up the emails, documents, outlines, which
1004
+
1005
+ 252
1006
+ 00:14:47,310 --> 00:14:49,950
1007
+ is a lot, but it's still a fairly modest amount.
1008
+
1009
+ 253
1010
+ 00:14:50,110 --> 00:14:52,020
1011
+ And I was like, Some days I do go on
1012
+
1013
+ 254
1014
+ 00:14:52,180 --> 00:14:54,980
1015
+ like one or two days where I've been usually when
1016
+
1017
+ 255
1018
+ 00:14:54,980 --> 00:14:57,060
1019
+ I'm like kind of out of the house and just
1020
+
1021
+ 256
1022
+ 00:14:57,300 --> 00:15:00,580
1023
+ have something like I have nothing else to do. Like
1024
+
1025
+ 257
1026
+ 00:15:00,740 --> 00:15:04,100
1027
+ if I'm at a hospital, we have a newborn and
1028
+
1029
+ 258
1030
+ 00:15:04,260 --> 00:15:07,380
1031
+ you're waiting for like eight hours and hours for an
1032
+
1033
+ 259
1034
+ 00:15:07,460 --> 00:15:10,900
1035
+ appointment. And I would probably have listened to podcasts before
1036
+
1037
+ 260
1038
+ 00:15:11,460 --> 00:15:14,260
1039
+ becoming a speech fanatic. And I'm like, oh, wait, let
1040
+
1041
+ 261
1042
+ 00:15:14,420 --> 00:15:16,339
1043
+ me just get down. Let me just get these ideas
1044
+
1045
+ 262
1046
+ 00:15:16,500 --> 00:15:18,620
1047
+ out of my head. And that's when I'll go on
1048
+
1049
+ 263
1050
+ 00:15:19,340 --> 00:15:21,900
1051
+ my speech binges. But those are like once every few
1052
+
1053
+ 264
1054
+ 00:15:21,900 --> 00:15:25,020
1055
+ months, like not frequently. But I said, okay, let's just
1056
+
1057
+ 265
1058
+ 00:15:25,100 --> 00:15:29,180
1059
+ say if I'm gonna price out Cloud SCT, if I
1060
+
1061
+ 266
1062
+ 00:15:29,260 --> 00:15:33,980
1063
+ was like dedicated every second of every waking hour to
1064
+
1065
+ 267
1066
+ 00:15:34,140 --> 00:15:37,980
1067
+ transcribing for some odd reason, I mean, I'd have to
1068
+
1069
+ 268
1070
+ 00:15:38,060 --> 00:15:40,860
1071
+ like eat and use the toilet. Like, you know, there's
1072
+
1073
+ 269
1074
+ 00:15:40,940 --> 00:15:43,500
1075
+ only so many hours I'm awake for. So like, let's
1076
+
1077
+ 270
1078
+ 00:15:43,500 --> 00:15:46,700
1079
+ just say a maximum of like 40 hour, 45 minutes.
1080
+
1081
+ 271
1082
+ 00:15:47,290 --> 00:15:49,370
1083
+ In the hour. Then I said, all right, let's just
1084
+
1085
+ 272
1086
+ 00:15:49,370 --> 00:15:52,970
1087
+ say 50. Who knows? You're dictating on the toilet. We
1088
+
1089
+ 273
1090
+ 00:15:53,130 --> 00:15:55,130
1091
+ do it. So it could be. You could just do
1092
+
1093
+ 274
1094
+ 00:15:55,210 --> 00:15:59,370
1095
+ 60. But whatever I did. And every day, like, you're
1096
+
1097
+ 275
1098
+ 00:15:59,450 --> 00:16:02,810
1099
+ going flat out seven days a week dictating non-stop I
1100
+
1101
+ 276
1102
+ 00:16:02,810 --> 00:16:05,930
1103
+ was like, what's my monthly API bill gonna be at
1104
+
1105
+ 277
1106
+ 00:16:06,010 --> 00:16:08,650
1107
+ this price? And it came out to, like, 70 or
1108
+
1109
+ 278
1110
+ 00:16:08,650 --> 00:16:10,810
1111
+ 80 bucks. And I was like, well, that would be
1112
+
1113
+ 279
1114
+ 00:16:11,210 --> 00:16:15,780
1115
+ an extraordinary. Amount of dictation. And I would hope that
1116
+
1117
+ 280
1118
+ 00:16:16,260 --> 00:16:20,020
1119
+ there was some compelling reason more worth more than $70
1120
+
1121
+ 281
1122
+ 00:16:20,340 --> 00:16:23,540
1123
+ that I embarked upon that project. So given that that's
1124
+
1125
+ 282
1126
+ 00:16:23,540 --> 00:16:25,540
1127
+ kind of the max point for me, I said that's
1128
+
1129
+ 283
1130
+ 00:16:25,620 --> 00:16:29,220
1131
+ actually very, very affordable. Now you're gonna, if you want
1132
+
1133
+ 284
1134
+ 00:16:29,300 --> 00:16:31,780
1135
+ to spec out the costs and you want to do
1136
+
1137
+ 285
1138
+ 00:16:31,780 --> 00:16:36,340
1139
+ the post-processing that I really do feel is valuable, that's
1140
+
1141
+ 286
1142
+ 00:16:36,420 --> 00:16:40,900
1143
+ gonna cost some more as well, unless you're using Gemini,
1144
+
1145
+ 287
1146
+ 00:16:41,380 --> 00:16:44,500
1147
+ which needless to say is a random person sitting in
1148
+
1149
+ 288
1150
+ 00:16:44,580 --> 00:16:49,140
1151
+ Jerusalem. I have no affiliation, nor with Google, nor anthropic,
1152
+
1153
+ 289
1154
+ 00:16:49,220 --> 00:16:52,100
1155
+ nor Gemini, nor any major tech vendor for that matter.
1156
+
1157
+ 290
1158
+ 00:16:53,700 --> 00:16:56,900
1159
+ I like Gemini not so much as a everyday model.
1160
+
1161
+ 291
1162
+ 00:16:57,380 --> 00:16:59,940
1163
+ It's kind of underwhelmed in that respect, I would say.
1164
+
1165
+ 292
1166
+ 00:17:00,340 --> 00:17:02,820
1167
+ But for multimodal, I think it's got a lot to
1168
+
1169
+ 293
1170
+ 00:17:02,820 --> 00:17:06,580
1171
+ offer. And I think that the transcribing functionality whereby it
1172
+
1173
+ 294
1174
+ 00:17:06,660 --> 00:17:11,980
1175
+ can process audio with a system prompt and both give
1176
+
1177
+ 295
1178
+ 00:17:12,140 --> 00:17:15,180
1179
+ you transcription that's cleaned up that reduces two steps to
1180
+
1181
+ 296
1182
+ 00:17:15,340 --> 00:17:18,300
1183
+ one. And that for me is a very, very big
1184
+
1185
+ 297
1186
+ 00:17:18,460 --> 00:17:21,660
1187
+ deal. And I feel like even Google has haven't really
1188
+
1189
+ 298
1190
+ 00:17:21,900 --> 00:17:26,780
1191
+ sort of thought through how useful the that modality is
1192
+
1193
+ 299
1194
+ 00:17:26,860 --> 00:17:29,340
1195
+ and what kind of use cases you can achieve with
1196
+
1197
+ 300
1198
+ 00:17:29,420 --> 00:17:31,340
1199
+ it. Because I found in the course of this year,
1200
+
1201
+ 301
1202
+ 00:17:31,980 --> 00:17:36,620
1203
+ just an endless list of really kind of system prompt
1204
+
1205
+ 302
1206
+ 00:17:36,940 --> 00:17:40,300
1207
+ system prompt stuff that I can say, okay, I've used
1208
+
1209
+ 303
1210
+ 00:17:40,300 --> 00:17:43,500
1211
+ it to capture context data for AI, which is literally
1212
+
1213
+ 304
1214
+ 00:17:43,580 --> 00:17:45,740
1215
+ I might speak for if I wanted to have a
1216
+
1217
+ 305
1218
+ 00:17:45,740 --> 00:17:49,820
1219
+ good bank of context data about who knows my childhood
1220
+
1221
+ 306
1222
+ 00:17:50,380 --> 00:17:54,300
1223
+ more realistically, maybe my career goals, something that would just
1224
+
1225
+ 307
1226
+ 00:17:54,380 --> 00:17:56,780
1227
+ be like really boring to type out. So I'll just
1228
+
1229
+ 308
1230
+ 00:17:56,860 --> 00:18:00,860
1231
+ like sit in my car and record it for 10
1232
+
1233
+ 309
1234
+ 00:18:00,940 --> 00:18:03,180
1235
+ minutes. And that 10 minutes you get a lot of
1236
+
1237
+ 310
1238
+ 00:18:03,340 --> 00:18:08,730
1239
+ information in. Um, emails, which is short text, just
1240
+
1241
+ 311
1242
+ 00:18:09,130 --> 00:18:12,330
1243
+ there is a whole bunch and all these workflows kind
1244
+
1245
+ 312
1246
+ 00:18:12,490 --> 00:18:14,490
1247
+ of require a little bit of treatment afterwards and different
1248
+
1249
+ 313
1250
+ 00:18:14,730 --> 00:18:18,170
1251
+ treatment. My context pipeline is kind of like just extract
1252
+
1253
+ 314
1254
+ 00:18:18,250 --> 00:18:21,050
1255
+ the bare essentials. So you end up with me talking
1256
+
1257
+ 315
1258
+ 00:18:21,130 --> 00:18:23,050
1259
+ very loosely about sort of what I've done in my
1260
+
1261
+ 316
1262
+ 00:18:23,130 --> 00:18:25,450
1263
+ career, where I've worked, where I might like to work.
1264
+
1265
+ 317
1266
+ 00:18:25,930 --> 00:18:29,050
1267
+ And it goes, it condenses that down to very robotic
1268
+
1269
+ 318
1270
+ 00:18:29,290 --> 00:18:32,570
1271
+ language that is easy to chunk parse and maybe put
1272
+
1273
+ 319
1274
+ 00:18:32,650 --> 00:18:36,630
1275
+ into a vector database. Daniel has worked in technology. Daniel
1276
+
1277
+ 320
1278
+ 00:18:37,510 --> 00:18:40,230
1279
+ has been working in, you know, stuff like that. That's
1280
+
1281
+ 321
1282
+ 00:18:40,230 --> 00:18:43,190
1283
+ not how you would speak, but I figure it's probably
1284
+
1285
+ 322
1286
+ 00:18:43,430 --> 00:18:47,430
1287
+ easier to parse for, after all, robots. So we've almost
1288
+
1289
+ 323
1290
+ 00:18:47,510 --> 00:18:49,350
1291
+ got to 20 minutes and this is actually a success
1292
+
1293
+ 324
1294
+ 00:18:49,830 --> 00:18:55,190
1295
+ because I wasted 20 minutes of the evening speaking
1296
+
1297
+ 325
1298
+ 00:18:55,270 --> 00:18:59,990
1299
+ into a microphone and the levels were shot and it
1300
+
1301
+ 326
1302
+ 00:18:59,990 --> 00:19:01,670
1303
+ was clipping and I said, I can't really do an
1304
+
1305
+ 327
1306
+ 00:19:01,750 --> 00:19:04,070
1307
+ evaluation. I have to be fair. I have to give
1308
+
1309
+ 328
1310
+ 00:19:04,640 --> 00:19:08,000
1311
+ the models a chance to do their thing. What am
1312
+
1313
+ 329
1314
+ 00:19:08,000 --> 00:19:10,400
1315
+ I hoping to achieve in this? Okay, my fine tune
1316
+
1317
+ 330
1318
+ 00:19:10,400 --> 00:19:13,440
1319
+ was a dud as mentioned. DeepChrom ST, I'm really, really
1320
+
1321
+ 331
1322
+ 00:19:13,520 --> 00:19:16,560
1323
+ hopeful that this prototype will work and it's a build
1324
+
1325
+ 332
1326
+ 00:19:16,800 --> 00:19:19,360
1327
+ in public open source, so anyone is welcome to use
1328
+
1329
+ 333
1330
+ 00:19:19,440 --> 00:19:22,400
1331
+ it if I make anything good. But that was really
1332
+
1333
+ 334
1334
+ 00:19:22,560 --> 00:19:26,560
1335
+ exciting for me last night when after hours of trying
1336
+
1337
+ 335
1338
+ 00:19:26,640 --> 00:19:30,560
1339
+ my own prototype, seeing someone just made something that works
1340
+
1341
+ 336
1342
+ 00:19:30,720 --> 00:19:32,480
1343
+ like that, you know, you're not gonna have to build
1344
+
1345
+ 337
1346
+ 00:19:32,720 --> 00:19:37,540
1347
+ a custom conda environment and image. I have AMD GPU,
1348
+
1349
+ 338
1350
+ 00:19:37,700 --> 00:19:41,060
1351
+ which makes things much more complicated. I didn't find it.
1352
+
1353
+ 339
1354
+ 00:19:41,620 --> 00:19:43,060
1355
+ And I was about to give up and I said,
1356
+
1357
+ 340
1358
+ 00:19:43,140 --> 00:19:45,540
1359
+ all right, let me just give Deep Grams Linux thing
1360
+
1361
+ 341
1362
+ 00:19:46,020 --> 00:19:49,300
1363
+ a shot. And if this doesn't work, I'm just going
1364
+
1365
+ 342
1366
+ 00:19:49,300 --> 00:19:51,060
1367
+ to go back to trying to Vibe code something myself.
1368
+
1369
+ 343
1370
+ 00:19:51,700 --> 00:19:55,540
1371
+ And when I ran the script, I was using Claude
1372
+
1373
+ 344
1374
+ 00:19:55,620 --> 00:19:59,140
1375
+ code to do the installation process. It ran the script
1376
+
1377
+ 345
1378
+ 00:19:59,220 --> 00:20:02,100
1379
+ and oh my gosh, it works just like that. The
1380
+
1381
+ 346
1382
+ 00:20:02,180 --> 00:20:06,060
1383
+ tricky thing For all those who want to know all
1384
+
1385
+ 347
1386
+ 00:20:06,060 --> 00:20:11,340
1387
+ the nitty gritty details, was that I
1388
+
1389
+ 348
1390
+ 00:20:11,340 --> 00:20:14,460
1391
+ don't think it was actually struggling with transcription, but pasting
1392
+
1393
+ 349
1394
+ 00:20:14,780 --> 00:20:18,220
1395
+ Wayland makes life very hard. And I think there was
1396
+
1397
+ 350
1398
+ 00:20:18,300 --> 00:20:21,580
1399
+ something not running the right time. Anyway, Deepgram, I looked
1400
+
1401
+ 351
1402
+ 00:20:21,580 --> 00:20:23,900
1403
+ at how they actually handled that because it worked out
1404
+
1405
+ 352
1406
+ 00:20:23,980 --> 00:20:26,620
1407
+ of the box when other stuff didn't. And it was
1408
+
1409
+ 353
1410
+ 00:20:27,180 --> 00:20:30,650
1411
+ quite a clever little mechanism. And but more so than
1412
+
1413
+ 354
1414
+ 00:20:30,730 --> 00:20:33,370
1415
+ that, the accuracy was brilliant. Now, what am I doing
1416
+
1417
+ 355
1418
+ 00:20:33,370 --> 00:20:36,010
1419
+ here? This is going to be a 20 minute audio
1420
+
1421
+ 356
1422
+ 00:20:36,570 --> 00:20:42,090
1423
+ sample. And I think I've done one or two
1424
+
1425
+ 357
1426
+ 00:20:42,250 --> 00:20:46,650
1427
+ of these before, but I did it with short snappy
1428
+
1429
+ 358
1430
+ 00:20:46,810 --> 00:20:49,850
1431
+ voice notes. This is kind of long form. This actually
1432
+
1433
+ 359
1434
+ 00:20:50,090 --> 00:20:52,250
1435
+ might be a better approximation for what's useful to me
1436
+
1437
+ 360
1438
+ 00:20:52,410 --> 00:20:55,970
1439
+ than voice memos. Like, I need to buy three Bread,
1440
+
1441
+ 361
1442
+ 00:20:56,050 --> 00:20:58,690
1443
+ eaters of milk tomorrow and Peter bread, which is probably
1444
+
1445
+ 362
1446
+ 00:20:58,850 --> 00:21:01,410
1447
+ how like half my voice notes sound. Like if anyone
1448
+
1449
+ 363
1450
+ 00:21:01,890 --> 00:21:04,130
1451
+ were to, I don't know, like find my phone, they'd
1452
+
1453
+ 364
1454
+ 00:21:04,130 --> 00:21:05,650
1455
+ be like, this is the most boring person in the
1456
+
1457
+ 365
1458
+ 00:21:05,650 --> 00:21:09,410
1459
+ world. Although actually, there are some like kind of journaling
1460
+
1461
+ 366
1462
+ 00:21:09,410 --> 00:21:11,570
1463
+ thoughts as well, but it's a lot of content like
1464
+
1465
+ 367
1466
+ 00:21:11,570 --> 00:21:14,530
1467
+ that. And the probably for the evaluation, the most useful
1468
+
1469
+ 368
1470
+ 00:21:14,610 --> 00:21:20,290
1471
+ thing is slightly obscure tech, GitHub, NeocleNo, hugging
1472
+
1473
+ 369
1474
+ 00:21:20,370 --> 00:21:23,020
1475
+ face, Not so obscure that it's not going to have
1476
+
1477
+ 370
1478
+ 00:21:23,100 --> 00:21:26,540
1479
+ a chance of knowing it, but hopefully sufficiently well known
1480
+
1481
+ 371
1482
+ 00:21:26,540 --> 00:21:28,780
1483
+ that the model should get it. I tried to do
1484
+
1485
+ 372
1486
+ 00:21:28,860 --> 00:21:31,660
1487
+ a little bit of speaking really fast and speaking very
1488
+
1489
+ 373
1490
+ 00:21:31,820 --> 00:21:35,100
1491
+ slowly. I would say in general, I've spoken, delivered this
1492
+
1493
+ 374
1494
+ 00:21:35,260 --> 00:21:37,580
1495
+ at a faster pace than I usually would owing to
1496
+
1497
+ 375
1498
+ 00:21:38,060 --> 00:21:42,540
1499
+ strong coffee flowing through my bloodstream. And the thing that
1500
+
1501
+ 376
1502
+ 00:21:42,540 --> 00:21:44,780
1503
+ I'm not going to get in this benchmark is background
1504
+
1505
+ 377
1506
+ 00:21:44,860 --> 00:21:46,540
1507
+ noise, which in my first take that I had to
1508
+
1509
+ 378
1510
+ 00:21:46,540 --> 00:21:49,790
1511
+ get rid of, My wife came in with my son
1512
+
1513
+ 379
1514
+ 00:21:50,110 --> 00:21:52,430
1515
+ and for a goodnight kiss. And that actually would have
1516
+
1517
+ 380
1518
+ 00:21:52,430 --> 00:21:56,590
1519
+ been super helpful to get in because it was non
1520
+
1521
+ 381
1522
+ 00:21:56,670 --> 00:22:00,270
1523
+ diarized or if we had diarization, a female, I could
1524
+
1525
+ 382
1526
+ 00:22:00,270 --> 00:22:02,510
1527
+ say, I want the male voice and that wasn't intended
1528
+
1529
+ 383
1530
+ 00:22:02,510 --> 00:22:05,950
1531
+ for transcription. And we're not going to get background noise
1532
+
1533
+ 384
1534
+ 00:22:06,030 --> 00:22:08,350
1535
+ like people honking their horns, which is something I've done
1536
+
1537
+ 385
1538
+ 00:22:08,510 --> 00:22:11,230
1539
+ in my main data set where I am trying to
1540
+
1541
+ 386
1542
+ 00:22:11,470 --> 00:22:14,420
1543
+ go back to some of my voice notes. Annotate them
1544
+
1545
+ 387
1546
+ 00:22:14,660 --> 00:22:16,500
1547
+ and run a benchmark. But this is going to be
1548
+
1549
+ 388
1550
+ 00:22:16,500 --> 00:22:21,780
1551
+ just a pure quick test. And as someone,
1552
+
1553
+ 389
1554
+ 00:22:22,340 --> 00:22:24,740
1555
+ I'm working on a voice note idea. That's my sort
1556
+
1557
+ 390
1558
+ 00:22:24,740 --> 00:22:28,740
1559
+ of end motivation. Besides thinking it's an ask to the
1560
+
1561
+ 391
1562
+ 00:22:28,740 --> 00:22:32,420
1563
+ outstanding technology that's coming to viability. And really, I know
1564
+
1565
+ 392
1566
+ 00:22:32,500 --> 00:22:36,020
1567
+ this sounds cheesy, can actually have a very transformative effect.
1568
+
1569
+ 393
1570
+ 00:22:37,060 --> 00:22:41,210
1571
+ It's, you know, voice technology has been life changing for
1572
+
1573
+ 394
1574
+ 00:22:42,010 --> 00:22:47,050
1575
+ folks living with disabilities. And I think
1576
+
1577
+ 395
1578
+ 00:22:47,210 --> 00:22:49,050
1579
+ there's something really nice about the fact that it can
1580
+
1581
+ 396
1582
+ 00:22:49,210 --> 00:22:52,570
1583
+ also benefit, you know, folks who are able bodied and
1584
+
1585
+ 397
1586
+ 00:22:52,730 --> 00:22:57,770
1587
+ like we can all in different ways make this tech
1588
+
1589
+ 398
1590
+ 00:22:57,850 --> 00:23:00,490
1591
+ as useful as possible, regardless of the exact way that
1592
+
1593
+ 399
1594
+ 00:23:00,490 --> 00:23:03,850
1595
+ we're using it. And I think there's something very powerful
1596
+
1597
+ 400
1598
+ 00:23:03,930 --> 00:23:06,520
1599
+ in that and it can be very cool. I see
1600
+
1601
+ 401
1602
+ 00:23:06,680 --> 00:23:10,280
1603
+ huge potential. What excites me about Voicetech? A lot of
1604
+
1605
+ 402
1606
+ 00:23:10,360 --> 00:23:14,440
1607
+ things actually. Firstly, the fact that it's cheap and accurate,
1608
+
1609
+ 403
1610
+ 00:23:14,520 --> 00:23:17,160
1611
+ as I mentioned at the very start of this. And
1612
+
1613
+ 404
1614
+ 00:23:17,320 --> 00:23:19,960
1615
+ it's getting better and better with stuff like accent handling.
1616
+
1617
+ 405
1618
+ 00:23:20,760 --> 00:23:23,480
1619
+ I'm not sure my fine-tune will actually ever come to
1620
+
1621
+ 406
1622
+ 00:23:23,560 --> 00:23:25,400
1623
+ fruition in the sense that I'll use it day to
1624
+
1625
+ 407
1626
+ 00:23:25,480 --> 00:23:28,920
1627
+ day as I imagine. I get like superb flawless words
1628
+
1629
+ 408
1630
+ 00:23:29,000 --> 00:23:33,420
1631
+ error rates because I'm just kind of skeptical about Local
1632
+
1633
+ 409
1634
+ 00:23:33,580 --> 00:23:37,180
1635
+ speech to text, as I mentioned, and I think the
1636
+
1637
+ 410
1638
+ 00:23:37,260 --> 00:23:40,780
1639
+ pace of innovation and improvement in the models, the main
1640
+
1641
+ 411
1642
+ 00:23:40,940 --> 00:23:44,700
1643
+ reasons for fine tuning from what I've seen have been
1644
+
1645
+ 412
1646
+ 00:23:44,860 --> 00:23:47,500
1647
+ people who are something that really blows my mind about
1648
+
1649
+ 413
1650
+ 00:23:48,060 --> 00:23:53,180
1651
+ ASR is the idea that it's inherently a lingual or
1652
+
1653
+ 414
1654
+ 00:23:53,340 --> 00:23:58,650
1655
+ multilingual phonetic based. So as folks who use speak
1656
+
1657
+ 415
1658
+ 00:23:58,970 --> 00:24:02,330
1659
+ very obscure languages, that there might be a paucity of
1660
+
1661
+ 416
1662
+ 00:24:02,330 --> 00:24:04,970
1663
+ training data or almost none at all, and therefore the
1664
+
1665
+ 417
1666
+ 00:24:04,970 --> 00:24:10,170
1667
+ accuracy is significantly reduced. Or folks in very critical
1668
+
1669
+ 418
1670
+ 00:24:10,410 --> 00:24:14,330
1671
+ environments, I know this is used extensively in medical transcription
1672
+
1673
+ 419
1674
+ 00:24:14,410 --> 00:24:19,210
1675
+ and dispatcher work, the call centers who send out ambulances,
1676
+
1677
+ 420
1678
+ 00:24:19,290 --> 00:24:23,210
1679
+ et cetera, where accuracy is absolutely paramount. And in the
1680
+
1681
+ 421
1682
+ 00:24:23,210 --> 00:24:26,940
1683
+ case of doctors, radiologist, they might be using very specialized
1684
+
1685
+ 422
1686
+ 00:24:26,940 --> 00:24:29,500
1687
+ vocab all the time. So those are kind of the
1688
+
1689
+ 423
1690
+ 00:24:29,580 --> 00:24:31,500
1691
+ main two things that I'm not sure that really just
1692
+
1693
+ 424
1694
+ 00:24:31,580 --> 00:24:35,020
1695
+ for trying to make it better on a few random
1696
+
1697
+ 425
1698
+ 00:24:35,020 --> 00:24:37,980
1699
+ tech words with my slightly, I mean, I have an
1700
+
1701
+ 426
1702
+ 00:24:38,060 --> 00:24:41,100
1703
+ accent, but like not, you know, an accent that a
1704
+
1705
+ 427
1706
+ 00:24:41,180 --> 00:24:45,980
1707
+ few other million people have ish. I'm not sure that
1708
+
1709
+ 428
1710
+ 00:24:46,460 --> 00:24:50,380
1711
+ my little fine tune is gonna actually like the bump
1712
+
1713
+ 429
1714
+ 00:24:50,540 --> 00:24:53,580
1715
+ in word error reduction, if I ever actually figure out
1716
+
1717
+ 430
1718
+ 00:24:53,580 --> 00:24:54,700
1719
+ how to do it and get it up to the
1720
+
1721
+ 431
1722
+ 00:24:54,780 --> 00:24:57,950
1723
+ cloud. By the time we've done that, I suspect that
1724
+
1725
+ 432
1726
+ 00:24:58,270 --> 00:25:00,510
1727
+ the next generation of ASR will just be so good
1728
+
1729
+ 433
1730
+ 00:25:00,590 --> 00:25:03,070
1731
+ that it will kind of be, well, that would have
1732
+
1733
+ 434
1734
+ 00:25:03,070 --> 00:25:04,750
1735
+ been cool if it worked out, but I'll just use
1736
+
1737
+ 435
1738
+ 00:25:04,830 --> 00:25:08,590
1739
+ this instead. So that's going to be it for today's
1740
+
1741
+ 436
1742
+ 00:25:08,910 --> 00:25:14,110
1743
+ episode of voice training data. Single long shot evaluation.
1744
+
1745
+ 437
1746
+ 00:25:14,430 --> 00:25:17,230
1747
+ Who am I going to compare? Whisper is always good
1748
+
1749
+ 438
1750
+ 00:25:17,230 --> 00:25:20,590
1751
+ as a benchmark, but I'm more interested in seeing Whisper
1752
+
1753
+ 439
1754
+ 00:25:20,670 --> 00:25:24,590
1755
+ head to head with two things, really. One is Whisper
1756
+
1757
+ 440
1758
+ 00:25:24,670 --> 00:25:29,780
1759
+ variants. So you've got these projects like faster Distill Whisper,
1760
+
1761
+ 441
1762
+ 00:25:29,860 --> 00:25:31,780
1763
+ it's a bit confusing, there's a whole bunch of them.
1764
+
1765
+ 442
1766
+ 00:25:32,100 --> 00:25:35,380
1767
+ And the emerging ASRs, which are also a thing. My
1768
+
1769
+ 443
1770
+ 00:25:35,460 --> 00:25:37,300
1771
+ intention for this is I'm not sure I'm going to
1772
+
1773
+ 444
1774
+ 00:25:37,300 --> 00:25:39,940
1775
+ have the time in any point in the foreseeable future
1776
+
1777
+ 445
1778
+ 00:25:40,260 --> 00:25:44,660
1779
+ to go back through this whole episode and create a
1780
+
1781
+ 446
1782
+ 00:25:44,740 --> 00:25:49,780
1783
+ proper source truth, where I fix everything. Might do
1784
+
1785
+ 447
1786
+ 00:25:49,860 --> 00:25:52,820
1787
+ it if I can get one transcriptions that sufficiently close
1788
+
1789
+ 448
1790
+ 00:25:53,060 --> 00:25:57,120
1791
+ to perfection. But what I would actually love to do
1792
+
1793
+ 449
1794
+ 00:25:57,280 --> 00:26:00,000
1795
+ on Hugging Face, I think would be a great probably
1796
+
1797
+ 450
1798
+ 00:26:00,320 --> 00:26:02,960
1799
+ how I might visualize this is having the audio waveform
1800
+
1801
+ 451
1802
+ 00:26:03,280 --> 00:26:08,240
1803
+ play and then have the transcript for each model below
1804
+
1805
+ 452
1806
+ 00:26:08,240 --> 00:26:12,640
1807
+ it and maybe even a like, you know, to scale
1808
+
1809
+ 453
1810
+ 00:26:13,200 --> 00:26:15,680
1811
+ and maybe even a local one as well, like local
1812
+
1813
+ 454
1814
+ 00:26:15,840 --> 00:26:21,180
1815
+ whisper versus OpenAI API, et cetera. And, I
1816
+
1817
+ 455
1818
+ 00:26:21,260 --> 00:26:23,580
1819
+ can then actually listen back to segments or anyone who
1820
+
1821
+ 456
1822
+ 00:26:23,580 --> 00:26:25,900
1823
+ wants to can listen back to segments of this recording
1824
+
1825
+ 457
1826
+ 00:26:26,220 --> 00:26:31,020
1827
+ and see where a particular model struggled and others didn't,
1828
+
1829
+ 458
1830
+ 00:26:31,500 --> 00:26:33,420
1831
+ as well as the sort of headline finding of which
1832
+
1833
+ 459
1834
+ 00:26:33,580 --> 00:26:36,940
1835
+ had the best WER, but that would require the source
1836
+
1837
+ 460
1838
+ 00:26:36,940 --> 00:26:39,660
1839
+ of truth. Okay, that's it. I hope this was, I
1840
+
1841
+ 461
1842
+ 00:26:39,660 --> 00:26:42,620
1843
+ don't know, maybe useful for other folks interested in STT.
1844
+
1845
+ 462
1846
+ 00:26:42,940 --> 00:26:45,740
1847
+ You want to see that I always feel think I've
1848
+
1849
+ 463
1850
+ 00:26:45,740 --> 00:26:48,950
1851
+ just said as something I didn't intend to. STT, I
1852
+
1853
+ 464
1854
+ 00:26:48,950 --> 00:26:52,550
1855
+ said for those. Listen carefully, including hopefully the models themselves.
1856
+
1857
+ 465
1858
+ 00:26:53,270 --> 00:26:57,350
1859
+ This has been myself, Daniel Rosell. For more jumbled repositories
1860
+
1861
+ 466
1862
+ 00:26:57,430 --> 00:27:01,830
1863
+ about my roving interests in AI, but particularly agentic, MCP
1864
+
1865
+ 467
1866
+ 00:27:02,070 --> 00:27:07,109
1867
+ and Voicetech, you can find me on GitHub, huggingface.com,
1868
+
1869
+ 468
1870
+ 00:27:10,310 --> 00:27:13,350
1871
+ which is my personal website, as well as this podcast,
1872
+
1873
+ 469
1874
+ 00:27:13,590 --> 00:27:17,030
1875
+ whose name I sadly cannot remember. Until next time, thanks
1876
+
1877
+ 470
1878
+ 00:27:17,030 --> 00:27:17,590
1879
+ for listening.
1880
+
srt-out/gladia.srt ADDED
@@ -0,0 +1,2003 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00.172 --> 00:00:15.108
3
+ Hello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um
4
+
5
+ 2
6
+ 00:00:15.467 --> 00:00:29.435
7
+ regarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um
8
+
9
+ 3
10
+ 00:00:30.219 --> 00:00:36.545
11
+ voice recording is actually to create a lengthy voice sample for a quick evaluation,
12
+
13
+ 4
14
+ 00:00:36.546 --> 00:00:38.088
15
+ a back of the envelope evaluation,
16
+
17
+ 5
18
+ 00:00:38.390 --> 00:00:39.148
19
+ as they might say,
20
+
21
+ 6
22
+ 00:00:39.749 --> 00:00:41.273
23
+ for different speech to text models.
24
+
25
+ 7
26
+ 00:00:41.274 --> 00:00:42.195
27
+ And I'm doing this because
28
+
29
+ 8
30
+ 00:00:42.975 --> 00:00:46.655
31
+ I thought I'd made a great breakthrough in my journey with speech tech,
32
+
33
+ 9
34
+ 00:00:47.234 --> 00:00:50.999
35
+ and that was succeeding in the elusive task of fine tuning Whisper.
36
+
37
+ 10
38
+ 00:00:51.780 --> 00:00:52.655
39
+ Whisper is,
40
+
41
+ 11
42
+ 00:00:52.920 --> 00:00:58.890
43
+ and I'm going to just talk i'm trying to mix up uh i'm going to try a few different
44
+
45
+ 12
46
+ 00:00:59.524 --> 00:01:18.581
47
+ styles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a
48
+
49
+ 13
50
+ 00:01:18.706 --> 00:01:28.831
51
+ speech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these
52
+
53
+ 14
54
+ 00:01:29.652 --> 00:01:33.436
55
+ just actually a series of step,
56
+
57
+ 15
58
+ 00:01:33.734 --> 00:01:34.355
59
+ standalone,
60
+
61
+ 16
62
+ 00:01:34.415 --> 00:01:34.918
63
+ step alone,
64
+
65
+ 17
66
+ 00:01:35.016 --> 00:01:36.200
67
+ standalone sentences.
68
+
69
+ 18
70
+ 00:01:36.519 --> 00:01:38.040
71
+ And how is it going to handle step alone?
72
+
73
+ 19
74
+ 00:01:38.078 --> 00:01:38.680
75
+ That's not a word.
76
+
77
+ 20
78
+ 00:01:39.859 --> 00:01:43.343
79
+ What happens when you use speech to text and you use a fake word?
80
+
81
+ 21
82
+ 00:01:43.367 --> 00:01:43.884
83
+ And then you're like,
84
+
85
+ 22
86
+ 00:01:43.923 --> 00:01:44.063
87
+ wait,
88
+
89
+ 23
90
+ 00:01:44.087 --> 00:01:44.703
91
+ that's not actually,
92
+
93
+ 24
94
+ 00:01:45.468 --> 00:01:46.328
95
+ that word doesn't exist.
96
+
97
+ 25
98
+ 00:01:47.048 --> 00:01:48.266
99
+ How does AI handle that?
100
+
101
+ 26
102
+ 00:01:48.484 --> 00:01:55.359
103
+ And these and more are all the questions that I'm seeking to answer in this training data.
104
+
105
+ 27
106
+ 00:01:56.001 --> 00:01:56.141
107
+ Now,
108
+
109
+ 28
110
+ 00:01:56.359 --> 00:01:56.718
111
+ why did,
112
+
113
+ 29
114
+ 00:01:56.843 --> 00:01:58.266
115
+ why was it trying to fine tune Whisper?
116
+
117
+ 30
118
+ 00:01:58.787 --> 00:02:16.968
119
+ what is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to
120
+
121
+ 31
122
+ 00:02:16.969 --> 00:02:27.735
123
+ be down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's
124
+
125
+ 32
126
+ 00:02:28.147 --> 00:02:30.349
127
+ Just so much that is fascinating in AI.
128
+
129
+ 33
130
+ 00:02:31.372 --> 00:02:41.520
131
+ But the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work.
132
+
133
+ 34
134
+ 00:02:42.082 --> 00:02:42.379
135
+ And
136
+
137
+ 35
138
+ 00:02:43.183 --> 00:02:47.230
139
+ I'm persevering hard with the task of training,
140
+
141
+ 36
142
+ 00:02:47.231 --> 00:02:47.527
143
+ I guess,
144
+
145
+ 37
146
+ 00:02:47.730 --> 00:02:49.762
147
+ a good solution working for Linux,
148
+
149
+ 38
150
+ 00:02:50.122 --> 00:02:51.683
151
+ which if anyone actually does listen to this,
152
+
153
+ 39
154
+ 00:02:51.777 --> 00:02:54.355
155
+ not just for the training data and for the actual content,
156
+
157
+ 40
158
+ 00:02:55.247 --> 00:02:56.497
159
+ this is this is sparked.
160
+
161
+ 41
162
+ 00:02:56.762 --> 00:02:57.044
163
+ I had
164
+
165
+ 42
166
+ 00:02:58.056 --> 00:03:13.229
167
+ besides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the
168
+
169
+ 43
170
+ 00:03:13.368 --> 00:03:24.518
171
+ reason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those
172
+
173
+ 44
174
+ 00:03:25.304 --> 00:03:29.768
175
+ instances where last week I put together an hour of voice training data,
176
+
177
+ 45
178
+ 00:03:30.528 --> 00:03:31.229
179
+ basically speaking,
180
+
181
+ 46
182
+ 00:03:31.271 --> 00:03:33.174
183
+ just random things for three minutes.
184
+
185
+ 47
186
+ 00:03:33.407 --> 00:03:38.618
187
+ And it was actually kind of tedious because the texts were really weird.
188
+
189
+ 48
190
+ 00:03:38.674 --> 00:03:39.174
191
+ Some of them were,
192
+
193
+ 49
194
+ 00:03:39.556 --> 00:03:40.080
195
+ it was like,
196
+
197
+ 50
198
+ 00:03:40.361 --> 00:03:40.596
199
+ it was
200
+
201
+ 51
202
+ 00:03:41.127 --> 00:03:41.939
203
+ AI generated.
204
+
205
+ 52
206
+ 00:03:42.721 --> 00:03:45.518
207
+ I tried before to read Sherlock Holmes for an hour and I just couldn't,
208
+
209
+ 53
210
+ 00:03:45.564 --> 00:03:48.893
211
+ I was so bored after 10 minutes that I was like,
212
+
213
+ 54
214
+ 00:03:48.894 --> 00:03:49.064
215
+ okay,
216
+
217
+ 55
218
+ 00:03:49.066 --> 00:03:51.705
219
+ I know I'm just going to have to find something else to read.
220
+
221
+ 56
222
+ 00:03:51.752 --> 00:03:51.877
223
+ So
224
+
225
+ 57
226
+ 00:03:52.907 --> 00:03:53.705
227
+ I used...
228
+
229
+ 58
230
+ 00:03:54.207 --> 00:04:11.201
231
+ a created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a
232
+
233
+ 59
234
+ 00:04:11.248 --> 00:04:22.858
235
+ voice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see.
236
+
237
+ 60
238
+ 00:04:23.295 --> 00:04:50.961
239
+ how close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used
240
+
241
+ 61
242
+ 00:04:52.117 --> 00:04:52.384
243
+ Bean.
244
+
245
+ 62
246
+ 00:04:52.755 --> 00:05:02.007
247
+ experimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or
248
+
249
+ 63
250
+ 00:05:02.836 --> 00:05:02.956
251
+ Well,
252
+
253
+ 64
254
+ 00:05:02.956 --> 00:05:03.497
255
+ it's more than that,
256
+
257
+ 65
258
+ 00:05:03.536 --> 00:05:03.776
259
+ actually.
260
+
261
+ 66
262
+ 00:05:03.778 --> 00:05:04.857
263
+ I'm working on a voice app.
264
+
265
+ 67
266
+ 00:05:04.937 --> 00:05:07.660
267
+ So it's a prototype,
268
+
269
+ 68
270
+ 00:05:07.680 --> 00:05:07.980
271
+ I guess,
272
+
273
+ 69
274
+ 00:05:08.019 --> 00:05:09.000
275
+ is really more accurate.
276
+
277
+ 70
278
+ 00:05:11.382 --> 00:05:13.805
279
+ But you can do that and you can work backwards.
280
+
281
+ 71
282
+ 00:05:13.843 --> 00:05:14.187
283
+ You're like,
284
+
285
+ 72
286
+ 00:05:14.343 --> 00:05:19.757
287
+ you listen back to a voice note and you painfully go through one of those transcribing,
288
+
289
+ 73
290
+ 00:05:19.992 --> 00:05:20.226
291
+ you know,
292
+
293
+ 74
294
+ 00:05:20.274 --> 00:05:23.413
295
+ where you start and stop and scrub around it and you fix the errors.
296
+
297
+ 75
298
+ 00:05:23.415 --> 00:05:24.117
299
+ But it's really,
300
+
301
+ 76
302
+ 00:05:24.180 --> 00:05:25.538
303
+ really boring to do that.
304
+
305
+ 77
306
+ 00:05:26.163 --> 00:05:31.680
307
+ So I thought it would be less tedious in the long term if I just recorded the source of truth.
308
+
309
+ 78
310
+ 00:05:32.247 --> 00:05:34.190
311
+ So it gave me these three minute snippets.
312
+
313
+ 79
314
+ 00:05:34.428 --> 00:05:38.593
315
+ I recorded them and saved an MP3 and a TXT in the same folder.
316
+
317
+ 80
318
+ 00:05:38.855 --> 00:05:40.500
319
+ And I created an error of that data.
320
+
321
+ 81
322
+ 00:05:41.975 --> 00:05:43.038
323
+ So I was very hopeful,
324
+
325
+ 82
326
+ 00:05:43.398 --> 00:05:43.781
327
+ quietly,
328
+
329
+ 83
330
+ 00:05:43.898 --> 00:05:44.117
331
+ you know,
332
+
333
+ 84
334
+ 00:05:44.117 --> 00:05:47.725
335
+ a little bit hopeful that I would be able that I could actually fine tune Whisper.
336
+
337
+ 85
338
+ 00:05:48.586 --> 00:05:53.100
339
+ I want to fine tune Whisper because when I got into voice tech last November,
340
+
341
+ 86
342
+ 00:05:54.242 --> 00:05:57.538
343
+ my wife was in the US and I was alone at home and,
344
+
345
+ 87
346
+ 00:05:57.819 --> 00:05:58.053
347
+ you know,
348
+
349
+ 88
350
+ 00:05:58.069 --> 00:05:59.117
351
+ went crazy.
352
+
353
+ 89
354
+ 00:05:59.444 --> 00:06:12.454
355
+ people like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high
356
+
357
+ 90
358
+ 00:06:13.336 --> 00:06:26.509
359
+ I used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically
360
+
361
+ 91
362
+ 00:06:27.500 --> 00:06:29.440
363
+ And this blew me away from the first go.
364
+
365
+ 92
366
+ 00:06:29.480 --> 00:06:29.701
367
+ I mean,
368
+
369
+ 93
370
+ 00:06:29.701 --> 00:06:30.860
371
+ it wasn't 100%
372
+
373
+ 94
374
+ 00:06:31.841 --> 00:06:33.360
375
+ accurate out of the box and it took work,
376
+
377
+ 95
378
+ 00:06:33.942 --> 00:06:41.302
379
+ but it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this.
380
+
381
+ 96
382
+ 00:06:41.942 --> 00:06:42.185
383
+ You know,
384
+
385
+ 97
386
+ 00:06:42.185 --> 00:06:46.418
387
+ there's a point where it's so like the transcript is you don't have to get 100%
388
+
389
+ 98
390
+ 00:06:46.482 --> 00:06:48.262
391
+ accuracy for it to be worth your time,
392
+
393
+ 99
394
+ 00:06:49.091 --> 00:06:51.668
395
+ for a speech to text to be a worthwhile addition to your productivity.
396
+
397
+ 100
398
+ 00:06:51.778 --> 00:06:53.043
399
+ But you do need to get above,
400
+
401
+ 101
402
+ 00:06:53.091 --> 00:06:53.418
403
+ let's say,
404
+
405
+ 102
406
+ 00:06:53.528 --> 00:06:53.887
407
+ I don't know,
408
+
409
+ 103
410
+ 00:06:53.966 --> 00:06:54.451
411
+ 85%.
412
+
413
+ 104
414
+ 00:06:54.466 --> 00:06:54.887
415
+ percent.
416
+
417
+ 105
418
+ 00:06:55.711 --> 00:06:56.651
419
+ If it's 60%
420
+
421
+ 106
422
+ 00:06:57.031 --> 00:06:57.413
423
+ or 50%,
424
+
425
+ 107
426
+ 00:06:57.793 --> 00:06:58.692
427
+ you inevitably say,
428
+
429
+ 108
430
+ 00:06:59.173 --> 00:06:59.512
431
+ screw it,
432
+
433
+ 109
434
+ 00:06:59.514 --> 00:07:05.033
435
+ I'll just type it because you end up missing errors in the transcript and it becomes actually worse.
436
+
437
+ 110
438
+ 00:07:05.110 --> 00:07:06.978
439
+ You end up in a worse position than you started with it.
440
+
441
+ 111
442
+ 00:07:06.978 --> 00:07:07.915
443
+ That's been my experience.
444
+
445
+ 112
446
+ 00:07:08.555 --> 00:07:08.673
447
+ So
448
+
449
+ 113
450
+ 00:07:10.572 --> 00:07:10.915
451
+ I was like,
452
+
453
+ 114
454
+ 00:07:10.994 --> 00:07:11.134
455
+ oh,
456
+
457
+ 115
458
+ 00:07:11.158 --> 00:07:12.228
459
+ this is actually really,
460
+
461
+ 116
462
+ 00:07:12.274 --> 00:07:12.838
463
+ really good now.
464
+
465
+ 117
466
+ 00:07:12.930 --> 00:07:13.555
467
+ How did that happen?
468
+
469
+ 118
470
+ 00:07:13.603 --> 00:07:15.040
471
+ And the answer is ASR,
472
+
473
+ 119
474
+ 00:07:15.680 --> 00:07:20.072
475
+ Whisper being open sourced and the transformer architecture.
476
+
477
+ 120
478
+ 00:07:20.072 --> 00:07:21.619
479
+ If you want to go back to the
480
+
481
+ 121
482
+ 00:07:23.319 --> 00:07:24.120
483
+ to the underpinnings,
484
+
485
+ 122
486
+ 00:07:24.139 --> 00:07:25.660
487
+ which really blows my mind.
488
+
489
+ 123
490
+ 00:07:25.920 --> 00:07:29.480
491
+ And it's on my list to read through that paper.
492
+
493
+ 124
494
+ 00:07:30.422 --> 00:07:38.444
495
+ All you need is attention as attentively as can be done with my limited brain because it's super,
496
+
497
+ 125
498
+ 00:07:38.500 --> 00:07:39.819
499
+ super high level stuff.
500
+
501
+ 126
502
+ 00:07:41.461 --> 00:07:42.350
503
+ Super advanced stuff,
504
+
505
+ 127
506
+ 00:07:42.367 --> 00:07:42.678
507
+ I mean.
508
+
509
+ 128
510
+ 00:07:43.100 --> 00:07:44.100
511
+ But that,
512
+
513
+ 129
514
+ 00:07:44.507 --> 00:07:52.600
515
+ I think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities.
516
+
517
+ 130
518
+ 00:07:53.507 --> 00:07:55.048
519
+ I find it fascinating that few people are like,
520
+
521
+ 131
522
+ 00:07:55.189 --> 00:07:55.490
523
+ hang on,
524
+
525
+ 132
526
+ 00:07:56.009 --> 00:07:58.994
527
+ you've got this thing that can speak to you like a chatbot,
528
+
529
+ 133
530
+ 00:07:58.995 --> 00:07:59.634
531
+ an LLM.
532
+
533
+ 134
534
+ 00:08:00.576 --> 00:08:02.600
535
+ And then you've got image generation.
536
+
537
+ 135
538
+ 00:08:02.959 --> 00:08:03.076
539
+ OK,
540
+
541
+ 136
542
+ 00:08:03.139 --> 00:08:03.521
543
+ so firstly,
544
+
545
+ 137
546
+ 00:08:03.639 --> 00:08:07.341
547
+ those two things on the surface have nothing in common.
548
+
549
+ 138
550
+ 00:08:08.545 --> 00:08:08.826
551
+ So like,
552
+
553
+ 139
554
+ 00:08:08.904 --> 00:08:09.505
555
+ how are they?
556
+
557
+ 140
558
+ 00:08:10.427 --> 00:08:12.286
559
+ How did that just happen all at the same time?
560
+
561
+ 141
562
+ 00:08:12.302 --> 00:08:13.411
563
+ And then when you extend that further,
564
+
565
+ 142
566
+ 00:08:14.944 --> 00:08:15.630
567
+ you're like Suno,
568
+
569
+ 143
570
+ 00:08:16.036 --> 00:08:16.225
571
+ right?
572
+
573
+ 144
574
+ 00:08:16.271 --> 00:08:20.896
575
+ You can sing a song and AI will like come up with an instrumental.
576
+
577
+ 145
578
+ 00:08:21.516 --> 00:08:22.637
579
+ And then you've got Whisper.
580
+
581
+ 146
582
+ 00:08:22.757 --> 00:08:23.077
583
+ And you're like,
584
+
585
+ 147
586
+ 00:08:23.079 --> 00:08:23.699
587
+ wait a second.
588
+
589
+ 148
590
+ 00:08:24.158 --> 00:08:25.201
591
+ How did all this stuff,
592
+
593
+ 149
594
+ 00:08:25.319 --> 00:08:26.598
595
+ like if it's all AI,
596
+
597
+ 150
598
+ 00:08:27.262 --> 00:08:27.603
599
+ what's,
600
+
601
+ 151
602
+ 00:08:27.942 --> 00:08:29.384
603
+ like there has to be some commonality.
604
+
605
+ 152
606
+ 00:08:29.543 --> 00:08:30.161
607
+ Otherwise,
608
+
609
+ 153
610
+ 00:08:30.865 --> 00:08:34.707
611
+ these are totally different technologies on the surface of it.
612
+
613
+ 154
614
+ 00:08:34.888 --> 00:08:37.990
615
+ And the transformer architecture is,
616
+
617
+ 155
618
+ 00:08:38.349 --> 00:08:39.067
619
+ as far as I know,
620
+
621
+ 156
622
+ 00:08:39.240 --> 00:08:40.162
623
+ the answer.
624
+
625
+ 157
626
+ 00:08:40.332 --> 00:08:41.192
627
+ And I can't even say,
628
+
629
+ 158
630
+ 00:08:41.302 --> 00:08:47.287
631
+ can't even pretend that I really understand what the transformer architecture means in depth.
632
+
633
+ 159
634
+ 00:08:47.317 --> 00:08:48.457
635
+ But I have scanned this.
636
+
637
+ 160
638
+ 00:08:48.707 --> 00:08:49.629
639
+ And as I said,
640
+
641
+ 161
642
+ 00:08:49.707 --> 00:08:50.599
643
+ I want to...
644
+
645
+ 162
646
+ 00:08:50.840 --> 00:09:01.552
647
+ printed and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy
648
+
649
+ 163
650
+ 00:09:02.208 --> 00:09:11.177
651
+ I think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like
652
+
653
+ 164
654
+ 00:09:11.662 --> 00:09:20.067
655
+ I don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of
656
+
657
+ 165
658
+ 00:09:20.543 --> 00:09:22.184
659
+ really obscure scientists who are like in their
660
+
661
+ 166
662
+ 00:09:22.524 --> 00:09:41.356
663
+ 50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the
664
+
665
+ 167
666
+ 00:09:41.919 --> 00:09:49.809
667
+ birth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring
668
+
669
+ 168
670
+ 00:09:50.580 --> 00:09:52.484
671
+ It's nice to think that people who,
672
+
673
+ 169
674
+ 00:09:53.488 --> 00:09:53.729
675
+ you know,
676
+
677
+ 170
678
+ 00:09:53.927 --> 00:09:56.294
679
+ just you might put them in the kind of.
680
+
681
+ 171
682
+ 00:09:56.966 --> 00:10:12.508
683
+ milieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay
684
+
685
+ 172
686
+ 00:10:12.570 --> 00:10:24.687
687
+ voice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess
688
+
689
+ 173
690
+ 00:10:25.066 --> 00:10:25.326
691
+ It's like,
692
+
693
+ 174
694
+ 00:10:25.326 --> 00:10:25.807
695
+ oh my gosh,
696
+
697
+ 175
698
+ 00:10:25.826 --> 00:10:27.609
699
+ I have to get like a really good microphone for this.
700
+
701
+ 176
702
+ 00:10:28.169 --> 00:10:28.288
703
+ So
704
+
705
+ 177
706
+ 00:10:29.370 --> 00:10:31.471
707
+ I didn't go on a spending spree because I said,
708
+
709
+ 178
710
+ 00:10:31.592 --> 00:10:34.432
711
+ I'm going to have to just wait a month and see if I still use this.
712
+
713
+ 179
714
+ 00:10:35.198 --> 00:10:37.596
715
+ And it just kind of became,
716
+
717
+ 180
718
+ 00:10:38.019 --> 00:10:40.823
719
+ it's become really part of my daily routine.
720
+
721
+ 181
722
+ 00:10:41.863 --> 00:10:43.003
723
+ Like if I'm writing an email,
724
+
725
+ 182
726
+ 00:10:43.269 --> 00:10:44.503
727
+ I'll record a voice note.
728
+
729
+ 183
730
+ 00:10:45.049 --> 00:10:46.284
731
+ And then I've developed.
732
+
733
+ 184
734
+ 00:10:46.784 --> 00:10:50.534
735
+ And it's nice to see that everyone is like developing the same things in parallel.
736
+
737
+ 185
738
+ 00:10:50.566 --> 00:10:52.409
739
+ Like that's kind of a weird thing to say.
740
+
741
+ 186
742
+ 00:10:52.488 --> 00:10:53.549
743
+ But when I look,
744
+
745
+ 187
746
+ 00:10:53.659 --> 00:10:53.769
747
+ I...
748
+
749
+ 188
750
+ 00:10:54.298 --> 00:11:11.754
751
+ kind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh
752
+
753
+ 189
754
+ 00:11:11.754 --> 00:11:21.441
755
+ llm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say,
756
+
757
+ 190
758
+ 00:11:21.566 --> 00:11:21.738
759
+ okay,
760
+
761
+ 191
762
+ 00:11:21.784 --> 00:11:22.909
763
+ this is crappy.
764
+
765
+ 192
766
+ 00:11:23.785 --> 00:11:33.653
767
+ text like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built
768
+
769
+ 193
770
+ 00:11:34.216 --> 00:11:49.996
771
+ I see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's
772
+
773
+ 194
774
+ 00:11:50.710 --> 00:11:53.312
775
+ It's the kind of thing that when you start using these tools every day,
776
+
777
+ 195
778
+ 00:11:53.613 --> 00:12:02.100
779
+ the need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to,
780
+
781
+ 196
782
+ 00:12:02.842 --> 00:12:03.086
783
+ you know,
784
+
785
+ 197
786
+ 00:12:03.163 --> 00:12:06.023
787
+ it takes so long to get it into a presentable email that again,
788
+
789
+ 198
790
+ 00:12:06.086 --> 00:12:06.241
791
+ it's,
792
+
793
+ 199
794
+ 00:12:06.428 --> 00:12:06.600
795
+ it's,
796
+
797
+ 200
798
+ 00:12:06.788 --> 00:12:06.928
799
+ it,
800
+
801
+ 201
802
+ 00:12:07.086 --> 00:12:13.006
803
+ it moves speech tech into that before that inflection point where you're like,
804
+
805
+ 202
806
+ 00:12:13.008 --> 00:12:13.131
807
+ nah,
808
+
809
+ 203
810
+ 00:12:13.133 --> 00:12:13.836
811
+ it's just not worth it.
812
+
813
+ 204
814
+ 00:12:13.850 --> 00:12:14.491
815
+ It's like,
816
+
817
+ 205
818
+ 00:12:15.178 --> 00:12:16.898
819
+ it'll just be quicker to type this.
820
+
821
+ 206
822
+ 00:12:17.428 --> 00:12:18.336
823
+ So it's a big,
824
+
825
+ 207
826
+ 00:12:18.350 --> 00:12:19.461
827
+ it's a little touch that actually.
828
+
829
+ 208
830
+ 00:12:20.289 --> 00:12:20.791
831
+ is a big deal.
832
+
833
+ 209
834
+ 00:12:21.672 --> 00:12:22.373
835
+ So I was on
836
+
837
+ 210
838
+ 00:12:22.712 --> 00:12:28.100
839
+ Whisper and I've been using Whisper and I kind of early on found a couple of tools.
840
+
841
+ 211
842
+ 00:12:28.458 --> 00:12:30.419
843
+ I couldn't find what I was looking for on Linux,
844
+
845
+ 212
846
+ 00:12:30.498 --> 00:12:35.725
847
+ which is basically just something that'll run in the background.
848
+
849
+ 213
850
+ 00:12:36.044 --> 00:12:43.873
851
+ You'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation.
852
+
853
+ 214
854
+ 00:12:45.248 --> 00:12:47.061
855
+ And the issues were I discovered
856
+
857
+ 215
858
+ 00:12:47.241 --> 00:13:06.619
859
+ that like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i
860
+
861
+ 216
862
+ 00:13:06.682 --> 00:13:16.104
863
+ just thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models
864
+
865
+ 217
866
+ 00:13:16.730 --> 00:13:18.470
867
+ That are much more,
868
+
869
+ 218
870
+ 00:13:19.070 --> 00:13:19.291
871
+ you know,
872
+
873
+ 219
874
+ 00:13:19.292 --> 00:13:20.688
875
+ you're going to be able to run the full model,
876
+
877
+ 220
878
+ 00:13:21.572 --> 00:13:24.916
879
+ the latest model versus whatever you can run on your average
880
+
881
+ 221
882
+ 00:13:25.533 --> 00:13:28.711
883
+ GPU, unless you want to buy a crazy GPU.
884
+
885
+ 222
886
+ 00:13:28.751 --> 00:13:29.892
887
+ It doesn't really make sense to me.
888
+
889
+ 223
890
+ 00:13:30.033 --> 00:13:39.619
891
+ Privacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment,
892
+
893
+ 224
894
+ 00:13:40.352 --> 00:13:42.197
895
+ maybe for regulatory reasons as well.
896
+
897
+ 225
898
+ 00:13:42.916 --> 00:13:43.727
899
+ But I'm not in that.
900
+
901
+ 226
902
+ 00:13:44.291 --> 00:13:45.744
903
+ I'm neither really care.
904
+
905
+ 227
906
+ 00:13:46.118 --> 00:13:52.018
907
+ about people listening to my grocery list consisting of reminding myself that I need to buy more beer,
908
+
909
+ 228
910
+ 00:13:52.619 --> 00:13:53.721
911
+ Cheetos and hummus,
912
+
913
+ 229
914
+ 00:13:53.759 --> 00:13:54.716
915
+ which is kind of the three,
916
+
917
+ 230
918
+ 00:13:55.264 --> 00:13:59.240
919
+ three staples of my diet during periods of poor nutrition.
920
+
921
+ 231
922
+ 00:14:00.020 --> 00:14:01.458
923
+ But the kind of stuff that I transcribe,
924
+
925
+ 232
926
+ 00:14:01.498 --> 00:14:02.153
927
+ it's just not,
928
+
929
+ 233
930
+ 00:14:02.154 --> 00:14:03.106
931
+ it's not a,
932
+
933
+ 234
934
+ 00:14:04.248 --> 00:14:08.059
935
+ it's not a privacy thing I'm that sort of sensitive about.
936
+
937
+ 235
938
+ 00:14:08.356 --> 00:14:08.748
939
+ And
940
+
941
+ 236
942
+ 00:14:09.606 --> 00:14:10.544
943
+ I don't do anything so,
944
+
945
+ 237
946
+ 00:14:11.559 --> 00:14:11.826
947
+ you know,
948
+
949
+ 238
950
+ 00:14:12.356 --> 00:14:14.356
951
+ sensitive or secure that requires air gapping.
952
+
953
+ 239
954
+ 00:14:14.403 --> 00:14:14.528
955
+ So.
956
+
957
+ 240
958
+ 00:14:15.770 --> 00:14:18.131
959
+ I looked at the pricing and especially the kind of older models,
960
+
961
+ 241
962
+ 00:14:18.273 --> 00:14:18.493
963
+ mini,
964
+
965
+ 242
966
+ 00:14:19.714 --> 00:14:20.417
967
+ some of them are very,
968
+
969
+ 243
970
+ 00:14:20.495 --> 00:14:21.174
971
+ very affordable.
972
+
973
+ 244
974
+ 00:14:21.256 --> 00:14:21.475
975
+ And
976
+
977
+ 245
978
+ 00:14:22.937 --> 00:14:24.721
979
+ I did a calculation once with
980
+
981
+ 246
982
+ 00:14:25.361 --> 00:14:26.339
983
+ ChatGPT and I was like,
984
+
985
+ 247
986
+ 00:14:26.424 --> 00:14:26.542
987
+ OK,
988
+
989
+ 248
990
+ 00:14:27.322 --> 00:14:27.783
991
+ this is the
992
+
993
+ 249
994
+ 00:14:28.464 --> 00:14:31.027
995
+ API price for I can't remember whatever the model was.
996
+
997
+ 250
998
+ 00:14:31.971 --> 00:14:33.861
999
+ Let's say I just go at it like nonstop,
1000
+
1001
+ 251
1002
+ 00:14:34.269 --> 00:14:35.408
1003
+ which it rarely happens.
1004
+
1005
+ 252
1006
+ 00:14:35.549 --> 00:14:36.033
1007
+ Probably
1008
+
1009
+ 253
1010
+ 00:14:36.691 --> 00:14:42.956
1011
+ I would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails.
1012
+
1013
+ 254
1014
+ 00:14:44.114 --> 00:14:44.234
1015
+ uh,
1016
+
1017
+ 255
1018
+ 00:14:44.635 --> 00:14:45.236
1019
+ documents,
1020
+
1021
+ 256
1022
+ 00:14:45.356 --> 00:14:46.080
1023
+ outlines,
1024
+
1025
+ 257
1026
+ 00:14:46.760 --> 00:14:47.100
1027
+ um,
1028
+
1029
+ 258
1030
+ 00:14:47.201 --> 00:14:47.763
1031
+ which is a lot,
1032
+
1033
+ 259
1034
+ 00:14:47.802 --> 00:14:48.182
1035
+ but it's,
1036
+
1037
+ 260
1038
+ 00:14:48.484 --> 00:14:49.889
1039
+ it's still a fairly modest amount.
1040
+
1041
+ 261
1042
+ 00:14:50.327 --> 00:14:50.730
1043
+ And I was like,
1044
+
1045
+ 262
1046
+ 00:14:50.750 --> 00:14:50.870
1047
+ well,
1048
+
1049
+ 263
1050
+ 00:14:50.952 --> 00:14:53.840
1051
+ some days I do go on like one or two days where I've been.
1052
+
1053
+ 264
1054
+ 00:14:54.749 --> 00:15:00.255
1055
+ Usually when I'm like kind of out of the house and just have something like I have nothing else to do.
1056
+
1057
+ 265
1058
+ 00:15:00.354 --> 00:15:01.813
1059
+ Like if I'm at a hospital,
1060
+
1061
+ 266
1062
+ 00:15:01.856 --> 00:15:07.841
1063
+ we have a newborn and you're waiting for like eight hours and hours for an appointment.
1064
+
1065
+ 267
1066
+ 00:15:08.380 --> 00:15:12.865
1067
+ And I would probably have listened to podcasts before becoming a speech fanatic.
1068
+
1069
+ 268
1070
+ 00:15:12.942 --> 00:15:13.475
1071
+ And I'm like,
1072
+
1073
+ 269
1074
+ 00:15:13.520 --> 00:15:13.645
1075
+ oh,
1076
+
1077
+ 270
1078
+ 00:15:13.662 --> 00:15:13.865
1079
+ wait,
1080
+
1081
+ 271
1082
+ 00:15:14.302 --> 00:15:15.255
1083
+ let me just get down.
1084
+
1085
+ 272
1086
+ 00:15:15.427 --> 00:15:16.975
1087
+ Let me just get these ideas out of my head.
1088
+
1089
+ 273
1090
+ 00:15:17.567 --> 00:15:20.645
1091
+ And that's when I'll go on my speech binges.
1092
+
1093
+ 274
1094
+ 00:15:20.692 --> 00:15:22.067
1095
+ But those are like once every few months,
1096
+
1097
+ 275
1098
+ 00:15:22.130 --> 00:15:23.270
1099
+ like not frequently.
1100
+
1101
+ 276
1102
+ 00:15:23.832 --> 00:15:24.192
1103
+ But I said,
1104
+
1105
+ 277
1106
+ 00:15:24.232 --> 00:15:24.413
1107
+ okay,
1108
+
1109
+ 278
1110
+ 00:15:24.494 --> 00:15:27.597
1111
+ let's just say if I'm going to price out cloud STT,
1112
+
1113
+ 279
1114
+ 00:15:29.038 --> 00:15:36.043
1115
+ if I was like dedicated every second of every waking hour to transcribing for some odd reason,
1116
+
1117
+ 280
1118
+ 00:15:36.823 --> 00:15:37.129
1119
+ um,
1120
+
1121
+ 281
1122
+ 00:15:37.323 --> 00:15:37.590
1123
+ I mean,
1124
+
1125
+ 282
1126
+ 00:15:37.591 --> 00:15:39.465
1127
+ it'd have to like eat and use the toilet.
1128
+
1129
+ 283
1130
+ 00:15:39.823 --> 00:15:40.090
1131
+ Like,
1132
+
1133
+ 284
1134
+ 00:15:40.527 --> 00:15:40.730
1135
+ you know,
1136
+
1137
+ 285
1138
+ 00:15:40.730 --> 00:15:42.527
1139
+ there's only so many hours I'm awake for.
1140
+
1141
+ 286
1142
+ 00:15:42.652 --> 00:15:43.090
1143
+ So like,
1144
+
1145
+ 287
1146
+ 00:15:43.198 --> 00:15:45.495
1147
+ let's just say a maximum of like 40 hours,
1148
+
1149
+ 288
1150
+ 00:15:45.620 --> 00:15:48.058
1151
+ 45 minutes in the hour.
1152
+
1153
+ 289
1154
+ 00:15:48.120 --> 00:15:48.573
1155
+ Then I said,
1156
+
1157
+ 290
1158
+ 00:15:48.590 --> 00:15:48.840
1159
+ all right,
1160
+
1161
+ 291
1162
+ 00:15:48.855 --> 00:15:49.823
1163
+ let's just say 50.
1164
+
1165
+ 292
1166
+ 00:15:50.715 --> 00:15:51.277
1167
+ Who knows?
1168
+
1169
+ 293
1170
+ 00:15:51.495 --> 00:15:52.573
1171
+ You're dictating on the toilet.
1172
+
1173
+ 294
1174
+ 00:15:52.855 --> 00:15:53.323
1175
+ We do it.
1176
+
1177
+ 295
1178
+ 00:15:54.144 --> 00:15:55.385
1179
+ So you could just do 60,
1180
+
1181
+ 296
1182
+ 00:15:55.524 --> 00:15:58.764
1183
+ but whatever I did and every day,
1184
+
1185
+ 297
1186
+ 00:15:58.986 --> 00:16:02.525
1187
+ like you're going flat out seven days a week dictating nonstop.
1188
+
1189
+ 298
1190
+ 00:16:02.565 --> 00:16:02.964
1191
+ I was like,
1192
+
1193
+ 299
1194
+ 00:16:03.104 --> 00:16:06.424
1195
+ what's my monthly API bill going to be at this price?
1196
+
1197
+ 300
1198
+ 00:16:06.947 --> 00:16:09.307
1199
+ And it came out to like 70 or 80 bucks.
1200
+
1201
+ 301
1202
+ 00:16:09.307 --> 00:16:09.745
1203
+ And I was like,
1204
+
1205
+ 302
1206
+ 00:16:09.854 --> 00:16:10.042
1207
+ well,
1208
+
1209
+ 303
1210
+ 00:16:10.135 --> 00:16:14.167
1211
+ that would be an extraordinary amount of dictation.
1212
+
1213
+ 304
1214
+ 00:16:14.322 --> 00:16:22.104
1215
+ And I would hope that there was some compelling reason worth more than $70 that I embarked upon that project.
1216
+
1217
+ 305
1218
+ 00:16:22.832 --> 00:16:24.716
1219
+ So given that that's kind of the max point for me,
1220
+
1221
+ 306
1222
+ 00:16:24.895 --> 00:16:26.116
1223
+ I said that's actually very,
1224
+
1225
+ 307
1226
+ 00:16:26.296 --> 00:16:26.996
1227
+ very affordable.
1228
+
1229
+ 308
1230
+ 00:16:28.099 --> 00:16:28.220
1231
+ Now,
1232
+
1233
+ 309
1234
+ 00:16:28.278 --> 00:16:35.504
1235
+ you're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable,
1236
+
1237
+ 310
1238
+ 00:16:36.207 --> 00:16:37.365
1239
+ that's going to cost some more as well.
1240
+
1241
+ 311
1242
+ 00:16:38.091 --> 00:16:39.309
1243
+ Unless you're using
1244
+
1245
+ 312
1246
+ 00:16:40.309 --> 00:16:42.996
1247
+ Gemini, which needless to say,
1248
+
1249
+ 313
1250
+ 00:16:43.013 --> 00:16:45.091
1251
+ is a random person sitting in Jerusalem.
1252
+
1253
+ 314
1254
+ 00:16:46.013 --> 00:16:46.934
1255
+ I have no affiliation,
1256
+
1257
+ 315
1258
+ 00:16:47.216 --> 00:16:48.341
1259
+ nor with Google,
1260
+
1261
+ 316
1262
+ 00:16:48.403 --> 00:16:49.184
1263
+ nor Anthropic,
1264
+
1265
+ 317
1266
+ 00:16:49.231 --> 00:16:49.903
1267
+ nor Gemini,
1268
+
1269
+ 318
1270
+ 00:16:49.966 --> 00:16:52.028
1271
+ nor any major tech vendor for that matter.
1272
+
1273
+ 319
1274
+ 00:16:52.688 --> 00:16:52.908
1275
+ Um,
1276
+
1277
+ 320
1278
+ 00:16:53.951 --> 00:16:56.770
1279
+ I like Gemini not so much as a everyday model.
1280
+
1281
+ 321
1282
+ 00:16:57.072 --> 00:16:57.412
1283
+ Um,
1284
+
1285
+ 322
1286
+ 00:16:57.513 --> 00:16:59.416
1287
+ it's kind of underwhelmed in that respect,
1288
+
1289
+ 323
1290
+ 00:16:59.434 --> 00:16:59.837
1291
+ I would say,
1292
+
1293
+ 324
1294
+ 00:17:00.477 --> 00:17:01.653
1295
+ but for multimodal,
1296
+
1297
+ 325
1298
+ 00:17:01.716 --> 00:17:02.934
1299
+ I think it's got a lot to offer.
1300
+
1301
+ 326
1302
+ 00:17:03.576 --> 00:17:06.762
1303
+ And I think that the transcribing functionality whereby it can,
1304
+
1305
+ 327
1306
+ 00:17:07.584 --> 00:17:07.840
1307
+ um,
1308
+
1309
+ 328
1310
+ 00:17:08.059 --> 00:17:13.809
1311
+ process audio with a system prompt and both give you transcription that's cleaned up,
1312
+
1313
+ 329
1314
+ 00:17:13.873 --> 00:17:15.373
1315
+ that reduces two steps to one.
1316
+
1317
+ 330
1318
+ 00:17:15.965 --> 00:17:18.012
1319
+ And that for me is a very,
1320
+
1321
+ 331
1322
+ 00:17:18.076 --> 00:17:18.653
1323
+ very big deal.
1324
+
1325
+ 332
1326
+ 00:17:18.873 --> 00:17:19.090
1327
+ And,
1328
+
1329
+ 333
1330
+ 00:17:19.840 --> 00:17:19.951
1331
+ uh,
1332
+
1333
+ 334
1334
+ 00:17:19.951 --> 00:17:22.045
1335
+ I feel like even Google has haven't really sort of
1336
+
1337
+ 335
1338
+ 00:17:22.669 --> 00:17:39.968
1339
+ thought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay
1340
+
1341
+ 336
1342
+ 00:17:40.125 --> 00:17:49.733
1343
+ i've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood.
1344
+
1345
+ 337
1346
+ 00:17:50.480 --> 00:18:06.348
1347
+ more realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails
1348
+
1349
+ 338
1350
+ 00:18:06.458 --> 00:18:15.864
1351
+ which is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context
1352
+
1353
+ 339
1354
+ 00:18:16.441 --> 00:18:37.698
1355
+ pipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has
1356
+
1357
+ 340
1358
+ 00:18:37.979 --> 00:18:44.526
1359
+ been working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for,
1360
+
1361
+ 341
1362
+ 00:18:44.962 --> 00:18:45.432
1363
+ after all,
1364
+
1365
+ 342
1366
+ 00:18:45.759 --> 00:18:46.104
1367
+ robots.
1368
+
1369
+ 343
1370
+ 00:18:46.930 --> 00:19:02.180
1371
+ So we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation.
1372
+
1373
+ 344
1374
+ 00:19:02.539 --> 00:19:03.320
1375
+ I have to be fair.
1376
+
1377
+ 345
1378
+ 00:19:03.398 --> 00:19:06.961
1379
+ I have to give the models a chance to do their thing.
1380
+
1381
+ 346
1382
+ 00:19:07.852 --> 00:19:09.430
1383
+ What am I hoping to achieve in this?
1384
+
1385
+ 347
1386
+ 00:19:09.586 --> 00:19:09.789
1387
+ Okay,
1388
+
1389
+ 348
1390
+ 00:19:09.852 --> 00:19:11.352
1391
+ my fine tune was a dud as mentioned.
1392
+
1393
+ 349
1394
+ 00:19:11.977 --> 00:19:12.648
1395
+ Deepgram SDT,
1396
+
1397
+ 350
1398
+ 00:19:12.789 --> 00:19:13.180
1399
+ I'm really,
1400
+
1401
+ 351
1402
+ 00:19:13.211 --> 00:19:15.477
1403
+ really hopeful that this prototype will work.
1404
+
1405
+ 352
1406
+ 00:19:16.060 --> 00:19:17.843
1407
+ And it's a built in public open source.
1408
+
1409
+ 353
1410
+ 00:19:17.844 --> 00:19:20.624
1411
+ So anyone is welcome to use it if I make anything good.
1412
+
1413
+ 354
1414
+ 00:19:21.788 --> 00:19:27.515
1415
+ But that was really exciting for me last night when after hours of trying my own prototype,
1416
+
1417
+ 355
1418
+ 00:19:27.593 --> 00:19:31.054
1419
+ seeing someone just made something that works like that,
1420
+
1421
+ 356
1422
+ 00:19:31.451 --> 00:19:31.654
1423
+ you know,
1424
+
1425
+ 357
1426
+ 00:19:31.655 --> 00:19:36.279
1427
+ you're not going to have to build a custom conda environment and image.
1428
+
1429
+ 358
1430
+ 00:19:36.468 --> 00:19:37.482
1431
+ I have AMD GPU,
1432
+
1433
+ 359
1434
+ 00:19:37.546 --> 00:19:39.811
1435
+ which makes things much more complicated.
1436
+
1437
+ 360
1438
+ 00:19:40.311 --> 00:19:41.029
1439
+ I didn't find it.
1440
+
1441
+ 361
1442
+ 00:19:42.093 --> 00:19:42.843
1443
+ And I was about to give up.
1444
+
1445
+ 362
1446
+ 00:19:42.844 --> 00:19:43.140
1447
+ And I said,
1448
+
1449
+ 363
1450
+ 00:19:43.171 --> 00:19:43.421
1451
+ all right,
1452
+
1453
+ 364
1454
+ 00:19:43.422 --> 00:19:45.468
1455
+ let me just give Deepgram's Linux thing.
1456
+
1457
+ 365
1458
+ 00:19:46.178 --> 00:19:48.265
1459
+ shot and if it doesn't work,
1460
+
1461
+ 366
1462
+ 00:19:49.027 --> 00:19:53.621
1463
+ I'm just gonna go back to trying to vibe code something myself and when I ran the script
1464
+
1465
+ 367
1466
+ 00:19:54.367 --> 00:19:57.450
1467
+ I was using cloud code to do the installation process.
1468
+
1469
+ 368
1470
+ 00:19:58.271 --> 00:20:00.114
1471
+ It ran the script and oh my gosh,
1472
+
1473
+ 369
1474
+ 00:20:00.192 --> 00:20:01.195
1475
+ it works just like that.
1476
+
1477
+ 370
1478
+ 00:20:01.977 --> 00:20:10.789
1479
+ The tricky thing for all those who wants to know all the nitty gritty details was that
1480
+
1481
+ 371
1482
+ 00:20:11.398 --> 00:20:13.648
1483
+ I don't think it was actually struggling with transcription,
1484
+
1485
+ 372
1486
+ 00:20:13.680 --> 00:20:14.352
1487
+ but pasting,
1488
+
1489
+ 373
1490
+ 00:20:14.884 --> 00:20:17.509
1491
+ Wayland makes life very hard.
1492
+
1493
+ 374
1494
+ 00:20:17.617 --> 00:20:19.634
1495
+ And I think there was something not running at the right time.
1496
+
1497
+ 375
1498
+ 00:20:19.695 --> 00:20:19.977
1499
+ Anyway,
1500
+
1501
+ 376
1502
+ 00:20:20.617 --> 00:20:21.117
1503
+ Deepgram,
1504
+
1505
+ 377
1506
+ 00:20:21.273 --> 00:20:24.134
1507
+ I looked at how they actually handled that because it worked out of the...
1508
+
1509
+ 378
1510
+ 00:20:24.203 --> 00:20:40.180
1511
+ box when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i
1512
+
1513
+ 379
1514
+ 00:20:40.181 --> 00:20:52.413
1515
+ think i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then
1516
+
1517
+ 380
1518
+ 00:20:53.144 --> 00:21:09.383
1519
+ voice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling
1520
+
1521
+ 381
1522
+ 00:21:09.398 --> 00:21:21.586
1523
+ thoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not
1524
+
1525
+ 382
1526
+ 00:21:21.743 --> 00:21:38.417
1527
+ so obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong
1528
+
1529
+ 383
1530
+ 00:21:38.542 --> 00:21:51.214
1531
+ coffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss
1532
+
1533
+ 384
1534
+ 00:21:51.675 --> 00:21:58.541
1535
+ And that actually would have been super helpful to get in because it was non-diarized or if we had diarization,
1536
+
1537
+ 385
1538
+ 00:21:59.502 --> 00:21:59.968
1539
+ a female,
1540
+
1541
+ 386
1542
+ 00:22:00.007 --> 00:22:00.443
1543
+ I could say,
1544
+
1545
+ 387
1546
+ 00:22:00.607 --> 00:22:03.171
1547
+ I want the male voice and that wasn't intended for transcription.
1548
+
1549
+ 388
1550
+ 00:22:04.724 --> 00:22:07.029
1551
+ And we're not going to get background noise like people honking their horns,
1552
+
1553
+ 389
1554
+ 00:22:07.146 --> 00:22:13.099
1555
+ which is something I've done in my main data set where I am trying to go back to some of my voice notes,
1556
+
1557
+ 390
1558
+ 00:22:13.818 --> 00:22:15.740
1559
+ annotate them and run a benchmark.
1560
+
1561
+ 391
1562
+ 00:22:15.741 --> 00:22:17.007
1563
+ But this is going to be just a pure,
1564
+
1565
+ 392
1566
+ 00:22:17.788 --> 00:22:20.007
1567
+ quick test and
1568
+
1569
+ 393
1570
+ 00:22:21.152 --> 00:22:24.012
1571
+ As someone working on a voice note idea,
1572
+
1573
+ 394
1574
+ 00:22:24.071 --> 00:22:27.272
1575
+ that's my sort of end motivation,
1576
+
1577
+ 395
1578
+ 00:22:27.332 --> 00:22:31.694
1579
+ besides thinking it's an absolutely outstanding technology that's coming to viability.
1580
+
1581
+ 396
1582
+ 00:22:31.772 --> 00:22:32.172
1583
+ And really,
1584
+
1585
+ 397
1586
+ 00:22:32.211 --> 00:22:33.094
1587
+ I know this sounds cheesy,
1588
+
1589
+ 398
1590
+ 00:22:33.633 --> 00:22:36.336
1591
+ can actually have a very transformative effect.
1592
+
1593
+ 399
1594
+ 00:22:37.272 --> 00:22:37.429
1595
+ It's,
1596
+
1597
+ 400
1598
+ 00:22:37.836 --> 00:22:38.069
1599
+ you know,
1600
+
1601
+ 401
1602
+ 00:22:38.101 --> 00:22:44.897
1603
+ voice technology has been life changing for folks living with disabilities.
1604
+
1605
+ 402
1606
+ 00:22:45.851 --> 00:22:46.258
1607
+ And
1608
+
1609
+ 403
1610
+ 00:22:47.054 --> 00:22:49.851
1611
+ I think there's something really nice about the fact that it can also benefit.
1612
+
1613
+ 404
1614
+ 00:22:50.619 --> 00:22:50.859
1615
+ you know,
1616
+
1617
+ 405
1618
+ 00:22:51.019 --> 00:22:58.787
1619
+ folks who are able-bodied and like we can all in different ways make this tech as useful as possible,
1620
+
1621
+ 406
1622
+ 00:22:59.231 --> 00:23:01.051
1623
+ regardless of the exact way that we're using it.
1624
+
1625
+ 407
1626
+ 00:23:02.490 --> 00:23:05.294
1627
+ And I think there's something very powerful in that and it can be very cool.
1628
+
1629
+ 408
1630
+ 00:23:06.395 --> 00:23:07.451
1631
+ I see huge potential.
1632
+
1633
+ 409
1634
+ 00:23:07.715 --> 00:23:08.934
1635
+ What excites me about voice tech?
1636
+
1637
+ 410
1638
+ 00:23:09.903 --> 00:23:10.512
1639
+ A lot of things,
1640
+
1641
+ 411
1642
+ 00:23:10.576 --> 00:23:10.872
1643
+ actually.
1644
+
1645
+ 412
1646
+ 00:23:12.294 --> 00:23:12.622
1647
+ Firstly,
1648
+
1649
+ 413
1650
+ 00:23:13.028 --> 00:23:14.278
1651
+ the fact that it's cheap and accurate,
1652
+
1653
+ 414
1654
+ 00:23:14.715 --> 00:23:16.122
1655
+ as I mentioned at the very start of this,
1656
+
1657
+ 415
1658
+ 00:23:17.372 --> 00:23:19.809
1659
+ and it's getting better and better with stuff like accent handling.
1660
+
1661
+ 416
1662
+ 00:23:21.053 --> 00:23:25.577
1663
+ I'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day,
1664
+
1665
+ 417
1666
+ 00:23:25.675 --> 00:23:26.878
1667
+ as I imagine.
1668
+
1669
+ 418
1670
+ 00:23:26.880 --> 00:23:27.878
1671
+ I get like superb,
1672
+
1673
+ 419
1674
+ 00:23:28.000 --> 00:23:28.942
1675
+ flawless words,
1676
+
1677
+ 420
1678
+ 00:23:29.058 --> 00:23:29.582
1679
+ error rates,
1680
+
1681
+ 421
1682
+ 00:23:29.597 --> 00:23:34.489
1683
+ because I'm just kind of skeptical about local speech to text,
1684
+
1685
+ 422
1686
+ 00:23:34.847 --> 00:23:35.503
1687
+ as I mentioned.
1688
+
1689
+ 423
1690
+ 00:23:36.105 --> 00:23:36.371
1691
+ And
1692
+
1693
+ 424
1694
+ 00:23:36.792 --> 00:23:40.386
1695
+ I think the pace of innovation and improvement in the models,
1696
+
1697
+ 425
1698
+ 00:23:40.574 --> 00:23:47.511
1699
+ the main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about
1700
+
1701
+ 426
1702
+ 00:23:48.199 --> 00:23:49.278
1703
+ ASR is
1704
+
1705
+ 427
1706
+ 00:23:49.531 --> 00:24:04.644
1707
+ the idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and
1708
+
1709
+ 428
1710
+ 00:24:04.644 --> 00:24:15.738
1711
+ therefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as,
1712
+
1713
+ 429
1714
+ 00:24:15.955 --> 00:24:16.894
1715
+ um,
1716
+
1717
+ 430
1718
+ 00:24:17.195 --> 00:24:17.435
1719
+ you know,
1720
+
1721
+ 431
1722
+ 00:24:17.455 --> 00:24:19.137
1723
+ the call centers who send out ambulances,
1724
+
1725
+ 432
1726
+ 00:24:19.199 --> 00:24:19.618
1727
+ et cetera,
1728
+
1729
+ 433
1730
+ 00:24:20.397 --> 00:24:22.441
1731
+ where accuracy is absolutely paramount.
1732
+
1733
+ 434
1734
+ 00:24:22.660 --> 00:24:24.125
1735
+ And in the case of doctors,
1736
+
1737
+ 435
1738
+ 00:24:24.721 --> 00:24:25.461
1739
+ radiologists,
1740
+
1741
+ 436
1742
+ 00:24:25.461 --> 00:24:28.008
1743
+ they might be using very specialized vocab all the time.
1744
+
1745
+ 437
1746
+ 00:24:28.827 --> 00:24:30.147
1747
+ So those are kind of the main two things.
1748
+
1749
+ 438
1750
+ 00:24:30.148 --> 00:24:37.093
1751
+ And I'm not sure that really just for trying to make it better on a few random tech words with my slightly,
1752
+
1753
+ 439
1754
+ 00:24:37.530 --> 00:24:37.750
1755
+ I mean,
1756
+
1757
+ 440
1758
+ 00:24:37.750 --> 00:24:38.358
1759
+ I have an accent,
1760
+
1761
+ 441
1762
+ 00:24:38.436 --> 00:24:39.218
1763
+ but like not,
1764
+
1765
+ 442
1766
+ 00:24:39.530 --> 00:24:39.797
1767
+ you know,
1768
+
1769
+ 443
1770
+ 00:24:40.233 --> 00:24:43.936
1771
+ an accent that a few other million people have it.
1772
+
1773
+ 444
1774
+ 00:24:44.922 --> 00:24:46.172
1775
+ I'm not sure that.
1776
+
1777
+ 445
1778
+ 00:24:46.579 --> 00:24:56.540
1779
+ my little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that
1780
+
1781
+ 446
1782
+ 00:24:57.029 --> 00:25:01.308
1783
+ I suspect that the next generation of ASR will just be so good that it will kind of be,
1784
+
1785
+ 447
1786
+ 00:25:02.051 --> 00:25:02.173
1787
+ no,
1788
+
1789
+ 448
1790
+ 00:25:02.430 --> 00:25:02.630
1791
+ well,
1792
+
1793
+ 449
1794
+ 00:25:02.808 --> 00:25:03.833
1795
+ that would have been cool if it worked out,
1796
+
1797
+ 450
1798
+ 00:25:03.872 --> 00:25:05.192
1799
+ but I'll just use this instead.
1800
+
1801
+ 451
1802
+ 00:25:05.972 --> 00:25:11.294
1803
+ So that's going to be it for today's episode of voice training data.
1804
+
1805
+ 452
1806
+ 00:25:12.011 --> 00:25:12.333
1807
+ Single,
1808
+
1809
+ 453
1810
+ 00:25:12.933 --> 00:25:14.028
1811
+ long shot evaluation.
1812
+
1813
+ 454
1814
+ 00:25:14.636 --> 00:25:15.450
1815
+ Who am I going to compare?
1816
+
1817
+ 455
1818
+ 00:25:16.622 --> 00:25:17.855
1819
+ Whisper is always good as a benchmark,
1820
+
1821
+ 456
1822
+ 00:25:17.886 --> 00:25:22.278
1823
+ but I'm more interested in seeing Whisper head-to-head with two things,
1824
+
1825
+ 457
1826
+ 00:25:22.308 --> 00:25:22.511
1827
+ really.
1828
+
1829
+ 458
1830
+ 00:25:23.450 --> 00:25:25.169
1831
+ One is Whisper variants.
1832
+
1833
+ 459
1834
+ 00:25:25.200 --> 00:25:25.950
1835
+ So you've got these...
1836
+
1837
+ 460
1838
+ 00:25:26.178 --> 00:25:44.617
1839
+ projects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create
1840
+
1841
+ 461
1842
+ 00:25:44.618 --> 00:25:55.430
1843
+ a proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but
1844
+
1845
+ 462
1846
+ 00:25:55.942 --> 00:25:57.241
1847
+ What I would actually love to do on
1848
+
1849
+ 463
1850
+ 00:25:58.102 --> 00:25:58.903
1851
+ Hugging Face,
1852
+
1853
+ 464
1854
+ 00:25:59.021 --> 00:25:59.800
1855
+ I think would be a great,
1856
+
1857
+ 465
1858
+ 00:25:59.984 --> 00:26:08.324
1859
+ probably how I might visualize this is having the audio waveform play and then have the transcript for each model below it.
1860
+
1861
+ 466
1862
+ 00:26:08.824 --> 00:26:09.722
1863
+ And maybe even a,
1864
+
1865
+ 467
1866
+ 00:26:11.144 --> 00:26:11.364
1867
+ like,
1868
+
1869
+ 468
1870
+ 00:26:11.489 --> 00:26:11.722
1871
+ you know,
1872
+
1873
+ 469
1874
+ 00:26:11.871 --> 00:26:15.105
1875
+ two scale and maybe even a local one as well,
1876
+
1877
+ 470
1878
+ 00:26:15.371 --> 00:26:17.903
1879
+ like Local Whisper versus OpenAI API,
1880
+
1881
+ 471
1882
+ 00:26:18.903 --> 00:26:19.449
1883
+ et cetera.
1884
+
1885
+ 472
1886
+ 00:26:19.746 --> 00:26:20.105
1887
+ And...
1888
+
1889
+ 473
1890
+ 00:26:21.238 --> 00:26:30.903
1891
+ I can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't,
1892
+
1893
+ 474
1894
+ 00:26:31.606 --> 00:26:34.090
1895
+ as well as the sort of headline finding of which had the best
1896
+
1897
+ 475
1898
+ 00:26:34.731 --> 00:26:37.372
1899
+ WER, but that would require the source of truth.
1900
+
1901
+ 476
1902
+ 00:26:37.919 --> 00:26:38.090
1903
+ Okay,
1904
+
1905
+ 477
1906
+ 00:26:38.137 --> 00:26:38.434
1907
+ that's it.
1908
+
1909
+ 478
1910
+ 00:26:38.637 --> 00:26:39.372
1911
+ I hope this was,
1912
+
1913
+ 479
1914
+ 00:26:39.622 --> 00:26:39.997
1915
+ I don't know,
1916
+
1917
+ 480
1918
+ 00:26:40.419 --> 00:26:42.403
1919
+ maybe useful for other folks interested in STT.
1920
+
1921
+ 481
1922
+ 00:26:43.106 --> 00:26:43.762
1923
+ You want to see that
1924
+
1925
+ 482
1926
+ 00:26:44.137 --> 00:26:44.919
1927
+ I always feel,
1928
+
1929
+ 483
1930
+ 00:26:45.434 --> 00:26:47.247
1931
+ think I've just said as something I didn't intend to.
1932
+
1933
+ 484
1934
+ 00:26:48.044 --> 00:26:48.481
1935
+ STT,
1936
+
1937
+ 485
1938
+ 00:26:48.872 --> 00:26:49.528
1939
+ I said for those.
1940
+
1941
+ 486
1942
+ 00:26:49.817 --> 00:26:50.378
1943
+ listen carefully,
1944
+
1945
+ 487
1946
+ 00:26:50.419 --> 00:26:52.902
1947
+ including hopefully the models themselves.
1948
+
1949
+ 488
1950
+ 00:26:53.441 --> 00:26:54.163
1951
+ This has been myself,
1952
+
1953
+ 489
1954
+ 00:26:54.304 --> 00:26:54.902
1955
+ Daniel Rosehill.
1956
+
1957
+ 490
1958
+ 00:26:55.022 --> 00:26:59.404
1959
+ For more jumbled repositories about my roving interest in AI,
1960
+
1961
+ 491
1962
+ 00:26:59.507 --> 00:27:00.765
1963
+ but particularly agentic,
1964
+
1965
+ 492
1966
+ 00:27:01.451 --> 00:27:03.015
1967
+ MCP and voice tech,
1968
+
1969
+ 493
1970
+ 00:27:03.413 --> 00:27:04.335
1971
+ you can find me on
1972
+
1973
+ 494
1974
+ 00:27:04.990 --> 00:27:06.749
1975
+ GitHub, Hugging Face,
1976
+
1977
+ 495
1978
+ 00:27:08.279 --> 00:27:08.811
1979
+ where else?
1980
+
1981
+ 496
1982
+ 00:27:09.140 --> 00:27:10.154
1983
+ Danielrosehill.com,
1984
+
1985
+ 497
1986
+ 00:27:10.171 --> 00:27:11.296
1987
+ which is my personal website,
1988
+
1989
+ 498
1990
+ 00:27:11.374 --> 00:27:13.483
1991
+ as well as this podcast,
1992
+
1993
+ 499
1994
+ 00:27:13.624 --> 00:27:15.186
1995
+ whose name I sadly cannot remember.
1996
+
1997
+ 500
1998
+ 00:27:15.936 --> 00:27:16.499
1999
+ Until next time,
2000
+
2001
+ 501
2002
+ 00:27:16.826 --> 00:27:17.343
2003
+ thanks for listening.
srt-out/nova3.srt ADDED
@@ -0,0 +1,2304 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,080 --> 00:00:06,240
3
+ Hello and welcome to a audio dataset consisting of one
4
+
5
+ 2
6
+ 00:00:06,240 --> 00:00:08,400
7
+ single episode of a nonexistent podcast.
8
+
9
+ 3
10
+ 00:00:08,800 --> 00:00:12,880
11
+ Or it I may append this to a podcast that
12
+
13
+ 4
14
+ 00:00:12,880 --> 00:00:18,814
15
+ I set up recently regarding my with my thoughts on
16
+
17
+ 5
18
+ 00:00:18,815 --> 00:00:20,815
19
+ speech tech and A.
20
+
21
+ 6
22
+ 00:00:20,815 --> 00:00:21,214
23
+ I.
24
+
25
+ 7
26
+ 00:00:21,214 --> 00:00:22,814
27
+ In particular, more A.
28
+
29
+ 8
30
+ 00:00:22,814 --> 00:00:23,054
31
+ I.
32
+
33
+ 9
34
+ 00:00:23,054 --> 00:00:23,935
35
+ And generative A.
36
+
37
+ 10
38
+ 00:00:23,935 --> 00:00:24,095
39
+ I.
40
+
41
+ 11
42
+ 00:00:24,095 --> 00:00:26,494
43
+ I would I would say.
44
+
45
+ 12
46
+ 00:00:26,814 --> 00:00:30,869
47
+ But in any event, the purpose of this voice recording
48
+
49
+ 13
50
+ 00:00:30,869 --> 00:00:35,590
51
+ is actually to create a lengthy voice sample for a
52
+
53
+ 14
54
+ 00:00:35,590 --> 00:00:38,950
55
+ quick evaluation, a back of the envelope evaluation, they might
56
+
57
+ 15
58
+ 00:00:38,950 --> 00:00:41,429
59
+ say, for different speech attacks models.
60
+
61
+ 16
62
+ 00:00:41,429 --> 00:00:43,945
63
+ I'm doing this because I thought I'd made a great
64
+
65
+ 17
66
+ 00:00:43,945 --> 00:00:47,784
67
+ breakthrough in my journey with speech tech and that was
68
+
69
+ 18
70
+ 00:00:47,784 --> 00:00:51,385
71
+ succeeding in the elusive task of fine tuning whisper.
72
+
73
+ 19
74
+ 00:00:51,704 --> 00:00:56,424
75
+ Whisper is, and I'm to just talk, I'm trying to
76
+
77
+ 20
78
+ 00:00:55,829 --> 00:00:56,789
79
+ mix up.
80
+
81
+ 21
82
+ 00:00:56,869 --> 00:01:00,390
83
+ I'm going to try a few different styles of speaking
84
+
85
+ 22
86
+ 00:01:00,390 --> 00:01:02,869
87
+ whisper something at some points as well.
88
+
89
+ 23
90
+ 00:01:03,350 --> 00:01:06,790
91
+ And I'll go back to speaking loud in in different
92
+
93
+ 24
94
+ 00:01:06,790 --> 00:01:09,030
95
+ parts are going to sound really like a crazy person
96
+
97
+ 25
98
+ 00:01:09,030 --> 00:01:12,424
99
+ because I'm also going to try to speak at different
100
+
101
+ 26
102
+ 00:01:12,984 --> 00:01:18,025
103
+ pitches and cadences in order to really try to push
104
+
105
+ 27
106
+ 00:01:18,344 --> 00:01:21,145
107
+ a speech to text model through its paces, which is
108
+
109
+ 28
110
+ 00:01:21,145 --> 00:01:24,609
111
+ trying to make sense of is this guy just rambling
112
+
113
+ 29
114
+ 00:01:24,609 --> 00:01:30,049
115
+ on incoherently in one long sentence or are these just
116
+
117
+ 30
118
+ 00:01:30,049 --> 00:01:36,450
119
+ actually a series of step standalone, standalone, standalone sentences?
120
+
121
+ 31
122
+ 00:01:36,450 --> 00:01:38,130
123
+ And how is it going to handle step alone?
124
+
125
+ 32
126
+ 00:01:38,130 --> 00:01:38,770
127
+ That's not a word.
128
+
129
+ 33
130
+ 00:01:39,704 --> 00:01:42,025
131
+ What happens when you use speech to text and you
132
+
133
+ 34
134
+ 00:01:42,025 --> 00:01:43,384
135
+ use a fake word?
136
+
137
+ 35
138
+ 00:01:43,384 --> 00:01:45,784
139
+ And then you're like, wait, that's not actually that word
140
+
141
+ 36
142
+ 00:01:45,784 --> 00:01:46,665
143
+ doesn't exist.
144
+
145
+ 37
146
+ 00:01:46,984 --> 00:01:48,584
147
+ How does AI handle that?
148
+
149
+ 38
150
+ 00:01:48,584 --> 00:01:53,750
151
+ And these and more are all the questions that I'm
152
+
153
+ 39
154
+ 00:01:53,750 --> 00:01:55,750
155
+ seeking to answer in this training data.
156
+
157
+ 40
158
+ 00:01:55,829 --> 00:01:58,549
159
+ Now, why was I trying to fine tune Whisper?
160
+
161
+ 41
162
+ 00:01:58,549 --> 00:01:59,750
163
+ And what is Whisper?
164
+
165
+ 42
166
+ 00:01:59,750 --> 00:02:02,710
167
+ As I said, I'm going to try to record this
168
+
169
+ 43
170
+ 00:02:02,710 --> 00:02:06,644
171
+ at a couple of different levels of technicality for folks
172
+
173
+ 44
174
+ 00:02:06,644 --> 00:02:11,764
175
+ who are in the normal world and not totally stuck
176
+
177
+ 45
178
+ 00:02:11,764 --> 00:02:13,764
179
+ down the rabbit hole of AI, which you have to
180
+
181
+ 46
182
+ 00:02:13,764 --> 00:02:17,685
183
+ say is a really wonderful rabbit hole to be done.
184
+
185
+ 47
186
+ 00:02:17,844 --> 00:02:20,919
187
+ It's a really interesting area and speech and voice tech
188
+
189
+ 48
190
+ 00:02:20,919 --> 00:02:24,359
191
+ is is the aspect of it that I find actually
192
+
193
+ 49
194
+ 00:02:24,359 --> 00:02:27,239
195
+ most I'm not sure I would say the most interesting
196
+
197
+ 50
198
+ 00:02:27,239 --> 00:02:30,759
199
+ because there's just so much that is fascinating in AI.
200
+
201
+ 51
202
+ 00:02:31,400 --> 00:02:34,134
203
+ But the most that I find the most personally transformative
204
+
205
+ 52
206
+ 00:02:34,134 --> 00:02:38,534
207
+ in terms of the impact that it's had on my
208
+
209
+ 53
210
+ 00:02:38,534 --> 00:02:41,254
211
+ daily work life and productivity and how I sort of
212
+
213
+ 54
214
+ 00:02:41,254 --> 00:02:41,895
215
+ work.
216
+
217
+ 55
218
+ 00:02:42,935 --> 00:02:47,500
219
+ I'm persevering hard with the task of trying to get
220
+
221
+ 56
222
+ 00:02:47,500 --> 00:02:50,939
223
+ a good solution working for Linux, which if anyone actually
224
+
225
+ 57
226
+ 00:02:50,939 --> 00:02:52,939
227
+ does listen to this, not just for the training data
228
+
229
+ 58
230
+ 00:02:52,939 --> 00:02:56,700
231
+ and for the actual content, is sparked.
232
+
233
+ 59
234
+ 00:02:56,700 --> 00:02:59,980
235
+ I had, besides the fine tune not working, well that
236
+
237
+ 60
238
+ 00:02:59,980 --> 00:03:01,385
239
+ was the failure.
240
+
241
+ 61
242
+ 00:03:02,504 --> 00:03:06,745
243
+ I used Claude code because one thinks these days that
244
+
245
+ 62
246
+ 00:03:06,745 --> 00:03:13,280
247
+ there is nothing short of solving, you know, the the
248
+
249
+ 63
250
+ 00:03:13,280 --> 00:03:17,599
251
+ reason of life or something that clause and agentic AI
252
+
253
+ 64
254
+ 00:03:17,599 --> 00:03:19,680
255
+ can't do, which is not really the case.
256
+
257
+ 65
258
+ 00:03:19,680 --> 00:03:23,199
259
+ It does seem that way sometimes, but it fails a
260
+
261
+ 66
262
+ 00:03:23,199 --> 00:03:23,759
263
+ lot as well.
264
+
265
+ 67
266
+ 00:03:23,759 --> 00:03:26,639
267
+ And this is one of those instances where last week
268
+
269
+ 68
270
+ 00:03:26,639 --> 00:03:30,824
271
+ I put together an hour of voice training data, basically
272
+
273
+ 69
274
+ 00:03:30,824 --> 00:03:33,465
275
+ speaking just random things for three minutes.
276
+
277
+ 70
278
+ 00:03:35,465 --> 00:03:38,104
279
+ It was actually kind of tedious because the texts were
280
+
281
+ 71
282
+ 00:03:38,104 --> 00:03:38,664
283
+ really weird.
284
+
285
+ 72
286
+ 00:03:38,664 --> 00:03:41,370
287
+ Some of them were, it was like it was AI
288
+
289
+ 73
290
+ 00:03:41,370 --> 00:03:42,250
291
+ generated.
292
+
293
+ 74
294
+ 00:03:42,569 --> 00:03:44,889
295
+ I tried before to read Sherlock Holmes for an hour
296
+
297
+ 75
298
+ 00:03:44,889 --> 00:03:47,689
299
+ and I just couldn't, I was so bored after ten
300
+
301
+ 76
302
+ 00:03:47,689 --> 00:03:50,569
303
+ minutes that I was like, okay, no, I'm just gonna
304
+
305
+ 77
306
+ 00:03:50,569 --> 00:03:51,930
307
+ have to find something else to read.
308
+
309
+ 78
310
+ 00:03:51,930 --> 00:03:58,284
311
+ So I used a created with AI Studio, VibeCoded, a
312
+
313
+ 79
314
+ 00:03:58,284 --> 00:04:03,164
315
+ synthetic text generator which actually I thought was probably a
316
+
317
+ 80
318
+ 00:04:03,164 --> 00:04:05,245
319
+ better way of doing it because it would give me
320
+
321
+ 81
322
+ 00:04:05,245 --> 00:04:09,069
323
+ more short samples with more varied content.
324
+
325
+ 82
326
+ 00:04:09,069 --> 00:04:11,710
327
+ So I was like, okay, give me a voice note
328
+
329
+ 83
330
+ 00:04:11,710 --> 00:04:14,909
331
+ like I'm recording an email, give me a short story
332
+
333
+ 84
334
+ 00:04:14,909 --> 00:04:18,189
335
+ to read, give me prose to read.
336
+
337
+ 85
338
+ 00:04:18,189 --> 00:04:20,634
339
+ So I came up with all these different things and
340
+
341
+ 86
342
+ 00:04:20,634 --> 00:04:22,714
343
+ they added a little timer to it so I could
344
+
345
+ 87
346
+ 00:04:22,714 --> 00:04:24,955
347
+ see how close I was to one hour.
348
+
349
+ 88
350
+ 00:04:25,915 --> 00:04:29,115
351
+ And I spent like an hour one afternoon or probably
352
+
353
+ 89
354
+ 00:04:29,115 --> 00:04:33,115
355
+ two hours by the time you do retakes and whatever
356
+
357
+ 90
358
+ 00:04:33,115 --> 00:04:36,169
359
+ because you want to it gave me a source of
360
+
361
+ 91
362
+ 00:04:36,169 --> 00:04:40,009
363
+ truth which I'm not sure if that's the scientific way
364
+
365
+ 92
366
+ 00:04:40,009 --> 00:04:44,169
367
+ to approach this topic of gathering training data but I
368
+
369
+ 93
370
+ 00:04:44,169 --> 00:04:45,449
371
+ thought made sense.
372
+
373
+ 94
374
+ 00:04:46,490 --> 00:04:49,464
375
+ I have a lot of audio data from recording voice
376
+
377
+ 95
378
+ 00:04:49,464 --> 00:04:53,544
379
+ notes which I've also kind of used, been experimenting with
380
+
381
+ 96
382
+ 00:04:53,544 --> 00:04:55,064
383
+ using for a different purpose.
384
+
385
+ 97
386
+ 00:04:55,384 --> 00:04:58,745
387
+ Slightly different annotating task types.
388
+
389
+ 98
390
+ 00:04:58,745 --> 00:05:03,250
391
+ It's more a text classification experiment or Well, it's more
392
+
393
+ 99
394
+ 00:05:03,250 --> 00:05:03,810
395
+ than that actually.
396
+
397
+ 100
398
+ 00:05:03,810 --> 00:05:05,009
399
+ I'm working on a voice app.
400
+
401
+ 101
402
+ 00:05:05,009 --> 00:05:09,329
403
+ So it's a prototype, I guess, is really more accurate.
404
+
405
+ 102
406
+ 00:05:11,409 --> 00:05:13,969
407
+ But you can do that and you can work backwards.
408
+
409
+ 103
410
+ 00:05:13,969 --> 00:05:18,354
411
+ Listen back to a voice note and you painfully go
412
+
413
+ 104
414
+ 00:05:18,354 --> 00:05:21,474
415
+ through one of those transcribing, where you start and stop
416
+
417
+ 105
418
+ 00:05:21,474 --> 00:05:23,634
419
+ and scrub around it and you fix the errors, but
420
+
421
+ 106
422
+ 00:05:23,634 --> 00:05:25,875
423
+ it's really, really pouring to do that.
424
+
425
+ 107
426
+ 00:05:26,115 --> 00:05:28,034
427
+ So I thought it would be less tedious in the
428
+
429
+ 108
430
+ 00:05:28,034 --> 00:05:31,714
431
+ long term if I just recorded the source of truth.
432
+
433
+ 109
434
+ 00:05:32,069 --> 00:05:34,389
435
+ So it gave me these three minutes snippets.
436
+
437
+ 110
438
+ 00:05:34,389 --> 00:05:37,509
439
+ I recorded them and saved an MP3 and a TXT
440
+
441
+ 111
442
+ 00:05:37,750 --> 00:05:40,310
443
+ in the same folder and I created an error that
444
+
445
+ 112
446
+ 00:05:40,310 --> 00:05:40,949
447
+ data.
448
+
449
+ 113
450
+ 00:05:41,990 --> 00:05:44,870
451
+ So I was very hopeful, quietly, a little bit hopeful
452
+
453
+ 114
454
+ 00:05:44,870 --> 00:05:47,029
455
+ that I would be able, that I could actually fine
456
+
457
+ 115
458
+ 00:05:47,029 --> 00:05:47,750
459
+ tune Whisper.
460
+
461
+ 116
462
+ 00:05:48,365 --> 00:05:51,085
463
+ I want to fine tune Whisper because when I got
464
+
465
+ 117
466
+ 00:05:51,085 --> 00:05:55,004
467
+ into voice tech last November, my wife was in the
468
+
469
+ 118
470
+ 00:05:55,004 --> 00:05:57,245
471
+ US and I was alone at home.
472
+
473
+ 119
474
+ 00:05:57,324 --> 00:06:01,004
475
+ And when crazy people like me do really wild things
476
+
477
+ 120
478
+ 00:06:01,004 --> 00:06:03,980
479
+ like use voice to tech technology.
480
+
481
+ 121
482
+ 00:06:03,980 --> 00:06:06,939
483
+ That was basically when I started doing it, I didn't
484
+
485
+ 122
486
+ 00:06:06,939 --> 00:06:09,580
487
+ feel like a crazy person speaking to myself.
488
+
489
+ 123
490
+ 00:06:09,980 --> 00:06:12,780
491
+ And my expectations weren't that high.
492
+
493
+ 124
494
+ 00:06:13,180 --> 00:06:17,685
495
+ I'd used speech tech now and again, tried it out.
496
+
497
+ 125
498
+ 00:06:17,685 --> 00:06:18,884
499
+ I was like, it'd be really cool if you could
500
+
501
+ 126
502
+ 00:06:18,884 --> 00:06:22,404
503
+ just like speak into your computer and whatever I tried
504
+
505
+ 127
506
+ 00:06:22,404 --> 00:06:25,925
507
+ out that had Linux support was just, it was not
508
+
509
+ 128
510
+ 00:06:25,925 --> 00:06:26,805
511
+ good basically.
512
+
513
+ 129
514
+ 00:06:27,365 --> 00:06:29,524
515
+ And this blew me away from the first go.
516
+
517
+ 130
518
+ 00:06:29,524 --> 00:06:32,339
519
+ I mean, it wasn't one hundred percent accurate out of
520
+
521
+ 131
522
+ 00:06:32,339 --> 00:06:34,500
523
+ the box and it took work, but it was good
524
+
525
+ 132
526
+ 00:06:34,500 --> 00:06:36,819
527
+ enough that there was a solid foundation and it kind
528
+
529
+ 133
530
+ 00:06:36,819 --> 00:06:41,139
531
+ of passed that pivot point that it's actually worth doing
532
+
533
+ 134
534
+ 00:06:41,139 --> 00:06:41,620
535
+ this.
536
+
537
+ 135
538
+ 00:06:41,939 --> 00:06:43,939
539
+ You know, there's a point where it's so like, the
540
+
541
+ 136
542
+ 00:06:43,939 --> 00:06:46,485
543
+ transcript is you don't have to get one hundred percent
544
+
545
+ 137
546
+ 00:06:46,485 --> 00:06:49,525
547
+ accuracy for it to be worth your time for speech
548
+
549
+ 138
550
+ 00:06:49,525 --> 00:06:51,925
551
+ to text to be a worthwhile addition to your productivity.
552
+
553
+ 139
554
+ 00:06:51,925 --> 00:06:53,685
555
+ But you do need to get above, let's say, I
556
+
557
+ 140
558
+ 00:06:53,685 --> 00:06:55,125
559
+ don't know, eighty five percent.
560
+
561
+ 141
562
+ 00:06:55,605 --> 00:06:58,805
563
+ If it's sixty percent or fifty percent, you inevitably say,
564
+
565
+ 142
566
+ 00:06:59,040 --> 00:07:00,319
567
+ Screw it, I'll just type it.
568
+
569
+ 143
570
+ 00:07:00,319 --> 00:07:03,680
571
+ Because you end up missing errors in the transcript and
572
+
573
+ 144
574
+ 00:07:03,680 --> 00:07:05,040
575
+ it becomes actually worse.
576
+
577
+ 145
578
+ 00:07:05,040 --> 00:07:06,720
579
+ You end up in a worse position than you started
580
+
581
+ 146
582
+ 00:07:06,720 --> 00:07:07,040
583
+ with it.
584
+
585
+ 147
586
+ 00:07:07,040 --> 00:07:08,240
587
+ That's been my experience.
588
+
589
+ 148
590
+ 00:07:08,560 --> 00:07:12,480
591
+ So I was like, Oh, this is actually really, really
592
+
593
+ 149
594
+ 00:07:12,480 --> 00:07:12,960
595
+ good now.
596
+
597
+ 150
598
+ 00:07:12,960 --> 00:07:13,680
599
+ How did that happen?
600
+
601
+ 151
602
+ 00:07:13,680 --> 00:07:17,995
603
+ And the answer is ASR, Whisper being open sourced and
604
+
605
+ 152
606
+ 00:07:18,714 --> 00:07:21,594
607
+ the transformer architecture, if you want to go back to
608
+
609
+ 153
610
+ 00:07:21,594 --> 00:07:26,394
611
+ the underpinnings, which really blows my mind and it's on
612
+
613
+ 154
614
+ 00:07:26,394 --> 00:07:29,830
615
+ my list to read through that paper.
616
+
617
+ 155
618
+ 00:07:30,389 --> 00:07:35,990
619
+ All you need is attention as attentively as can be
620
+
621
+ 156
622
+ 00:07:35,990 --> 00:07:39,350
623
+ done with my limited brain because it's super super high
624
+
625
+ 157
626
+ 00:07:39,350 --> 00:07:43,045
627
+ level stuff, super advanced stuff, mean.
628
+
629
+ 158
630
+ 00:07:43,285 --> 00:07:48,084
631
+ That I think of all the things that are fascinating
632
+
633
+ 159
634
+ 00:07:48,084 --> 00:07:52,564
635
+ about the sudden rise in AI and the dramatic capabilities,
636
+
637
+ 160
638
+ 00:07:53,339 --> 00:07:55,419
639
+ I find it fascinating that few people are like, hang
640
+
641
+ 161
642
+ 00:07:55,419 --> 00:07:58,300
643
+ on, you've got this thing that can speak to you
644
+
645
+ 162
646
+ 00:07:58,300 --> 00:08:00,060
647
+ like a chatbot, an LLM.
648
+
649
+ 163
650
+ 00:08:00,620 --> 00:08:02,860
651
+ And then you've got image generation.
652
+
653
+ 164
654
+ 00:08:02,860 --> 00:08:03,180
655
+ Okay.
656
+
657
+ 165
658
+ 00:08:03,180 --> 00:08:07,100
659
+ So firstly, two things on the surface have nothing in
660
+
661
+ 166
662
+ 00:08:07,100 --> 00:08:07,419
663
+ common.
664
+
665
+ 167
666
+ 00:08:08,365 --> 00:08:12,044
667
+ So how did that just happen all at the same
668
+
669
+ 168
670
+ 00:08:12,044 --> 00:08:12,285
671
+ time?
672
+
673
+ 169
674
+ 00:08:12,285 --> 00:08:15,964
675
+ And then when you extend that further, you're like, Suno.
676
+
677
+ 170
678
+ 00:08:15,964 --> 00:08:19,485
679
+ You can sing a song and AI will come up
680
+
681
+ 171
682
+ 00:08:19,485 --> 00:08:21,165
683
+ with an instrumental.
684
+
685
+ 172
686
+ 00:08:21,485 --> 00:08:23,485
687
+ And then you've got Whisper and you're like, Wait a
688
+
689
+ 173
690
+ 00:08:23,485 --> 00:08:23,725
691
+ second.
692
+
693
+ 174
694
+ 00:08:24,100 --> 00:08:28,180
695
+ How did all this stuff If it's all AI, there
696
+
697
+ 175
698
+ 00:08:28,180 --> 00:08:29,540
699
+ has to be some commonality.
700
+
701
+ 176
702
+ 00:08:29,540 --> 00:08:35,139
703
+ Otherwise, are totally different technologies on the surface of it.
704
+
705
+ 177
706
+ 00:08:35,220 --> 00:08:39,384
707
+ And the transformer architecture is, as far as I know,
708
+
709
+ 178
710
+ 00:08:39,384 --> 00:08:40,264
711
+ the answer.
712
+
713
+ 179
714
+ 00:08:40,264 --> 00:08:42,985
715
+ And I can't even say, can't even pretend that I
716
+
717
+ 180
718
+ 00:08:42,985 --> 00:08:47,384
719
+ really understand what the transformer architecture means in-depth.
720
+
721
+ 181
722
+ 00:08:47,384 --> 00:08:49,865
723
+ But I have scanned this and as I said, I
724
+
725
+ 182
726
+ 00:08:49,865 --> 00:08:52,879
727
+ want to print it and really kind of think over
728
+
729
+ 183
730
+ 00:08:52,879 --> 00:08:54,160
731
+ it at some point.
732
+
733
+ 184
734
+ 00:08:54,879 --> 00:08:58,080
735
+ And I'll probably feel bad about myself, I think, because
736
+
737
+ 185
738
+ 00:08:58,080 --> 00:08:59,679
739
+ weren't those guys in twenties?
740
+
741
+ 186
742
+ 00:09:00,320 --> 00:09:01,840
743
+ Like, that's crazy.
744
+
745
+ 187
746
+ 00:09:02,160 --> 00:09:06,160
747
+ I think I asked ChatGPT once who wrote that paper
748
+
749
+ 188
750
+ 00:09:06,545 --> 00:09:09,264
751
+ and how old were they when it was published in
752
+
753
+ 189
754
+ 00:09:09,264 --> 00:09:09,825
755
+ ArcSiv?
756
+
757
+ 190
758
+ 00:09:09,825 --> 00:09:13,105
759
+ And I was expecting like, I don't know, what do
760
+
761
+ 191
762
+ 00:09:13,105 --> 00:09:13,585
763
+ you imagine?
764
+
765
+ 192
766
+ 00:09:13,585 --> 00:09:15,665
767
+ I personally imagine kind of like, you you have these
768
+
769
+ 193
770
+ 00:09:15,665 --> 00:09:19,745
771
+ breakthroughs during COVID and things like that, where like these
772
+
773
+ 194
774
+ 00:09:19,745 --> 00:09:22,629
775
+ kind of really obscure scientists who are in their 50s
776
+
777
+ 195
778
+ 00:09:22,629 --> 00:09:26,870
779
+ and they've just kind of been laboring in labs and
780
+
781
+ 196
782
+ 00:09:26,870 --> 00:09:29,830
783
+ wearily in writing and publishing in kind of obscure academic
784
+
785
+ 197
786
+ 00:09:29,830 --> 00:09:30,710
787
+ publications.
788
+
789
+ 198
790
+ 00:09:30,870 --> 00:09:33,669
791
+ And they finally hit a big or win a Nobel
792
+
793
+ 199
794
+ 00:09:33,669 --> 00:09:36,235
795
+ Prize and then their household names.
796
+
797
+ 200
798
+ 00:09:36,634 --> 00:09:38,634
799
+ So that was kind of what I had in mind.
800
+
801
+ 201
802
+ 00:09:38,634 --> 00:09:42,154
803
+ That was the mental image I'd formed of the birth
804
+
805
+ 202
806
+ 00:09:42,154 --> 00:09:42,955
807
+ of ArcSim.
808
+
809
+ 203
810
+ 00:09:42,955 --> 00:09:45,595
811
+ Like I wasn't expecting twenty somethings in San Francisco.
812
+
813
+ 204
814
+ 00:09:45,595 --> 00:09:48,794
815
+ I thought that was both very funny, very cool, and
816
+
817
+ 205
818
+ 00:09:48,794 --> 00:09:50,075
819
+ actually kind of inspiring.
820
+
821
+ 206
822
+ 00:09:50,554 --> 00:09:55,230
823
+ It's nice to think that people who just you might
824
+
825
+ 207
826
+ 00:09:55,230 --> 00:09:58,509
827
+ put them in the kind of milieu or bubble or
828
+
829
+ 208
830
+ 00:09:58,509 --> 00:10:02,669
831
+ world that you are in incredibly in through a series
832
+
833
+ 209
834
+ 00:10:02,669 --> 00:10:05,835
835
+ of connections that are coming up with such literally world
836
+
837
+ 210
838
+ 00:10:05,835 --> 00:10:07,835
839
+ changing innovations.
840
+
841
+ 211
842
+ 00:10:07,914 --> 00:10:11,274
843
+ So that was I thought anyway, that's that that was
844
+
845
+ 212
846
+ 00:10:11,274 --> 00:10:11,835
847
+ cool.
848
+
849
+ 213
850
+ 00:10:12,235 --> 00:10:12,554
851
+ Okay.
852
+
853
+ 214
854
+ 00:10:12,554 --> 00:10:13,434
855
+ Voice training data.
856
+
857
+ 215
858
+ 00:10:13,434 --> 00:10:14,154
859
+ How are we doing?
860
+
861
+ 216
862
+ 00:10:14,154 --> 00:10:17,355
863
+ We're about ten minutes, and I'm still talking about voice
864
+
865
+ 217
866
+ 00:10:17,355 --> 00:10:18,235
867
+ technology.
868
+
869
+ 218
870
+ 00:10:18,634 --> 00:10:22,179
871
+ So Whisper was brilliant, and I was so excited that
872
+
873
+ 219
874
+ 00:10:22,179 --> 00:10:25,860
875
+ my first instinct was to guess, like, Oh my gosh,
876
+
877
+ 220
878
+ 00:10:25,860 --> 00:10:28,019
879
+ I have to get a really good microphone for this.
880
+
881
+ 221
882
+ 00:10:28,179 --> 00:10:31,379
883
+ So I didn't go on a spending spree because I
884
+
885
+ 222
886
+ 00:10:31,379 --> 00:10:33,299
887
+ said, I'm gonna have to just wait a month and
888
+
889
+ 223
890
+ 00:10:33,299 --> 00:10:34,740
891
+ see if I still use this.
892
+
893
+ 224
894
+ 00:10:35,220 --> 00:10:38,875
895
+ And it just kind of became it's become really part
896
+
897
+ 225
898
+ 00:10:38,875 --> 00:10:40,955
899
+ of my daily routine.
900
+
901
+ 226
902
+ 00:10:41,754 --> 00:10:44,315
903
+ Like if I'm writing an email, I'll record a voice
904
+
905
+ 227
906
+ 00:10:44,315 --> 00:10:47,595
907
+ note and then I've developed and it's nice to see
908
+
909
+ 228
910
+ 00:10:47,595 --> 00:10:50,759
911
+ that everyone is like developing the same things in parallel.
912
+
913
+ 229
914
+ 00:10:50,759 --> 00:10:53,399
915
+ That's kind of a weird thing to say, when I
916
+
917
+ 230
918
+ 00:10:53,399 --> 00:11:00,279
919
+ started working on these prototypes on GitHub, which is where
920
+
921
+ 231
922
+ 00:11:00,279 --> 00:11:04,039
923
+ I just kind of share very freely and loosely ideas
924
+
925
+ 232
926
+ 00:11:04,039 --> 00:11:06,945
927
+ and first iterations on concepts.
928
+
929
+ 233
930
+ 00:11:09,024 --> 00:11:10,704
931
+ And for want of a better word, I called it
932
+
933
+ 234
934
+ 00:11:10,704 --> 00:11:14,945
935
+ like LLM post processing or clean up or basically a
936
+
937
+ 235
938
+ 00:11:14,945 --> 00:11:17,745
939
+ system prompt that after you get back the raw text
940
+
941
+ 236
942
+ 00:11:17,745 --> 00:11:21,620
943
+ from Whisper, you run it through a model and say,
944
+
945
+ 237
946
+ 00:11:21,620 --> 00:11:26,339
947
+ okay, this is crappy text like add sentence structure and,
948
+
949
+ 238
950
+ 00:11:26,339 --> 00:11:27,459
951
+ you know, fix it up.
952
+
953
+ 239
954
+ 00:11:27,860 --> 00:11:32,579
955
+ And now when I'm exploring the different tools that are
956
+
957
+ 240
958
+ 00:11:32,579 --> 00:11:35,634
959
+ out there that people have built, I see quite a
960
+
961
+ 241
962
+ 00:11:35,634 --> 00:11:39,475
963
+ number of projects have basically done the same thing.
964
+
965
+ 242
966
+ 00:11:40,754 --> 00:11:43,235
967
+ Lest that be misconstrued, I'm not saying for a millisecond
968
+
969
+ 243
970
+ 00:11:43,235 --> 00:11:44,595
971
+ that I inspired them.
972
+
973
+ 244
974
+ 00:11:44,595 --> 00:11:48,034
975
+ I'm sure this has been a thing that's been integrated
976
+
977
+ 245
978
+ 00:11:48,034 --> 00:11:51,290
979
+ into tools for a while, but it's the kind of
980
+
981
+ 246
982
+ 00:11:51,290 --> 00:11:53,690
983
+ thing that when you start using these tools every day,
984
+
985
+ 247
986
+ 00:11:53,690 --> 00:11:57,610
987
+ the need for it is almost instantly apparent because text
988
+
989
+ 248
990
+ 00:11:57,610 --> 00:12:01,529
991
+ that doesn't have any punctuation or paragraph spacing takes a
992
+
993
+ 249
994
+ 00:12:01,529 --> 00:12:03,965
995
+ long time to, you know, it takes so long to
996
+
997
+ 250
998
+ 00:12:03,965 --> 00:12:09,004
999
+ get it into a presentable email that again, moves speech
1000
+
1001
+ 251
1002
+ 00:12:09,004 --> 00:12:13,085
1003
+ tech into that before that inflection point where you're like,
1004
+
1005
+ 252
1006
+ 00:12:13,085 --> 00:12:13,965
1007
+ nah, it's just not worth it.
1008
+
1009
+ 253
1010
+ 00:12:13,965 --> 00:12:16,924
1011
+ It's like, it'll just be quicker to type this.
1012
+
1013
+ 254
1014
+ 00:12:17,279 --> 00:12:19,840
1015
+ So it's a big, it's a little touch that actually
1016
+
1017
+ 255
1018
+ 00:12:20,080 --> 00:12:21,200
1019
+ is a big deal.
1020
+
1021
+ 256
1022
+ 00:12:21,519 --> 00:12:25,440
1023
+ So I was on Whisper and I've been using Whisper
1024
+
1025
+ 257
1026
+ 00:12:25,440 --> 00:12:27,759
1027
+ and I kind of early on found a couple of
1028
+
1029
+ 258
1030
+ 00:12:27,759 --> 00:12:28,399
1031
+ tools.
1032
+
1033
+ 259
1034
+ 00:12:28,399 --> 00:12:30,639
1035
+ I couldn't find what I was looking for on Linux,
1036
+
1037
+ 260
1038
+ 00:12:30,639 --> 00:12:35,924
1039
+ which is basically just something that'll run-in the background.
1040
+
1041
+ 261
1042
+ 00:12:35,924 --> 00:12:38,245
1043
+ You'll give it an API key and it will just
1044
+
1045
+ 262
1046
+ 00:12:38,245 --> 00:12:43,044
1047
+ like transcribe with like a little key to start and
1048
+
1049
+ 263
1050
+ 00:12:43,044 --> 00:12:43,845
1051
+ stop the dictation.
1052
+
1053
+ 264
1054
+ 00:12:45,080 --> 00:12:48,440
1055
+ And the issues where I discovered that like most people
1056
+
1057
+ 265
1058
+ 00:12:48,440 --> 00:12:52,040
1059
+ involved in creating these projects were very much focused on
1060
+
1061
+ 266
1062
+ 00:12:52,040 --> 00:12:55,800
1063
+ local models, running Whisper locally because you can.
1064
+
1065
+ 267
1066
+ 00:12:56,279 --> 00:12:58,200
1067
+ And I tried that a bunch of times and just
1068
+
1069
+ 268
1070
+ 00:12:58,200 --> 00:13:01,054
1071
+ never got results that were as good as the cloud.
1072
+
1073
+ 269
1074
+ 00:13:01,455 --> 00:13:03,615
1075
+ And when I began looking at the cost of the
1076
+
1077
+ 270
1078
+ 00:13:03,615 --> 00:13:06,654
1079
+ speech to text APIs and what I was spending, I
1080
+
1081
+ 271
1082
+ 00:13:06,654 --> 00:13:09,855
1083
+ just thought there is it's actually, in my opinion, just
1084
+
1085
+ 272
1086
+ 00:13:09,855 --> 00:13:13,160
1087
+ one of the better deals in API spending in the
1088
+
1089
+ 273
1090
+ 00:13:13,160 --> 00:13:13,480
1091
+ cloud.
1092
+
1093
+ 274
1094
+ 00:13:13,480 --> 00:13:15,720
1095
+ Like, it's just not that expensive for very, very good
1096
+
1097
+ 275
1098
+ 00:13:15,720 --> 00:13:19,639
1099
+ models that are much more, you know, you're gonna be
1100
+
1101
+ 276
1102
+ 00:13:19,639 --> 00:13:22,759
1103
+ able to run the full model, the latest model versus
1104
+
1105
+ 277
1106
+ 00:13:22,759 --> 00:13:26,605
1107
+ whatever you can run on your average GPU unless you
1108
+
1109
+ 278
1110
+ 00:13:26,605 --> 00:13:28,845
1111
+ want to buy a crazy GPU.
1112
+
1113
+ 279
1114
+ 00:13:28,845 --> 00:13:30,044
1115
+ It doesn't really make sense to me.
1116
+
1117
+ 280
1118
+ 00:13:30,044 --> 00:13:33,164
1119
+ Privacy is another concern that I know is kind of
1120
+
1121
+ 281
1122
+ 00:13:33,164 --> 00:13:35,325
1123
+ like a very much a separate thing that people just
1124
+
1125
+ 282
1126
+ 00:13:35,325 --> 00:13:38,845
1127
+ don't want their voice data and their voice leaving their
1128
+
1129
+ 283
1130
+ 00:13:38,845 --> 00:13:42,460
1131
+ local environment maybe for regulatory reasons as well.
1132
+
1133
+ 284
1134
+ 00:13:42,700 --> 00:13:43,980
1135
+ But I'm not in that.
1136
+
1137
+ 285
1138
+ 00:13:44,220 --> 00:13:48,540
1139
+ I neither really care about people listening to my, grocery
1140
+
1141
+ 286
1142
+ 00:13:48,540 --> 00:13:51,580
1143
+ list, consisting of, reminding myself that I need to buy
1144
+
1145
+ 287
1146
+ 00:13:51,580 --> 00:13:54,779
1147
+ more beer, Cheetos, and hummus, which is kind of the
1148
+
1149
+ 288
1150
+ 00:13:55,334 --> 00:13:59,574
1151
+ three staples of my diet during periods of poor nutrition.
1152
+
1153
+ 289
1154
+ 00:13:59,894 --> 00:14:02,375
1155
+ But the kind of stuff that I transcribe, it's just
1156
+
1157
+ 290
1158
+ 00:14:02,375 --> 00:14:02,694
1159
+ not.
1160
+
1161
+ 291
1162
+ 00:14:02,694 --> 00:14:07,814
1163
+ It's not a privacy thing I'm that sort of sensitive
1164
+
1165
+ 292
1166
+ 00:14:07,814 --> 00:14:13,269
1167
+ about and I don't do anything so sensitive or secure
1168
+
1169
+ 293
1170
+ 00:14:13,269 --> 00:14:14,790
1171
+ that requires air capping.
1172
+
1173
+ 294
1174
+ 00:14:15,670 --> 00:14:17,590
1175
+ I looked at the pricing and especially the kind of
1176
+
1177
+ 295
1178
+ 00:14:17,590 --> 00:14:18,950
1179
+ older model mini.
1180
+
1181
+ 296
1182
+ 00:14:19,590 --> 00:14:21,910
1183
+ Some of them are very, very affordable and I did
1184
+
1185
+ 297
1186
+ 00:14:21,910 --> 00:14:26,764
1187
+ a calculation once with ChatGPT and I was like, okay,
1188
+
1189
+ 298
1190
+ 00:14:26,764 --> 00:14:30,365
1191
+ this is the API price for I can't remember whatever
1192
+
1193
+ 299
1194
+ 00:14:30,365 --> 00:14:31,404
1195
+ the model was.
1196
+
1197
+ 300
1198
+ 00:14:31,804 --> 00:14:34,445
1199
+ Let's say I just go at it like nonstop, which
1200
+
1201
+ 301
1202
+ 00:14:34,445 --> 00:14:35,565
1203
+ rarely happens.
1204
+
1205
+ 302
1206
+ 00:14:35,644 --> 00:14:38,959
1207
+ Probably, I would say on average I might dictate thirty
1208
+
1209
+ 303
1210
+ 00:14:38,959 --> 00:14:41,759
1211
+ to sixty minutes per day if I was probably summing
1212
+
1213
+ 304
1214
+ 00:14:41,759 --> 00:14:48,000
1215
+ up the emails, documents, outlines, which is a lot, but
1216
+
1217
+ 305
1218
+ 00:14:48,000 --> 00:14:50,159
1219
+ it's it's still a fairly modest amount.
1220
+
1221
+ 306
1222
+ 00:14:50,159 --> 00:14:51,839
1223
+ And I was like, well, some days I do go
1224
+
1225
+ 307
1226
+ 00:14:51,839 --> 00:14:54,934
1227
+ on like one or two days where I've been usually
1228
+
1229
+ 308
1230
+ 00:14:54,934 --> 00:14:56,855
1231
+ when I'm like kind of out of the house and
1232
+
1233
+ 309
1234
+ 00:14:56,855 --> 00:15:00,535
1235
+ just have something like I have nothing else to do.
1236
+
1237
+ 310
1238
+ 00:15:00,535 --> 00:15:03,175
1239
+ Like if I'm at a hospital, we have a newborn
1240
+
1241
+ 311
1242
+ 00:15:03,575 --> 00:15:07,299
1243
+ and you're waiting for like eight hours and hours for
1244
+
1245
+ 312
1246
+ 00:15:07,299 --> 00:15:08,100
1247
+ an appointment.
1248
+
1249
+ 313
1250
+ 00:15:08,179 --> 00:15:12,019
1251
+ And I would probably have listened to podcasts before becoming
1252
+
1253
+ 314
1254
+ 00:15:12,019 --> 00:15:12,980
1255
+ a speech fanatic.
1256
+
1257
+ 315
1258
+ 00:15:12,980 --> 00:15:15,379
1259
+ And I'm like, Oh, wait, let me just get down.
1260
+
1261
+ 316
1262
+ 00:15:15,379 --> 00:15:17,379
1263
+ Let me just get these ideas out of my head.
1264
+
1265
+ 317
1266
+ 00:15:17,540 --> 00:15:20,745
1267
+ And that's when I'll go on my speech binges.
1268
+
1269
+ 318
1270
+ 00:15:20,745 --> 00:15:22,664
1271
+ But those are like once every few months, like not
1272
+
1273
+ 319
1274
+ 00:15:22,664 --> 00:15:23,544
1275
+ frequently.
1276
+
1277
+ 320
1278
+ 00:15:23,784 --> 00:15:25,784
1279
+ But I said, okay, let's just say if I'm going
1280
+
1281
+ 321
1282
+ 00:15:25,784 --> 00:15:28,184
1283
+ to price out cloud STT.
1284
+
1285
+ 322
1286
+ 00:15:28,985 --> 00:15:33,500
1287
+ If I was like dedicated every second of every waking
1288
+
1289
+ 323
1290
+ 00:15:33,500 --> 00:15:37,820
1291
+ hour to transcribing for some odd reason, I mean I'd
1292
+
1293
+ 324
1294
+ 00:15:37,820 --> 00:15:39,820
1295
+ have to eat and use the toilet.
1296
+
1297
+ 325
1298
+ 00:15:40,540 --> 00:15:42,700
1299
+ There's only so many hours I'm awake for.
1300
+
1301
+ 326
1302
+ 00:15:42,700 --> 00:15:47,019
1303
+ So let's just say a maximum of forty five minutes
1304
+
1305
+ 327
1306
+ 00:15:47,205 --> 00:15:49,205
1307
+ in the hour, then I said, All right, let's just
1308
+
1309
+ 328
1310
+ 00:15:49,205 --> 00:15:50,165
1311
+ say fifty.
1312
+
1313
+ 329
1314
+ 00:15:50,644 --> 00:15:51,365
1315
+ Who knows?
1316
+
1317
+ 330
1318
+ 00:15:51,365 --> 00:15:52,804
1319
+ You're dictating on the toilet.
1320
+
1321
+ 331
1322
+ 00:15:52,804 --> 00:15:53,605
1323
+ We do it.
1324
+
1325
+ 332
1326
+ 00:15:53,924 --> 00:15:56,884
1327
+ So you could just do sixty, but whatever I did
1328
+
1329
+ 333
1330
+ 00:15:57,125 --> 00:16:01,179
1331
+ and every day, like you're going flat out seven days
1332
+
1333
+ 334
1334
+ 00:16:01,179 --> 00:16:02,620
1335
+ a week dictating nonstop.
1336
+
1337
+ 335
1338
+ 00:16:02,620 --> 00:16:05,579
1339
+ I was like, What's my monthly API bill going to
1340
+
1341
+ 336
1342
+ 00:16:05,579 --> 00:16:06,700
1343
+ be at this price?
1344
+
1345
+ 337
1346
+ 00:16:06,779 --> 00:16:09,339
1347
+ And it came out to like seventy or eighty bucks.
1348
+
1349
+ 338
1350
+ 00:16:09,339 --> 00:16:12,620
1351
+ And I was like, Well, that would be an extraordinary
1352
+
1353
+ 339
1354
+ 00:16:12,940 --> 00:16:14,379
1355
+ amount of dictation.
1356
+
1357
+ 340
1358
+ 00:16:14,379 --> 00:16:18,105
1359
+ And I would hope that there was some compelling reason
1360
+
1361
+ 341
1362
+ 00:16:18,745 --> 00:16:21,784
1363
+ worth more than seventy dollars that I embarked upon that
1364
+
1365
+ 342
1366
+ 00:16:21,784 --> 00:16:22,424
1367
+ project.
1368
+
1369
+ 343
1370
+ 00:16:22,664 --> 00:16:24,585
1371
+ So given that that's kind of the max point for
1372
+
1373
+ 344
1374
+ 00:16:24,585 --> 00:16:27,304
1375
+ me I said that's actually very very affordable.
1376
+
1377
+ 345
1378
+ 00:16:28,024 --> 00:16:30,504
1379
+ Now you're gonna if you want to spec out the
1380
+
1381
+ 346
1382
+ 00:16:30,504 --> 00:16:33,909
1383
+ costs and you want to do the post processing that
1384
+
1385
+ 347
1386
+ 00:16:33,909 --> 00:16:36,789
1387
+ I really do feel is valuable, that's going to cost
1388
+
1389
+ 348
1390
+ 00:16:36,789 --> 00:16:37,750
1391
+ some more as well.
1392
+
1393
+ 349
1394
+ 00:16:38,070 --> 00:16:43,269
1395
+ Unless you're using Gemini, which needless to say is a
1396
+
1397
+ 350
1398
+ 00:16:43,269 --> 00:16:45,190
1399
+ random person sitting in Jerusalem.
1400
+
1401
+ 351
1402
+ 00:16:45,855 --> 00:16:49,455
1403
+ I have no affiliation nor with Google nor Anthropic nor
1404
+
1405
+ 352
1406
+ 00:16:49,455 --> 00:16:52,414
1407
+ Gemini nor any major tech vendor for that matter.
1408
+
1409
+ 353
1410
+ 00:16:53,855 --> 00:16:57,215
1411
+ I like Gemini not so much as a everyday model.
1412
+
1413
+ 354
1414
+ 00:16:57,455 --> 00:16:59,934
1415
+ It's kind of underwhelmed in that respect, I would say.
1416
+
1417
+ 355
1418
+ 00:17:00,379 --> 00:17:02,779
1419
+ But for multimodal, I think it's got a lot to
1420
+
1421
+ 356
1422
+ 00:17:02,779 --> 00:17:03,339
1423
+ offer.
1424
+
1425
+ 357
1426
+ 00:17:03,659 --> 00:17:07,179
1427
+ And I think that the transcribing functionality whereby it can,
1428
+
1429
+ 358
1430
+ 00:17:08,059 --> 00:17:12,380
1431
+ process audio with a system prompt and both give you
1432
+
1433
+ 359
1434
+ 00:17:12,380 --> 00:17:13,900
1435
+ transcription that's cleaned up.
1436
+
1437
+ 360
1438
+ 00:17:13,900 --> 00:17:15,339
1439
+ That reduces two steps to one.
1440
+
1441
+ 361
1442
+ 00:17:15,835 --> 00:17:18,954
1443
+ And that for me is a very, very big deal.
1444
+
1445
+ 362
1446
+ 00:17:18,955 --> 00:17:22,474
1447
+ And I feel like even Google hasn't really sort of
1448
+
1449
+ 363
1450
+ 00:17:22,555 --> 00:17:27,195
1451
+ thought through how useful the that modality is and what
1452
+
1453
+ 364
1454
+ 00:17:27,195 --> 00:17:29,700
1455
+ kind of use cases you can achieve with it.
1456
+
1457
+ 365
1458
+ 00:17:29,700 --> 00:17:32,339
1459
+ Because I found in the course of this year just
1460
+
1461
+ 366
1462
+ 00:17:32,339 --> 00:17:38,019
1463
+ an endless list of really kind of system prompt stuff
1464
+
1465
+ 367
1466
+ 00:17:38,019 --> 00:17:40,900
1467
+ that I can say, okay, I've used it to capture
1468
+
1469
+ 368
1470
+ 00:17:40,900 --> 00:17:44,115
1471
+ context data for AI, which is literally I might speak
1472
+
1473
+ 369
1474
+ 00:17:44,115 --> 00:17:46,755
1475
+ for if I wanted to have a good bank of
1476
+
1477
+ 370
1478
+ 00:17:46,755 --> 00:17:50,035
1479
+ context data about who knows my childhood.
1480
+
1481
+ 371
1482
+ 00:17:50,434 --> 00:17:54,355
1483
+ More realistically, maybe my career goals, something that would just
1484
+
1485
+ 372
1486
+ 00:17:54,355 --> 00:17:56,195
1487
+ be like really boring to type out.
1488
+
1489
+ 373
1490
+ 00:17:56,195 --> 00:18:00,500
1491
+ So I'll just like sit in my car and record
1492
+
1493
+ 374
1494
+ 00:18:00,500 --> 00:18:01,460
1495
+ it for ten minutes.
1496
+
1497
+ 375
1498
+ 00:18:01,460 --> 00:18:03,779
1499
+ And that ten minutes you get a lot of information
1500
+
1501
+ 376
1502
+ 00:18:03,779 --> 00:18:04,419
1503
+ in.
1504
+
1505
+ 377
1506
+ 00:18:05,619 --> 00:18:07,700
1507
+ Emails, which is short text.
1508
+
1509
+ 378
1510
+ 00:18:08,660 --> 00:18:10,419
1511
+ Just there is a whole bunch.
1512
+
1513
+ 379
1514
+ 00:18:10,420 --> 00:18:13,375
1515
+ And all these workflows kind of require a little bit
1516
+
1517
+ 380
1518
+ 00:18:13,375 --> 00:18:15,134
1519
+ of treatment afterwards and different treatment.
1520
+
1521
+ 381
1522
+ 00:18:15,134 --> 00:18:18,414
1523
+ My context pipeline is kind of like just extract the
1524
+
1525
+ 382
1526
+ 00:18:18,414 --> 00:18:19,295
1527
+ bare essentials.
1528
+
1529
+ 383
1530
+ 00:18:19,295 --> 00:18:22,174
1531
+ You end up with me talking very loosely about sort
1532
+
1533
+ 384
1534
+ 00:18:22,174 --> 00:18:24,494
1535
+ of what I've done in my career, where I've worked,
1536
+
1537
+ 385
1538
+ 00:18:24,494 --> 00:18:25,454
1539
+ where I might like to work.
1540
+
1541
+ 386
1542
+ 00:18:26,000 --> 00:18:29,119
1543
+ And it goes, it condenses that down to very robotic
1544
+
1545
+ 387
1546
+ 00:18:29,119 --> 00:18:32,720
1547
+ language that is easy to chunk parse and maybe put
1548
+
1549
+ 388
1550
+ 00:18:32,720 --> 00:18:34,000
1551
+ into a vector database.
1552
+
1553
+ 389
1554
+ 00:18:34,000 --> 00:18:36,240
1555
+ Daniel has worked in technology.
1556
+
1557
+ 390
1558
+ 00:18:36,240 --> 00:18:39,840
1559
+ Daniel has been working in, know, stuff like that.
1560
+
1561
+ 391
1562
+ 00:18:39,840 --> 00:18:43,055
1563
+ That's not how you would speak, but I figure it's
1564
+
1565
+ 392
1566
+ 00:18:43,055 --> 00:18:46,494
1567
+ probably easier to parse for, after all, robots.
1568
+
1569
+ 393
1570
+ 00:18:46,815 --> 00:18:48,734
1571
+ So we've almost got to twenty minutes and this is
1572
+
1573
+ 394
1574
+ 00:18:48,734 --> 00:18:53,134
1575
+ actually a success because I wasted twenty minutes of my
1576
+
1577
+ 395
1578
+ 00:18:53,535 --> 00:18:57,200
1579
+ of the evening speaking into you in microphone and the
1580
+
1581
+ 396
1582
+ 00:18:57,200 --> 00:19:01,119
1583
+ levels were shot and was clipping and I said I
1584
+
1585
+ 397
1586
+ 00:19:01,119 --> 00:19:02,400
1587
+ can't really do an evaluation.
1588
+
1589
+ 398
1590
+ 00:19:02,400 --> 00:19:03,440
1591
+ I have to be fair.
1592
+
1593
+ 399
1594
+ 00:19:03,440 --> 00:19:06,400
1595
+ I have to give the models a chance to do
1596
+
1597
+ 400
1598
+ 00:19:06,400 --> 00:19:06,960
1599
+ their thing.
1600
+
1601
+ 401
1602
+ 00:19:07,505 --> 00:19:09,585
1603
+ What am I hoping to achieve in this?
1604
+
1605
+ 402
1606
+ 00:19:09,585 --> 00:19:11,664
1607
+ Okay, my fine tune was a dud as mentioned.
1608
+
1609
+ 403
1610
+ 00:19:11,745 --> 00:19:15,265
1611
+ Deepgram STT, I'm really, really hopeful that this prototype will
1612
+
1613
+ 404
1614
+ 00:19:15,265 --> 00:19:18,065
1615
+ work and it's a build in public open source so
1616
+
1617
+ 405
1618
+ 00:19:18,065 --> 00:19:20,384
1619
+ anyone is welcome to use it if I make anything
1620
+
1621
+ 406
1622
+ 00:19:20,384 --> 00:19:20,705
1623
+ good.
1624
+
1625
+ 407
1626
+ 00:19:21,640 --> 00:19:23,880
1627
+ But that was really exciting for me last night when
1628
+
1629
+ 408
1630
+ 00:19:23,880 --> 00:19:28,920
1631
+ after hours of trying my own prototype, seeing someone just
1632
+
1633
+ 409
1634
+ 00:19:28,920 --> 00:19:32,119
1635
+ made something that works like that, you you're not gonna
1636
+
1637
+ 410
1638
+ 00:19:32,119 --> 00:19:36,454
1639
+ have to build a custom conda environment and image.
1640
+
1641
+ 411
1642
+ 00:19:36,454 --> 00:19:40,054
1643
+ I have AMD GPU which makes things much more complicated.
1644
+
1645
+ 412
1646
+ 00:19:40,294 --> 00:19:42,694
1647
+ I didn't find it and I was about to give
1648
+
1649
+ 413
1650
+ 00:19:42,694 --> 00:19:43,974
1651
+ up and I said, All right, let me just give
1652
+
1653
+ 414
1654
+ 00:19:43,974 --> 00:19:46,535
1655
+ Deepgram's Linux thing a shot.
1656
+
1657
+ 415
1658
+ 00:19:47,109 --> 00:19:49,669
1659
+ And if this doesn't work, I'm just gonna go back
1660
+
1661
+ 416
1662
+ 00:19:49,669 --> 00:19:51,429
1663
+ to trying to vibe code something myself.
1664
+
1665
+ 417
1666
+ 00:19:51,750 --> 00:19:55,589
1667
+ And when I ran the script, I was using Cloud
1668
+
1669
+ 418
1670
+ 00:19:55,589 --> 00:19:59,109
1671
+ Code to do the installation process, it ran the script
1672
+
1673
+ 419
1674
+ 00:19:59,109 --> 00:20:01,269
1675
+ and, oh my gosh, it works just like that.
1676
+
1677
+ 420
1678
+ 00:20:01,904 --> 00:20:06,065
1679
+ The tricky thing for all those who wants to know
1680
+
1681
+ 421
1682
+ 00:20:06,065 --> 00:20:11,505
1683
+ all the nitty, ditty, nitty gritty details was that I
1684
+
1685
+ 422
1686
+ 00:20:11,505 --> 00:20:14,704
1687
+ don't think it was actually struggling with transcription, but pasting
1688
+
1689
+ 423
1690
+ 00:20:14,785 --> 00:20:17,619
1691
+ Weyland makes life very hard.
1692
+
1693
+ 424
1694
+ 00:20:17,619 --> 00:20:19,220
1695
+ And I think there was something not running at the
1696
+
1697
+ 425
1698
+ 00:20:19,220 --> 00:20:19,779
1699
+ right time.
1700
+
1701
+ 426
1702
+ 00:20:19,779 --> 00:20:23,059
1703
+ Anyway, Deepgram, I looked at how they actually handle that
1704
+
1705
+ 427
1706
+ 00:20:23,059 --> 00:20:25,220
1707
+ because it worked out of the box when other stuff
1708
+
1709
+ 428
1710
+ 00:20:25,220 --> 00:20:25,859
1711
+ didn't.
1712
+
1713
+ 429
1714
+ 00:20:26,180 --> 00:20:28,980
1715
+ And it was quite a clever little mechanism.
1716
+
1717
+ 430
1718
+ 00:20:29,575 --> 00:20:32,215
1719
+ And but more so than that, the accuracy was brilliant.
1720
+
1721
+ 431
1722
+ 00:20:32,215 --> 00:20:33,654
1723
+ Now what am I what am I doing here?
1724
+
1725
+ 432
1726
+ 00:20:33,654 --> 00:20:37,255
1727
+ This is gonna be a twenty minute audio sample.
1728
+
1729
+ 433
1730
+ 00:20:38,455 --> 00:20:42,490
1731
+ And I'm I think I've done one or two of
1732
+
1733
+ 434
1734
+ 00:20:42,490 --> 00:20:47,210
1735
+ these before, but I did it with short, snappy voice
1736
+
1737
+ 435
1738
+ 00:20:47,210 --> 00:20:47,690
1739
+ notes.
1740
+
1741
+ 436
1742
+ 00:20:47,690 --> 00:20:49,450
1743
+ This is kind of long form.
1744
+
1745
+ 437
1746
+ 00:20:49,529 --> 00:20:52,009
1747
+ This actually might be a better approximation for what's useful
1748
+
1749
+ 438
1750
+ 00:20:52,009 --> 00:20:53,929
1751
+ to me than voice memos.
1752
+
1753
+ 439
1754
+ 00:20:53,929 --> 00:20:56,974
1755
+ Like, I need to buy three liters of milk tomorrow
1756
+
1757
+ 440
1758
+ 00:20:56,974 --> 00:21:00,255
1759
+ and peter bread, which is probably how half my voice
1760
+
1761
+ 441
1762
+ 00:21:00,255 --> 00:21:00,815
1763
+ notes sound.
1764
+
1765
+ 442
1766
+ 00:21:00,815 --> 00:21:04,174
1767
+ Like if anyone were to find my phone they'd be
1768
+
1769
+ 443
1770
+ 00:21:04,174 --> 00:21:06,014
1771
+ like this is the most boring person in the world.
1772
+
1773
+ 444
1774
+ 00:21:06,095 --> 00:21:10,130
1775
+ Although actually there are some journaling thoughts as well, but
1776
+
1777
+ 445
1778
+ 00:21:10,130 --> 00:21:11,890
1779
+ it's a lot of content like that.
1780
+
1781
+ 446
1782
+ 00:21:11,890 --> 00:21:14,690
1783
+ And the probably for the evaluation, the most useful thing
1784
+
1785
+ 447
1786
+ 00:21:14,690 --> 00:21:21,914
1787
+ is slightly obscure tech, GitHub, Nucleano, hugging face, not so
1788
+
1789
+ 448
1790
+ 00:21:21,914 --> 00:21:24,554
1791
+ obscure that it's not gonna have a chance of knowing
1792
+
1793
+ 449
1794
+ 00:21:24,554 --> 00:21:27,274
1795
+ it, but hopefully sufficiently well known that the model should
1796
+
1797
+ 450
1798
+ 00:21:27,274 --> 00:21:27,914
1799
+ get it.
1800
+
1801
+ 451
1802
+ 00:21:27,994 --> 00:21:30,075
1803
+ I tried to do a little bit of speaking really
1804
+
1805
+ 452
1806
+ 00:21:30,075 --> 00:21:32,474
1807
+ fast and speaking very slowly.
1808
+
1809
+ 453
1810
+ 00:21:32,474 --> 00:21:35,609
1811
+ Would say in general, I've spoken, delivered this at a
1812
+
1813
+ 454
1814
+ 00:21:35,609 --> 00:21:39,210
1815
+ faster pace than I usually would owing to strong coffee
1816
+
1817
+ 455
1818
+ 00:21:39,210 --> 00:21:40,650
1819
+ flowing through my bloodstream.
1820
+
1821
+ 456
1822
+ 00:21:41,210 --> 00:21:43,609
1823
+ And the thing that I'm not gonna get in this
1824
+
1825
+ 457
1826
+ 00:21:43,609 --> 00:21:46,170
1827
+ benchmark is background noise, which in my first take that
1828
+
1829
+ 458
1830
+ 00:21:46,170 --> 00:21:48,535
1831
+ I had to get rid of, my wife came in
1832
+
1833
+ 459
1834
+ 00:21:48,535 --> 00:21:51,575
1835
+ with my son and for a good night kiss.
1836
+
1837
+ 460
1838
+ 00:21:51,654 --> 00:21:55,174
1839
+ And that actually would have been super helpful to get
1840
+
1841
+ 461
1842
+ 00:21:55,174 --> 00:21:57,894
1843
+ in because it was non diarized or if we had
1844
+
1845
+ 462
1846
+ 00:21:57,894 --> 00:21:58,775
1847
+ diarization.
1848
+
1849
+ 463
1850
+ 00:21:59,414 --> 00:22:01,494
1851
+ A female, I could say, I want the male voice
1852
+
1853
+ 464
1854
+ 00:22:01,494 --> 00:22:03,174
1855
+ and that wasn't intended for transcription.
1856
+
1857
+ 465
1858
+ 00:22:04,589 --> 00:22:06,349
1859
+ And we're not going to get background noise like people
1860
+
1861
+ 466
1862
+ 00:22:06,349 --> 00:22:09,069
1863
+ honking their horns, which is something I've done in my
1864
+
1865
+ 467
1866
+ 00:22:09,230 --> 00:22:11,950
1867
+ main data set where I am trying to go back
1868
+
1869
+ 468
1870
+ 00:22:11,950 --> 00:22:15,150
1871
+ to some of my voice notes, annotate them and run
1872
+
1873
+ 469
1874
+ 00:22:15,150 --> 00:22:15,789
1875
+ a benchmark.
1876
+
1877
+ 470
1878
+ 00:22:15,789 --> 00:22:18,345
1879
+ But this is going to be just a pure quick
1880
+
1881
+ 471
1882
+ 00:22:18,345 --> 00:22:19,144
1883
+ test.
1884
+
1885
+ 472
1886
+ 00:22:19,865 --> 00:22:24,105
1887
+ And as someone I'm working on a voice note idea.
1888
+
1889
+ 473
1890
+ 00:22:24,105 --> 00:22:28,265
1891
+ That's my sort of end motivation besides thinking it's an
1892
+
1893
+ 474
1894
+ 00:22:28,265 --> 00:22:31,865
1895
+ absolutely outstanding technology that's coming to viability.
1896
+
1897
+ 475
1898
+ 00:22:31,865 --> 00:22:34,480
1899
+ And really, I know this sounds cheesy, can actually have
1900
+
1901
+ 476
1902
+ 00:22:34,480 --> 00:22:36,559
1903
+ a very transformative effect.
1904
+
1905
+ 477
1906
+ 00:22:38,000 --> 00:22:43,200
1907
+ Voice technology has been life changing for folks living with
1908
+
1909
+ 478
1910
+ 00:22:44,079 --> 00:22:45,119
1911
+ disabilities.
1912
+
1913
+ 479
1914
+ 00:22:46,000 --> 00:22:48,625
1915
+ And I think there's something really nice about the fact
1916
+
1917
+ 480
1918
+ 00:22:48,625 --> 00:22:52,625
1919
+ that it can also benefit folks who are able-bodied and
1920
+
1921
+ 481
1922
+ 00:22:52,625 --> 00:22:57,984
1923
+ we can all in different ways make this tech as
1924
+
1925
+ 482
1926
+ 00:22:57,984 --> 00:23:00,785
1927
+ useful as possible regardless of the exact way that we're
1928
+
1929
+ 483
1930
+ 00:23:00,785 --> 00:23:01,105
1931
+ using it.
1932
+
1933
+ 484
1934
+ 00:23:02,279 --> 00:23:04,519
1935
+ And I think there's something very powerful in that, and
1936
+
1937
+ 485
1938
+ 00:23:04,519 --> 00:23:05,639
1939
+ it can be very cool.
1940
+
1941
+ 486
1942
+ 00:23:06,200 --> 00:23:07,639
1943
+ I see huge potential.
1944
+
1945
+ 487
1946
+ 00:23:07,639 --> 00:23:09,399
1947
+ What excites me about voice tech?
1948
+
1949
+ 488
1950
+ 00:23:09,799 --> 00:23:11,239
1951
+ A lot of things actually.
1952
+
1953
+ 489
1954
+ 00:23:12,200 --> 00:23:14,919
1955
+ Firstly, the fact that it's cheap and accurate, as I
1956
+
1957
+ 490
1958
+ 00:23:14,919 --> 00:23:17,865
1959
+ mentioned at the very start of this, and it's getting
1960
+
1961
+ 491
1962
+ 00:23:17,865 --> 00:23:20,184
1963
+ better and better with stuff like accent handling.
1964
+
1965
+ 492
1966
+ 00:23:20,825 --> 00:23:23,384
1967
+ I'm not sure my fine tune will actually ever come
1968
+
1969
+ 493
1970
+ 00:23:23,384 --> 00:23:25,305
1971
+ to fruition in the sense that I'll use it day
1972
+
1973
+ 494
1974
+ 00:23:25,305 --> 00:23:26,664
1975
+ to day as I imagine.
1976
+
1977
+ 495
1978
+ 00:23:26,744 --> 00:23:30,585
1979
+ I get like superb, flawless words error rates because I'm
1980
+
1981
+ 496
1982
+ 00:23:30,585 --> 00:23:35,029
1983
+ just kind of skeptical about local speech to text, as
1984
+
1985
+ 497
1986
+ 00:23:35,029 --> 00:23:35,750
1987
+ I mentioned.
1988
+
1989
+ 498
1990
+ 00:23:36,150 --> 00:23:39,910
1991
+ And I think the pace of innovation and improvement in
1992
+
1993
+ 499
1994
+ 00:23:39,910 --> 00:23:42,390
1995
+ the models, the main reasons for fine tuning from what
1996
+
1997
+ 500
1998
+ 00:23:42,390 --> 00:23:46,230
1999
+ I've seen have been people who are something that really
2000
+
2001
+ 501
2002
+ 00:23:46,230 --> 00:23:50,455
2003
+ blows blows my mind about ASR is the idea that
2004
+
2005
+ 502
2006
+ 00:23:50,455 --> 00:23:55,654
2007
+ it's inherently ailingual or multilingual, phonetic based.
2008
+
2009
+ 503
2010
+ 00:23:56,375 --> 00:24:00,455
2011
+ So as folks who use speak very obscure languages that
2012
+
2013
+ 504
2014
+ 00:24:00,455 --> 00:24:03,174
2015
+ there may be very there might be a paucity of
2016
+
2017
+ 505
2018
+ 00:24:02,309 --> 00:24:05,110
2019
+ training data or almost none at all, and therefore the
2020
+
2021
+ 506
2022
+ 00:24:05,110 --> 00:24:06,870
2023
+ accuracy is significantly reduced.
2024
+
2025
+ 507
2026
+ 00:24:06,870 --> 00:24:11,430
2027
+ Or folks in very critical environments, I know there are
2028
+
2029
+ 508
2030
+ 00:24:11,590 --> 00:24:15,430
2031
+ this is used extensively in medical transcription and dispatcher work
2032
+
2033
+ 509
2034
+ 00:24:15,430 --> 00:24:19,144
2035
+ as, you know the call centers who send out ambulances
2036
+
2037
+ 510
2038
+ 00:24:19,144 --> 00:24:19,944
2039
+ etc.
2040
+
2041
+ 511
2042
+ 00:24:20,345 --> 00:24:23,625
2043
+ Where accuracy is absolutely paramount and in the case of
2044
+
2045
+ 512
2046
+ 00:24:23,625 --> 00:24:27,625
2047
+ doctors radiologists they might be using very specialized vocab all
2048
+
2049
+ 513
2050
+ 00:24:27,625 --> 00:24:27,945
2051
+ the time.
2052
+
2053
+ 514
2054
+ 00:24:28,710 --> 00:24:30,309
2055
+ So those are kind of the main two things, and
2056
+
2057
+ 515
2058
+ 00:24:30,309 --> 00:24:32,230
2059
+ I'm not sure that really just for trying to make
2060
+
2061
+ 516
2062
+ 00:24:32,230 --> 00:24:36,470
2063
+ it better on a few random tech words with my
2064
+
2065
+ 517
2066
+ 00:24:36,470 --> 00:24:39,509
2067
+ slightly I mean, I have an accent, but, like, not,
2068
+
2069
+ 518
2070
+ 00:24:39,509 --> 00:24:42,549
2071
+ you know, an accent that a few other million people
2072
+
2073
+ 519
2074
+ 00:24:42,950 --> 00:24:43,990
2075
+ have ish.
2076
+
2077
+ 520
2078
+ 00:24:44,765 --> 00:24:48,045
2079
+ I'm not sure that my little fine tune is gonna
2080
+
2081
+ 521
2082
+ 00:24:48,045 --> 00:24:52,684
2083
+ actually like, the bump in word error reduction, if I
2084
+
2085
+ 522
2086
+ 00:24:52,684 --> 00:24:54,285
2087
+ ever actually figure out how to do it and get
2088
+
2089
+ 523
2090
+ 00:24:54,285 --> 00:24:56,445
2091
+ it up to the cloud, by the time we've done
2092
+
2093
+ 524
2094
+ 00:24:56,445 --> 00:25:00,039
2095
+ that, I suspect that the next generation of ASR will
2096
+
2097
+ 525
2098
+ 00:25:00,039 --> 00:25:01,799
2099
+ just be so good that it will kind of be,
2100
+
2101
+ 526
2102
+ 00:25:02,039 --> 00:25:04,039
2103
+ well, that would have been cool if it worked out,
2104
+
2105
+ 527
2106
+ 00:25:04,039 --> 00:25:05,559
2107
+ but I'll just use this instead.
2108
+
2109
+ 528
2110
+ 00:25:05,799 --> 00:25:10,759
2111
+ So that's gonna be it for today's episode of voice
2112
+
2113
+ 529
2114
+ 00:25:10,759 --> 00:25:11,720
2115
+ training data.
2116
+
2117
+ 530
2118
+ 00:25:11,960 --> 00:25:14,335
2119
+ Single, long shot evaluation.
2120
+
2121
+ 531
2122
+ 00:25:14,575 --> 00:25:15,774
2123
+ Who am I gonna compare?
2124
+
2125
+ 532
2126
+ 00:25:16,494 --> 00:25:18,654
2127
+ Whisper is always good as a benchmark, but I'm more
2128
+
2129
+ 533
2130
+ 00:25:18,654 --> 00:25:22,255
2131
+ interested in seeing Whisper head to head with two things
2132
+
2133
+ 534
2134
+ 00:25:22,255 --> 00:25:22,974
2135
+ really.
2136
+
2137
+ 535
2138
+ 00:25:23,375 --> 00:25:25,214
2139
+ One is Whisper variants.
2140
+
2141
+ 536
2142
+ 00:25:25,214 --> 00:25:27,775
2143
+ So you've got these projects like Faster Whisper.
2144
+
2145
+ 537
2146
+ 00:25:29,190 --> 00:25:30,069
2147
+ Distill Whisper.
2148
+
2149
+ 538
2150
+ 00:25:30,069 --> 00:25:30,789
2151
+ It's a bit confusing.
2152
+
2153
+ 539
2154
+ 00:25:30,789 --> 00:25:31,989
2155
+ There's a whole bunch of them.
2156
+
2157
+ 540
2158
+ 00:25:32,230 --> 00:25:35,190
2159
+ And the emerging ASRs, which are also a thing.
2160
+
2161
+ 541
2162
+ 00:25:35,349 --> 00:25:37,190
2163
+ My intention for this is I'm not sure I'm gonna
2164
+
2165
+ 542
2166
+ 00:25:37,190 --> 00:25:39,990
2167
+ have the time in any point in the foreseeable future
2168
+
2169
+ 543
2170
+ 00:25:39,990 --> 00:25:44,855
2171
+ to go back to this whole episode and create a
2172
+
2173
+ 544
2174
+ 00:25:44,855 --> 00:25:48,374
2175
+ proper source truth where I fix everything.
2176
+
2177
+ 545
2178
+ 00:25:49,335 --> 00:25:51,974
2179
+ Might do it if I can get one transcription that's
2180
+
2181
+ 546
2182
+ 00:25:51,974 --> 00:25:54,214
2183
+ sufficiently close to perfection.
2184
+
2185
+ 547
2186
+ 00:25:55,014 --> 00:25:58,480
2187
+ But what I would actually love to do on Hugging
2188
+
2189
+ 548
2190
+ 00:25:58,480 --> 00:26:00,559
2191
+ Face, I think would be a great probably how I
2192
+
2193
+ 549
2194
+ 00:26:00,559 --> 00:26:04,480
2195
+ might visualize this is having the audio waveform play and
2196
+
2197
+ 550
2198
+ 00:26:04,480 --> 00:26:08,960
2199
+ then have the transcript for each model below it and
2200
+
2201
+ 551
2202
+ 00:26:08,960 --> 00:26:13,845
2203
+ maybe even a, like, you know, to scale and maybe
2204
+
2205
+ 552
2206
+ 00:26:13,845 --> 00:26:16,724
2207
+ even a local one as well, like local whisper versus
2208
+
2209
+ 553
2210
+ 00:26:16,724 --> 00:26:19,764
2211
+ OpenAI API, etcetera.
2212
+
2213
+ 554
2214
+ 00:26:19,845 --> 00:26:23,204
2215
+ And I can then actually listen back to segments or
2216
+
2217
+ 555
2218
+ 00:26:23,204 --> 00:26:25,365
2219
+ anyone who wants to can listen back to segments of
2220
+
2221
+ 556
2222
+ 00:26:25,365 --> 00:26:30,299
2223
+ this recording and see where a particular model struggled and
2224
+
2225
+ 557
2226
+ 00:26:30,299 --> 00:26:33,179
2227
+ others didn't as well as the sort of headline finding
2228
+
2229
+ 558
2230
+ 00:26:33,179 --> 00:26:35,659
2231
+ of which had the best W E R but that
2232
+
2233
+ 559
2234
+ 00:26:35,659 --> 00:26:37,739
2235
+ would require the source of truth.
2236
+
2237
+ 560
2238
+ 00:26:37,740 --> 00:26:38,539
2239
+ Okay, that's it.
2240
+
2241
+ 561
2242
+ 00:26:38,505 --> 00:26:41,065
2243
+ I hope this was, I don't know, maybe useful for
2244
+
2245
+ 562
2246
+ 00:26:41,065 --> 00:26:42,984
2247
+ other folks interested in STT.
2248
+
2249
+ 563
2250
+ 00:26:43,065 --> 00:26:46,025
2251
+ You want to see I always think I've just said
2252
+
2253
+ 564
2254
+ 00:26:46,025 --> 00:26:47,704
2255
+ it as something I didn't intend to.
2256
+
2257
+ 565
2258
+ 00:26:47,944 --> 00:26:49,704
2259
+ STT, I said for those.
2260
+
2261
+ 566
2262
+ 00:26:49,704 --> 00:26:53,129
2263
+ Listen carefully, including hopefully the models themselves.
2264
+
2265
+ 567
2266
+ 00:26:53,369 --> 00:26:55,129
2267
+ This has been myself, Daniel Rosol.
2268
+
2269
+ 568
2270
+ 00:26:55,129 --> 00:26:59,450
2271
+ For more jumbled repositories about my roving interest in AI
2272
+
2273
+ 569
2274
+ 00:26:59,450 --> 00:27:04,089
2275
+ but particularly AgenTic, MCP and VoiceTech you can find me
2276
+
2277
+ 570
2278
+ 00:27:04,089 --> 00:27:05,769
2279
+ on GitHub.
2280
+
2281
+ 571
2282
+ 00:27:06,009 --> 00:27:06,730
2283
+ Hugging Face.
2284
+
2285
+ 572
2286
+ 00:27:08,125 --> 00:27:09,004
2287
+ Where else?
2288
+
2289
+ 573
2290
+ 00:27:09,005 --> 00:27:11,805
2291
+ DanielRosel dot com, which is my personal website, as well
2292
+
2293
+ 574
2294
+ 00:27:11,805 --> 00:27:15,565
2295
+ as this podcast whose name I sadly cannot remember.
2296
+
2297
+ 575
2298
+ 00:27:15,724 --> 00:27:16,765
2299
+ Until next time.
2300
+
2301
+ 576
2302
+ 00:27:16,765 --> 00:27:17,404
2303
+ Thanks for listening.
2304
+
srt-out/speechmatics.srt ADDED
@@ -0,0 +1,2069 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1
2
+ 00:00:00,120 --> 00:00:06,520
3
+ Hello and welcome to a audio data
4
+ set consisting of one single
5
+
6
+ 2
7
+ 00:00:06,520 --> 00:00:12,120
8
+ episode of a non-existent podcast.
9
+ Or it, uh, I may append this to a
10
+
11
+ 3
12
+ 00:00:12,120 --> 00:00:16,640
13
+ podcast that I set up recently.
14
+ Um, regarding my, uh,
15
+
16
+ 4
17
+ 00:00:16,680 --> 00:00:21,960
18
+ with my thoughts on speech,
19
+ tech and AI in particular,
20
+
21
+ 5
22
+ 00:00:22,240 --> 00:00:27,960
23
+ more AI and generative AI, I would,
24
+ uh, I would say, but in any event,
25
+
26
+ 6
27
+ 00:00:27,960 --> 00:00:32,480
28
+ the purpose of this, um,
29
+ voice recording is actually to create
30
+
31
+ 7
32
+ 00:00:32,680 --> 00:00:37,560
33
+ a lengthy voice sample for a quick
34
+ evaluation, a back of the envelope
35
+
36
+ 8
37
+ 00:00:37,560 --> 00:00:41,160
38
+ evaluation, as they might say,
39
+ for different speech to text models.
40
+
41
+ 9
42
+ 00:00:41,160 --> 00:00:43,800
43
+ And I'm doing this because I,
44
+ uh, I thought I'd made a great
45
+
46
+ 10
47
+ 00:00:43,800 --> 00:00:48,320
48
+ breakthrough in my journey with
49
+ speech tech, and that was succeeding
50
+
51
+ 11
52
+ 00:00:48,320 --> 00:00:52,720
53
+ in the elusive task of fine tuning.
54
+ Whisper, whisper is.
55
+
56
+ 12
57
+ 00:00:52,840 --> 00:00:56,960
58
+ And I'm going to just talk.
59
+ I'm trying to mix up, uh,
60
+
61
+ 13
62
+ 00:00:56,960 --> 00:01:00,470
63
+ I'm going to try a few different
64
+ styles of speaking.
65
+
66
+ 14
67
+ 00:01:00,470 --> 00:01:02,630
68
+ I might whisper something at
69
+ some point as well,
70
+
71
+ 15
72
+ 00:01:03,190 --> 00:01:07,150
73
+ and I'll go back to speaking loud in,
74
+ uh, in different parts.
75
+
76
+ 16
77
+ 00:01:07,150 --> 00:01:09,710
78
+ I'm going to sound really like a
79
+ crazy person, because I'm also
80
+
81
+ 17
82
+ 00:01:09,710 --> 00:01:15,870
83
+ going to try to speak at different
84
+ pitches and cadences in order to
85
+
86
+ 18
87
+ 00:01:15,910 --> 00:01:20,630
88
+ really try to put a speech to
89
+ text model through its paces,
90
+
91
+ 19
92
+ 00:01:20,630 --> 00:01:25,870
93
+ which is trying to make sense of,
94
+ is this guy just on incoherently in
95
+
96
+ 20
97
+ 00:01:25,870 --> 00:01:34,350
98
+ one long sentence, or are these just
99
+ actually a series of step standalone,
100
+
101
+ 21
102
+ 00:01:34,350 --> 00:01:37,510
103
+ standalone, standalone sentences?
104
+ And how is it going to handle
105
+
106
+ 22
107
+ 00:01:37,510 --> 00:01:40,750
108
+ step alone? That's not a word.
109
+ Uh, what happens when you use
110
+
111
+ 23
112
+ 00:01:40,750 --> 00:01:44,030
113
+ speech to text and you use a fake
114
+ word and then you're like, wait,
115
+
116
+ 24
117
+ 00:01:44,030 --> 00:01:48,350
118
+ that's not actually that word doesn't
119
+ exist. How does AI handle that?
120
+
121
+ 25
122
+ 00:01:48,390 --> 00:01:53,910
123
+ And, uh, these and more are all
124
+ the questions that I'm seeking
125
+
126
+ 26
127
+ 00:01:53,910 --> 00:01:57,350
128
+ to answer in this training data.
129
+ Now, why did why was it trying
130
+
131
+ 27
132
+ 00:01:57,350 --> 00:01:59,740
133
+ to fine tune a whisper?
134
+ And what is whisper?
135
+
136
+ 28
137
+ 00:01:59,780 --> 00:02:03,540
138
+ As I said, I'm gonna try to, uh,
139
+ record this at a couple of different
140
+
141
+ 29
142
+ 00:02:03,540 --> 00:02:09,060
143
+ levels of technicality for folks who
144
+ are, uh, you know, in the normal, uh,
145
+
146
+ 30
147
+ 00:02:09,060 --> 00:02:13,460
148
+ world and not totally stuck down
149
+ the rabbit hole of AI, uh, which I
150
+
151
+ 31
152
+ 00:02:13,460 --> 00:02:17,460
153
+ have to say is a really wonderful,
154
+ uh, rabbit hole to be to be down.
155
+
156
+ 32
157
+ 00:02:17,580 --> 00:02:21,700
158
+ Um, it's a really interesting area.
159
+ And speech and voice tech is is
160
+
161
+ 33
162
+ 00:02:21,940 --> 00:02:24,980
163
+ the aspect of it that I find
164
+ actually most.
165
+
166
+ 34
167
+ 00:02:25,180 --> 00:02:28,340
168
+ I'm not sure I would say the most
169
+ interesting, because there's just
170
+
171
+ 35
172
+ 00:02:28,340 --> 00:02:32,700
173
+ so much that is fascinating in AI.
174
+ Uh, but the most that I find the
175
+
176
+ 36
177
+ 00:02:32,700 --> 00:02:36,220
178
+ most personally transformative
179
+ in terms of the impact that it's
180
+
181
+ 37
182
+ 00:02:36,220 --> 00:02:41,660
183
+ had on my daily work life and
184
+ productivity and how I sort of work.
185
+
186
+ 38
187
+ 00:02:41,940 --> 00:02:48,020
188
+ And I'm persevering hard with the
189
+ task of trying to guess a good
190
+
191
+ 39
192
+ 00:02:48,020 --> 00:02:51,700
193
+ solution working for Linux, which if
194
+ anyone actually does listen to this,
195
+
196
+ 40
197
+ 00:02:51,700 --> 00:02:55,100
198
+ not just for the training data
199
+ and for the actual content, uh,
200
+
201
+ 41
202
+ 00:02:55,140 --> 00:02:59,600
203
+ this is this is has sparked I had
204
+ besides the fine tune not working.
205
+
206
+ 42
207
+ 00:02:59,600 --> 00:03:05,560
208
+ Well, that was the failure.
209
+ Um, I used clod code because one
210
+
211
+ 43
212
+ 00:03:05,560 --> 00:03:10,160
213
+ thinks these days that there is
214
+ nothing short of solving,
215
+
216
+ 44
217
+ 00:03:11,040 --> 00:03:14,680
218
+ you know, the, uh,
219
+ the reason of life or something.
220
+
221
+ 45
222
+ 00:03:15,080 --> 00:03:19,560
223
+ Uh, that clod and agentic AI can't
224
+ do, uh, which is not really the case.
225
+
226
+ 46
227
+ 00:03:19,600 --> 00:03:23,600
228
+ Uh, it does seem that way sometimes,
229
+ but it fails a lot as well.
230
+
231
+ 47
232
+ 00:03:23,600 --> 00:03:26,960
233
+ And this is one of those, uh,
234
+ instances where last week I put
235
+
236
+ 48
237
+ 00:03:26,960 --> 00:03:31,400
238
+ together an hour of voice training
239
+ data, basically speaking just
240
+
241
+ 49
242
+ 00:03:31,400 --> 00:03:35,040
243
+ random things for three minutes.
244
+ And, um,
245
+
246
+ 50
247
+ 00:03:35,720 --> 00:03:38,520
248
+ it was actually kind of tedious
249
+ because the texts were really weird.
250
+
251
+ 51
252
+ 00:03:38,520 --> 00:03:42,120
253
+ Some of them were it was like it
254
+ was AI generated.
255
+
256
+ 52
257
+ 00:03:42,320 --> 00:03:44,920
258
+ Um, I tried before to read
259
+ Sherlock Holmes for an hour and
260
+
261
+ 53
262
+ 00:03:44,920 --> 00:03:47,000
263
+ I just couldn't.
264
+ I was so bored, uh,
265
+
266
+ 54
267
+ 00:03:47,040 --> 00:03:50,800
268
+ after ten minutes that I was like,
269
+ okay, now I'm just gonna have to
270
+
271
+ 55
272
+ 00:03:50,800 --> 00:03:56,470
273
+ find something else to read.
274
+ So I used a created with AI
275
+
276
+ 56
277
+ 00:03:56,510 --> 00:04:00,150
278
+ studio vibe coded.
279
+ A synthetic text generator.
280
+
281
+ 57
282
+ 00:04:00,390 --> 00:04:03,990
283
+ Um, which actually I thought was
284
+ probably a better way of doing it
285
+
286
+ 58
287
+ 00:04:03,990 --> 00:04:08,870
288
+ because it would give me more short
289
+ samples with more varied content.
290
+
291
+ 59
292
+ 00:04:08,870 --> 00:04:13,310
293
+ So I was like, okay, give me a voice
294
+ note, like I'm recording an email,
295
+
296
+ 60
297
+ 00:04:13,310 --> 00:04:18,110
298
+ give me a short story to read,
299
+ give me prose, um, to read.
300
+
301
+ 61
302
+ 00:04:18,110 --> 00:04:21,310
303
+ So I came up with all these
304
+ different things, and I added a
305
+
306
+ 62
307
+ 00:04:21,310 --> 00:04:24,750
308
+ little timer to it so I could
309
+ see how close I was to one hour.
310
+
311
+ 63
312
+ 00:04:24,990 --> 00:04:29,830
313
+ Um, and, uh, I spent like an hour one
314
+ afternoon or probably two hours by
315
+
316
+ 64
317
+ 00:04:29,830 --> 00:04:34,190
318
+ the time you, um, you do retakes
319
+ or whatever because you want to.
320
+
321
+ 65
322
+ 00:04:34,990 --> 00:04:39,190
323
+ It gave me a source of truth,
324
+ which I'm not sure if that's the
325
+
326
+ 66
327
+ 00:04:39,190 --> 00:04:43,550
328
+ scientific way to approach this topic
329
+ of gathering, uh, training data,
330
+
331
+ 67
332
+ 00:04:43,550 --> 00:04:48,070
333
+ but I thought it made sense.
334
+ Um, I have a lot of audio data
335
+
336
+ 68
337
+ 00:04:48,070 --> 00:04:52,070
338
+ from recording voice notes,
339
+ which I've also kind of used, um,
340
+
341
+ 69
342
+ 00:04:52,070 --> 00:04:55,780
343
+ been experimenting with using for
344
+ a different purpose, slightly
345
+
346
+ 70
347
+ 00:04:55,780 --> 00:05:00,820
348
+ different annotating task types.
349
+ It's more text classification
350
+
351
+ 71
352
+ 00:05:00,820 --> 00:05:03,740
353
+ experiment or uh, well,
354
+ it's more than that, actually.
355
+
356
+ 72
357
+ 00:05:03,740 --> 00:05:08,100
358
+ I'm working on a voice app,
359
+ so it's a prototype I guess is
360
+
361
+ 73
362
+ 00:05:08,100 --> 00:05:12,780
363
+ really more accurate.
364
+ Um, but you can do that and you
365
+
366
+ 74
367
+ 00:05:12,780 --> 00:05:14,220
368
+ can work backwards.
369
+ You're like,
370
+
371
+ 75
372
+ 00:05:14,260 --> 00:05:18,620
373
+ you listen back to a voice note
374
+ and you painfully go through one
375
+
376
+ 76
377
+ 00:05:18,620 --> 00:05:21,980
378
+ of those transcribing, you know,
379
+ where you start and stop and scrub
380
+
381
+ 77
382
+ 00:05:21,980 --> 00:05:24,100
383
+ around it and you fix the errors.
384
+ But it's really,
385
+
386
+ 78
387
+ 00:05:24,100 --> 00:05:27,220
388
+ really boring to do that.
389
+ So I thought it would be less
390
+
391
+ 79
392
+ 00:05:27,220 --> 00:05:31,860
393
+ tedious in the long term if I just
394
+ recorded The Source of truth.
395
+
396
+ 80
397
+ 00:05:32,180 --> 00:05:34,300
398
+ So it gave me these three minute
399
+ snippets.
400
+
401
+ 81
402
+ 00:05:34,300 --> 00:05:38,780
403
+ I recorded them and saved an MP3
404
+ and a txt in the same folder,
405
+
406
+ 82
407
+ 00:05:38,780 --> 00:05:43,820
408
+ and I created an hour of that data.
409
+ Uh, so I was very hopeful, quietly,
410
+
411
+ 83
412
+ 00:05:43,860 --> 00:05:46,380
413
+ you know, a little bit hopeful
414
+ that I would be able that I could
415
+
416
+ 84
417
+ 00:05:46,380 --> 00:05:49,700
418
+ actually fine tune, whisper.
419
+ Um, I want to fine tune whisper
420
+
421
+ 85
422
+ 00:05:49,700 --> 00:05:54,840
423
+ because when I got into voice tech
424
+ last November, my wife was in
425
+
426
+ 86
427
+ 00:05:54,840 --> 00:05:59,600
428
+ the US and I was alone at home.
429
+ And you know, when crazy people
430
+
431
+ 87
432
+ 00:05:59,600 --> 00:06:03,760
433
+ like me do really wild things like
434
+ use voice to tech, uh, technology.
435
+
436
+ 88
437
+ 00:06:03,760 --> 00:06:06,520
438
+ That was basically, um,
439
+ when I started doing it,
440
+
441
+ 89
442
+ 00:06:06,520 --> 00:06:10,280
443
+ I didn't feel like a crazy person
444
+ speaking to myself, and my
445
+
446
+ 90
447
+ 00:06:10,280 --> 00:06:16,120
448
+ expectations weren't that high.
449
+ Uh, I used speech tech now and again.
450
+
451
+ 91
452
+ 00:06:16,200 --> 00:06:18,480
453
+ Um, tried it out.
454
+ I was like, it'd be really cool
455
+
456
+ 92
457
+ 00:06:18,480 --> 00:06:20,520
458
+ if you could just, like,
459
+ speak into your computer.
460
+
461
+ 93
462
+ 00:06:20,880 --> 00:06:24,720
463
+ And whatever I tried out that
464
+ had Linux support was just.
465
+
466
+ 94
467
+ 00:06:25,440 --> 00:06:28,640
468
+ It was not good, basically.
469
+ Um, and this blew me away from
470
+
471
+ 95
472
+ 00:06:28,640 --> 00:06:32,040
473
+ the first go.
474
+ I mean, it wasn't 100% accurate
475
+
476
+ 96
477
+ 00:06:32,080 --> 00:06:35,160
478
+ out of the box and it took work,
479
+ but it was good enough that there was
480
+
481
+ 97
482
+ 00:06:35,160 --> 00:06:39,720
483
+ a solid foundation and it kind of
484
+ passed that, uh, pivot point that
485
+
486
+ 98
487
+ 00:06:39,720 --> 00:06:42,880
488
+ it's actually worth doing this.
489
+ You know, there's a point where
490
+
491
+ 99
492
+ 00:06:42,880 --> 00:06:46,920
493
+ it's so like the transcript is you
494
+ don't have to get 100% accuracy
495
+
496
+ 100
497
+ 00:06:46,920 --> 00:06:50,630
498
+ for it to be worth your time for
499
+ speech to text to be a worthwhile
500
+
501
+ 101
502
+ 00:06:50,630 --> 00:06:53,070
503
+ addition to your productivity.
504
+ But you do need to get above.
505
+
506
+ 102
507
+ 00:06:53,110 --> 00:06:57,750
508
+ Let's say, I don't know, 85%.
509
+ If it's 60% or 50%,
510
+
511
+ 103
512
+ 00:06:57,750 --> 00:07:00,790
513
+ you inevitably say, screw it.
514
+ I'll just type it because you end up
515
+
516
+ 104
517
+ 00:07:00,790 --> 00:07:05,070
518
+ missing errors in the transcript
519
+ and it becomes actually worse.
520
+
521
+ 105
522
+ 00:07:05,070 --> 00:07:06,830
523
+ You end up in a worse position
524
+ than you started with.
525
+
526
+ 106
527
+ 00:07:06,830 --> 00:07:11,030
528
+ And that's been my experience.
529
+ So, um, I was like, oh,
530
+
531
+ 107
532
+ 00:07:11,070 --> 00:07:13,550
533
+ this is actually really, really good.
534
+ Now how did that happen?
535
+
536
+ 108
537
+ 00:07:13,550 --> 00:07:18,910
538
+ And the answer is ASR whisper
539
+ being open sourced and the
540
+
541
+ 109
542
+ 00:07:18,910 --> 00:07:21,910
543
+ transformer architecture,
544
+ if you want to go back to the,
545
+
546
+ 110
547
+ 00:07:22,510 --> 00:07:26,750
548
+ um, to the underpinnings, which
549
+ really blows my mind and it's on my
550
+
551
+ 111
552
+ 00:07:26,750 --> 00:07:32,430
553
+ list to read through that paper.
554
+ Um, all you need is attention as
555
+
556
+ 112
557
+ 00:07:33,470 --> 00:07:38,470
558
+ attentively as can be done with my
559
+ limited brain because it's super,
560
+
561
+ 113
562
+ 00:07:38,470 --> 00:07:42,310
563
+ super high level stuff.
564
+ Um, super advanced stuff.
565
+
566
+ 114
567
+ 00:07:42,350 --> 00:07:48,070
568
+ I mean, uh, but that I think of all
569
+ the things that are fascinating
570
+
571
+ 115
572
+ 00:07:48,180 --> 00:07:52,820
573
+ about the sudden rise in AI and
574
+ the dramatic capabilities.
575
+
576
+ 116
577
+ 00:07:53,420 --> 00:07:55,700
578
+ I find it fascinating that few
579
+ people are like, hang on,
580
+
581
+ 117
582
+ 00:07:55,860 --> 00:07:59,740
583
+ you've got this thing that can speak
584
+ to you like a chatbot, an LLM,
585
+
586
+ 118
587
+ 00:08:00,420 --> 00:08:05,580
588
+ and then you've got image generation.
589
+ Okay, so firstly, those two things on
590
+
591
+ 119
592
+ 00:08:05,580 --> 00:08:10,860
593
+ the surface have nothing in common.
594
+ Um, so like how are they how did that
595
+
596
+ 120
597
+ 00:08:10,860 --> 00:08:13,100
598
+ just happen all at the same time.
599
+ And then when you extend that
600
+
601
+ 121
602
+ 00:08:13,100 --> 00:08:16,180
603
+ further, um, you're like sooner,
604
+ right?
605
+
606
+ 122
607
+ 00:08:16,180 --> 00:08:21,700
608
+ You can sing a song and AI will like,
609
+ come up with an instrumental and then
610
+
611
+ 123
612
+ 00:08:21,700 --> 00:08:23,860
613
+ you've got whisper and you're like,
614
+ wait a second,
615
+
616
+ 124
617
+ 00:08:24,060 --> 00:08:28,100
618
+ how did all this stuff, like,
619
+ if it's all AI, what's like there
620
+
621
+ 125
622
+ 00:08:28,100 --> 00:08:30,700
623
+ has to be some commonality.
624
+ Otherwise these are four.
625
+
626
+ 126
627
+ 00:08:30,780 --> 00:08:34,780
628
+ These are totally different
629
+ technologies on the surface of it.
630
+
631
+ 127
632
+ 00:08:34,780 --> 00:08:40,220
633
+ And, uh, the transformer architecture
634
+ is, as far as I know, the answer.
635
+
636
+ 128
637
+ 00:08:40,220 --> 00:08:43,860
638
+ And I can't even say can't even
639
+ pretend that I really understand
640
+
641
+ 129
642
+ 00:08:44,140 --> 00:08:47,290
643
+ what the transformer
644
+ architecture means in depth,
645
+
646
+ 130
647
+ 00:08:47,290 --> 00:08:51,810
648
+ but I have scanned it and as I said,
649
+ I want to print it and really kind
650
+
651
+ 131
652
+ 00:08:51,810 --> 00:08:56,770
653
+ of think over it at some point,
654
+ and I'll probably feel bad about
655
+
656
+ 132
657
+ 00:08:56,770 --> 00:08:59,090
658
+ myself, I think,
659
+ because weren't those guys in their
660
+
661
+ 133
662
+ 00:08:59,130 --> 00:09:04,010
663
+ in their 20s like, that's crazy.
664
+ I think I asked ChatGPT once who
665
+
666
+ 134
667
+ 00:09:04,050 --> 00:09:08,370
668
+ were the who wrote that paper
669
+ and how old were they when it
670
+
671
+ 135
672
+ 00:09:08,370 --> 00:09:11,290
673
+ was published in arXiv?
674
+ And I was expecting like,
675
+
676
+ 136
677
+ 00:09:11,530 --> 00:09:13,450
678
+ I don't know,
679
+ what do you what do you imagine?
680
+
681
+ 137
682
+ 00:09:13,450 --> 00:09:15,050
683
+ I personally imagine kind of like,
684
+ you know,
685
+
686
+ 138
687
+ 00:09:15,090 --> 00:09:19,210
688
+ you have these breakthroughs during
689
+ Covid and things like that where
690
+
691
+ 139
692
+ 00:09:19,250 --> 00:09:22,210
693
+ like these kind of really obscure
694
+ scientists who are like in their
695
+
696
+ 140
697
+ 00:09:22,210 --> 00:09:27,250
698
+ 50s and they've just kind of been
699
+ laboring in labs and, uh, wearily
700
+
701
+ 141
702
+ 00:09:27,250 --> 00:09:30,650
703
+ and writing in publishing in kind
704
+ of obscure academic publications.
705
+
706
+ 142
707
+ 00:09:30,850 --> 00:09:34,050
708
+ And they finally, like,
709
+ hit a big or win a Nobel Prize and
710
+
711
+ 143
712
+ 00:09:34,050 --> 00:09:37,930
713
+ then their household household names.
714
+ Uh, so that was kind of what I
715
+
716
+ 144
717
+ 00:09:37,930 --> 00:09:39,770
718
+ had in mind.
719
+ That was the mental image I'd
720
+
721
+ 145
722
+ 00:09:39,770 --> 00:09:44,010
723
+ formed of the birth of arXiv.
724
+ Like, I wasn't expecting 20
725
+
726
+ 146
727
+ 00:09:44,050 --> 00:09:47,430
728
+ somethings in San Francisco,
729
+ though I thought that was both very,
730
+
731
+ 147
732
+ 00:09:47,430 --> 00:09:49,990
733
+ very funny, very cool,
734
+ and actually kind of inspiring.
735
+
736
+ 148
737
+ 00:09:50,510 --> 00:09:55,630
738
+ It's nice to think that people who,
739
+ you know, just you might put them
740
+
741
+ 149
742
+ 00:09:55,630 --> 00:10:01,030
743
+ in the kind of milieu or bubble or
744
+ world that you are in or credibly in,
745
+
746
+ 150
747
+ 00:10:01,070 --> 00:10:03,710
748
+ through, you know,
749
+ a series of connections that are
750
+
751
+ 151
752
+ 00:10:03,710 --> 00:10:07,750
753
+ coming up with such literally
754
+ world changing, um, innovations.
755
+
756
+ 152
757
+ 00:10:07,790 --> 00:10:11,550
758
+ Uh, so that was, I thought,
759
+ anyway, that, that that was cool.
760
+
761
+ 153
762
+ 00:10:12,190 --> 00:10:14,070
763
+ Okay. Voice training data.
764
+ How are we doing?
765
+
766
+ 154
767
+ 00:10:14,070 --> 00:10:18,110
768
+ We're about ten minutes, and I'm
769
+ still talking about voice technology.
770
+
771
+ 155
772
+ 00:10:18,310 --> 00:10:22,470
773
+ Um, so whisper was brilliant,
774
+ and I was so excited that I was.
775
+
776
+ 156
777
+ 00:10:22,470 --> 00:10:25,750
778
+ My first instinct was to, like,
779
+ get like, oh, my gosh,
780
+
781
+ 157
782
+ 00:10:25,750 --> 00:10:27,830
783
+ I have to get, like,
784
+ a really good microphone for this.
785
+
786
+ 158
787
+ 00:10:28,070 --> 00:10:31,750
788
+ So, um, I didn't go on a
789
+ spending spree because I said,
790
+
791
+ 159
792
+ 00:10:31,790 --> 00:10:34,590
793
+ I'm gonna have to just wait a
794
+ month and see if I still use this.
795
+
796
+ 160
797
+ 00:10:35,030 --> 00:10:40,110
798
+ And it just kind of became it's
799
+ become really part of my daily
800
+
801
+ 161
802
+ 00:10:40,110 --> 00:10:43,110
803
+ routine.
804
+ Like, if I'm writing an email,
805
+
806
+ 162
807
+ 00:10:43,110 --> 00:10:47,140
808
+ I'll record a voice note.
809
+ And then I've developed and it's
810
+
811
+ 163
812
+ 00:10:47,140 --> 00:10:50,020
813
+ nice to see that everyone is
814
+ like developing the same things
815
+
816
+ 164
817
+ 00:10:50,020 --> 00:10:52,020
818
+ in parallel.
819
+ Like, that's kind of a weird thing
820
+
821
+ 165
822
+ 00:10:52,060 --> 00:10:57,460
823
+ to say, but when I look, I kind of
824
+ came when I started working on this,
825
+
826
+ 166
827
+ 00:10:57,500 --> 00:11:00,820
828
+ these prototypes on GitHub,
829
+ which is where I just kind of
830
+
831
+ 167
832
+ 00:11:00,860 --> 00:11:04,860
833
+ share very freely and loosely,
834
+ uh, ideas and, you know,
835
+
836
+ 168
837
+ 00:11:04,900 --> 00:11:10,140
838
+ first iterations on, on concepts,
839
+ um, and for want of a better word,
840
+
841
+ 169
842
+ 00:11:10,140 --> 00:11:14,020
843
+ I called it like, uh,
844
+ lm post-processing or cleanup or
845
+
846
+ 170
847
+ 00:11:14,260 --> 00:11:18,220
848
+ basically a system prompt that after
849
+ you get back the raw text from
850
+
851
+ 171
852
+ 00:11:18,540 --> 00:11:24,220
853
+ whisper, you run it through a model
854
+ and say, okay, this is crappy text,
855
+
856
+ 172
857
+ 00:11:24,260 --> 00:11:27,260
858
+ like add sentence structure and,
859
+ you know, fix it up.
860
+
861
+ 173
862
+ 00:11:27,700 --> 00:11:32,780
863
+ And, um, now when I'm exploring the
864
+ different tools that are out there
865
+
866
+ 174
867
+ 00:11:32,820 --> 00:11:36,700
868
+ that people have built, I see, uh,
869
+ quite a number of projects have
870
+
871
+ 175
872
+ 00:11:37,300 --> 00:11:41,820
873
+ basically done the same thing,
874
+ um, less that be misconstrued.
875
+
876
+ 176
877
+ 00:11:41,820 --> 00:11:44,490
878
+ I'm not saying for a millisecond
879
+ that I inspired them.
880
+
881
+ 177
882
+ 00:11:44,490 --> 00:11:49,010
883
+ I'm sure this has been a thing that's
884
+ been integrated into tools for a
885
+
886
+ 178
887
+ 00:11:49,050 --> 00:11:52,410
888
+ while, but it's it's the kind of
889
+ thing that when you start using these
890
+
891
+ 179
892
+ 00:11:52,410 --> 00:11:56,850
893
+ tools every day, the need for it
894
+ is almost instantly apparent, uh,
895
+
896
+ 180
897
+ 00:11:56,850 --> 00:12:00,890
898
+ because text that doesn't have any
899
+ punctuation or paragraph spacing
900
+
901
+ 181
902
+ 00:12:00,930 --> 00:12:04,370
903
+ takes a long time to, you know,
904
+ it takes so long to get it into
905
+
906
+ 182
907
+ 00:12:04,370 --> 00:12:09,490
908
+ a presentable email that again,
909
+ it's it's it moves speech tech
910
+
911
+ 183
912
+ 00:12:09,530 --> 00:12:13,050
913
+ into that before that inflection
914
+ point where you're like, no,
915
+
916
+ 184
917
+ 00:12:13,050 --> 00:12:16,370
918
+ it's just not worth it.
919
+ It's like it'll just be quicker
920
+
921
+ 185
922
+ 00:12:16,370 --> 00:12:18,970
923
+ to type this.
924
+ So it's a big it's a little touch.
925
+
926
+ 186
927
+ 00:12:18,970 --> 00:12:24,210
928
+ That actually is a big deal.
929
+ Uh, so I was on whisper and I've
930
+
931
+ 187
932
+ 00:12:24,210 --> 00:12:28,290
933
+ been using whisper and I kind of
934
+ early on found a couple of tools.
935
+
936
+ 188
937
+ 00:12:28,330 --> 00:12:31,050
938
+ I couldn't find what I was
939
+ looking for on Linux, which is,
940
+
941
+ 189
942
+ 00:12:31,490 --> 00:12:35,890
943
+ um, basically just something
944
+ that'll run in the background.
945
+
946
+ 190
947
+ 00:12:35,930 --> 00:12:40,250
948
+ You'll give it an API key and it
949
+ will just transcribe. Um.
950
+
951
+ 191
952
+ 00:12:41,400 --> 00:12:44,120
953
+ with, like, a little key to
954
+ start and stop the dictation.
955
+
956
+ 192
957
+ 00:12:44,720 --> 00:12:49,160
958
+ Uh, and the issues were I discovered
959
+ that, like most people involved in
960
+
961
+ 193
962
+ 00:12:49,160 --> 00:12:54,040
963
+ creating these projects were very
964
+ much focused on local models running
965
+
966
+ 194
967
+ 00:12:54,040 --> 00:12:57,520
968
+ whisper locally, because you can.
969
+ And I tried that a bunch of
970
+
971
+ 195
972
+ 00:12:57,520 --> 00:13:00,960
973
+ times and just never got results
974
+ that were as good as the cloud.
975
+
976
+ 196
977
+ 00:13:01,280 --> 00:13:04,760
978
+ And when I began looking at the
979
+ cost of the speech to text APIs
980
+
981
+ 197
982
+ 00:13:04,760 --> 00:13:08,640
983
+ and what I was spending,
984
+ I just thought there's it's actually,
985
+
986
+ 198
987
+ 00:13:08,840 --> 00:13:13,320
988
+ in my opinion, just one of the better
989
+ deals in API spending and in cloud.
990
+
991
+ 199
992
+ 00:13:13,360 --> 00:13:17,400
993
+ Like it's just not that expensive
994
+ for very, very good models that are
995
+
996
+ 200
997
+ 00:13:17,520 --> 00:13:20,960
998
+ much more, you know, you're going
999
+ to be able to run the full model,
1000
+
1001
+ 201
1002
+ 00:13:21,480 --> 00:13:26,080
1003
+ the latest model versus whatever
1004
+ you can run on your average GPU.
1005
+
1006
+ 202
1007
+ 00:13:26,120 --> 00:13:29,880
1008
+ Unless you want to buy a crazy GPU.
1009
+ It doesn't really make sense to me.
1010
+
1011
+ 203
1012
+ 00:13:29,880 --> 00:13:33,600
1013
+ Now, privacy is another concern.
1014
+ Um, that I know is kind of like a
1015
+
1016
+ 204
1017
+ 00:13:33,640 --> 00:13:37,040
1018
+ very much a separate thing that
1019
+ people just don't want their voice,
1020
+
1021
+ 205
1022
+ 00:13:37,040 --> 00:13:39,910
1023
+ data, and their voice leaving
1024
+ their local environment,
1025
+
1026
+ 206
1027
+ 00:13:40,230 --> 00:13:43,950
1028
+ maybe for regulatory reasons as well.
1029
+ Um, but I'm not in that.
1030
+
1031
+ 207
1032
+ 00:13:44,030 --> 00:13:48,030
1033
+ Um, I'm neither really care about
1034
+ people listening to my, uh,
1035
+
1036
+ 208
1037
+ 00:13:48,070 --> 00:13:51,310
1038
+ grocery list consisting of, uh,
1039
+ reminding myself that I need to
1040
+
1041
+ 209
1042
+ 00:13:51,350 --> 00:13:54,910
1043
+ buy more beer, Cheetos and hummus,
1044
+ which is kind of the three,
1045
+
1046
+ 210
1047
+ 00:13:55,110 --> 00:13:59,430
1048
+ three staples of my diet.
1049
+ Um, during periods of poor nutrition.
1050
+
1051
+ 211
1052
+ 00:13:59,710 --> 00:14:03,430
1053
+ Uh, but the kind of stuff that I
1054
+ transcribe, it's just not it's not a,
1055
+
1056
+ 212
1057
+ 00:14:04,110 --> 00:14:09,470
1058
+ it's not a privacy thing and that
1059
+ sort of sensitive about and, uh,
1060
+
1061
+ 213
1062
+ 00:14:09,470 --> 00:14:13,190
1063
+ I don't do anything so,
1064
+ you know, sensitive or secure,
1065
+
1066
+ 214
1067
+ 00:14:13,190 --> 00:14:16,710
1068
+ that requires air gapping.
1069
+ So, um, I looked at the pricing and
1070
+
1071
+ 215
1072
+ 00:14:16,710 --> 00:14:20,390
1073
+ especially the kind of older models,
1074
+ mini, um, some of them are very,
1075
+
1076
+ 216
1077
+ 00:14:20,390 --> 00:14:23,230
1078
+ very affordable.
1079
+ And I did a back of the I did a
1080
+
1081
+ 217
1082
+ 00:14:23,230 --> 00:14:27,270
1083
+ calculation once with ChatGPT
1084
+ and I was like, okay, this is a,
1085
+
1086
+ 218
1087
+ 00:14:27,270 --> 00:14:31,190
1088
+ this is the API price for I can't
1089
+ remember whatever the model was.
1090
+
1091
+ 219
1092
+ 00:14:31,670 --> 00:14:34,030
1093
+ Uh, let's say I just go at it
1094
+ like nonstop,
1095
+
1096
+ 220
1097
+ 00:14:34,150 --> 00:14:37,530
1098
+ which it rarely happens. Probably.
1099
+ I would say on average,
1100
+
1101
+ 221
1102
+ 00:14:37,530 --> 00:14:42,010
1103
+ I might dictate 30 to 60 minutes per
1104
+ day if I was probably summing up
1105
+
1106
+ 222
1107
+ 00:14:42,010 --> 00:14:48,610
1108
+ the emails, documents, outlines,
1109
+ um, which is a lot, but it's it's
1110
+
1111
+ 223
1112
+ 00:14:48,610 --> 00:14:50,850
1113
+ still a fairly modest amount.
1114
+ And I was like, well,
1115
+
1116
+ 224
1117
+ 00:14:50,890 --> 00:14:54,050
1118
+ some days I do go on like 1 or 2
1119
+ days where I've been.
1120
+
1121
+ 225
1122
+ 00:14:54,570 --> 00:14:58,570
1123
+ Usually when I'm like kind of out of
1124
+ the house and just have something
1125
+
1126
+ 226
1127
+ 00:14:59,210 --> 00:15:02,370
1128
+ like, I have nothing else to do.
1129
+ Like if I'm at a hospital with a
1130
+
1131
+ 227
1132
+ 00:15:02,370 --> 00:15:07,090
1133
+ newborn, uh, and you're waiting
1134
+ for like eight hours and hours
1135
+
1136
+ 228
1137
+ 00:15:07,090 --> 00:15:10,330
1138
+ for an appointment, and I would
1139
+ probably have listened to podcasts
1140
+
1141
+ 229
1142
+ 00:15:10,610 --> 00:15:14,130
1143
+ before becoming a speech fanatic.
1144
+ And I'm like, oh, wait,
1145
+
1146
+ 230
1147
+ 00:15:14,170 --> 00:15:16,490
1148
+ let me just get down.
1149
+ Let me just get these ideas out
1150
+
1151
+ 231
1152
+ 00:15:16,530 --> 00:15:19,730
1153
+ of my head.
1154
+ And that's when I'll go on my
1155
+
1156
+ 232
1157
+ 00:15:19,770 --> 00:15:21,650
1158
+ speech binges.
1159
+ But those are like once every
1160
+
1161
+ 233
1162
+ 00:15:21,650 --> 00:15:25,090
1163
+ few months, like not frequently.
1164
+ But I said, okay, let's just say
1165
+
1166
+ 234
1167
+ 00:15:25,090 --> 00:15:30,770
1168
+ if I'm gonna price out.
1169
+ Cloud asked if I was like, dedicated
1170
+
1171
+ 235
1172
+ 00:15:30,770 --> 00:15:37,000
1173
+ every second of every waking hour to
1174
+ transcribing for some odd reason. Um.
1175
+
1176
+ 236
1177
+ 00:15:37,320 --> 00:15:39,800
1178
+ I mean, it'd have to, like,
1179
+ eat and use the toilet and,
1180
+
1181
+ 237
1182
+ 00:15:39,840 --> 00:15:42,640
1183
+ like, you know, there's only so
1184
+ many hours I'm awake for.
1185
+
1186
+ 238
1187
+ 00:15:42,640 --> 00:15:44,800
1188
+ So, like,
1189
+ let's just say a maximum of, like,
1190
+
1191
+ 239
1192
+ 00:15:44,840 --> 00:15:48,800
1193
+ 40 hours, 45 minutes in the hour.
1194
+ Then I said, all right,
1195
+
1196
+ 240
1197
+ 00:15:48,800 --> 00:15:52,720
1198
+ let's just say 50. Who knows?
1199
+ You're dictating on the toilet.
1200
+
1201
+ 241
1202
+ 00:15:52,760 --> 00:15:54,000
1203
+ We do it.
1204
+ Uh,
1205
+
1206
+ 242
1207
+ 00:15:54,000 --> 00:15:58,840
1208
+ so it could be you could just do 60.
1209
+ But whatever I did, and every day,
1210
+
1211
+ 243
1212
+ 00:15:58,880 --> 00:16:02,560
1213
+ like, you're going flat out seven
1214
+ days a week dictating non-stop.
1215
+
1216
+ 244
1217
+ 00:16:02,600 --> 00:16:06,560
1218
+ I was like, what's my monthly API
1219
+ bill going to be at this price?
1220
+
1221
+ 245
1222
+ 00:16:06,840 --> 00:16:09,240
1223
+ And it came out to like 70 or 80
1224
+ bucks.
1225
+
1226
+ 246
1227
+ 00:16:09,240 --> 00:16:14,200
1228
+ And I was like, well, that would be
1229
+ an extraordinary amount of dictation.
1230
+
1231
+ 247
1232
+ 00:16:14,200 --> 00:16:17,960
1233
+ And I would hope that there was
1234
+ some compelling reason,
1235
+
1236
+ 248
1237
+ 00:16:18,160 --> 00:16:22,320
1238
+ more worth more than $70,
1239
+ that I embarked upon that project.
1240
+
1241
+ 249
1242
+ 00:16:22,520 --> 00:16:25,320
1243
+ Uh, so given that that's kind of the
1244
+ max point for me, I said, that's
1245
+
1246
+ 250
1247
+ 00:16:25,360 --> 00:16:29,120
1248
+ actually very, very affordable.
1249
+ Um, now you're gonna if you want
1250
+
1251
+ 251
1252
+ 00:16:29,160 --> 00:16:34,200
1253
+ to spec out the costs and you want
1254
+ to do the post-processing that I
1255
+
1256
+ 252
1257
+ 00:16:34,270 --> 00:16:37,230
1258
+ really do feel is valuable.
1259
+ Um, that's going to cost some more as
1260
+
1261
+ 253
1262
+ 00:16:37,230 --> 00:16:43,230
1263
+ well, unless you're using Gemini,
1264
+ which, uh, needless to say, is a
1265
+
1266
+ 254
1267
+ 00:16:43,230 --> 00:16:47,070
1268
+ random person sitting in Jerusalem.
1269
+ Uh, I have no affiliation,
1270
+
1271
+ 255
1272
+ 00:16:47,070 --> 00:16:51,470
1273
+ nor with Google, nor anthropic,
1274
+ nor Gemini, nor any major tech vendor
1275
+
1276
+ 256
1277
+ 00:16:51,470 --> 00:16:56,910
1278
+ for that matter. Um, I like Gemini.
1279
+ Not so much as a everyday model.
1280
+
1281
+ 257
1282
+ 00:16:56,990 --> 00:16:59,950
1283
+ Um, it's kind of underwhelmed in
1284
+ that respect, I would say.
1285
+
1286
+ 258
1287
+ 00:17:00,350 --> 00:17:03,150
1288
+ But for multimodal,
1289
+ I think it's got a lot to offer.
1290
+
1291
+ 259
1292
+ 00:17:03,430 --> 00:17:06,990
1293
+ And I think that the transcribing
1294
+ functionality whereby it can,
1295
+
1296
+ 260
1297
+ 00:17:07,390 --> 00:17:12,270
1298
+ um, process audio with a system
1299
+ prompt and both give you
1300
+
1301
+ 261
1302
+ 00:17:12,310 --> 00:17:15,510
1303
+ transcription that's cleaned up,
1304
+ that reduces two steps to one.
1305
+
1306
+ 262
1307
+ 00:17:15,830 --> 00:17:18,750
1308
+ And that for me is a very,
1309
+ very big deal.
1310
+
1311
+ 263
1312
+ 00:17:18,750 --> 00:17:23,110
1313
+ And, uh, I feel like even Google
1314
+ has haven't really sort of thought
1315
+
1316
+ 264
1317
+ 00:17:23,110 --> 00:17:27,550
1318
+ through how useful the that
1319
+ modality is and what kind of use
1320
+
1321
+ 265
1322
+ 00:17:27,550 --> 00:17:30,910
1323
+ cases you can achieve with it.
1324
+ Because I found in the course of
1325
+
1326
+ 266
1327
+ 00:17:30,910 --> 00:17:36,610
1328
+ this year just an endless list
1329
+ of really kind of system prompt,
1330
+
1331
+ 267
1332
+ 00:17:36,850 --> 00:17:41,410
1333
+ system prompt stuff that I can say,
1334
+ okay, I've used it to capture context
1335
+
1336
+ 268
1337
+ 00:17:41,410 --> 00:17:45,690
1338
+ data for AI, which is literally I
1339
+ might speak for if I wanted to have a
1340
+
1341
+ 269
1342
+ 00:17:45,690 --> 00:17:49,850
1343
+ good bank of context data about,
1344
+ who knows, my childhood.
1345
+
1346
+ 270
1347
+ 00:17:50,130 --> 00:17:53,570
1348
+ Uh, more realistically,
1349
+ maybe my career goals, uh,
1350
+
1351
+ 271
1352
+ 00:17:53,570 --> 00:17:56,130
1353
+ something that would just be,
1354
+ like, really boring to type out.
1355
+
1356
+ 272
1357
+ 00:17:56,250 --> 00:18:01,250
1358
+ So I'll just, like, sit in my car
1359
+ and record it for ten minutes.
1360
+
1361
+ 273
1362
+ 00:18:01,250 --> 00:18:04,210
1363
+ And that ten minutes,
1364
+ you get a lot of information in,
1365
+
1366
+ 274
1367
+ 00:18:04,650 --> 00:18:10,210
1368
+ um, emails, which is short text.
1369
+ Um, just there is a whole bunch.
1370
+
1371
+ 275
1372
+ 00:18:10,210 --> 00:18:13,690
1373
+ And all these workflows kind of
1374
+ require a little bit of treatment
1375
+
1376
+ 276
1377
+ 00:18:13,690 --> 00:18:17,610
1378
+ afterwards and different treatment.
1379
+ My context pipeline is kind of like
1380
+
1381
+ 277
1382
+ 00:18:17,610 --> 00:18:21,330
1383
+ just extract the bare essentials.
1384
+ So you end up with me talking very
1385
+
1386
+ 278
1387
+ 00:18:21,330 --> 00:18:24,370
1388
+ loosely about sort of what I've done
1389
+ in my career, where I've worked,
1390
+
1391
+ 279
1392
+ 00:18:24,370 --> 00:18:27,730
1393
+ where I might like to work,
1394
+ and it goes it condenses that
1395
+
1396
+ 280
1397
+ 00:18:27,730 --> 00:18:31,720
1398
+ down to very robotic language
1399
+ that is easy to chunk, parse,
1400
+
1401
+ 281
1402
+ 00:18:31,720 --> 00:18:36,080
1403
+ and maybe put into a vector database.
1404
+ Daniel has worked in technology,
1405
+
1406
+ 282
1407
+ 00:18:36,120 --> 00:18:39,760
1408
+ Daniel is a has been working in,
1409
+ you know, stuff like that.
1410
+
1411
+ 283
1412
+ 00:18:39,760 --> 00:18:43,720
1413
+ That's not how you would speak.
1414
+ Um, but I figure it's probably easier
1415
+
1416
+ 284
1417
+ 00:18:43,720 --> 00:18:48,240
1418
+ to parse for, after all, robots.
1419
+ So we've almost got to 20 minutes.
1420
+
1421
+ 285
1422
+ 00:18:48,240 --> 00:18:52,760
1423
+ And this is actually a success
1424
+ because I wasted 20 minutes of my,
1425
+
1426
+ 286
1427
+ 00:18:52,920 --> 00:18:57,000
1428
+ uh, of the evening speaking into
1429
+ a microphone, and, uh,
1430
+
1431
+ 287
1432
+ 00:18:57,040 --> 00:19:00,960
1433
+ the levels were shot and, uh, it,
1434
+ uh, it was clipping and I said,
1435
+
1436
+ 288
1437
+ 00:19:00,960 --> 00:19:03,320
1438
+ I can't really do an evaluation.
1439
+ I have to be fair.
1440
+
1441
+ 289
1442
+ 00:19:03,320 --> 00:19:07,120
1443
+ I have to give the models a
1444
+ chance to do their thing.
1445
+
1446
+ 290
1447
+ 00:19:07,640 --> 00:19:09,480
1448
+ Uh,
1449
+ what am I hoping to achieve in this?
1450
+
1451
+ 291
1452
+ 00:19:09,520 --> 00:19:12,720
1453
+ Okay, my fine tune was a dud,
1454
+ as mentioned Deepgram SVT.
1455
+
1456
+ 292
1457
+ 00:19:12,760 --> 00:19:15,640
1458
+ I'm really, really hopeful that
1459
+ this prototype will work.
1460
+
1461
+ 293
1462
+ 00:19:15,920 --> 00:19:19,080
1463
+ And it's a built in public open
1464
+ source, so anyone is welcome to
1465
+
1466
+ 294
1467
+ 00:19:19,120 --> 00:19:23,040
1468
+ use it if I make anything good.
1469
+ Um, but that was really exciting for
1470
+
1471
+ 295
1472
+ 00:19:23,040 --> 00:19:27,520
1473
+ me last night when after hours of,
1474
+ um, trying my own prototype,
1475
+
1476
+ 296
1477
+ 00:19:27,520 --> 00:19:31,350
1478
+ seeing someone just made
1479
+ something that works like that.
1480
+
1481
+ 297
1482
+ 00:19:31,390 --> 00:19:32,790
1483
+ You know,
1484
+ you're not going to have to build a
1485
+
1486
+ 298
1487
+ 00:19:32,790 --> 00:19:38,350
1488
+ custom conda environment and image.
1489
+ I have AMD GPU, which makes
1490
+
1491
+ 299
1492
+ 00:19:38,350 --> 00:19:42,430
1493
+ things much more complicated.
1494
+ I didn't find it and I was about
1495
+
1496
+ 300
1497
+ 00:19:42,430 --> 00:19:44,110
1498
+ to give up and I said,
1499
+ all right, let me just give deep
1500
+
1501
+ 301
1502
+ 00:19:44,110 --> 00:19:48,870
1503
+ grams Linux thing a shot.
1504
+ And if this doesn't work, um,
1505
+
1506
+ 302
1507
+ 00:19:48,870 --> 00:19:51,270
1508
+ I'm just going to go back to
1509
+ trying to code something myself.
1510
+
1511
+ 303
1512
+ 00:19:51,630 --> 00:19:56,310
1513
+ And when I ran the script,
1514
+ I was using cloud code to do the
1515
+
1516
+ 304
1517
+ 00:19:56,310 --> 00:20:00,150
1518
+ installation process.
1519
+ It ran the script and oh my gosh,
1520
+
1521
+ 305
1522
+ 00:20:00,190 --> 00:20:05,470
1523
+ it works just like that.
1524
+ Uh, the tricky thing for all those
1525
+
1526
+ 306
1527
+ 00:20:05,470 --> 00:20:10,430
1528
+ who wants to know all the nitty
1529
+ gritty, nitty gritty details, um, was
1530
+
1531
+ 307
1532
+ 00:20:10,430 --> 00:20:13,870
1533
+ that I don't think it was actually
1534
+ struggling with transcription, but
1535
+
1536
+ 308
1537
+ 00:20:13,870 --> 00:20:18,670
1538
+ pasting Wayland makes life very hard,
1539
+ and I think there was something not
1540
+
1541
+ 309
1542
+ 00:20:18,670 --> 00:20:21,990
1543
+ running in the right time anyway.
1544
+ Deepgram I looked at how they
1545
+
1546
+ 310
1547
+ 00:20:21,990 --> 00:20:24,830
1548
+ actually handle that because it
1549
+ worked out of the box when other
1550
+
1551
+ 311
1552
+ 00:20:24,830 --> 00:20:29,260
1553
+ stuff didn't, and it was quite a
1554
+ clever little mechanism,
1555
+
1556
+ 312
1557
+ 00:20:29,580 --> 00:20:32,220
1558
+ and but more so than that,
1559
+ the accuracy was brilliant.
1560
+
1561
+ 313
1562
+ 00:20:32,260 --> 00:20:35,140
1563
+ Now, what am I doing here?
1564
+ This is going to be a 20 minute
1565
+
1566
+ 314
1567
+ 00:20:35,380 --> 00:20:43,100
1568
+ audio sample, and I'm I think
1569
+ I've done 1 or 2 of these before,
1570
+
1571
+ 315
1572
+ 00:20:43,100 --> 00:20:49,300
1573
+ but I did it with short, snappy voice
1574
+ notes. This is kind of long form.
1575
+
1576
+ 316
1577
+ 00:20:49,580 --> 00:20:51,860
1578
+ This actually might be a better
1579
+ approximation for what's useful
1580
+
1581
+ 317
1582
+ 00:20:51,860 --> 00:20:56,220
1583
+ to me than voice memos.
1584
+ Like I need to buy three liters
1585
+
1586
+ 318
1587
+ 00:20:56,220 --> 00:20:59,300
1588
+ of milk tomorrow, and pita bread,
1589
+ which is probably how like half
1590
+
1591
+ 319
1592
+ 00:20:59,300 --> 00:21:02,940
1593
+ my voice voice notes sound like
1594
+ if anyone were to, I don't know,
1595
+
1596
+ 320
1597
+ 00:21:02,980 --> 00:21:04,700
1598
+ like find my phone,
1599
+ they'd be like, this is the most
1600
+
1601
+ 321
1602
+ 00:21:04,700 --> 00:21:07,540
1603
+ boring person in the world.
1604
+ Although actually there are some
1605
+
1606
+ 322
1607
+ 00:21:07,580 --> 00:21:09,820
1608
+ like kind of, uh,
1609
+ journaling thoughts as well.
1610
+
1611
+ 323
1612
+ 00:21:09,820 --> 00:21:13,820
1613
+ But it's a lot of content like that.
1614
+ And the probably for the evaluation,
1615
+
1616
+ 324
1617
+ 00:21:13,820 --> 00:21:20,780
1618
+ the most useful thing is slightly
1619
+ obscure tech GitHub uh, hugging face
1620
+
1621
+ 325
1622
+ 00:21:21,300 --> 00:21:24,780
1623
+ not so obscure that it's not going
1624
+ to have a chance of knowing it,
1625
+
1626
+ 326
1627
+ 00:21:24,780 --> 00:21:27,760
1628
+ but hopefully sufficiently well
1629
+ known that the model should get it.
1630
+
1631
+ 327
1632
+ 00:21:28,320 --> 00:21:30,880
1633
+ I tried to do a little bit of
1634
+ speaking really fast and
1635
+
1636
+ 328
1637
+ 00:21:30,880 --> 00:21:33,320
1638
+ speaking very slowly.
1639
+ I would say in general,
1640
+
1641
+ 329
1642
+ 00:21:33,320 --> 00:21:37,000
1643
+ I've spoken, delivered this at a
1644
+ faster pace than I usually would
1645
+
1646
+ 330
1647
+ 00:21:37,040 --> 00:21:40,400
1648
+ owing to strong coffee flowing
1649
+ through my bloodstream.
1650
+
1651
+ 331
1652
+ 00:21:41,040 --> 00:21:44,320
1653
+ And the thing that I'm not going
1654
+ to get in this benchmark is
1655
+
1656
+ 332
1657
+ 00:21:44,320 --> 00:21:47,000
1658
+ background noise, which in my first
1659
+ take that I had to get rid of,
1660
+
1661
+ 333
1662
+ 00:21:47,800 --> 00:21:51,360
1663
+ my wife came in with my son and
1664
+ for a good night kiss.
1665
+
1666
+ 334
1667
+ 00:21:51,560 --> 00:21:55,240
1668
+ And that actually would have
1669
+ been super helpful to get in
1670
+
1671
+ 335
1672
+ 00:21:55,240 --> 00:21:59,880
1673
+ because it was not diarised.
1674
+ Or if we had diarisation a female,
1675
+
1676
+ 336
1677
+ 00:22:00,000 --> 00:22:02,400
1678
+ I could say I want the male
1679
+ voice and that wasn't intended
1680
+
1681
+ 337
1682
+ 00:22:02,400 --> 00:22:05,400
1683
+ for transcription.
1684
+ Um, and we're not going to get
1685
+
1686
+ 338
1687
+ 00:22:05,400 --> 00:22:07,080
1688
+ background noise like people
1689
+ honking their horns,
1690
+
1691
+ 339
1692
+ 00:22:07,080 --> 00:22:11,400
1693
+ which is something I've done in my
1694
+ main data set where I am trying to
1695
+
1696
+ 340
1697
+ 00:22:11,560 --> 00:22:15,640
1698
+ go back to some of my voice notes,
1699
+ annotate them, and run a benchmark.
1700
+
1701
+ 341
1702
+ 00:22:15,640 --> 00:22:19,080
1703
+ But this is going to be just a
1704
+ pure quick test.
1705
+
1706
+ 342
1707
+ 00:22:19,560 --> 00:22:24,000
1708
+ And as someone I'm working on a
1709
+ voice note idea,
1710
+
1711
+ 343
1712
+ 00:22:24,000 --> 00:22:28,350
1713
+ that's my sort of end motivation.
1714
+ Besides thinking it's an
1715
+
1716
+ 344
1717
+ 00:22:28,350 --> 00:22:31,710
1718
+ absolutely outstanding technology
1719
+ that's coming to viability.
1720
+
1721
+ 345
1722
+ 00:22:31,710 --> 00:22:34,790
1723
+ And really, I know this sounds
1724
+ cheesy can actually have a very
1725
+
1726
+ 346
1727
+ 00:22:34,790 --> 00:22:38,950
1728
+ transformative effect.
1729
+ Um, it's, you know, voice technology
1730
+
1731
+ 347
1732
+ 00:22:38,990 --> 00:22:45,030
1733
+ has been life changing for, uh,
1734
+ folks living with, um, disabilities.
1735
+
1736
+ 348
1737
+ 00:22:45,750 --> 00:22:48,670
1738
+ And I think there's something
1739
+ really nice about the fact that
1740
+
1741
+ 349
1742
+ 00:22:48,670 --> 00:22:52,830
1743
+ it can also benefit, you know,
1744
+ folks who are able bodied and like,
1745
+
1746
+ 350
1747
+ 00:22:52,870 --> 00:22:59,070
1748
+ we can all in different ways, um,
1749
+ make this tech as useful as possible,
1750
+
1751
+ 351
1752
+ 00:22:59,110 --> 00:23:01,230
1753
+ regardless of the exact way that
1754
+ we're using it.
1755
+
1756
+ 352
1757
+ 00:23:01,630 --> 00:23:04,830
1758
+ Um, and I think there's something
1759
+ very powerful in that, and it can be
1760
+
1761
+ 353
1762
+ 00:23:04,830 --> 00:23:09,030
1763
+ very cool. Um, I see use potential.
1764
+ What excites me about voice tech?
1765
+
1766
+ 354
1767
+ 00:23:09,870 --> 00:23:13,670
1768
+ A lot of things, actually.
1769
+ Firstly, the fact that it's cheap
1770
+
1771
+ 355
1772
+ 00:23:13,670 --> 00:23:17,230
1773
+ and accurate, as I mentioned at
1774
+ the very start of this, um,
1775
+
1776
+ 356
1777
+ 00:23:17,230 --> 00:23:20,910
1778
+ and it's getting better and better
1779
+ with stuff like accent handling, um,
1780
+
1781
+ 357
1782
+ 00:23:20,910 --> 00:23:24,300
1783
+ I'm not sure my, my fine tune will
1784
+ actually ever come to fruition in the
1785
+
1786
+ 358
1787
+ 00:23:24,300 --> 00:23:27,980
1788
+ sense that I'll use it day to day,
1789
+ as I imagine I get like superb,
1790
+
1791
+ 359
1792
+ 00:23:27,980 --> 00:23:33,660
1793
+ flawless word error rates because I'm
1794
+ just kind of skeptical about local
1795
+
1796
+ 360
1797
+ 00:23:33,660 --> 00:23:38,220
1798
+ speech to texts, as I mentioned.
1799
+ And I think the pace of innovation
1800
+
1801
+ 361
1802
+ 00:23:38,220 --> 00:23:42,180
1803
+ and improvement in the models,
1804
+ the main reasons for fine tuning from
1805
+
1806
+ 362
1807
+ 00:23:42,180 --> 00:23:46,460
1808
+ what I've seen have been people who
1809
+ are something that really blows,
1810
+
1811
+ 363
1812
+ 00:23:46,500 --> 00:23:53,060
1813
+ blows my mind about ASR is the idea
1814
+ that it's inherently a lingual
1815
+
1816
+ 364
1817
+ 00:23:53,060 --> 00:23:59,220
1818
+ or multilingual phonetic based.
1819
+ So as folks who use speak very
1820
+
1821
+ 365
1822
+ 00:23:59,260 --> 00:24:02,340
1823
+ obscure languages that there may
1824
+ be there might be a paucity of
1825
+
1826
+ 366
1827
+ 00:24:02,340 --> 00:24:05,620
1828
+ training data or almost none at all,
1829
+ and therefore the accuracy is
1830
+
1831
+ 367
1832
+ 00:24:05,620 --> 00:24:10,780
1833
+ significantly reduced or folks
1834
+ in very critical environments.
1835
+
1836
+ 368
1837
+ 00:24:10,820 --> 00:24:13,500
1838
+ I know there are.
1839
+ This is used extensively in medical
1840
+
1841
+ 369
1842
+ 00:24:13,500 --> 00:24:18,260
1843
+ transcription and dispatcher work as,
1844
+ um, you know, the call centers who
1845
+
1846
+ 370
1847
+ 00:24:18,260 --> 00:24:22,610
1848
+ send out ambulances, etc., where
1849
+ accuracy is absolutely paramount.
1850
+
1851
+ 371
1852
+ 00:24:22,610 --> 00:24:26,170
1853
+ And in the case of doctors,
1854
+ radiologists, they might be using
1855
+
1856
+ 372
1857
+ 00:24:26,170 --> 00:24:29,730
1858
+ very specialized vocab all the time.
1859
+ So those are kind of the main
1860
+
1861
+ 373
1862
+ 00:24:29,730 --> 00:24:31,650
1863
+ two things.
1864
+ And I'm not sure that really just for
1865
+
1866
+ 374
1867
+ 00:24:31,650 --> 00:24:37,410
1868
+ trying to make it better on a few
1869
+ random tech words with my slightly.
1870
+
1871
+ 375
1872
+ 00:24:37,450 --> 00:24:41,370
1873
+ I mean, I have an accent, but like,
1874
+ not, you know, an accent that a few
1875
+
1876
+ 376
1877
+ 00:24:41,410 --> 00:24:47,330
1878
+ other million people have. Ish.
1879
+ I'm not sure that my little fine
1880
+
1881
+ 377
1882
+ 00:24:47,330 --> 00:24:52,370
1883
+ tune is going to actually like the
1884
+ bump in word error rate reduction.
1885
+
1886
+ 378
1887
+ 00:24:52,370 --> 00:24:54,690
1888
+ If I ever actually figure out how
1889
+ to do it and get it up to the
1890
+
1891
+ 379
1892
+ 00:24:54,690 --> 00:24:58,730
1893
+ cloud by the time I've done that.
1894
+ I suspect that the next
1895
+
1896
+ 380
1897
+ 00:24:58,730 --> 00:25:01,530
1898
+ generation of ASR will just be
1899
+ so good that it will kind of be.
1900
+
1901
+ 381
1902
+ 00:25:02,050 --> 00:25:03,890
1903
+ Ah, well,
1904
+ that would be cool if it worked out,
1905
+
1906
+ 382
1907
+ 00:25:03,890 --> 00:25:08,850
1908
+ but I'll just use this instead.
1909
+ So that's going to be it for today's
1910
+
1911
+ 383
1912
+ 00:25:08,850 --> 00:25:14,250
1913
+ episode of, uh, voice training data.
1914
+ Single long shot evaluation.
1915
+
1916
+ 384
1917
+ 00:25:14,530 --> 00:25:17,450
1918
+ Who am I going to compare?
1919
+ Whisper is always good as a
1920
+
1921
+ 385
1922
+ 00:25:17,450 --> 00:25:20,720
1923
+ benchmark, but I'm more
1924
+ interested in seeing Whisperer
1925
+
1926
+ 386
1927
+ 00:25:20,720 --> 00:25:25,200
1928
+ head to head with two things,
1929
+ really. One is whisper variance.
1930
+
1931
+ 387
1932
+ 00:25:25,200 --> 00:25:30,000
1933
+ So you've got these projects like
1934
+ faster Whisper, Still whisper.
1935
+
1936
+ 388
1937
+ 00:25:30,000 --> 00:25:31,760
1938
+ It's a bit confusing.
1939
+ There's a whole bunch of them
1940
+
1941
+ 389
1942
+ 00:25:32,040 --> 00:25:34,920
1943
+ and the emerging acers,
1944
+ which are also a thing.
1945
+
1946
+ 390
1947
+ 00:25:35,320 --> 00:25:37,800
1948
+ My intention for this is I'm not
1949
+ sure I'm going to have the time
1950
+
1951
+ 391
1952
+ 00:25:37,800 --> 00:25:41,760
1953
+ in any point in the foreseeable
1954
+ future to go back through this whole
1955
+
1956
+ 392
1957
+ 00:25:41,760 --> 00:25:46,680
1958
+ episode and create a proper source,
1959
+ truth or a fix.
1960
+
1961
+ 393
1962
+ 00:25:47,440 --> 00:25:51,800
1963
+ Everything might do it if I can
1964
+ get one transcription that
1965
+
1966
+ 394
1967
+ 00:25:51,800 --> 00:25:56,840
1968
+ sufficiently close to perfection.
1969
+ But what I would actually love
1970
+
1971
+ 395
1972
+ 00:25:56,840 --> 00:25:59,920
1973
+ to do on Hugging Face I think
1974
+ would be a great.
1975
+
1976
+ 396
1977
+ 00:25:59,920 --> 00:26:03,680
1978
+ Probably how I might visualize this
1979
+ is having the audio waveform play,
1980
+
1981
+ 397
1982
+ 00:26:04,160 --> 00:26:09,920
1983
+ and then have the transcript for each
1984
+ model below it, and maybe even a,
1985
+
1986
+ 398
1987
+ 00:26:10,600 --> 00:26:15,240
1988
+ um, like, you know, two scale and
1989
+ maybe even a local one as well,
1990
+
1991
+ 399
1992
+ 00:26:15,280 --> 00:26:21,820
1993
+ like local whisper versus open
1994
+ AI API, Etc. and, um, I can then
1995
+
1996
+ 400
1997
+ 00:26:21,820 --> 00:26:24,500
1998
+ actually listen back to segments
1999
+ or anyone who wants to can listen
2000
+
2001
+ 401
2002
+ 00:26:24,500 --> 00:26:29,540
2003
+ back to segments of this recording
2004
+ and see where a particular model
2005
+
2006
+ 402
2007
+ 00:26:29,580 --> 00:26:33,060
2008
+ struggled and others didn't, as well
2009
+ as the sort of headline finding
2010
+
2011
+ 403
2012
+ 00:26:33,100 --> 00:26:36,900
2013
+ of which had the best, uh, wer.
2014
+ But that would require the source
2015
+
2016
+ 404
2017
+ 00:26:36,900 --> 00:26:40,140
2018
+ of truth. Okay. That's it.
2019
+ Hope this was, I don't know,
2020
+
2021
+ 405
2022
+ 00:26:40,300 --> 00:26:43,580
2023
+ maybe useful for other folks
2024
+ interested in stuff you want to see.
2025
+
2026
+ 406
2027
+ 00:26:44,060 --> 00:26:48,220
2028
+ I always feel think I've just said
2029
+ something I didn't intend to say.
2030
+
2031
+ 407
2032
+ 00:26:48,780 --> 00:26:51,140
2033
+ I said for those, listen carefully.
2034
+ Including, hopefully,
2035
+
2036
+ 408
2037
+ 00:26:51,140 --> 00:26:54,180
2038
+ the models themselves.
2039
+ This has been myself,
2040
+
2041
+ 409
2042
+ 00:26:54,220 --> 00:26:58,020
2043
+ Daniel Rosehill, for more, um,
2044
+ jumbled repositories about my,
2045
+
2046
+ 410
2047
+ 00:26:58,060 --> 00:27:00,940
2048
+ uh, roving interest in AI,
2049
+ but particularly Agentic,
2050
+
2051
+ 411
2052
+ 00:27:01,300 --> 00:27:05,460
2053
+ MCP and voice tech.
2054
+ Uh, you can find me on GitHub.
2055
+
2056
+ 412
2057
+ 00:27:05,940 --> 00:27:11,260
2058
+ Hugging face. Where else?
2059
+ Daniel, which is my personal website,
2060
+
2061
+ 413
2062
+ 00:27:11,260 --> 00:27:15,380
2063
+ as well as this podcast whose
2064
+ name I sadly cannot remember.
2065
+
2066
+ 414
2067
+ 00:27:15,820 --> 00:27:17,540
2068
+ Until next time.
2069
+ Thanks for listening.
style.css CHANGED
@@ -1,28 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  body {
2
- padding: 2rem;
3
- font-family: -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
 
 
 
 
 
 
 
 
 
4
  }
5
 
6
  h1 {
7
- font-size: 16px;
8
- margin-top: 0;
9
  }
10
 
11
  p {
12
- color: rgb(107, 114, 128);
13
- font-size: 15px;
14
- margin-bottom: 10px;
15
- margin-top: 5px;
16
  }
17
 
18
- .card {
19
- max-width: 620px;
20
- margin: 0 auto;
21
- padding: 16px;
22
- border: 1px solid lightgray;
23
- border-radius: 16px;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  }
25
 
26
- .card p:last-child {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  margin-bottom: 0;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  }
 
1
+ *,
2
+ *::before,
3
+ *::after {
4
+ box-sizing: border-box;
5
+ }
6
+
7
+ :root {
8
+ color-scheme: light;
9
+ font-family: "Inter", "Segoe UI", system-ui, -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
10
+ background: #f5f6fa;
11
+ color: #1f2937;
12
+ }
13
+
14
  body {
15
+ margin: 0;
16
+ background: radial-gradient(circle at top, #ffffff, #eef2ff 60%);
17
+ min-height: 100vh;
18
+ padding: 2.5rem clamp(1rem, 4vw, 4rem);
19
+ }
20
+
21
+ .app {
22
+ max-width: 960px;
23
+ margin: 0 auto;
24
+ display: grid;
25
+ gap: 2rem;
26
  }
27
 
28
  h1 {
29
+ font-weight: 700;
30
+ margin-bottom: 0.25rem;
31
  }
32
 
33
  p {
34
+ margin: 0;
35
+ color: #4b5563;
36
+ line-height: 1.6;
 
37
  }
38
 
39
+ .hero {
40
+ display: grid;
41
+ gap: 1.5rem;
42
+ padding: 1.5rem;
43
+ border-radius: 24px;
44
+ border: 1px solid rgba(255, 255, 255, 0.6);
45
+ background: rgba(255, 255, 255, 0.85);
46
+ box-shadow: 0 25px 60px rgba(15, 23, 42, 0.08);
47
+ }
48
+
49
+ .audio-shell {
50
+ background: rgba(15, 23, 42, 0.9);
51
+ border-radius: 18px;
52
+ padding: 1rem;
53
+ box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.1);
54
+ display: grid;
55
+ gap: 0.75rem;
56
+ }
57
+
58
+ audio {
59
+ width: 100%;
60
+ filter: drop-shadow(0 5px 20px rgba(0, 0, 0, 0.3));
61
+ }
62
+
63
+ #waveform {
64
+ width: 100%;
65
+ height: 140px;
66
+ border-radius: 12px;
67
+ background: rgba(255, 255, 255, 0.05);
68
  }
69
 
70
+ .transcripts {
71
+ display: grid;
72
+ gap: 1.75rem;
73
+ }
74
+
75
+ #reference-track {
76
+ display: grid;
77
+ }
78
+
79
+ .models-grid {
80
+ display: grid;
81
+ gap: 1.25rem;
82
+ }
83
+
84
+ .track {
85
+ background: #ffffff;
86
+ border-radius: 20px;
87
+ padding: 1.25rem;
88
+ box-shadow: 0 20px 40px rgba(15, 23, 42, 0.07);
89
+ border: 1px solid rgba(31, 41, 55, 0.08);
90
+ }
91
+
92
+ .track--error {
93
+ border: 1px dashed rgba(239, 68, 68, 0.6);
94
+ background: rgba(254, 242, 242, 0.9);
95
+ box-shadow: none;
96
+ }
97
+
98
+ .track--emphasis {
99
+ border: 2px solid #00b894;
100
+ box-shadow: 0 25px 45px rgba(0, 184, 148, 0.15);
101
+ }
102
+
103
+ .track header {
104
+ display: flex;
105
+ flex-wrap: wrap;
106
+ align-items: center;
107
+ justify-content: space-between;
108
+ gap: 0.75rem;
109
+ margin-bottom: 1rem;
110
+ }
111
+
112
+ .track h2 {
113
+ font-size: 1.25rem;
114
+ margin: 0;
115
+ }
116
+
117
+ .badge {
118
+ background: rgba(31, 41, 55, 0.05);
119
+ color: #1f2937;
120
+ padding: 0.2rem 0.75rem;
121
+ border-radius: 999px;
122
+ font-size: 0.85rem;
123
+ }
124
+
125
+ .badge--error {
126
+ background: rgba(239, 68, 68, 0.15);
127
+ color: #b91c1c;
128
+ }
129
+
130
+ .track-error {
131
+ margin: 0;
132
+ color: #b91c1c;
133
+ font-weight: 500;
134
+ }
135
+
136
+ .track-body {
137
+ display: grid;
138
+ gap: 0.5rem;
139
+ max-height: 300px;
140
+ overflow: auto;
141
+ scrollbar-width: thin;
142
+ }
143
+
144
+ .track-body::-webkit-scrollbar {
145
+ width: 6px;
146
+ }
147
+
148
+ .track-body::-webkit-scrollbar-thumb {
149
+ background: rgba(31, 41, 55, 0.25);
150
+ border-radius: 999px;
151
+ }
152
+
153
+ .segment {
154
+ padding: 0.85rem 1rem;
155
+ border-radius: 14px;
156
+ background: rgba(15, 23, 42, 0.03);
157
+ border: 1px solid rgba(0, 0, 0, 0.04);
158
+ transition: background 0.25s ease, transform 0.25s ease, border 0.25s ease;
159
+ }
160
+
161
+ .segment p {
162
+ margin-top: 0.35rem;
163
  margin-bottom: 0;
164
+ color: #111827;
165
+ }
166
+
167
+ .segment-time {
168
+ display: inline-flex;
169
+ align-items: center;
170
+ gap: 0.35rem;
171
+ font-family: "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
172
+ font-size: 0.85rem;
173
+ color: rgba(55, 65, 81, 0.9);
174
+ }
175
+
176
+ .segment.is-active {
177
+ background: rgba(64, 112, 244, 0.08);
178
+ border-color: var(--accent, rgba(64, 112, 244, 0.5));
179
+ box-shadow: 0 10px 20px rgba(64, 112, 244, 0.15);
180
+ transform: translateY(-2px);
181
+ }
182
+
183
+ .track--emphasis .segment.is-active {
184
+ background: rgba(0, 184, 148, 0.1);
185
+ box-shadow: 0 12px 24px rgba(0, 184, 148, 0.2);
186
+ border-color: rgba(0, 184, 148, 0.6);
187
+ }
188
+
189
+ @media (min-width: 720px) {
190
+ .hero {
191
+ grid-template-columns: 1.1fr 0.9fr;
192
+ align-items: center;
193
+ }
194
+
195
+ .models-grid {
196
+ grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
197
+ }
198
  }
transcripts.js ADDED
The diff for this file is too large to render. See raw diff