Spaces:

danielrosehill
/

STT-Comparison

Running

App Files Files Community

STT-Comparison / srt-out /nova3.srt

danielrosehill

Fix SRT timestamp alignment with ground truth

0aa8adc about 2 months ago

raw

history blame contribute delete

44.5 kB

	1
	00:00:00,000 --> 00:00:06,160
	Hello and welcome to a audio dataset consisting of one

	2
	00:00:06,160 --> 00:00:08,320
	single episode of a nonexistent podcast.

	3
	00:00:08,720 --> 00:00:12,800
	Or it I may append this to a podcast that

	4
	00:00:12,800 --> 00:00:18,734
	I set up recently regarding my with my thoughts on

	5
	00:00:18,735 --> 00:00:20,735
	speech tech and A.

	6
	00:00:20,735 --> 00:00:21,134
	I.

	7
	00:00:21,134 --> 00:00:22,734
	In particular, more A.

	8
	00:00:22,734 --> 00:00:22,974
	I.

	9
	00:00:22,974 --> 00:00:23,855
	And generative A.

	10
	00:00:23,855 --> 00:00:24,015
	I.

	11
	00:00:24,015 --> 00:00:26,414
	I would I would say.

	12
	00:00:26,734 --> 00:00:30,789
	But in any event, the purpose of this voice recording

	13
	00:00:30,789 --> 00:00:35,510
	is actually to create a lengthy voice sample for a

	14
	00:00:35,510 --> 00:00:38,870
	quick evaluation, a back of the envelope evaluation, they might

	15
	00:00:38,870 --> 00:00:41,349
	say, for different speech attacks models.

	16
	00:00:41,349 --> 00:00:43,865
	I'm doing this because I thought I'd made a great

	17
	00:00:43,865 --> 00:00:47,704
	breakthrough in my journey with speech tech and that was

	18
	00:00:47,704 --> 00:00:51,305
	succeeding in the elusive task of fine tuning whisper.

	19
	00:00:51,624 --> 00:00:56,344
	Whisper is, and I'm to just talk, I'm trying to

	20
	00:00:55,749 --> 00:00:56,709
	mix up.

	21
	00:00:56,789 --> 00:01:00,310
	I'm going to try a few different styles of speaking

	22
	00:01:00,310 --> 00:01:02,789
	whisper something at some points as well.

	23
	00:01:03,270 --> 00:01:06,710
	And I'll go back to speaking loud in in different

	24
	00:01:06,710 --> 00:01:08,950
	parts are going to sound really like a crazy person

	25
	00:01:08,950 --> 00:01:12,344
	because I'm also going to try to speak at different

	26
	00:01:12,904 --> 00:01:17,945
	pitches and cadences in order to really try to push

	27
	00:01:18,264 --> 00:01:21,065
	a speech to text model through its paces, which is

	28
	00:01:21,065 --> 00:01:24,529
	trying to make sense of is this guy just rambling

	29
	00:01:24,529 --> 00:01:29,969
	on incoherently in one long sentence or are these just

	30
	00:01:29,969 --> 00:01:36,370
	actually a series of step standalone, standalone, standalone sentences?

	31
	00:01:36,370 --> 00:01:38,050
	And how is it going to handle step alone?

	32
	00:01:38,050 --> 00:01:38,690
	That's not a word.

	33
	00:01:39,624 --> 00:01:41,945
	What happens when you use speech to text and you

	34
	00:01:41,945 --> 00:01:43,304
	use a fake word?

	35
	00:01:43,304 --> 00:01:45,704
	And then you're like, wait, that's not actually that word

	36
	00:01:45,704 --> 00:01:46,585
	doesn't exist.

	37
	00:01:46,904 --> 00:01:48,504
	How does AI handle that?

	38
	00:01:48,504 --> 00:01:53,670
	And these and more are all the questions that I'm

	39
	00:01:53,670 --> 00:01:55,670
	seeking to answer in this training data.

	40
	00:01:55,749 --> 00:01:58,469
	Now, why was I trying to fine tune Whisper?

	41
	00:01:58,469 --> 00:01:59,670
	And what is Whisper?

	42
	00:01:59,670 --> 00:02:02,630
	As I said, I'm going to try to record this

	43
	00:02:02,630 --> 00:02:06,564
	at a couple of different levels of technicality for folks

	44
	00:02:06,564 --> 00:02:11,684
	who are in the normal world and not totally stuck

	45
	00:02:11,684 --> 00:02:13,684
	down the rabbit hole of AI, which you have to

	46
	00:02:13,684 --> 00:02:17,605
	say is a really wonderful rabbit hole to be done.

	47
	00:02:17,764 --> 00:02:20,839
	It's a really interesting area and speech and voice tech

	48
	00:02:20,839 --> 00:02:24,279
	is is the aspect of it that I find actually

	49
	00:02:24,279 --> 00:02:27,159
	most I'm not sure I would say the most interesting

	50
	00:02:27,159 --> 00:02:30,679
	because there's just so much that is fascinating in AI.

	51
	00:02:31,320 --> 00:02:34,054
	But the most that I find the most personally transformative

	52
	00:02:34,054 --> 00:02:38,454
	in terms of the impact that it's had on my

	53
	00:02:38,454 --> 00:02:41,174
	daily work life and productivity and how I sort of

	54
	00:02:41,174 --> 00:02:41,815
	work.

	55
	00:02:42,855 --> 00:02:47,420
	I'm persevering hard with the task of trying to get

	56
	00:02:47,420 --> 00:02:50,859
	a good solution working for Linux, which if anyone actually

	57
	00:02:50,859 --> 00:02:52,859
	does listen to this, not just for the training data

	58
	00:02:52,859 --> 00:02:56,620
	and for the actual content, is sparked.

	59
	00:02:56,620 --> 00:02:59,900
	I had, besides the fine tune not working, well that

	60
	00:02:59,900 --> 00:03:01,305
	was the failure.

	61
	00:03:02,424 --> 00:03:06,665
	I used Claude code because one thinks these days that

	62
	00:03:06,665 --> 00:03:13,200
	there is nothing short of solving, you know, the the

	63
	00:03:13,200 --> 00:03:17,519
	reason of life or something that clause and agentic AI

	64
	00:03:17,519 --> 00:03:19,600
	can't do, which is not really the case.

	65
	00:03:19,600 --> 00:03:23,119
	It does seem that way sometimes, but it fails a

	66
	00:03:23,119 --> 00:03:23,679
	lot as well.

	67
	00:03:23,679 --> 00:03:26,559
	And this is one of those instances where last week

	68
	00:03:26,559 --> 00:03:30,744
	I put together an hour of voice training data, basically

	69
	00:03:30,744 --> 00:03:33,385
	speaking just random things for three minutes.

	70
	00:03:35,385 --> 00:03:38,024
	It was actually kind of tedious because the texts were

	71
	00:03:38,024 --> 00:03:38,584
	really weird.

	72
	00:03:38,584 --> 00:03:41,290
	Some of them were, it was like it was AI

	73
	00:03:41,290 --> 00:03:42,170
	generated.

	74
	00:03:42,489 --> 00:03:44,809
	I tried before to read Sherlock Holmes for an hour

	75
	00:03:44,809 --> 00:03:47,609
	and I just couldn't, I was so bored after ten

	76
	00:03:47,609 --> 00:03:50,489
	minutes that I was like, okay, no, I'm just gonna

	77
	00:03:50,489 --> 00:03:51,850
	have to find something else to read.

	78
	00:03:51,850 --> 00:03:58,204
	So I used a created with AI Studio, VibeCoded, a

	79
	00:03:58,204 --> 00:04:03,084
	synthetic text generator which actually I thought was probably a

	80
	00:04:03,084 --> 00:04:05,165
	better way of doing it because it would give me

	81
	00:04:05,165 --> 00:04:08,989
	more short samples with more varied content.

	82
	00:04:08,989 --> 00:04:11,630
	So I was like, okay, give me a voice note

	83
	00:04:11,630 --> 00:04:14,829
	like I'm recording an email, give me a short story

	84
	00:04:14,829 --> 00:04:18,109
	to read, give me prose to read.

	85
	00:04:18,109 --> 00:04:20,554
	So I came up with all these different things and

	86
	00:04:20,554 --> 00:04:22,634
	they added a little timer to it so I could

	87
	00:04:22,634 --> 00:04:24,875
	see how close I was to one hour.

	88
	00:04:25,835 --> 00:04:29,035
	And I spent like an hour one afternoon or probably

	89
	00:04:29,035 --> 00:04:33,035
	two hours by the time you do retakes and whatever

	90
	00:04:33,035 --> 00:04:36,089
	because you want to it gave me a source of

	91
	00:04:36,089 --> 00:04:39,929
	truth which I'm not sure if that's the scientific way

	92
	00:04:39,929 --> 00:04:44,089
	to approach this topic of gathering training data but I

	93
	00:04:44,089 --> 00:04:45,369
	thought made sense.

	94
	00:04:46,410 --> 00:04:49,384
	I have a lot of audio data from recording voice

	95
	00:04:49,384 --> 00:04:53,464
	notes which I've also kind of used, been experimenting with

	96
	00:04:53,464 --> 00:04:54,984
	using for a different purpose.

	97
	00:04:55,304 --> 00:04:58,665
	Slightly different annotating task types.

	98
	00:04:58,665 --> 00:05:03,170
	It's more a text classification experiment or Well, it's more

	99
	00:05:03,170 --> 00:05:03,730
	than that actually.

	100
	00:05:03,730 --> 00:05:04,929
	I'm working on a voice app.

	101
	00:05:04,929 --> 00:05:09,249
	So it's a prototype, I guess, is really more accurate.

	102
	00:05:11,329 --> 00:05:13,889
	But you can do that and you can work backwards.

	103
	00:05:13,889 --> 00:05:18,274
	Listen back to a voice note and you painfully go

	104
	00:05:18,274 --> 00:05:21,394
	through one of those transcribing, where you start and stop

	105
	00:05:21,394 --> 00:05:23,554
	and scrub around it and you fix the errors, but

	106
	00:05:23,554 --> 00:05:25,795
	it's really, really pouring to do that.

	107
	00:05:26,035 --> 00:05:27,954
	So I thought it would be less tedious in the

	108
	00:05:27,954 --> 00:05:31,634
	long term if I just recorded the source of truth.

	109
	00:05:31,989 --> 00:05:34,309
	So it gave me these three minutes snippets.

	110
	00:05:34,309 --> 00:05:37,429
	I recorded them and saved an MP3 and a TXT

	111
	00:05:37,670 --> 00:05:40,230
	in the same folder and I created an error that

	112
	00:05:40,230 --> 00:05:40,869
	data.

	113
	00:05:41,910 --> 00:05:44,790
	So I was very hopeful, quietly, a little bit hopeful

	114
	00:05:44,790 --> 00:05:46,949
	that I would be able, that I could actually fine

	115
	00:05:46,949 --> 00:05:47,670
	tune Whisper.

	116
	00:05:48,285 --> 00:05:51,005
	I want to fine tune Whisper because when I got

	117
	00:05:51,005 --> 00:05:54,924
	into voice tech last November, my wife was in the

	118
	00:05:54,924 --> 00:05:57,165
	US and I was alone at home.

	119
	00:05:57,244 --> 00:06:00,924
	And when crazy people like me do really wild things

	120
	00:06:00,924 --> 00:06:03,900
	like use voice to tech technology.

	121
	00:06:03,900 --> 00:06:06,859
	That was basically when I started doing it, I didn't

	122
	00:06:06,859 --> 00:06:09,500
	feel like a crazy person speaking to myself.

	123
	00:06:09,900 --> 00:06:12,700
	And my expectations weren't that high.

	124
	00:06:13,100 --> 00:06:17,605
	I'd used speech tech now and again, tried it out.

	125
	00:06:17,605 --> 00:06:18,804
	I was like, it'd be really cool if you could

	126
	00:06:18,804 --> 00:06:22,324
	just like speak into your computer and whatever I tried

	127
	00:06:22,324 --> 00:06:25,845
	out that had Linux support was just, it was not

	128
	00:06:25,845 --> 00:06:26,725
	good basically.

	129
	00:06:27,285 --> 00:06:29,444
	And this blew me away from the first go.

	130
	00:06:29,444 --> 00:06:32,259
	I mean, it wasn't one hundred percent accurate out of

	131
	00:06:32,259 --> 00:06:34,420
	the box and it took work, but it was good

	132
	00:06:34,420 --> 00:06:36,739
	enough that there was a solid foundation and it kind

	133
	00:06:36,739 --> 00:06:41,059
	of passed that pivot point that it's actually worth doing

	134
	00:06:41,059 --> 00:06:41,540
	this.

	135
	00:06:41,859 --> 00:06:43,859
	You know, there's a point where it's so like, the

	136
	00:06:43,859 --> 00:06:46,405
	transcript is you don't have to get one hundred percent

	137
	00:06:46,405 --> 00:06:49,445
	accuracy for it to be worth your time for speech

	138
	00:06:49,445 --> 00:06:51,845
	to text to be a worthwhile addition to your productivity.

	139
	00:06:51,845 --> 00:06:53,605
	But you do need to get above, let's say, I

	140
	00:06:53,605 --> 00:06:55,045
	don't know, eighty five percent.

	141
	00:06:55,525 --> 00:06:58,725
	If it's sixty percent or fifty percent, you inevitably say,

	142
	00:06:58,960 --> 00:07:00,239
	Screw it, I'll just type it.

	143
	00:07:00,239 --> 00:07:03,600
	Because you end up missing errors in the transcript and

	144
	00:07:03,600 --> 00:07:04,960
	it becomes actually worse.

	145
	00:07:04,960 --> 00:07:06,640
	You end up in a worse position than you started

	146
	00:07:06,640 --> 00:07:06,960
	with it.

	147
	00:07:06,960 --> 00:07:08,160
	That's been my experience.

	148
	00:07:08,480 --> 00:07:12,400
	So I was like, Oh, this is actually really, really

	149
	00:07:12,400 --> 00:07:12,880
	good now.

	150
	00:07:12,880 --> 00:07:13,600
	How did that happen?

	151
	00:07:13,600 --> 00:07:17,915
	And the answer is ASR, Whisper being open sourced and

	152
	00:07:18,634 --> 00:07:21,514
	the transformer architecture, if you want to go back to

	153
	00:07:21,514 --> 00:07:26,314
	the underpinnings, which really blows my mind and it's on

	154
	00:07:26,314 --> 00:07:29,750
	my list to read through that paper.

	155
	00:07:30,309 --> 00:07:35,910
	All you need is attention as attentively as can be

	156
	00:07:35,910 --> 00:07:39,270
	done with my limited brain because it's super super high

	157
	00:07:39,270 --> 00:07:42,965
	level stuff, super advanced stuff, mean.

	158
	00:07:43,205 --> 00:07:48,004
	That I think of all the things that are fascinating

	159
	00:07:48,004 --> 00:07:52,484
	about the sudden rise in AI and the dramatic capabilities,

	160
	00:07:53,259 --> 00:07:55,339
	I find it fascinating that few people are like, hang

	161
	00:07:55,339 --> 00:07:58,220
	on, you've got this thing that can speak to you

	162
	00:07:58,220 --> 00:07:59,980
	like a chatbot, an LLM.

	163
	00:08:00,540 --> 00:08:02,780
	And then you've got image generation.

	164
	00:08:02,780 --> 00:08:03,100
	Okay.

	165
	00:08:03,100 --> 00:08:07,020
	So firstly, two things on the surface have nothing in

	166
	00:08:07,020 --> 00:08:07,339
	common.

	167
	00:08:08,285 --> 00:08:11,964
	So how did that just happen all at the same

	168
	00:08:11,964 --> 00:08:12,205
	time?

	169
	00:08:12,205 --> 00:08:15,884
	And then when you extend that further, you're like, Suno.

	170
	00:08:15,884 --> 00:08:19,405
	You can sing a song and AI will come up

	171
	00:08:19,405 --> 00:08:21,085
	with an instrumental.

	172
	00:08:21,405 --> 00:08:23,405
	And then you've got Whisper and you're like, Wait a

	173
	00:08:23,405 --> 00:08:23,645
	second.

	174
	00:08:24,020 --> 00:08:28,100
	How did all this stuff If it's all AI, there

	175
	00:08:28,100 --> 00:08:29,460
	has to be some commonality.

	176
	00:08:29,460 --> 00:08:35,059
	Otherwise, are totally different technologies on the surface of it.

	177
	00:08:35,140 --> 00:08:39,304
	And the transformer architecture is, as far as I know,

	178
	00:08:39,304 --> 00:08:40,184
	the answer.

	179
	00:08:40,184 --> 00:08:42,905
	And I can't even say, can't even pretend that I

	180
	00:08:42,905 --> 00:08:47,304
	really understand what the transformer architecture means in-depth.

	181
	00:08:47,304 --> 00:08:49,785
	But I have scanned this and as I said, I

	182
	00:08:49,785 --> 00:08:52,799
	want to print it and really kind of think over

	183
	00:08:52,799 --> 00:08:54,080
	it at some point.

	184
	00:08:54,799 --> 00:08:58,000
	And I'll probably feel bad about myself, I think, because

	185
	00:08:58,000 --> 00:08:59,599
	weren't those guys in twenties?

	186
	00:09:00,240 --> 00:09:01,760
	Like, that's crazy.

	187
	00:09:02,080 --> 00:09:06,080
	I think I asked ChatGPT once who wrote that paper

	188
	00:09:06,465 --> 00:09:09,184
	and how old were they when it was published in

	189
	00:09:09,184 --> 00:09:09,745
	ArcSiv?

	190
	00:09:09,745 --> 00:09:13,025
	And I was expecting like, I don't know, what do

	191
	00:09:13,025 --> 00:09:13,505
	you imagine?

	192
	00:09:13,505 --> 00:09:15,585
	I personally imagine kind of like, you you have these

	193
	00:09:15,585 --> 00:09:19,665
	breakthroughs during COVID and things like that, where like these

	194
	00:09:19,665 --> 00:09:22,549
	kind of really obscure scientists who are in their 50s

	195
	00:09:22,549 --> 00:09:26,790
	and they've just kind of been laboring in labs and

	196
	00:09:26,790 --> 00:09:29,750
	wearily in writing and publishing in kind of obscure academic

	197
	00:09:29,750 --> 00:09:30,630
	publications.

	198
	00:09:30,790 --> 00:09:33,589
	And they finally hit a big or win a Nobel

	199
	00:09:33,589 --> 00:09:36,155
	Prize and then their household names.

	200
	00:09:36,554 --> 00:09:38,554
	So that was kind of what I had in mind.

	201
	00:09:38,554 --> 00:09:42,074
	That was the mental image I'd formed of the birth

	202
	00:09:42,074 --> 00:09:42,875
	of ArcSim.

	203
	00:09:42,875 --> 00:09:45,515
	Like I wasn't expecting twenty somethings in San Francisco.

	204
	00:09:45,515 --> 00:09:48,714
	I thought that was both very funny, very cool, and

	205
	00:09:48,714 --> 00:09:49,995
	actually kind of inspiring.

	206
	00:09:50,474 --> 00:09:55,150
	It's nice to think that people who just you might

	207
	00:09:55,150 --> 00:09:58,429
	put them in the kind of milieu or bubble or

	208
	00:09:58,429 --> 00:10:02,589
	world that you are in incredibly in through a series

	209
	00:10:02,589 --> 00:10:05,755
	of connections that are coming up with such literally world

	210
	00:10:05,755 --> 00:10:07,755
	changing innovations.

	211
	00:10:07,834 --> 00:10:11,194
	So that was I thought anyway, that's that that was

	212
	00:10:11,194 --> 00:10:11,755
	cool.

	213
	00:10:12,155 --> 00:10:12,474
	Okay.

	214
	00:10:12,474 --> 00:10:13,354
	Voice training data.

	215
	00:10:13,354 --> 00:10:14,074
	How are we doing?

	216
	00:10:14,074 --> 00:10:17,275
	We're about ten minutes, and I'm still talking about voice

	217
	00:10:17,275 --> 00:10:18,155
	technology.

	218
	00:10:18,554 --> 00:10:22,099
	So Whisper was brilliant, and I was so excited that

	219
	00:10:22,099 --> 00:10:25,780
	my first instinct was to guess, like, Oh my gosh,

	220
	00:10:25,780 --> 00:10:27,939
	I have to get a really good microphone for this.

	221
	00:10:28,099 --> 00:10:31,299
	So I didn't go on a spending spree because I

	222
	00:10:31,299 --> 00:10:33,219
	said, I'm gonna have to just wait a month and

	223
	00:10:33,219 --> 00:10:34,660
	see if I still use this.

	224
	00:10:35,140 --> 00:10:38,795
	And it just kind of became it's become really part

	225
	00:10:38,795 --> 00:10:40,875
	of my daily routine.

	226
	00:10:41,674 --> 00:10:44,235
	Like if I'm writing an email, I'll record a voice

	227
	00:10:44,235 --> 00:10:47,515
	note and then I've developed and it's nice to see

	228
	00:10:47,515 --> 00:10:50,679
	that everyone is like developing the same things in parallel.

	229
	00:10:50,679 --> 00:10:53,319
	That's kind of a weird thing to say, when I

	230
	00:10:53,319 --> 00:11:00,199
	started working on these prototypes on GitHub, which is where

	231
	00:11:00,199 --> 00:11:03,959
	I just kind of share very freely and loosely ideas

	232
	00:11:03,959 --> 00:11:06,865
	and first iterations on concepts.

	233
	00:11:08,944 --> 00:11:10,624
	And for want of a better word, I called it

	234
	00:11:10,624 --> 00:11:14,865
	like LLM post processing or clean up or basically a

	235
	00:11:14,865 --> 00:11:17,665
	system prompt that after you get back the raw text

	236
	00:11:17,665 --> 00:11:21,540
	from Whisper, you run it through a model and say,

	237
	00:11:21,540 --> 00:11:26,259
	okay, this is crappy text like add sentence structure and,

	238
	00:11:26,259 --> 00:11:27,379
	you know, fix it up.

	239
	00:11:27,780 --> 00:11:32,499
	And now when I'm exploring the different tools that are

	240
	00:11:32,499 --> 00:11:35,554
	out there that people have built, I see quite a

	241
	00:11:35,554 --> 00:11:39,395
	number of projects have basically done the same thing.

	242
	00:11:40,674 --> 00:11:43,155
	Lest that be misconstrued, I'm not saying for a millisecond

	243
	00:11:43,155 --> 00:11:44,515
	that I inspired them.

	244
	00:11:44,515 --> 00:11:47,954
	I'm sure this has been a thing that's been integrated

	245
	00:11:47,954 --> 00:11:51,210
	into tools for a while, but it's the kind of

	246
	00:11:51,210 --> 00:11:53,610
	thing that when you start using these tools every day,

	247
	00:11:53,610 --> 00:11:57,530
	the need for it is almost instantly apparent because text

	248
	00:11:57,530 --> 00:12:01,449
	that doesn't have any punctuation or paragraph spacing takes a

	249
	00:12:01,449 --> 00:12:03,885
	long time to, you know, it takes so long to

	250
	00:12:03,885 --> 00:12:08,924
	get it into a presentable email that again, moves speech

	251
	00:12:08,924 --> 00:12:13,005
	tech into that before that inflection point where you're like,

	252
	00:12:13,005 --> 00:12:13,885
	nah, it's just not worth it.

	253
	00:12:13,885 --> 00:12:16,844
	It's like, it'll just be quicker to type this.

	254
	00:12:17,199 --> 00:12:19,760
	So it's a big, it's a little touch that actually

	255
	00:12:20,000 --> 00:12:21,120
	is a big deal.

	256
	00:12:21,439 --> 00:12:25,360
	So I was on Whisper and I've been using Whisper

	257
	00:12:25,360 --> 00:12:27,679
	and I kind of early on found a couple of

	258
	00:12:27,679 --> 00:12:28,319
	tools.

	259
	00:12:28,319 --> 00:12:30,559
	I couldn't find what I was looking for on Linux,

	260
	00:12:30,559 --> 00:12:35,844
	which is basically just something that'll run-in the background.

	261
	00:12:35,844 --> 00:12:38,165
	You'll give it an API key and it will just

	262
	00:12:38,165 --> 00:12:42,964
	like transcribe with like a little key to start and

	263
	00:12:42,964 --> 00:12:43,765
	stop the dictation.

	264
	00:12:45,000 --> 00:12:48,360
	And the issues where I discovered that like most people

	265
	00:12:48,360 --> 00:12:51,960
	involved in creating these projects were very much focused on

	266
	00:12:51,960 --> 00:12:55,720
	local models, running Whisper locally because you can.

	267
	00:12:56,199 --> 00:12:58,120
	And I tried that a bunch of times and just

	268
	00:12:58,120 --> 00:13:00,974
	never got results that were as good as the cloud.

	269
	00:13:01,375 --> 00:13:03,535
	And when I began looking at the cost of the

	270
	00:13:03,535 --> 00:13:06,574
	speech to text APIs and what I was spending, I

	271
	00:13:06,574 --> 00:13:09,775
	just thought there is it's actually, in my opinion, just

	272
	00:13:09,775 --> 00:13:13,080
	one of the better deals in API spending in the

	273
	00:13:13,080 --> 00:13:13,400
	cloud.

	274
	00:13:13,400 --> 00:13:15,640
	Like, it's just not that expensive for very, very good

	275
	00:13:15,640 --> 00:13:19,559
	models that are much more, you know, you're gonna be

	276
	00:13:19,559 --> 00:13:22,679
	able to run the full model, the latest model versus

	277
	00:13:22,679 --> 00:13:26,525
	whatever you can run on your average GPU unless you

	278
	00:13:26,525 --> 00:13:28,765
	want to buy a crazy GPU.

	279
	00:13:28,765 --> 00:13:29,964
	It doesn't really make sense to me.

	280
	00:13:29,964 --> 00:13:33,084
	Privacy is another concern that I know is kind of

	281
	00:13:33,084 --> 00:13:35,245
	like a very much a separate thing that people just

	282
	00:13:35,245 --> 00:13:38,765
	don't want their voice data and their voice leaving their

	283
	00:13:38,765 --> 00:13:42,380
	local environment maybe for regulatory reasons as well.

	284
	00:13:42,620 --> 00:13:43,900
	But I'm not in that.

	285
	00:13:44,140 --> 00:13:48,460
	I neither really care about people listening to my, grocery

	286
	00:13:48,460 --> 00:13:51,500
	list, consisting of, reminding myself that I need to buy

	287
	00:13:51,500 --> 00:13:54,699
	more beer, Cheetos, and hummus, which is kind of the

	288
	00:13:55,254 --> 00:13:59,494
	three staples of my diet during periods of poor nutrition.

	289
	00:13:59,814 --> 00:14:02,295
	But the kind of stuff that I transcribe, it's just

	290
	00:14:02,295 --> 00:14:02,614
	not.

	291
	00:14:02,614 --> 00:14:07,734
	It's not a privacy thing I'm that sort of sensitive

	292
	00:14:07,734 --> 00:14:13,189
	about and I don't do anything so sensitive or secure

	293
	00:14:13,189 --> 00:14:14,710
	that requires air capping.

	294
	00:14:15,590 --> 00:14:17,510
	I looked at the pricing and especially the kind of

	295
	00:14:17,510 --> 00:14:18,870
	older model mini.

	296
	00:14:19,510 --> 00:14:21,830
	Some of them are very, very affordable and I did

	297
	00:14:21,830 --> 00:14:26,684
	a calculation once with ChatGPT and I was like, okay,

	298
	00:14:26,684 --> 00:14:30,285
	this is the API price for I can't remember whatever

	299
	00:14:30,285 --> 00:14:31,324
	the model was.

	300
	00:14:31,724 --> 00:14:34,365
	Let's say I just go at it like nonstop, which

	301
	00:14:34,365 --> 00:14:35,485
	rarely happens.

	302
	00:14:35,564 --> 00:14:38,879
	Probably, I would say on average I might dictate thirty

	303
	00:14:38,879 --> 00:14:41,679
	to sixty minutes per day if I was probably summing

	304
	00:14:41,679 --> 00:14:47,920
	up the emails, documents, outlines, which is a lot, but

	305
	00:14:47,920 --> 00:14:50,079
	it's it's still a fairly modest amount.

	306
	00:14:50,079 --> 00:14:51,759
	And I was like, well, some days I do go

	307
	00:14:51,759 --> 00:14:54,854
	on like one or two days where I've been usually

	308
	00:14:54,854 --> 00:14:56,775
	when I'm like kind of out of the house and

	309
	00:14:56,775 --> 00:15:00,455
	just have something like I have nothing else to do.

	310
	00:15:00,455 --> 00:15:03,095
	Like if I'm at a hospital, we have a newborn

	311
	00:15:03,495 --> 00:15:07,219
	and you're waiting for like eight hours and hours for

	312
	00:15:07,219 --> 00:15:08,020
	an appointment.

	313
	00:15:08,099 --> 00:15:11,939
	And I would probably have listened to podcasts before becoming

	314
	00:15:11,939 --> 00:15:12,900
	a speech fanatic.

	315
	00:15:12,900 --> 00:15:15,299
	And I'm like, Oh, wait, let me just get down.

	316
	00:15:15,299 --> 00:15:17,299
	Let me just get these ideas out of my head.

	317
	00:15:17,460 --> 00:15:20,665
	And that's when I'll go on my speech binges.

	318
	00:15:20,665 --> 00:15:22,584
	But those are like once every few months, like not

	319
	00:15:22,584 --> 00:15:23,464
	frequently.

	320
	00:15:23,704 --> 00:15:25,704
	But I said, okay, let's just say if I'm going

	321
	00:15:25,704 --> 00:15:28,104
	to price out cloud STT.

	322
	00:15:28,905 --> 00:15:33,420
	If I was like dedicated every second of every waking

	323
	00:15:33,420 --> 00:15:37,740
	hour to transcribing for some odd reason, I mean I'd

	324
	00:15:37,740 --> 00:15:39,740
	have to eat and use the toilet.

	325
	00:15:40,460 --> 00:15:42,620
	There's only so many hours I'm awake for.

	326
	00:15:42,620 --> 00:15:46,939
	So let's just say a maximum of forty five minutes

	327
	00:15:47,125 --> 00:15:49,125
	in the hour, then I said, All right, let's just

	328
	00:15:49,125 --> 00:15:50,085
	say fifty.

	329
	00:15:50,564 --> 00:15:51,285
	Who knows?

	330
	00:15:51,285 --> 00:15:52,724
	You're dictating on the toilet.

	331
	00:15:52,724 --> 00:15:53,525
	We do it.

	332
	00:15:53,844 --> 00:15:56,804
	So you could just do sixty, but whatever I did

	333
	00:15:57,045 --> 00:16:01,099
	and every day, like you're going flat out seven days

	334
	00:16:01,099 --> 00:16:02,540
	a week dictating nonstop.

	335
	00:16:02,540 --> 00:16:05,499
	I was like, What's my monthly API bill going to

	336
	00:16:05,499 --> 00:16:06,620
	be at this price?

	337
	00:16:06,699 --> 00:16:09,259
	And it came out to like seventy or eighty bucks.

	338
	00:16:09,259 --> 00:16:12,540
	And I was like, Well, that would be an extraordinary

	339
	00:16:12,860 --> 00:16:14,299
	amount of dictation.

	340
	00:16:14,299 --> 00:16:18,025
	And I would hope that there was some compelling reason

	341
	00:16:18,665 --> 00:16:21,704
	worth more than seventy dollars that I embarked upon that

	342
	00:16:21,704 --> 00:16:22,344
	project.

	343
	00:16:22,584 --> 00:16:24,505
	So given that that's kind of the max point for

	344
	00:16:24,505 --> 00:16:27,224
	me I said that's actually very very affordable.

	345
	00:16:27,944 --> 00:16:30,424
	Now you're gonna if you want to spec out the

	346
	00:16:30,424 --> 00:16:33,829
	costs and you want to do the post processing that

	347
	00:16:33,829 --> 00:16:36,709
	I really do feel is valuable, that's going to cost

	348
	00:16:36,709 --> 00:16:37,670
	some more as well.

	349
	00:16:37,990 --> 00:16:43,189
	Unless you're using Gemini, which needless to say is a

	350
	00:16:43,189 --> 00:16:45,110
	random person sitting in Jerusalem.

	351
	00:16:45,775 --> 00:16:49,375
	I have no affiliation nor with Google nor Anthropic nor

	352
	00:16:49,375 --> 00:16:52,334
	Gemini nor any major tech vendor for that matter.

	353
	00:16:53,775 --> 00:16:57,135
	I like Gemini not so much as a everyday model.

	354
	00:16:57,375 --> 00:16:59,854
	It's kind of underwhelmed in that respect, I would say.

	355
	00:17:00,299 --> 00:17:02,699
	But for multimodal, I think it's got a lot to

	356
	00:17:02,699 --> 00:17:03,259
	offer.

	357
	00:17:03,579 --> 00:17:07,099
	And I think that the transcribing functionality whereby it can,

	358
	00:17:07,979 --> 00:17:12,300
	process audio with a system prompt and both give you

	359
	00:17:12,300 --> 00:17:13,820
	transcription that's cleaned up.

	360
	00:17:13,820 --> 00:17:15,259
	That reduces two steps to one.

	361
	00:17:15,755 --> 00:17:18,874
	And that for me is a very, very big deal.

	362
	00:17:18,875 --> 00:17:22,394
	And I feel like even Google hasn't really sort of

	363
	00:17:22,475 --> 00:17:27,115
	thought through how useful the that modality is and what

	364
	00:17:27,115 --> 00:17:29,620
	kind of use cases you can achieve with it.

	365
	00:17:29,620 --> 00:17:32,259
	Because I found in the course of this year just

	366
	00:17:32,259 --> 00:17:37,939
	an endless list of really kind of system prompt stuff

	367
	00:17:37,939 --> 00:17:40,820
	that I can say, okay, I've used it to capture

	368
	00:17:40,820 --> 00:17:44,035
	context data for AI, which is literally I might speak

	369
	00:17:44,035 --> 00:17:46,675
	for if I wanted to have a good bank of

	370
	00:17:46,675 --> 00:17:49,955
	context data about who knows my childhood.

	371
	00:17:50,354 --> 00:17:54,275
	More realistically, maybe my career goals, something that would just

	372
	00:17:54,275 --> 00:17:56,115
	be like really boring to type out.

	373
	00:17:56,115 --> 00:18:00,420
	So I'll just like sit in my car and record

	374
	00:18:00,420 --> 00:18:01,380
	it for ten minutes.

	375
	00:18:01,380 --> 00:18:03,699
	And that ten minutes you get a lot of information

	376
	00:18:03,699 --> 00:18:04,339
	in.

	377
	00:18:05,539 --> 00:18:07,620
	Emails, which is short text.

	378
	00:18:08,580 --> 00:18:10,339
	Just there is a whole bunch.

	379
	00:18:10,340 --> 00:18:13,295
	And all these workflows kind of require a little bit

	380
	00:18:13,295 --> 00:18:15,054
	of treatment afterwards and different treatment.

	381
	00:18:15,054 --> 00:18:18,334
	My context pipeline is kind of like just extract the

	382
	00:18:18,334 --> 00:18:19,215
	bare essentials.

	383
	00:18:19,215 --> 00:18:22,094
	You end up with me talking very loosely about sort

	384
	00:18:22,094 --> 00:18:24,414
	of what I've done in my career, where I've worked,

	385
	00:18:24,414 --> 00:18:25,374
	where I might like to work.

	386
	00:18:25,920 --> 00:18:29,039
	And it goes, it condenses that down to very robotic

	387
	00:18:29,039 --> 00:18:32,640
	language that is easy to chunk parse and maybe put

	388
	00:18:32,640 --> 00:18:33,920
	into a vector database.

	389
	00:18:33,920 --> 00:18:36,160
	Daniel has worked in technology.

	390
	00:18:36,160 --> 00:18:39,760
	Daniel has been working in, know, stuff like that.

	391
	00:18:39,760 --> 00:18:42,975
	That's not how you would speak, but I figure it's

	392
	00:18:42,975 --> 00:18:46,414
	probably easier to parse for, after all, robots.

	393
	00:18:46,735 --> 00:18:48,654
	So we've almost got to twenty minutes and this is

	394
	00:18:48,654 --> 00:18:53,054
	actually a success because I wasted twenty minutes of my

	395
	00:18:53,455 --> 00:18:57,120
	of the evening speaking into you in microphone and the

	396
	00:18:57,120 --> 00:19:01,039
	levels were shot and was clipping and I said I

	397
	00:19:01,039 --> 00:19:02,320
	can't really do an evaluation.

	398
	00:19:02,320 --> 00:19:03,360
	I have to be fair.

	399
	00:19:03,360 --> 00:19:06,320
	I have to give the models a chance to do

	400
	00:19:06,320 --> 00:19:06,880
	their thing.

	401
	00:19:07,425 --> 00:19:09,505
	What am I hoping to achieve in this?

	402
	00:19:09,505 --> 00:19:11,584
	Okay, my fine tune was a dud as mentioned.

	403
	00:19:11,665 --> 00:19:15,185
	Deepgram STT, I'm really, really hopeful that this prototype will

	404
	00:19:15,185 --> 00:19:17,985
	work and it's a build in public open source so

	405
	00:19:17,985 --> 00:19:20,304
	anyone is welcome to use it if I make anything

	406
	00:19:20,304 --> 00:19:20,625
	good.

	407
	00:19:21,560 --> 00:19:23,800
	But that was really exciting for me last night when

	408
	00:19:23,800 --> 00:19:28,840
	after hours of trying my own prototype, seeing someone just

	409
	00:19:28,840 --> 00:19:32,039
	made something that works like that, you you're not gonna

	410
	00:19:32,039 --> 00:19:36,374
	have to build a custom conda environment and image.

	411
	00:19:36,374 --> 00:19:39,974
	I have AMD GPU which makes things much more complicated.

	412
	00:19:40,214 --> 00:19:42,614
	I didn't find it and I was about to give

	413
	00:19:42,614 --> 00:19:43,894
	up and I said, All right, let me just give

	414
	00:19:43,894 --> 00:19:46,455
	Deepgram's Linux thing a shot.

	415
	00:19:47,029 --> 00:19:49,589
	And if this doesn't work, I'm just gonna go back

	416
	00:19:49,589 --> 00:19:51,349
	to trying to vibe code something myself.

	417
	00:19:51,670 --> 00:19:55,509
	And when I ran the script, I was using Cloud

	418
	00:19:55,509 --> 00:19:59,029
	Code to do the installation process, it ran the script

	419
	00:19:59,029 --> 00:20:01,189
	and, oh my gosh, it works just like that.

	420
	00:20:01,824 --> 00:20:05,985
	The tricky thing for all those who wants to know

	421
	00:20:05,985 --> 00:20:11,425
	all the nitty, ditty, nitty gritty details was that I

	422
	00:20:11,425 --> 00:20:14,624
	don't think it was actually struggling with transcription, but pasting

	423
	00:20:14,705 --> 00:20:17,539
	Weyland makes life very hard.

	424
	00:20:17,539 --> 00:20:19,140
	And I think there was something not running at the

	425
	00:20:19,140 --> 00:20:19,699
	right time.

	426
	00:20:19,699 --> 00:20:22,979
	Anyway, Deepgram, I looked at how they actually handle that

	427
	00:20:22,979 --> 00:20:25,140
	because it worked out of the box when other stuff

	428
	00:20:25,140 --> 00:20:25,779
	didn't.

	429
	00:20:26,100 --> 00:20:28,900
	And it was quite a clever little mechanism.

	430
	00:20:29,495 --> 00:20:32,135
	And but more so than that, the accuracy was brilliant.

	431
	00:20:32,135 --> 00:20:33,574
	Now what am I what am I doing here?

	432
	00:20:33,574 --> 00:20:37,175
	This is gonna be a twenty minute audio sample.

	433
	00:20:38,375 --> 00:20:42,410
	And I'm I think I've done one or two of

	434
	00:20:42,410 --> 00:20:47,130
	these before, but I did it with short, snappy voice

	435
	00:20:47,130 --> 00:20:47,610
	notes.

	436
	00:20:47,610 --> 00:20:49,370
	This is kind of long form.

	437
	00:20:49,449 --> 00:20:51,929
	This actually might be a better approximation for what's useful

	438
	00:20:51,929 --> 00:20:53,849
	to me than voice memos.

	439
	00:20:53,849 --> 00:20:56,894
	Like, I need to buy three liters of milk tomorrow

	440
	00:20:56,894 --> 00:21:00,175
	and peter bread, which is probably how half my voice

	441
	00:21:00,175 --> 00:21:00,735
	notes sound.

	442
	00:21:00,735 --> 00:21:04,094
	Like if anyone were to find my phone they'd be

	443
	00:21:04,094 --> 00:21:05,934
	like this is the most boring person in the world.

	444
	00:21:06,015 --> 00:21:10,050
	Although actually there are some journaling thoughts as well, but

	445
	00:21:10,050 --> 00:21:11,810
	it's a lot of content like that.

	446
	00:21:11,810 --> 00:21:14,610
	And the probably for the evaluation, the most useful thing

	447
	00:21:14,610 --> 00:21:21,834
	is slightly obscure tech, GitHub, Nucleano, hugging face, not so

	448
	00:21:21,834 --> 00:21:24,474
	obscure that it's not gonna have a chance of knowing

	449
	00:21:24,474 --> 00:21:27,194
	it, but hopefully sufficiently well known that the model should

	450
	00:21:27,194 --> 00:21:27,834
	get it.

	451
	00:21:27,914 --> 00:21:29,995
	I tried to do a little bit of speaking really

	452
	00:21:29,995 --> 00:21:32,394
	fast and speaking very slowly.

	453
	00:21:32,394 --> 00:21:35,529
	Would say in general, I've spoken, delivered this at a

	454
	00:21:35,529 --> 00:21:39,130
	faster pace than I usually would owing to strong coffee

	455
	00:21:39,130 --> 00:21:40,570
	flowing through my bloodstream.

	456
	00:21:41,130 --> 00:21:43,529
	And the thing that I'm not gonna get in this

	457
	00:21:43,529 --> 00:21:46,090
	benchmark is background noise, which in my first take that

	458
	00:21:46,090 --> 00:21:48,455
	I had to get rid of, my wife came in

	459
	00:21:48,455 --> 00:21:51,495
	with my son and for a good night kiss.

	460
	00:21:51,574 --> 00:21:55,094
	And that actually would have been super helpful to get

	461
	00:21:55,094 --> 00:21:57,814
	in because it was non diarized or if we had

	462
	00:21:57,814 --> 00:21:58,695
	diarization.

	463
	00:21:59,334 --> 00:22:01,414
	A female, I could say, I want the male voice

	464
	00:22:01,414 --> 00:22:03,094
	and that wasn't intended for transcription.

	465
	00:22:04,509 --> 00:22:06,269
	And we're not going to get background noise like people

	466
	00:22:06,269 --> 00:22:08,989
	honking their horns, which is something I've done in my

	467
	00:22:09,150 --> 00:22:11,870
	main data set where I am trying to go back

	468
	00:22:11,870 --> 00:22:15,070
	to some of my voice notes, annotate them and run

	469
	00:22:15,070 --> 00:22:15,709
	a benchmark.

	470
	00:22:15,709 --> 00:22:18,265
	But this is going to be just a pure quick

	471
	00:22:18,265 --> 00:22:19,064
	test.

	472
	00:22:19,785 --> 00:22:24,025
	And as someone I'm working on a voice note idea.

	473
	00:22:24,025 --> 00:22:28,185
	That's my sort of end motivation besides thinking it's an

	474
	00:22:28,185 --> 00:22:31,785
	absolutely outstanding technology that's coming to viability.

	475
	00:22:31,785 --> 00:22:34,400
	And really, I know this sounds cheesy, can actually have

	476
	00:22:34,400 --> 00:22:36,479
	a very transformative effect.

	477
	00:22:37,920 --> 00:22:43,120
	Voice technology has been life changing for folks living with

	478
	00:22:43,999 --> 00:22:45,039
	disabilities.

	479
	00:22:45,920 --> 00:22:48,545
	And I think there's something really nice about the fact

	480
	00:22:48,545 --> 00:22:52,545
	that it can also benefit folks who are able-bodied and

	481
	00:22:52,545 --> 00:22:57,904
	we can all in different ways make this tech as

	482
	00:22:57,904 --> 00:23:00,705
	useful as possible regardless of the exact way that we're

	483
	00:23:00,705 --> 00:23:01,025
	using it.

	484
	00:23:02,199 --> 00:23:04,439
	And I think there's something very powerful in that, and

	485
	00:23:04,439 --> 00:23:05,559
	it can be very cool.

	486
	00:23:06,120 --> 00:23:07,559
	I see huge potential.

	487
	00:23:07,559 --> 00:23:09,319
	What excites me about voice tech?

	488
	00:23:09,719 --> 00:23:11,159
	A lot of things actually.

	489
	00:23:12,120 --> 00:23:14,839
	Firstly, the fact that it's cheap and accurate, as I

	490
	00:23:14,839 --> 00:23:17,785
	mentioned at the very start of this, and it's getting

	491
	00:23:17,785 --> 00:23:20,104
	better and better with stuff like accent handling.

	492
	00:23:20,745 --> 00:23:23,304
	I'm not sure my fine tune will actually ever come

	493
	00:23:23,304 --> 00:23:25,225
	to fruition in the sense that I'll use it day

	494
	00:23:25,225 --> 00:23:26,584
	to day as I imagine.

	495
	00:23:26,664 --> 00:23:30,505
	I get like superb, flawless words error rates because I'm

	496
	00:23:30,505 --> 00:23:34,949
	just kind of skeptical about local speech to text, as

	497
	00:23:34,949 --> 00:23:35,670
	I mentioned.

	498
	00:23:36,070 --> 00:23:39,830
	And I think the pace of innovation and improvement in

	499
	00:23:39,830 --> 00:23:42,310
	the models, the main reasons for fine tuning from what

	500
	00:23:42,310 --> 00:23:46,150
	I've seen have been people who are something that really

	501
	00:23:46,150 --> 00:23:50,375
	blows blows my mind about ASR is the idea that

	502
	00:23:50,375 --> 00:23:55,574
	it's inherently ailingual or multilingual, phonetic based.

	503
	00:23:56,295 --> 00:24:00,375
	So as folks who use speak very obscure languages that

	504
	00:24:00,375 --> 00:24:03,094
	there may be very there might be a paucity of

	505
	00:24:02,229 --> 00:24:05,030
	training data or almost none at all, and therefore the

	506
	00:24:05,030 --> 00:24:06,790
	accuracy is significantly reduced.

	507
	00:24:06,790 --> 00:24:11,350
	Or folks in very critical environments, I know there are

	508
	00:24:11,510 --> 00:24:15,350
	this is used extensively in medical transcription and dispatcher work

	509
	00:24:15,350 --> 00:24:19,064
	as, you know the call centers who send out ambulances

	510
	00:24:19,064 --> 00:24:19,864
	etc.

	511
	00:24:20,265 --> 00:24:23,545
	Where accuracy is absolutely paramount and in the case of

	512
	00:24:23,545 --> 00:24:27,545
	doctors radiologists they might be using very specialized vocab all

	513
	00:24:27,545 --> 00:24:27,865
	the time.

	514
	00:24:28,630 --> 00:24:30,229
	So those are kind of the main two things, and

	515
	00:24:30,229 --> 00:24:32,150
	I'm not sure that really just for trying to make

	516
	00:24:32,150 --> 00:24:36,390
	it better on a few random tech words with my

	517
	00:24:36,390 --> 00:24:39,429
	slightly I mean, I have an accent, but, like, not,

	518
	00:24:39,429 --> 00:24:42,469
	you know, an accent that a few other million people

	519
	00:24:42,870 --> 00:24:43,910
	have ish.

	520
	00:24:44,685 --> 00:24:47,965
	I'm not sure that my little fine tune is gonna

	521
	00:24:47,965 --> 00:24:52,604
	actually like, the bump in word error reduction, if I

	522
	00:24:52,604 --> 00:24:54,205
	ever actually figure out how to do it and get

	523
	00:24:54,205 --> 00:24:56,365
	it up to the cloud, by the time we've done

	524
	00:24:56,365 --> 00:24:59,959
	that, I suspect that the next generation of ASR will

	525
	00:24:59,959 --> 00:25:01,719
	just be so good that it will kind of be,

	526
	00:25:01,959 --> 00:25:03,959
	well, that would have been cool if it worked out,

	527
	00:25:03,959 --> 00:25:05,479
	but I'll just use this instead.

	528
	00:25:05,719 --> 00:25:10,679
	So that's gonna be it for today's episode of voice

	529
	00:25:10,679 --> 00:25:11,640
	training data.

	530
	00:25:11,880 --> 00:25:14,255
	Single, long shot evaluation.

	531
	00:25:14,495 --> 00:25:15,694
	Who am I gonna compare?

	532
	00:25:16,414 --> 00:25:18,574
	Whisper is always good as a benchmark, but I'm more

	533
	00:25:18,574 --> 00:25:22,175
	interested in seeing Whisper head to head with two things

	534
	00:25:22,175 --> 00:25:22,894
	really.

	535
	00:25:23,295 --> 00:25:25,134
	One is Whisper variants.

	536
	00:25:25,134 --> 00:25:27,695
	So you've got these projects like Faster Whisper.

	537
	00:25:29,110 --> 00:25:29,989
	Distill Whisper.

	538
	00:25:29,989 --> 00:25:30,709
	It's a bit confusing.

	539
	00:25:30,709 --> 00:25:31,909
	There's a whole bunch of them.

	540
	00:25:32,150 --> 00:25:35,110
	And the emerging ASRs, which are also a thing.

	541
	00:25:35,269 --> 00:25:37,110
	My intention for this is I'm not sure I'm gonna

	542
	00:25:37,110 --> 00:25:39,910
	have the time in any point in the foreseeable future

	543
	00:25:39,910 --> 00:25:44,775
	to go back to this whole episode and create a

	544
	00:25:44,775 --> 00:25:48,294
	proper source truth where I fix everything.

	545
	00:25:49,255 --> 00:25:51,894
	Might do it if I can get one transcription that's

	546
	00:25:51,894 --> 00:25:54,134
	sufficiently close to perfection.

	547
	00:25:54,934 --> 00:25:58,400
	But what I would actually love to do on Hugging

	548
	00:25:58,400 --> 00:26:00,479
	Face, I think would be a great probably how I

	549
	00:26:00,479 --> 00:26:04,400
	might visualize this is having the audio waveform play and

	550
	00:26:04,400 --> 00:26:08,880
	then have the transcript for each model below it and

	551
	00:26:08,880 --> 00:26:13,765
	maybe even a, like, you know, to scale and maybe

	552
	00:26:13,765 --> 00:26:16,644
	even a local one as well, like local whisper versus

	553
	00:26:16,644 --> 00:26:19,684
	OpenAI API, etcetera.

	554
	00:26:19,765 --> 00:26:23,124
	And I can then actually listen back to segments or

	555
	00:26:23,124 --> 00:26:25,285
	anyone who wants to can listen back to segments of

	556
	00:26:25,285 --> 00:26:30,219
	this recording and see where a particular model struggled and

	557
	00:26:30,219 --> 00:26:33,099
	others didn't as well as the sort of headline finding

	558
	00:26:33,099 --> 00:26:35,579
	of which had the best W E R but that

	559
	00:26:35,579 --> 00:26:37,659
	would require the source of truth.

	560
	00:26:37,660 --> 00:26:38,459
	Okay, that's it.

	561
	00:26:38,425 --> 00:26:40,985
	I hope this was, I don't know, maybe useful for

	562
	00:26:40,985 --> 00:26:42,904
	other folks interested in STT.

	563
	00:26:42,985 --> 00:26:45,945
	You want to see I always think I've just said

	564
	00:26:45,945 --> 00:26:47,624
	it as something I didn't intend to.

	565
	00:26:47,864 --> 00:26:49,624
	STT, I said for those.

	566
	00:26:49,624 --> 00:26:53,049
	Listen carefully, including hopefully the models themselves.

	567
	00:26:53,289 --> 00:26:55,049
	This has been myself, Daniel Rosol.

	568
	00:26:55,049 --> 00:26:59,370
	For more jumbled repositories about my roving interest in AI

	569
	00:26:59,370 --> 00:27:04,009
	but particularly AgenTic, MCP and VoiceTech you can find me

	570
	00:27:04,009 --> 00:27:05,689
	on GitHub.

	571
	00:27:05,929 --> 00:27:06,650
	Hugging Face.

	572
	00:27:08,045 --> 00:27:08,924
	Where else?

	573
	00:27:08,925 --> 00:27:11,725
	DanielRosel dot com, which is my personal website, as well

	574
	00:27:11,725 --> 00:27:15,485
	as this podcast whose name I sadly cannot remember.

	575
	00:27:15,644 --> 00:27:16,685
	Until next time.

	576
	00:27:16,685 --> 00:27:17,324
	Thanks for listening.