Spaces:
Running
Running
| 1 | |
| 00:00:00,000 --> 00:00:06,160 | |
| Hello and welcome to a audio dataset consisting of one | |
| 2 | |
| 00:00:06,160 --> 00:00:08,320 | |
| single episode of a nonexistent podcast. | |
| 3 | |
| 00:00:08,720 --> 00:00:12,800 | |
| Or it I may append this to a podcast that | |
| 4 | |
| 00:00:12,800 --> 00:00:18,734 | |
| I set up recently regarding my with my thoughts on | |
| 5 | |
| 00:00:18,735 --> 00:00:20,735 | |
| speech tech and A. | |
| 6 | |
| 00:00:20,735 --> 00:00:21,134 | |
| I. | |
| 7 | |
| 00:00:21,134 --> 00:00:22,734 | |
| In particular, more A. | |
| 8 | |
| 00:00:22,734 --> 00:00:22,974 | |
| I. | |
| 9 | |
| 00:00:22,974 --> 00:00:23,855 | |
| And generative A. | |
| 10 | |
| 00:00:23,855 --> 00:00:24,015 | |
| I. | |
| 11 | |
| 00:00:24,015 --> 00:00:26,414 | |
| I would I would say. | |
| 12 | |
| 00:00:26,734 --> 00:00:30,789 | |
| But in any event, the purpose of this voice recording | |
| 13 | |
| 00:00:30,789 --> 00:00:35,510 | |
| is actually to create a lengthy voice sample for a | |
| 14 | |
| 00:00:35,510 --> 00:00:38,870 | |
| quick evaluation, a back of the envelope evaluation, they might | |
| 15 | |
| 00:00:38,870 --> 00:00:41,349 | |
| say, for different speech attacks models. | |
| 16 | |
| 00:00:41,349 --> 00:00:43,865 | |
| I'm doing this because I thought I'd made a great | |
| 17 | |
| 00:00:43,865 --> 00:00:47,704 | |
| breakthrough in my journey with speech tech and that was | |
| 18 | |
| 00:00:47,704 --> 00:00:51,305 | |
| succeeding in the elusive task of fine tuning whisper. | |
| 19 | |
| 00:00:51,624 --> 00:00:56,344 | |
| Whisper is, and I'm to just talk, I'm trying to | |
| 20 | |
| 00:00:55,749 --> 00:00:56,709 | |
| mix up. | |
| 21 | |
| 00:00:56,789 --> 00:01:00,310 | |
| I'm going to try a few different styles of speaking | |
| 22 | |
| 00:01:00,310 --> 00:01:02,789 | |
| whisper something at some points as well. | |
| 23 | |
| 00:01:03,270 --> 00:01:06,710 | |
| And I'll go back to speaking loud in in different | |
| 24 | |
| 00:01:06,710 --> 00:01:08,950 | |
| parts are going to sound really like a crazy person | |
| 25 | |
| 00:01:08,950 --> 00:01:12,344 | |
| because I'm also going to try to speak at different | |
| 26 | |
| 00:01:12,904 --> 00:01:17,945 | |
| pitches and cadences in order to really try to push | |
| 27 | |
| 00:01:18,264 --> 00:01:21,065 | |
| a speech to text model through its paces, which is | |
| 28 | |
| 00:01:21,065 --> 00:01:24,529 | |
| trying to make sense of is this guy just rambling | |
| 29 | |
| 00:01:24,529 --> 00:01:29,969 | |
| on incoherently in one long sentence or are these just | |
| 30 | |
| 00:01:29,969 --> 00:01:36,370 | |
| actually a series of step standalone, standalone, standalone sentences? | |
| 31 | |
| 00:01:36,370 --> 00:01:38,050 | |
| And how is it going to handle step alone? | |
| 32 | |
| 00:01:38,050 --> 00:01:38,690 | |
| That's not a word. | |
| 33 | |
| 00:01:39,624 --> 00:01:41,945 | |
| What happens when you use speech to text and you | |
| 34 | |
| 00:01:41,945 --> 00:01:43,304 | |
| use a fake word? | |
| 35 | |
| 00:01:43,304 --> 00:01:45,704 | |
| And then you're like, wait, that's not actually that word | |
| 36 | |
| 00:01:45,704 --> 00:01:46,585 | |
| doesn't exist. | |
| 37 | |
| 00:01:46,904 --> 00:01:48,504 | |
| How does AI handle that? | |
| 38 | |
| 00:01:48,504 --> 00:01:53,670 | |
| And these and more are all the questions that I'm | |
| 39 | |
| 00:01:53,670 --> 00:01:55,670 | |
| seeking to answer in this training data. | |
| 40 | |
| 00:01:55,749 --> 00:01:58,469 | |
| Now, why was I trying to fine tune Whisper? | |
| 41 | |
| 00:01:58,469 --> 00:01:59,670 | |
| And what is Whisper? | |
| 42 | |
| 00:01:59,670 --> 00:02:02,630 | |
| As I said, I'm going to try to record this | |
| 43 | |
| 00:02:02,630 --> 00:02:06,564 | |
| at a couple of different levels of technicality for folks | |
| 44 | |
| 00:02:06,564 --> 00:02:11,684 | |
| who are in the normal world and not totally stuck | |
| 45 | |
| 00:02:11,684 --> 00:02:13,684 | |
| down the rabbit hole of AI, which you have to | |
| 46 | |
| 00:02:13,684 --> 00:02:17,605 | |
| say is a really wonderful rabbit hole to be done. | |
| 47 | |
| 00:02:17,764 --> 00:02:20,839 | |
| It's a really interesting area and speech and voice tech | |
| 48 | |
| 00:02:20,839 --> 00:02:24,279 | |
| is is the aspect of it that I find actually | |
| 49 | |
| 00:02:24,279 --> 00:02:27,159 | |
| most I'm not sure I would say the most interesting | |
| 50 | |
| 00:02:27,159 --> 00:02:30,679 | |
| because there's just so much that is fascinating in AI. | |
| 51 | |
| 00:02:31,320 --> 00:02:34,054 | |
| But the most that I find the most personally transformative | |
| 52 | |
| 00:02:34,054 --> 00:02:38,454 | |
| in terms of the impact that it's had on my | |
| 53 | |
| 00:02:38,454 --> 00:02:41,174 | |
| daily work life and productivity and how I sort of | |
| 54 | |
| 00:02:41,174 --> 00:02:41,815 | |
| work. | |
| 55 | |
| 00:02:42,855 --> 00:02:47,420 | |
| I'm persevering hard with the task of trying to get | |
| 56 | |
| 00:02:47,420 --> 00:02:50,859 | |
| a good solution working for Linux, which if anyone actually | |
| 57 | |
| 00:02:50,859 --> 00:02:52,859 | |
| does listen to this, not just for the training data | |
| 58 | |
| 00:02:52,859 --> 00:02:56,620 | |
| and for the actual content, is sparked. | |
| 59 | |
| 00:02:56,620 --> 00:02:59,900 | |
| I had, besides the fine tune not working, well that | |
| 60 | |
| 00:02:59,900 --> 00:03:01,305 | |
| was the failure. | |
| 61 | |
| 00:03:02,424 --> 00:03:06,665 | |
| I used Claude code because one thinks these days that | |
| 62 | |
| 00:03:06,665 --> 00:03:13,200 | |
| there is nothing short of solving, you know, the the | |
| 63 | |
| 00:03:13,200 --> 00:03:17,519 | |
| reason of life or something that clause and agentic AI | |
| 64 | |
| 00:03:17,519 --> 00:03:19,600 | |
| can't do, which is not really the case. | |
| 65 | |
| 00:03:19,600 --> 00:03:23,119 | |
| It does seem that way sometimes, but it fails a | |
| 66 | |
| 00:03:23,119 --> 00:03:23,679 | |
| lot as well. | |
| 67 | |
| 00:03:23,679 --> 00:03:26,559 | |
| And this is one of those instances where last week | |
| 68 | |
| 00:03:26,559 --> 00:03:30,744 | |
| I put together an hour of voice training data, basically | |
| 69 | |
| 00:03:30,744 --> 00:03:33,385 | |
| speaking just random things for three minutes. | |
| 70 | |
| 00:03:35,385 --> 00:03:38,024 | |
| It was actually kind of tedious because the texts were | |
| 71 | |
| 00:03:38,024 --> 00:03:38,584 | |
| really weird. | |
| 72 | |
| 00:03:38,584 --> 00:03:41,290 | |
| Some of them were, it was like it was AI | |
| 73 | |
| 00:03:41,290 --> 00:03:42,170 | |
| generated. | |
| 74 | |
| 00:03:42,489 --> 00:03:44,809 | |
| I tried before to read Sherlock Holmes for an hour | |
| 75 | |
| 00:03:44,809 --> 00:03:47,609 | |
| and I just couldn't, I was so bored after ten | |
| 76 | |
| 00:03:47,609 --> 00:03:50,489 | |
| minutes that I was like, okay, no, I'm just gonna | |
| 77 | |
| 00:03:50,489 --> 00:03:51,850 | |
| have to find something else to read. | |
| 78 | |
| 00:03:51,850 --> 00:03:58,204 | |
| So I used a created with AI Studio, VibeCoded, a | |
| 79 | |
| 00:03:58,204 --> 00:04:03,084 | |
| synthetic text generator which actually I thought was probably a | |
| 80 | |
| 00:04:03,084 --> 00:04:05,165 | |
| better way of doing it because it would give me | |
| 81 | |
| 00:04:05,165 --> 00:04:08,989 | |
| more short samples with more varied content. | |
| 82 | |
| 00:04:08,989 --> 00:04:11,630 | |
| So I was like, okay, give me a voice note | |
| 83 | |
| 00:04:11,630 --> 00:04:14,829 | |
| like I'm recording an email, give me a short story | |
| 84 | |
| 00:04:14,829 --> 00:04:18,109 | |
| to read, give me prose to read. | |
| 85 | |
| 00:04:18,109 --> 00:04:20,554 | |
| So I came up with all these different things and | |
| 86 | |
| 00:04:20,554 --> 00:04:22,634 | |
| they added a little timer to it so I could | |
| 87 | |
| 00:04:22,634 --> 00:04:24,875 | |
| see how close I was to one hour. | |
| 88 | |
| 00:04:25,835 --> 00:04:29,035 | |
| And I spent like an hour one afternoon or probably | |
| 89 | |
| 00:04:29,035 --> 00:04:33,035 | |
| two hours by the time you do retakes and whatever | |
| 90 | |
| 00:04:33,035 --> 00:04:36,089 | |
| because you want to it gave me a source of | |
| 91 | |
| 00:04:36,089 --> 00:04:39,929 | |
| truth which I'm not sure if that's the scientific way | |
| 92 | |
| 00:04:39,929 --> 00:04:44,089 | |
| to approach this topic of gathering training data but I | |
| 93 | |
| 00:04:44,089 --> 00:04:45,369 | |
| thought made sense. | |
| 94 | |
| 00:04:46,410 --> 00:04:49,384 | |
| I have a lot of audio data from recording voice | |
| 95 | |
| 00:04:49,384 --> 00:04:53,464 | |
| notes which I've also kind of used, been experimenting with | |
| 96 | |
| 00:04:53,464 --> 00:04:54,984 | |
| using for a different purpose. | |
| 97 | |
| 00:04:55,304 --> 00:04:58,665 | |
| Slightly different annotating task types. | |
| 98 | |
| 00:04:58,665 --> 00:05:03,170 | |
| It's more a text classification experiment or Well, it's more | |
| 99 | |
| 00:05:03,170 --> 00:05:03,730 | |
| than that actually. | |
| 100 | |
| 00:05:03,730 --> 00:05:04,929 | |
| I'm working on a voice app. | |
| 101 | |
| 00:05:04,929 --> 00:05:09,249 | |
| So it's a prototype, I guess, is really more accurate. | |
| 102 | |
| 00:05:11,329 --> 00:05:13,889 | |
| But you can do that and you can work backwards. | |
| 103 | |
| 00:05:13,889 --> 00:05:18,274 | |
| Listen back to a voice note and you painfully go | |
| 104 | |
| 00:05:18,274 --> 00:05:21,394 | |
| through one of those transcribing, where you start and stop | |
| 105 | |
| 00:05:21,394 --> 00:05:23,554 | |
| and scrub around it and you fix the errors, but | |
| 106 | |
| 00:05:23,554 --> 00:05:25,795 | |
| it's really, really pouring to do that. | |
| 107 | |
| 00:05:26,035 --> 00:05:27,954 | |
| So I thought it would be less tedious in the | |
| 108 | |
| 00:05:27,954 --> 00:05:31,634 | |
| long term if I just recorded the source of truth. | |
| 109 | |
| 00:05:31,989 --> 00:05:34,309 | |
| So it gave me these three minutes snippets. | |
| 110 | |
| 00:05:34,309 --> 00:05:37,429 | |
| I recorded them and saved an MP3 and a TXT | |
| 111 | |
| 00:05:37,670 --> 00:05:40,230 | |
| in the same folder and I created an error that | |
| 112 | |
| 00:05:40,230 --> 00:05:40,869 | |
| data. | |
| 113 | |
| 00:05:41,910 --> 00:05:44,790 | |
| So I was very hopeful, quietly, a little bit hopeful | |
| 114 | |
| 00:05:44,790 --> 00:05:46,949 | |
| that I would be able, that I could actually fine | |
| 115 | |
| 00:05:46,949 --> 00:05:47,670 | |
| tune Whisper. | |
| 116 | |
| 00:05:48,285 --> 00:05:51,005 | |
| I want to fine tune Whisper because when I got | |
| 117 | |
| 00:05:51,005 --> 00:05:54,924 | |
| into voice tech last November, my wife was in the | |
| 118 | |
| 00:05:54,924 --> 00:05:57,165 | |
| US and I was alone at home. | |
| 119 | |
| 00:05:57,244 --> 00:06:00,924 | |
| And when crazy people like me do really wild things | |
| 120 | |
| 00:06:00,924 --> 00:06:03,900 | |
| like use voice to tech technology. | |
| 121 | |
| 00:06:03,900 --> 00:06:06,859 | |
| That was basically when I started doing it, I didn't | |
| 122 | |
| 00:06:06,859 --> 00:06:09,500 | |
| feel like a crazy person speaking to myself. | |
| 123 | |
| 00:06:09,900 --> 00:06:12,700 | |
| And my expectations weren't that high. | |
| 124 | |
| 00:06:13,100 --> 00:06:17,605 | |
| I'd used speech tech now and again, tried it out. | |
| 125 | |
| 00:06:17,605 --> 00:06:18,804 | |
| I was like, it'd be really cool if you could | |
| 126 | |
| 00:06:18,804 --> 00:06:22,324 | |
| just like speak into your computer and whatever I tried | |
| 127 | |
| 00:06:22,324 --> 00:06:25,845 | |
| out that had Linux support was just, it was not | |
| 128 | |
| 00:06:25,845 --> 00:06:26,725 | |
| good basically. | |
| 129 | |
| 00:06:27,285 --> 00:06:29,444 | |
| And this blew me away from the first go. | |
| 130 | |
| 00:06:29,444 --> 00:06:32,259 | |
| I mean, it wasn't one hundred percent accurate out of | |
| 131 | |
| 00:06:32,259 --> 00:06:34,420 | |
| the box and it took work, but it was good | |
| 132 | |
| 00:06:34,420 --> 00:06:36,739 | |
| enough that there was a solid foundation and it kind | |
| 133 | |
| 00:06:36,739 --> 00:06:41,059 | |
| of passed that pivot point that it's actually worth doing | |
| 134 | |
| 00:06:41,059 --> 00:06:41,540 | |
| this. | |
| 135 | |
| 00:06:41,859 --> 00:06:43,859 | |
| You know, there's a point where it's so like, the | |
| 136 | |
| 00:06:43,859 --> 00:06:46,405 | |
| transcript is you don't have to get one hundred percent | |
| 137 | |
| 00:06:46,405 --> 00:06:49,445 | |
| accuracy for it to be worth your time for speech | |
| 138 | |
| 00:06:49,445 --> 00:06:51,845 | |
| to text to be a worthwhile addition to your productivity. | |
| 139 | |
| 00:06:51,845 --> 00:06:53,605 | |
| But you do need to get above, let's say, I | |
| 140 | |
| 00:06:53,605 --> 00:06:55,045 | |
| don't know, eighty five percent. | |
| 141 | |
| 00:06:55,525 --> 00:06:58,725 | |
| If it's sixty percent or fifty percent, you inevitably say, | |
| 142 | |
| 00:06:58,960 --> 00:07:00,239 | |
| Screw it, I'll just type it. | |
| 143 | |
| 00:07:00,239 --> 00:07:03,600 | |
| Because you end up missing errors in the transcript and | |
| 144 | |
| 00:07:03,600 --> 00:07:04,960 | |
| it becomes actually worse. | |
| 145 | |
| 00:07:04,960 --> 00:07:06,640 | |
| You end up in a worse position than you started | |
| 146 | |
| 00:07:06,640 --> 00:07:06,960 | |
| with it. | |
| 147 | |
| 00:07:06,960 --> 00:07:08,160 | |
| That's been my experience. | |
| 148 | |
| 00:07:08,480 --> 00:07:12,400 | |
| So I was like, Oh, this is actually really, really | |
| 149 | |
| 00:07:12,400 --> 00:07:12,880 | |
| good now. | |
| 150 | |
| 00:07:12,880 --> 00:07:13,600 | |
| How did that happen? | |
| 151 | |
| 00:07:13,600 --> 00:07:17,915 | |
| And the answer is ASR, Whisper being open sourced and | |
| 152 | |
| 00:07:18,634 --> 00:07:21,514 | |
| the transformer architecture, if you want to go back to | |
| 153 | |
| 00:07:21,514 --> 00:07:26,314 | |
| the underpinnings, which really blows my mind and it's on | |
| 154 | |
| 00:07:26,314 --> 00:07:29,750 | |
| my list to read through that paper. | |
| 155 | |
| 00:07:30,309 --> 00:07:35,910 | |
| All you need is attention as attentively as can be | |
| 156 | |
| 00:07:35,910 --> 00:07:39,270 | |
| done with my limited brain because it's super super high | |
| 157 | |
| 00:07:39,270 --> 00:07:42,965 | |
| level stuff, super advanced stuff, mean. | |
| 158 | |
| 00:07:43,205 --> 00:07:48,004 | |
| That I think of all the things that are fascinating | |
| 159 | |
| 00:07:48,004 --> 00:07:52,484 | |
| about the sudden rise in AI and the dramatic capabilities, | |
| 160 | |
| 00:07:53,259 --> 00:07:55,339 | |
| I find it fascinating that few people are like, hang | |
| 161 | |
| 00:07:55,339 --> 00:07:58,220 | |
| on, you've got this thing that can speak to you | |
| 162 | |
| 00:07:58,220 --> 00:07:59,980 | |
| like a chatbot, an LLM. | |
| 163 | |
| 00:08:00,540 --> 00:08:02,780 | |
| And then you've got image generation. | |
| 164 | |
| 00:08:02,780 --> 00:08:03,100 | |
| Okay. | |
| 165 | |
| 00:08:03,100 --> 00:08:07,020 | |
| So firstly, two things on the surface have nothing in | |
| 166 | |
| 00:08:07,020 --> 00:08:07,339 | |
| common. | |
| 167 | |
| 00:08:08,285 --> 00:08:11,964 | |
| So how did that just happen all at the same | |
| 168 | |
| 00:08:11,964 --> 00:08:12,205 | |
| time? | |
| 169 | |
| 00:08:12,205 --> 00:08:15,884 | |
| And then when you extend that further, you're like, Suno. | |
| 170 | |
| 00:08:15,884 --> 00:08:19,405 | |
| You can sing a song and AI will come up | |
| 171 | |
| 00:08:19,405 --> 00:08:21,085 | |
| with an instrumental. | |
| 172 | |
| 00:08:21,405 --> 00:08:23,405 | |
| And then you've got Whisper and you're like, Wait a | |
| 173 | |
| 00:08:23,405 --> 00:08:23,645 | |
| second. | |
| 174 | |
| 00:08:24,020 --> 00:08:28,100 | |
| How did all this stuff If it's all AI, there | |
| 175 | |
| 00:08:28,100 --> 00:08:29,460 | |
| has to be some commonality. | |
| 176 | |
| 00:08:29,460 --> 00:08:35,059 | |
| Otherwise, are totally different technologies on the surface of it. | |
| 177 | |
| 00:08:35,140 --> 00:08:39,304 | |
| And the transformer architecture is, as far as I know, | |
| 178 | |
| 00:08:39,304 --> 00:08:40,184 | |
| the answer. | |
| 179 | |
| 00:08:40,184 --> 00:08:42,905 | |
| And I can't even say, can't even pretend that I | |
| 180 | |
| 00:08:42,905 --> 00:08:47,304 | |
| really understand what the transformer architecture means in-depth. | |
| 181 | |
| 00:08:47,304 --> 00:08:49,785 | |
| But I have scanned this and as I said, I | |
| 182 | |
| 00:08:49,785 --> 00:08:52,799 | |
| want to print it and really kind of think over | |
| 183 | |
| 00:08:52,799 --> 00:08:54,080 | |
| it at some point. | |
| 184 | |
| 00:08:54,799 --> 00:08:58,000 | |
| And I'll probably feel bad about myself, I think, because | |
| 185 | |
| 00:08:58,000 --> 00:08:59,599 | |
| weren't those guys in twenties? | |
| 186 | |
| 00:09:00,240 --> 00:09:01,760 | |
| Like, that's crazy. | |
| 187 | |
| 00:09:02,080 --> 00:09:06,080 | |
| I think I asked ChatGPT once who wrote that paper | |
| 188 | |
| 00:09:06,465 --> 00:09:09,184 | |
| and how old were they when it was published in | |
| 189 | |
| 00:09:09,184 --> 00:09:09,745 | |
| ArcSiv? | |
| 190 | |
| 00:09:09,745 --> 00:09:13,025 | |
| And I was expecting like, I don't know, what do | |
| 191 | |
| 00:09:13,025 --> 00:09:13,505 | |
| you imagine? | |
| 192 | |
| 00:09:13,505 --> 00:09:15,585 | |
| I personally imagine kind of like, you you have these | |
| 193 | |
| 00:09:15,585 --> 00:09:19,665 | |
| breakthroughs during COVID and things like that, where like these | |
| 194 | |
| 00:09:19,665 --> 00:09:22,549 | |
| kind of really obscure scientists who are in their 50s | |
| 195 | |
| 00:09:22,549 --> 00:09:26,790 | |
| and they've just kind of been laboring in labs and | |
| 196 | |
| 00:09:26,790 --> 00:09:29,750 | |
| wearily in writing and publishing in kind of obscure academic | |
| 197 | |
| 00:09:29,750 --> 00:09:30,630 | |
| publications. | |
| 198 | |
| 00:09:30,790 --> 00:09:33,589 | |
| And they finally hit a big or win a Nobel | |
| 199 | |
| 00:09:33,589 --> 00:09:36,155 | |
| Prize and then their household names. | |
| 200 | |
| 00:09:36,554 --> 00:09:38,554 | |
| So that was kind of what I had in mind. | |
| 201 | |
| 00:09:38,554 --> 00:09:42,074 | |
| That was the mental image I'd formed of the birth | |
| 202 | |
| 00:09:42,074 --> 00:09:42,875 | |
| of ArcSim. | |
| 203 | |
| 00:09:42,875 --> 00:09:45,515 | |
| Like I wasn't expecting twenty somethings in San Francisco. | |
| 204 | |
| 00:09:45,515 --> 00:09:48,714 | |
| I thought that was both very funny, very cool, and | |
| 205 | |
| 00:09:48,714 --> 00:09:49,995 | |
| actually kind of inspiring. | |
| 206 | |
| 00:09:50,474 --> 00:09:55,150 | |
| It's nice to think that people who just you might | |
| 207 | |
| 00:09:55,150 --> 00:09:58,429 | |
| put them in the kind of milieu or bubble or | |
| 208 | |
| 00:09:58,429 --> 00:10:02,589 | |
| world that you are in incredibly in through a series | |
| 209 | |
| 00:10:02,589 --> 00:10:05,755 | |
| of connections that are coming up with such literally world | |
| 210 | |
| 00:10:05,755 --> 00:10:07,755 | |
| changing innovations. | |
| 211 | |
| 00:10:07,834 --> 00:10:11,194 | |
| So that was I thought anyway, that's that that was | |
| 212 | |
| 00:10:11,194 --> 00:10:11,755 | |
| cool. | |
| 213 | |
| 00:10:12,155 --> 00:10:12,474 | |
| Okay. | |
| 214 | |
| 00:10:12,474 --> 00:10:13,354 | |
| Voice training data. | |
| 215 | |
| 00:10:13,354 --> 00:10:14,074 | |
| How are we doing? | |
| 216 | |
| 00:10:14,074 --> 00:10:17,275 | |
| We're about ten minutes, and I'm still talking about voice | |
| 217 | |
| 00:10:17,275 --> 00:10:18,155 | |
| technology. | |
| 218 | |
| 00:10:18,554 --> 00:10:22,099 | |
| So Whisper was brilliant, and I was so excited that | |
| 219 | |
| 00:10:22,099 --> 00:10:25,780 | |
| my first instinct was to guess, like, Oh my gosh, | |
| 220 | |
| 00:10:25,780 --> 00:10:27,939 | |
| I have to get a really good microphone for this. | |
| 221 | |
| 00:10:28,099 --> 00:10:31,299 | |
| So I didn't go on a spending spree because I | |
| 222 | |
| 00:10:31,299 --> 00:10:33,219 | |
| said, I'm gonna have to just wait a month and | |
| 223 | |
| 00:10:33,219 --> 00:10:34,660 | |
| see if I still use this. | |
| 224 | |
| 00:10:35,140 --> 00:10:38,795 | |
| And it just kind of became it's become really part | |
| 225 | |
| 00:10:38,795 --> 00:10:40,875 | |
| of my daily routine. | |
| 226 | |
| 00:10:41,674 --> 00:10:44,235 | |
| Like if I'm writing an email, I'll record a voice | |
| 227 | |
| 00:10:44,235 --> 00:10:47,515 | |
| note and then I've developed and it's nice to see | |
| 228 | |
| 00:10:47,515 --> 00:10:50,679 | |
| that everyone is like developing the same things in parallel. | |
| 229 | |
| 00:10:50,679 --> 00:10:53,319 | |
| That's kind of a weird thing to say, when I | |
| 230 | |
| 00:10:53,319 --> 00:11:00,199 | |
| started working on these prototypes on GitHub, which is where | |
| 231 | |
| 00:11:00,199 --> 00:11:03,959 | |
| I just kind of share very freely and loosely ideas | |
| 232 | |
| 00:11:03,959 --> 00:11:06,865 | |
| and first iterations on concepts. | |
| 233 | |
| 00:11:08,944 --> 00:11:10,624 | |
| And for want of a better word, I called it | |
| 234 | |
| 00:11:10,624 --> 00:11:14,865 | |
| like LLM post processing or clean up or basically a | |
| 235 | |
| 00:11:14,865 --> 00:11:17,665 | |
| system prompt that after you get back the raw text | |
| 236 | |
| 00:11:17,665 --> 00:11:21,540 | |
| from Whisper, you run it through a model and say, | |
| 237 | |
| 00:11:21,540 --> 00:11:26,259 | |
| okay, this is crappy text like add sentence structure and, | |
| 238 | |
| 00:11:26,259 --> 00:11:27,379 | |
| you know, fix it up. | |
| 239 | |
| 00:11:27,780 --> 00:11:32,499 | |
| And now when I'm exploring the different tools that are | |
| 240 | |
| 00:11:32,499 --> 00:11:35,554 | |
| out there that people have built, I see quite a | |
| 241 | |
| 00:11:35,554 --> 00:11:39,395 | |
| number of projects have basically done the same thing. | |
| 242 | |
| 00:11:40,674 --> 00:11:43,155 | |
| Lest that be misconstrued, I'm not saying for a millisecond | |
| 243 | |
| 00:11:43,155 --> 00:11:44,515 | |
| that I inspired them. | |
| 244 | |
| 00:11:44,515 --> 00:11:47,954 | |
| I'm sure this has been a thing that's been integrated | |
| 245 | |
| 00:11:47,954 --> 00:11:51,210 | |
| into tools for a while, but it's the kind of | |
| 246 | |
| 00:11:51,210 --> 00:11:53,610 | |
| thing that when you start using these tools every day, | |
| 247 | |
| 00:11:53,610 --> 00:11:57,530 | |
| the need for it is almost instantly apparent because text | |
| 248 | |
| 00:11:57,530 --> 00:12:01,449 | |
| that doesn't have any punctuation or paragraph spacing takes a | |
| 249 | |
| 00:12:01,449 --> 00:12:03,885 | |
| long time to, you know, it takes so long to | |
| 250 | |
| 00:12:03,885 --> 00:12:08,924 | |
| get it into a presentable email that again, moves speech | |
| 251 | |
| 00:12:08,924 --> 00:12:13,005 | |
| tech into that before that inflection point where you're like, | |
| 252 | |
| 00:12:13,005 --> 00:12:13,885 | |
| nah, it's just not worth it. | |
| 253 | |
| 00:12:13,885 --> 00:12:16,844 | |
| It's like, it'll just be quicker to type this. | |
| 254 | |
| 00:12:17,199 --> 00:12:19,760 | |
| So it's a big, it's a little touch that actually | |
| 255 | |
| 00:12:20,000 --> 00:12:21,120 | |
| is a big deal. | |
| 256 | |
| 00:12:21,439 --> 00:12:25,360 | |
| So I was on Whisper and I've been using Whisper | |
| 257 | |
| 00:12:25,360 --> 00:12:27,679 | |
| and I kind of early on found a couple of | |
| 258 | |
| 00:12:27,679 --> 00:12:28,319 | |
| tools. | |
| 259 | |
| 00:12:28,319 --> 00:12:30,559 | |
| I couldn't find what I was looking for on Linux, | |
| 260 | |
| 00:12:30,559 --> 00:12:35,844 | |
| which is basically just something that'll run-in the background. | |
| 261 | |
| 00:12:35,844 --> 00:12:38,165 | |
| You'll give it an API key and it will just | |
| 262 | |
| 00:12:38,165 --> 00:12:42,964 | |
| like transcribe with like a little key to start and | |
| 263 | |
| 00:12:42,964 --> 00:12:43,765 | |
| stop the dictation. | |
| 264 | |
| 00:12:45,000 --> 00:12:48,360 | |
| And the issues where I discovered that like most people | |
| 265 | |
| 00:12:48,360 --> 00:12:51,960 | |
| involved in creating these projects were very much focused on | |
| 266 | |
| 00:12:51,960 --> 00:12:55,720 | |
| local models, running Whisper locally because you can. | |
| 267 | |
| 00:12:56,199 --> 00:12:58,120 | |
| And I tried that a bunch of times and just | |
| 268 | |
| 00:12:58,120 --> 00:13:00,974 | |
| never got results that were as good as the cloud. | |
| 269 | |
| 00:13:01,375 --> 00:13:03,535 | |
| And when I began looking at the cost of the | |
| 270 | |
| 00:13:03,535 --> 00:13:06,574 | |
| speech to text APIs and what I was spending, I | |
| 271 | |
| 00:13:06,574 --> 00:13:09,775 | |
| just thought there is it's actually, in my opinion, just | |
| 272 | |
| 00:13:09,775 --> 00:13:13,080 | |
| one of the better deals in API spending in the | |
| 273 | |
| 00:13:13,080 --> 00:13:13,400 | |
| cloud. | |
| 274 | |
| 00:13:13,400 --> 00:13:15,640 | |
| Like, it's just not that expensive for very, very good | |
| 275 | |
| 00:13:15,640 --> 00:13:19,559 | |
| models that are much more, you know, you're gonna be | |
| 276 | |
| 00:13:19,559 --> 00:13:22,679 | |
| able to run the full model, the latest model versus | |
| 277 | |
| 00:13:22,679 --> 00:13:26,525 | |
| whatever you can run on your average GPU unless you | |
| 278 | |
| 00:13:26,525 --> 00:13:28,765 | |
| want to buy a crazy GPU. | |
| 279 | |
| 00:13:28,765 --> 00:13:29,964 | |
| It doesn't really make sense to me. | |
| 280 | |
| 00:13:29,964 --> 00:13:33,084 | |
| Privacy is another concern that I know is kind of | |
| 281 | |
| 00:13:33,084 --> 00:13:35,245 | |
| like a very much a separate thing that people just | |
| 282 | |
| 00:13:35,245 --> 00:13:38,765 | |
| don't want their voice data and their voice leaving their | |
| 283 | |
| 00:13:38,765 --> 00:13:42,380 | |
| local environment maybe for regulatory reasons as well. | |
| 284 | |
| 00:13:42,620 --> 00:13:43,900 | |
| But I'm not in that. | |
| 285 | |
| 00:13:44,140 --> 00:13:48,460 | |
| I neither really care about people listening to my, grocery | |
| 286 | |
| 00:13:48,460 --> 00:13:51,500 | |
| list, consisting of, reminding myself that I need to buy | |
| 287 | |
| 00:13:51,500 --> 00:13:54,699 | |
| more beer, Cheetos, and hummus, which is kind of the | |
| 288 | |
| 00:13:55,254 --> 00:13:59,494 | |
| three staples of my diet during periods of poor nutrition. | |
| 289 | |
| 00:13:59,814 --> 00:14:02,295 | |
| But the kind of stuff that I transcribe, it's just | |
| 290 | |
| 00:14:02,295 --> 00:14:02,614 | |
| not. | |
| 291 | |
| 00:14:02,614 --> 00:14:07,734 | |
| It's not a privacy thing I'm that sort of sensitive | |
| 292 | |
| 00:14:07,734 --> 00:14:13,189 | |
| about and I don't do anything so sensitive or secure | |
| 293 | |
| 00:14:13,189 --> 00:14:14,710 | |
| that requires air capping. | |
| 294 | |
| 00:14:15,590 --> 00:14:17,510 | |
| I looked at the pricing and especially the kind of | |
| 295 | |
| 00:14:17,510 --> 00:14:18,870 | |
| older model mini. | |
| 296 | |
| 00:14:19,510 --> 00:14:21,830 | |
| Some of them are very, very affordable and I did | |
| 297 | |
| 00:14:21,830 --> 00:14:26,684 | |
| a calculation once with ChatGPT and I was like, okay, | |
| 298 | |
| 00:14:26,684 --> 00:14:30,285 | |
| this is the API price for I can't remember whatever | |
| 299 | |
| 00:14:30,285 --> 00:14:31,324 | |
| the model was. | |
| 300 | |
| 00:14:31,724 --> 00:14:34,365 | |
| Let's say I just go at it like nonstop, which | |
| 301 | |
| 00:14:34,365 --> 00:14:35,485 | |
| rarely happens. | |
| 302 | |
| 00:14:35,564 --> 00:14:38,879 | |
| Probably, I would say on average I might dictate thirty | |
| 303 | |
| 00:14:38,879 --> 00:14:41,679 | |
| to sixty minutes per day if I was probably summing | |
| 304 | |
| 00:14:41,679 --> 00:14:47,920 | |
| up the emails, documents, outlines, which is a lot, but | |
| 305 | |
| 00:14:47,920 --> 00:14:50,079 | |
| it's it's still a fairly modest amount. | |
| 306 | |
| 00:14:50,079 --> 00:14:51,759 | |
| And I was like, well, some days I do go | |
| 307 | |
| 00:14:51,759 --> 00:14:54,854 | |
| on like one or two days where I've been usually | |
| 308 | |
| 00:14:54,854 --> 00:14:56,775 | |
| when I'm like kind of out of the house and | |
| 309 | |
| 00:14:56,775 --> 00:15:00,455 | |
| just have something like I have nothing else to do. | |
| 310 | |
| 00:15:00,455 --> 00:15:03,095 | |
| Like if I'm at a hospital, we have a newborn | |
| 311 | |
| 00:15:03,495 --> 00:15:07,219 | |
| and you're waiting for like eight hours and hours for | |
| 312 | |
| 00:15:07,219 --> 00:15:08,020 | |
| an appointment. | |
| 313 | |
| 00:15:08,099 --> 00:15:11,939 | |
| And I would probably have listened to podcasts before becoming | |
| 314 | |
| 00:15:11,939 --> 00:15:12,900 | |
| a speech fanatic. | |
| 315 | |
| 00:15:12,900 --> 00:15:15,299 | |
| And I'm like, Oh, wait, let me just get down. | |
| 316 | |
| 00:15:15,299 --> 00:15:17,299 | |
| Let me just get these ideas out of my head. | |
| 317 | |
| 00:15:17,460 --> 00:15:20,665 | |
| And that's when I'll go on my speech binges. | |
| 318 | |
| 00:15:20,665 --> 00:15:22,584 | |
| But those are like once every few months, like not | |
| 319 | |
| 00:15:22,584 --> 00:15:23,464 | |
| frequently. | |
| 320 | |
| 00:15:23,704 --> 00:15:25,704 | |
| But I said, okay, let's just say if I'm going | |
| 321 | |
| 00:15:25,704 --> 00:15:28,104 | |
| to price out cloud STT. | |
| 322 | |
| 00:15:28,905 --> 00:15:33,420 | |
| If I was like dedicated every second of every waking | |
| 323 | |
| 00:15:33,420 --> 00:15:37,740 | |
| hour to transcribing for some odd reason, I mean I'd | |
| 324 | |
| 00:15:37,740 --> 00:15:39,740 | |
| have to eat and use the toilet. | |
| 325 | |
| 00:15:40,460 --> 00:15:42,620 | |
| There's only so many hours I'm awake for. | |
| 326 | |
| 00:15:42,620 --> 00:15:46,939 | |
| So let's just say a maximum of forty five minutes | |
| 327 | |
| 00:15:47,125 --> 00:15:49,125 | |
| in the hour, then I said, All right, let's just | |
| 328 | |
| 00:15:49,125 --> 00:15:50,085 | |
| say fifty. | |
| 329 | |
| 00:15:50,564 --> 00:15:51,285 | |
| Who knows? | |
| 330 | |
| 00:15:51,285 --> 00:15:52,724 | |
| You're dictating on the toilet. | |
| 331 | |
| 00:15:52,724 --> 00:15:53,525 | |
| We do it. | |
| 332 | |
| 00:15:53,844 --> 00:15:56,804 | |
| So you could just do sixty, but whatever I did | |
| 333 | |
| 00:15:57,045 --> 00:16:01,099 | |
| and every day, like you're going flat out seven days | |
| 334 | |
| 00:16:01,099 --> 00:16:02,540 | |
| a week dictating nonstop. | |
| 335 | |
| 00:16:02,540 --> 00:16:05,499 | |
| I was like, What's my monthly API bill going to | |
| 336 | |
| 00:16:05,499 --> 00:16:06,620 | |
| be at this price? | |
| 337 | |
| 00:16:06,699 --> 00:16:09,259 | |
| And it came out to like seventy or eighty bucks. | |
| 338 | |
| 00:16:09,259 --> 00:16:12,540 | |
| And I was like, Well, that would be an extraordinary | |
| 339 | |
| 00:16:12,860 --> 00:16:14,299 | |
| amount of dictation. | |
| 340 | |
| 00:16:14,299 --> 00:16:18,025 | |
| And I would hope that there was some compelling reason | |
| 341 | |
| 00:16:18,665 --> 00:16:21,704 | |
| worth more than seventy dollars that I embarked upon that | |
| 342 | |
| 00:16:21,704 --> 00:16:22,344 | |
| project. | |
| 343 | |
| 00:16:22,584 --> 00:16:24,505 | |
| So given that that's kind of the max point for | |
| 344 | |
| 00:16:24,505 --> 00:16:27,224 | |
| me I said that's actually very very affordable. | |
| 345 | |
| 00:16:27,944 --> 00:16:30,424 | |
| Now you're gonna if you want to spec out the | |
| 346 | |
| 00:16:30,424 --> 00:16:33,829 | |
| costs and you want to do the post processing that | |
| 347 | |
| 00:16:33,829 --> 00:16:36,709 | |
| I really do feel is valuable, that's going to cost | |
| 348 | |
| 00:16:36,709 --> 00:16:37,670 | |
| some more as well. | |
| 349 | |
| 00:16:37,990 --> 00:16:43,189 | |
| Unless you're using Gemini, which needless to say is a | |
| 350 | |
| 00:16:43,189 --> 00:16:45,110 | |
| random person sitting in Jerusalem. | |
| 351 | |
| 00:16:45,775 --> 00:16:49,375 | |
| I have no affiliation nor with Google nor Anthropic nor | |
| 352 | |
| 00:16:49,375 --> 00:16:52,334 | |
| Gemini nor any major tech vendor for that matter. | |
| 353 | |
| 00:16:53,775 --> 00:16:57,135 | |
| I like Gemini not so much as a everyday model. | |
| 354 | |
| 00:16:57,375 --> 00:16:59,854 | |
| It's kind of underwhelmed in that respect, I would say. | |
| 355 | |
| 00:17:00,299 --> 00:17:02,699 | |
| But for multimodal, I think it's got a lot to | |
| 356 | |
| 00:17:02,699 --> 00:17:03,259 | |
| offer. | |
| 357 | |
| 00:17:03,579 --> 00:17:07,099 | |
| And I think that the transcribing functionality whereby it can, | |
| 358 | |
| 00:17:07,979 --> 00:17:12,300 | |
| process audio with a system prompt and both give you | |
| 359 | |
| 00:17:12,300 --> 00:17:13,820 | |
| transcription that's cleaned up. | |
| 360 | |
| 00:17:13,820 --> 00:17:15,259 | |
| That reduces two steps to one. | |
| 361 | |
| 00:17:15,755 --> 00:17:18,874 | |
| And that for me is a very, very big deal. | |
| 362 | |
| 00:17:18,875 --> 00:17:22,394 | |
| And I feel like even Google hasn't really sort of | |
| 363 | |
| 00:17:22,475 --> 00:17:27,115 | |
| thought through how useful the that modality is and what | |
| 364 | |
| 00:17:27,115 --> 00:17:29,620 | |
| kind of use cases you can achieve with it. | |
| 365 | |
| 00:17:29,620 --> 00:17:32,259 | |
| Because I found in the course of this year just | |
| 366 | |
| 00:17:32,259 --> 00:17:37,939 | |
| an endless list of really kind of system prompt stuff | |
| 367 | |
| 00:17:37,939 --> 00:17:40,820 | |
| that I can say, okay, I've used it to capture | |
| 368 | |
| 00:17:40,820 --> 00:17:44,035 | |
| context data for AI, which is literally I might speak | |
| 369 | |
| 00:17:44,035 --> 00:17:46,675 | |
| for if I wanted to have a good bank of | |
| 370 | |
| 00:17:46,675 --> 00:17:49,955 | |
| context data about who knows my childhood. | |
| 371 | |
| 00:17:50,354 --> 00:17:54,275 | |
| More realistically, maybe my career goals, something that would just | |
| 372 | |
| 00:17:54,275 --> 00:17:56,115 | |
| be like really boring to type out. | |
| 373 | |
| 00:17:56,115 --> 00:18:00,420 | |
| So I'll just like sit in my car and record | |
| 374 | |
| 00:18:00,420 --> 00:18:01,380 | |
| it for ten minutes. | |
| 375 | |
| 00:18:01,380 --> 00:18:03,699 | |
| And that ten minutes you get a lot of information | |
| 376 | |
| 00:18:03,699 --> 00:18:04,339 | |
| in. | |
| 377 | |
| 00:18:05,539 --> 00:18:07,620 | |
| Emails, which is short text. | |
| 378 | |
| 00:18:08,580 --> 00:18:10,339 | |
| Just there is a whole bunch. | |
| 379 | |
| 00:18:10,340 --> 00:18:13,295 | |
| And all these workflows kind of require a little bit | |
| 380 | |
| 00:18:13,295 --> 00:18:15,054 | |
| of treatment afterwards and different treatment. | |
| 381 | |
| 00:18:15,054 --> 00:18:18,334 | |
| My context pipeline is kind of like just extract the | |
| 382 | |
| 00:18:18,334 --> 00:18:19,215 | |
| bare essentials. | |
| 383 | |
| 00:18:19,215 --> 00:18:22,094 | |
| You end up with me talking very loosely about sort | |
| 384 | |
| 00:18:22,094 --> 00:18:24,414 | |
| of what I've done in my career, where I've worked, | |
| 385 | |
| 00:18:24,414 --> 00:18:25,374 | |
| where I might like to work. | |
| 386 | |
| 00:18:25,920 --> 00:18:29,039 | |
| And it goes, it condenses that down to very robotic | |
| 387 | |
| 00:18:29,039 --> 00:18:32,640 | |
| language that is easy to chunk parse and maybe put | |
| 388 | |
| 00:18:32,640 --> 00:18:33,920 | |
| into a vector database. | |
| 389 | |
| 00:18:33,920 --> 00:18:36,160 | |
| Daniel has worked in technology. | |
| 390 | |
| 00:18:36,160 --> 00:18:39,760 | |
| Daniel has been working in, know, stuff like that. | |
| 391 | |
| 00:18:39,760 --> 00:18:42,975 | |
| That's not how you would speak, but I figure it's | |
| 392 | |
| 00:18:42,975 --> 00:18:46,414 | |
| probably easier to parse for, after all, robots. | |
| 393 | |
| 00:18:46,735 --> 00:18:48,654 | |
| So we've almost got to twenty minutes and this is | |
| 394 | |
| 00:18:48,654 --> 00:18:53,054 | |
| actually a success because I wasted twenty minutes of my | |
| 395 | |
| 00:18:53,455 --> 00:18:57,120 | |
| of the evening speaking into you in microphone and the | |
| 396 | |
| 00:18:57,120 --> 00:19:01,039 | |
| levels were shot and was clipping and I said I | |
| 397 | |
| 00:19:01,039 --> 00:19:02,320 | |
| can't really do an evaluation. | |
| 398 | |
| 00:19:02,320 --> 00:19:03,360 | |
| I have to be fair. | |
| 399 | |
| 00:19:03,360 --> 00:19:06,320 | |
| I have to give the models a chance to do | |
| 400 | |
| 00:19:06,320 --> 00:19:06,880 | |
| their thing. | |
| 401 | |
| 00:19:07,425 --> 00:19:09,505 | |
| What am I hoping to achieve in this? | |
| 402 | |
| 00:19:09,505 --> 00:19:11,584 | |
| Okay, my fine tune was a dud as mentioned. | |
| 403 | |
| 00:19:11,665 --> 00:19:15,185 | |
| Deepgram STT, I'm really, really hopeful that this prototype will | |
| 404 | |
| 00:19:15,185 --> 00:19:17,985 | |
| work and it's a build in public open source so | |
| 405 | |
| 00:19:17,985 --> 00:19:20,304 | |
| anyone is welcome to use it if I make anything | |
| 406 | |
| 00:19:20,304 --> 00:19:20,625 | |
| good. | |
| 407 | |
| 00:19:21,560 --> 00:19:23,800 | |
| But that was really exciting for me last night when | |
| 408 | |
| 00:19:23,800 --> 00:19:28,840 | |
| after hours of trying my own prototype, seeing someone just | |
| 409 | |
| 00:19:28,840 --> 00:19:32,039 | |
| made something that works like that, you you're not gonna | |
| 410 | |
| 00:19:32,039 --> 00:19:36,374 | |
| have to build a custom conda environment and image. | |
| 411 | |
| 00:19:36,374 --> 00:19:39,974 | |
| I have AMD GPU which makes things much more complicated. | |
| 412 | |
| 00:19:40,214 --> 00:19:42,614 | |
| I didn't find it and I was about to give | |
| 413 | |
| 00:19:42,614 --> 00:19:43,894 | |
| up and I said, All right, let me just give | |
| 414 | |
| 00:19:43,894 --> 00:19:46,455 | |
| Deepgram's Linux thing a shot. | |
| 415 | |
| 00:19:47,029 --> 00:19:49,589 | |
| And if this doesn't work, I'm just gonna go back | |
| 416 | |
| 00:19:49,589 --> 00:19:51,349 | |
| to trying to vibe code something myself. | |
| 417 | |
| 00:19:51,670 --> 00:19:55,509 | |
| And when I ran the script, I was using Cloud | |
| 418 | |
| 00:19:55,509 --> 00:19:59,029 | |
| Code to do the installation process, it ran the script | |
| 419 | |
| 00:19:59,029 --> 00:20:01,189 | |
| and, oh my gosh, it works just like that. | |
| 420 | |
| 00:20:01,824 --> 00:20:05,985 | |
| The tricky thing for all those who wants to know | |
| 421 | |
| 00:20:05,985 --> 00:20:11,425 | |
| all the nitty, ditty, nitty gritty details was that I | |
| 422 | |
| 00:20:11,425 --> 00:20:14,624 | |
| don't think it was actually struggling with transcription, but pasting | |
| 423 | |
| 00:20:14,705 --> 00:20:17,539 | |
| Weyland makes life very hard. | |
| 424 | |
| 00:20:17,539 --> 00:20:19,140 | |
| And I think there was something not running at the | |
| 425 | |
| 00:20:19,140 --> 00:20:19,699 | |
| right time. | |
| 426 | |
| 00:20:19,699 --> 00:20:22,979 | |
| Anyway, Deepgram, I looked at how they actually handle that | |
| 427 | |
| 00:20:22,979 --> 00:20:25,140 | |
| because it worked out of the box when other stuff | |
| 428 | |
| 00:20:25,140 --> 00:20:25,779 | |
| didn't. | |
| 429 | |
| 00:20:26,100 --> 00:20:28,900 | |
| And it was quite a clever little mechanism. | |
| 430 | |
| 00:20:29,495 --> 00:20:32,135 | |
| And but more so than that, the accuracy was brilliant. | |
| 431 | |
| 00:20:32,135 --> 00:20:33,574 | |
| Now what am I what am I doing here? | |
| 432 | |
| 00:20:33,574 --> 00:20:37,175 | |
| This is gonna be a twenty minute audio sample. | |
| 433 | |
| 00:20:38,375 --> 00:20:42,410 | |
| And I'm I think I've done one or two of | |
| 434 | |
| 00:20:42,410 --> 00:20:47,130 | |
| these before, but I did it with short, snappy voice | |
| 435 | |
| 00:20:47,130 --> 00:20:47,610 | |
| notes. | |
| 436 | |
| 00:20:47,610 --> 00:20:49,370 | |
| This is kind of long form. | |
| 437 | |
| 00:20:49,449 --> 00:20:51,929 | |
| This actually might be a better approximation for what's useful | |
| 438 | |
| 00:20:51,929 --> 00:20:53,849 | |
| to me than voice memos. | |
| 439 | |
| 00:20:53,849 --> 00:20:56,894 | |
| Like, I need to buy three liters of milk tomorrow | |
| 440 | |
| 00:20:56,894 --> 00:21:00,175 | |
| and peter bread, which is probably how half my voice | |
| 441 | |
| 00:21:00,175 --> 00:21:00,735 | |
| notes sound. | |
| 442 | |
| 00:21:00,735 --> 00:21:04,094 | |
| Like if anyone were to find my phone they'd be | |
| 443 | |
| 00:21:04,094 --> 00:21:05,934 | |
| like this is the most boring person in the world. | |
| 444 | |
| 00:21:06,015 --> 00:21:10,050 | |
| Although actually there are some journaling thoughts as well, but | |
| 445 | |
| 00:21:10,050 --> 00:21:11,810 | |
| it's a lot of content like that. | |
| 446 | |
| 00:21:11,810 --> 00:21:14,610 | |
| And the probably for the evaluation, the most useful thing | |
| 447 | |
| 00:21:14,610 --> 00:21:21,834 | |
| is slightly obscure tech, GitHub, Nucleano, hugging face, not so | |
| 448 | |
| 00:21:21,834 --> 00:21:24,474 | |
| obscure that it's not gonna have a chance of knowing | |
| 449 | |
| 00:21:24,474 --> 00:21:27,194 | |
| it, but hopefully sufficiently well known that the model should | |
| 450 | |
| 00:21:27,194 --> 00:21:27,834 | |
| get it. | |
| 451 | |
| 00:21:27,914 --> 00:21:29,995 | |
| I tried to do a little bit of speaking really | |
| 452 | |
| 00:21:29,995 --> 00:21:32,394 | |
| fast and speaking very slowly. | |
| 453 | |
| 00:21:32,394 --> 00:21:35,529 | |
| Would say in general, I've spoken, delivered this at a | |
| 454 | |
| 00:21:35,529 --> 00:21:39,130 | |
| faster pace than I usually would owing to strong coffee | |
| 455 | |
| 00:21:39,130 --> 00:21:40,570 | |
| flowing through my bloodstream. | |
| 456 | |
| 00:21:41,130 --> 00:21:43,529 | |
| And the thing that I'm not gonna get in this | |
| 457 | |
| 00:21:43,529 --> 00:21:46,090 | |
| benchmark is background noise, which in my first take that | |
| 458 | |
| 00:21:46,090 --> 00:21:48,455 | |
| I had to get rid of, my wife came in | |
| 459 | |
| 00:21:48,455 --> 00:21:51,495 | |
| with my son and for a good night kiss. | |
| 460 | |
| 00:21:51,574 --> 00:21:55,094 | |
| And that actually would have been super helpful to get | |
| 461 | |
| 00:21:55,094 --> 00:21:57,814 | |
| in because it was non diarized or if we had | |
| 462 | |
| 00:21:57,814 --> 00:21:58,695 | |
| diarization. | |
| 463 | |
| 00:21:59,334 --> 00:22:01,414 | |
| A female, I could say, I want the male voice | |
| 464 | |
| 00:22:01,414 --> 00:22:03,094 | |
| and that wasn't intended for transcription. | |
| 465 | |
| 00:22:04,509 --> 00:22:06,269 | |
| And we're not going to get background noise like people | |
| 466 | |
| 00:22:06,269 --> 00:22:08,989 | |
| honking their horns, which is something I've done in my | |
| 467 | |
| 00:22:09,150 --> 00:22:11,870 | |
| main data set where I am trying to go back | |
| 468 | |
| 00:22:11,870 --> 00:22:15,070 | |
| to some of my voice notes, annotate them and run | |
| 469 | |
| 00:22:15,070 --> 00:22:15,709 | |
| a benchmark. | |
| 470 | |
| 00:22:15,709 --> 00:22:18,265 | |
| But this is going to be just a pure quick | |
| 471 | |
| 00:22:18,265 --> 00:22:19,064 | |
| test. | |
| 472 | |
| 00:22:19,785 --> 00:22:24,025 | |
| And as someone I'm working on a voice note idea. | |
| 473 | |
| 00:22:24,025 --> 00:22:28,185 | |
| That's my sort of end motivation besides thinking it's an | |
| 474 | |
| 00:22:28,185 --> 00:22:31,785 | |
| absolutely outstanding technology that's coming to viability. | |
| 475 | |
| 00:22:31,785 --> 00:22:34,400 | |
| And really, I know this sounds cheesy, can actually have | |
| 476 | |
| 00:22:34,400 --> 00:22:36,479 | |
| a very transformative effect. | |
| 477 | |
| 00:22:37,920 --> 00:22:43,120 | |
| Voice technology has been life changing for folks living with | |
| 478 | |
| 00:22:43,999 --> 00:22:45,039 | |
| disabilities. | |
| 479 | |
| 00:22:45,920 --> 00:22:48,545 | |
| And I think there's something really nice about the fact | |
| 480 | |
| 00:22:48,545 --> 00:22:52,545 | |
| that it can also benefit folks who are able-bodied and | |
| 481 | |
| 00:22:52,545 --> 00:22:57,904 | |
| we can all in different ways make this tech as | |
| 482 | |
| 00:22:57,904 --> 00:23:00,705 | |
| useful as possible regardless of the exact way that we're | |
| 483 | |
| 00:23:00,705 --> 00:23:01,025 | |
| using it. | |
| 484 | |
| 00:23:02,199 --> 00:23:04,439 | |
| And I think there's something very powerful in that, and | |
| 485 | |
| 00:23:04,439 --> 00:23:05,559 | |
| it can be very cool. | |
| 486 | |
| 00:23:06,120 --> 00:23:07,559 | |
| I see huge potential. | |
| 487 | |
| 00:23:07,559 --> 00:23:09,319 | |
| What excites me about voice tech? | |
| 488 | |
| 00:23:09,719 --> 00:23:11,159 | |
| A lot of things actually. | |
| 489 | |
| 00:23:12,120 --> 00:23:14,839 | |
| Firstly, the fact that it's cheap and accurate, as I | |
| 490 | |
| 00:23:14,839 --> 00:23:17,785 | |
| mentioned at the very start of this, and it's getting | |
| 491 | |
| 00:23:17,785 --> 00:23:20,104 | |
| better and better with stuff like accent handling. | |
| 492 | |
| 00:23:20,745 --> 00:23:23,304 | |
| I'm not sure my fine tune will actually ever come | |
| 493 | |
| 00:23:23,304 --> 00:23:25,225 | |
| to fruition in the sense that I'll use it day | |
| 494 | |
| 00:23:25,225 --> 00:23:26,584 | |
| to day as I imagine. | |
| 495 | |
| 00:23:26,664 --> 00:23:30,505 | |
| I get like superb, flawless words error rates because I'm | |
| 496 | |
| 00:23:30,505 --> 00:23:34,949 | |
| just kind of skeptical about local speech to text, as | |
| 497 | |
| 00:23:34,949 --> 00:23:35,670 | |
| I mentioned. | |
| 498 | |
| 00:23:36,070 --> 00:23:39,830 | |
| And I think the pace of innovation and improvement in | |
| 499 | |
| 00:23:39,830 --> 00:23:42,310 | |
| the models, the main reasons for fine tuning from what | |
| 500 | |
| 00:23:42,310 --> 00:23:46,150 | |
| I've seen have been people who are something that really | |
| 501 | |
| 00:23:46,150 --> 00:23:50,375 | |
| blows blows my mind about ASR is the idea that | |
| 502 | |
| 00:23:50,375 --> 00:23:55,574 | |
| it's inherently ailingual or multilingual, phonetic based. | |
| 503 | |
| 00:23:56,295 --> 00:24:00,375 | |
| So as folks who use speak very obscure languages that | |
| 504 | |
| 00:24:00,375 --> 00:24:03,094 | |
| there may be very there might be a paucity of | |
| 505 | |
| 00:24:02,229 --> 00:24:05,030 | |
| training data or almost none at all, and therefore the | |
| 506 | |
| 00:24:05,030 --> 00:24:06,790 | |
| accuracy is significantly reduced. | |
| 507 | |
| 00:24:06,790 --> 00:24:11,350 | |
| Or folks in very critical environments, I know there are | |
| 508 | |
| 00:24:11,510 --> 00:24:15,350 | |
| this is used extensively in medical transcription and dispatcher work | |
| 509 | |
| 00:24:15,350 --> 00:24:19,064 | |
| as, you know the call centers who send out ambulances | |
| 510 | |
| 00:24:19,064 --> 00:24:19,864 | |
| etc. | |
| 511 | |
| 00:24:20,265 --> 00:24:23,545 | |
| Where accuracy is absolutely paramount and in the case of | |
| 512 | |
| 00:24:23,545 --> 00:24:27,545 | |
| doctors radiologists they might be using very specialized vocab all | |
| 513 | |
| 00:24:27,545 --> 00:24:27,865 | |
| the time. | |
| 514 | |
| 00:24:28,630 --> 00:24:30,229 | |
| So those are kind of the main two things, and | |
| 515 | |
| 00:24:30,229 --> 00:24:32,150 | |
| I'm not sure that really just for trying to make | |
| 516 | |
| 00:24:32,150 --> 00:24:36,390 | |
| it better on a few random tech words with my | |
| 517 | |
| 00:24:36,390 --> 00:24:39,429 | |
| slightly I mean, I have an accent, but, like, not, | |
| 518 | |
| 00:24:39,429 --> 00:24:42,469 | |
| you know, an accent that a few other million people | |
| 519 | |
| 00:24:42,870 --> 00:24:43,910 | |
| have ish. | |
| 520 | |
| 00:24:44,685 --> 00:24:47,965 | |
| I'm not sure that my little fine tune is gonna | |
| 521 | |
| 00:24:47,965 --> 00:24:52,604 | |
| actually like, the bump in word error reduction, if I | |
| 522 | |
| 00:24:52,604 --> 00:24:54,205 | |
| ever actually figure out how to do it and get | |
| 523 | |
| 00:24:54,205 --> 00:24:56,365 | |
| it up to the cloud, by the time we've done | |
| 524 | |
| 00:24:56,365 --> 00:24:59,959 | |
| that, I suspect that the next generation of ASR will | |
| 525 | |
| 00:24:59,959 --> 00:25:01,719 | |
| just be so good that it will kind of be, | |
| 526 | |
| 00:25:01,959 --> 00:25:03,959 | |
| well, that would have been cool if it worked out, | |
| 527 | |
| 00:25:03,959 --> 00:25:05,479 | |
| but I'll just use this instead. | |
| 528 | |
| 00:25:05,719 --> 00:25:10,679 | |
| So that's gonna be it for today's episode of voice | |
| 529 | |
| 00:25:10,679 --> 00:25:11,640 | |
| training data. | |
| 530 | |
| 00:25:11,880 --> 00:25:14,255 | |
| Single, long shot evaluation. | |
| 531 | |
| 00:25:14,495 --> 00:25:15,694 | |
| Who am I gonna compare? | |
| 532 | |
| 00:25:16,414 --> 00:25:18,574 | |
| Whisper is always good as a benchmark, but I'm more | |
| 533 | |
| 00:25:18,574 --> 00:25:22,175 | |
| interested in seeing Whisper head to head with two things | |
| 534 | |
| 00:25:22,175 --> 00:25:22,894 | |
| really. | |
| 535 | |
| 00:25:23,295 --> 00:25:25,134 | |
| One is Whisper variants. | |
| 536 | |
| 00:25:25,134 --> 00:25:27,695 | |
| So you've got these projects like Faster Whisper. | |
| 537 | |
| 00:25:29,110 --> 00:25:29,989 | |
| Distill Whisper. | |
| 538 | |
| 00:25:29,989 --> 00:25:30,709 | |
| It's a bit confusing. | |
| 539 | |
| 00:25:30,709 --> 00:25:31,909 | |
| There's a whole bunch of them. | |
| 540 | |
| 00:25:32,150 --> 00:25:35,110 | |
| And the emerging ASRs, which are also a thing. | |
| 541 | |
| 00:25:35,269 --> 00:25:37,110 | |
| My intention for this is I'm not sure I'm gonna | |
| 542 | |
| 00:25:37,110 --> 00:25:39,910 | |
| have the time in any point in the foreseeable future | |
| 543 | |
| 00:25:39,910 --> 00:25:44,775 | |
| to go back to this whole episode and create a | |
| 544 | |
| 00:25:44,775 --> 00:25:48,294 | |
| proper source truth where I fix everything. | |
| 545 | |
| 00:25:49,255 --> 00:25:51,894 | |
| Might do it if I can get one transcription that's | |
| 546 | |
| 00:25:51,894 --> 00:25:54,134 | |
| sufficiently close to perfection. | |
| 547 | |
| 00:25:54,934 --> 00:25:58,400 | |
| But what I would actually love to do on Hugging | |
| 548 | |
| 00:25:58,400 --> 00:26:00,479 | |
| Face, I think would be a great probably how I | |
| 549 | |
| 00:26:00,479 --> 00:26:04,400 | |
| might visualize this is having the audio waveform play and | |
| 550 | |
| 00:26:04,400 --> 00:26:08,880 | |
| then have the transcript for each model below it and | |
| 551 | |
| 00:26:08,880 --> 00:26:13,765 | |
| maybe even a, like, you know, to scale and maybe | |
| 552 | |
| 00:26:13,765 --> 00:26:16,644 | |
| even a local one as well, like local whisper versus | |
| 553 | |
| 00:26:16,644 --> 00:26:19,684 | |
| OpenAI API, etcetera. | |
| 554 | |
| 00:26:19,765 --> 00:26:23,124 | |
| And I can then actually listen back to segments or | |
| 555 | |
| 00:26:23,124 --> 00:26:25,285 | |
| anyone who wants to can listen back to segments of | |
| 556 | |
| 00:26:25,285 --> 00:26:30,219 | |
| this recording and see where a particular model struggled and | |
| 557 | |
| 00:26:30,219 --> 00:26:33,099 | |
| others didn't as well as the sort of headline finding | |
| 558 | |
| 00:26:33,099 --> 00:26:35,579 | |
| of which had the best W E R but that | |
| 559 | |
| 00:26:35,579 --> 00:26:37,659 | |
| would require the source of truth. | |
| 560 | |
| 00:26:37,660 --> 00:26:38,459 | |
| Okay, that's it. | |
| 561 | |
| 00:26:38,425 --> 00:26:40,985 | |
| I hope this was, I don't know, maybe useful for | |
| 562 | |
| 00:26:40,985 --> 00:26:42,904 | |
| other folks interested in STT. | |
| 563 | |
| 00:26:42,985 --> 00:26:45,945 | |
| You want to see I always think I've just said | |
| 564 | |
| 00:26:45,945 --> 00:26:47,624 | |
| it as something I didn't intend to. | |
| 565 | |
| 00:26:47,864 --> 00:26:49,624 | |
| STT, I said for those. | |
| 566 | |
| 00:26:49,624 --> 00:26:53,049 | |
| Listen carefully, including hopefully the models themselves. | |
| 567 | |
| 00:26:53,289 --> 00:26:55,049 | |
| This has been myself, Daniel Rosol. | |
| 568 | |
| 00:26:55,049 --> 00:26:59,370 | |
| For more jumbled repositories about my roving interest in AI | |
| 569 | |
| 00:26:59,370 --> 00:27:04,009 | |
| but particularly AgenTic, MCP and VoiceTech you can find me | |
| 570 | |
| 00:27:04,009 --> 00:27:05,689 | |
| on GitHub. | |
| 571 | |
| 00:27:05,929 --> 00:27:06,650 | |
| Hugging Face. | |
| 572 | |
| 00:27:08,045 --> 00:27:08,924 | |
| Where else? | |
| 573 | |
| 00:27:08,925 --> 00:27:11,725 | |
| DanielRosel dot com, which is my personal website, as well | |
| 574 | |
| 00:27:11,725 --> 00:27:15,485 | |
| as this podcast whose name I sadly cannot remember. | |
| 575 | |
| 00:27:15,644 --> 00:27:16,685 | |
| Until next time. | |
| 576 | |
| 00:27:16,685 --> 00:27:17,324 | |
| Thanks for listening. | |