Buckets:

rtrm's picture
download
raw
77.4 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Pre-trained models for automatic speech recognition&quot;,&quot;local&quot;:&quot;pre-trained-models-for-automatic-speech-recognition&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Probing CTC Models&quot;,&quot;local&quot;:&quot;probing-ctc-models&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Graduation to Seq2Seq&quot;,&quot;local&quot;:&quot;graduation-to-seq2seq&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Long-Form Transcription and Timestamps&quot;,&quot;local&quot;:&quot;long-form-transcription-and-timestamps&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Summary&quot;,&quot;local&quot;:&quot;summary&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/audio-course/pr_201/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/scheduler.f7e1785c.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/singletons.0d70d4cc.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.279db187.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/paths.274f629d.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.9f8f0838.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/0.e329f606.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/28.3470cfbc.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/CodeBlock.b3510e34.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/EditOnGithub.5a9bb8c5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Pre-trained models for automatic speech recognition&quot;,&quot;local&quot;:&quot;pre-trained-models-for-automatic-speech-recognition&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Probing CTC Models&quot;,&quot;local&quot;:&quot;probing-ctc-models&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Graduation to Seq2Seq&quot;,&quot;local&quot;:&quot;graduation-to-seq2seq&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Long-Form Transcription and Timestamps&quot;,&quot;local&quot;:&quot;long-form-transcription-and-timestamps&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Summary&quot;,&quot;local&quot;:&quot;summary&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="pre-trained-models-for-automatic-speech-recognition" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#pre-trained-models-for-automatic-speech-recognition"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Pre-trained models for automatic speech recognition</span></h1> <p data-svelte-h="svelte-qu6ugv">In this section, we’ll cover how to use the <code>pipeline()</code> to leverage pre-trained models for speech recognition. In <a href="../chapter2/asr_pipeline">Unit 2</a>,
we introduced the <code>pipeline()</code> as an easy way of running speech recognition tasks, with all pre- and post-processing handled under-the-hood
and the flexibility to quickly experiment with any pre-trained checkpoint on the Hugging Face Hub. In this Unit, we’ll go a
level deeper and explore the different attributes of speech recognition models and how we can use them to tackle a range
of different tasks.</p> <p data-svelte-h="svelte-1lo7cg7">As detailed in Unit 3, speech recognition model broadly fall into one of two categories:</p> <ol data-svelte-h="svelte-4r9go5"><li>Connectionist Temporal Classification (CTC): <em>encoder-only</em> models with a linear classification (CTC) head on top</li> <li>Sequence-to-sequence (Seq2Seq): <em>encoder-decoder</em> models, with a cross-attention mechanism between the encoder and decoder</li></ol> <p data-svelte-h="svelte-7bh989">Prior to 2022, CTC was the more popular of the two architectures, with encoder-only models such as Wav2Vec2, HuBERT and XLSR achieving
breakthoughs in the pre-training / fine-tuning paradigm for speech. Big corporations, such as Meta and Microsoft, pre-trained
the encoder on vast amounts of unlabelled audio data for many days or weeks. Users could then take a pre-trained checkpoint, and
fine-tune it with a CTC head on as little as <strong>10 minutes</strong> of labelled speech data to achieve strong performance on a downstream
speech recognition task.</p> <p data-svelte-h="svelte-1y7wlps">However, CTC models have their shortcomings. Appending a simple linear layer to an encoder gives a small, fast overall model, but can
be prone to phonetic spelling errors. We’ll demonstrate this for the Wav2Vec2 model below.</p> <h2 class="relative group"><a id="probing-ctc-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#probing-ctc-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Probing CTC Models</span></h2> <p data-svelte-h="svelte-6p3rx4">Let’s load a small excerpt of the <a href="hf-internal-testing/librispeech_asr_dummy">LibriSpeech ASR</a> dataset to demonstrate
Wav2Vec2’s speech transcription capabilities:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset
dataset = load_dataset(
<span class="hljs-string">&quot;hf-internal-testing/librispeech_asr_dummy&quot;</span>, <span class="hljs-string">&quot;clean&quot;</span>, split=<span class="hljs-string">&quot;validation&quot;</span>
)
dataset<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-title function_ invoke__">Dataset</span>({
<span class="hljs-attr">features</span>: [<span class="hljs-string">&#x27;file&#x27;</span>, <span class="hljs-string">&#x27;audio&#x27;</span>, <span class="hljs-string">&#x27;text&#x27;</span>, <span class="hljs-string">&#x27;speaker_id&#x27;</span>, <span class="hljs-string">&#x27;chapter_id&#x27;</span>, <span class="hljs-string">&#x27;id&#x27;</span>],
<span class="hljs-attr">num_rows</span>: <span class="hljs-number">73</span>
})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1f14qa5">We can pick one of the 73 audio samples and inspect the audio sample as well as the transcription:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> Audio
sample = dataset[<span class="hljs-number">2</span>]
<span class="hljs-built_in">print</span>(sample[<span class="hljs-string">&quot;text&quot;</span>])
Audio(sample[<span class="hljs-string">&quot;audio&quot;</span>][<span class="hljs-string">&quot;array&quot;</span>], rate=sample[<span class="hljs-string">&quot;audio&quot;</span>][<span class="hljs-string">&quot;sampling_rate&quot;</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->HE TELLS US THAT <span class="hljs-built_in">AT</span> THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS <span class="hljs-keyword">AND </span>ROAST <span class="hljs-keyword">BEEF </span>LOOMING <span class="hljs-keyword">BEFORE </span>US SIMILES DRAWN FROM EATING <span class="hljs-keyword">AND </span>ITS RESULTS OCCUR MOST READILY TO THE MIND<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-zwb92k">Alright! Christmas and roast beef, sounds great! 🎄 Having chosen a data sample, we now load a fine-tuned checkpoint into
the <code>pipeline()</code>. For this, we’ll use the official <a href="facebook/wav2vec2-base-100h">Wav2Vec2 base</a> checkpoint fine-tuned on
100 hours of LibriSpeech data:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline
pipe = pipeline(<span class="hljs-string">&quot;automatic-speech-recognition&quot;</span>, model=<span class="hljs-string">&quot;facebook/wav2vec2-base-100h&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-160y70r">Next, we’ll take an example from the dataset and pass its raw data to the pipeline. Since the <code>pipeline</code> <em>consumes</em> any
dictionary that we pass it (meaning it cannot be re-used), we’ll pass a copy of the data. This way, we can safely re-use
the same audio sample in the following examples:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pipe(sample[<span class="hljs-string">&quot;audio&quot;</span>].copy())<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment">{&quot;text&quot;: &quot;HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAUS AND ROSE BEEF LOOMING BEFORE US SIMALYIS DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND&quot;}</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1n1tyml">We can see that the Wav2Vec2 model does a pretty good job at transcribing this sample - at a first glance it looks generally correct.
Let’s put the target and prediction side-by-side and highlight the differences:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-symbol">Target:</span> HE TELLS US THAT <span class="hljs-built_in">AT</span> THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS <span class="hljs-keyword">AND </span>ROAST <span class="hljs-keyword">BEEF </span>LOOMING <span class="hljs-keyword">BEFORE </span>US SIMILES DRAWN FROM EATING <span class="hljs-keyword">AND </span>ITS RESULTS OCCUR MOST READILY TO THE MIND
<span class="hljs-symbol">Prediction:</span> HE TELLS US THAT <span class="hljs-built_in">AT</span> THIS FESTIVE SEASON OF THE YEAR WITH **CHRISTMAUS** <span class="hljs-keyword">AND </span>**ROSE** <span class="hljs-keyword">BEEF </span>LOOMING <span class="hljs-keyword">BEFORE </span>US **SIMALYIS** DRAWN FROM EATING <span class="hljs-keyword">AND </span>ITS RESULTS OCCUR MOST READILY TO THE MIND<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-fjdtrw">Comparing the target text to the predicted transcription, we can see that all words <em>sound</em> correct, but some are not spelled accurately. For example:</p> <ul data-svelte-h="svelte-11mc7qy"><li><em>CHRISTMAUS</em> vs. <em>CHRISTMAS</em></li> <li><em>ROSE</em> vs. <em>ROAST</em></li> <li><em>SIMALYIS</em> vs. <em>SIMILES</em></li></ul> <p data-svelte-h="svelte-18hr596">This highlights the shortcoming of a CTC model. A CTC model is essentially an ‘acoustic-only’ model: it consists of an encoder
which forms hidden-state representations from the audio inputs, and a linear layer which maps the hidden-states to characters:</p> <p data-svelte-h="svelte-mwf7be">This means that the system almost entirely bases its prediction on the acoustic input it was given (the phonetic sounds of the audio),
and so has a tendency to transcribe the audio in a phonetic way (e.g. <em>CHRISTMAUS</em>). It gives less importance to the
language modelling context of previous and successive letters, and so is prone to phonetic spelling errors. A more intelligent model
would identify that <em>CHRISTMAUS</em> is not a valid word in the English vocabulary, and correct it to <em>CHRISTMAS</em> when making
its predictions. We’re also missing two big features in our prediction - casing and punctuation - which limits the usefulness of
the model’s transcriptions to real-world applications.</p> <h2 class="relative group"><a id="graduation-to-seq2seq" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#graduation-to-seq2seq"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Graduation to Seq2Seq</span></h2> <p data-svelte-h="svelte-55oy83">Cue Seq2Seq models! As outlined in Unit 3, Seq2Seq models are formed of an encoder and decoder linked via a cross-attention
mechanism. The encoder plays the same role as before, computing hidden-state representations of the audio inputs, while the decoder
plays the role of a <strong>language model</strong>. The decoder processes the entire sequence of hidden-state representations
from the encoder and generates the corresponding text transcriptions. With global context of the audio input, the decoder
is able to use language modelling context as it makes its predictions, correcting for spelling mistakes on-the-fly and thus
circumventing the issue of phonetic predictions.</p> <p data-svelte-h="svelte-969zyv">There are two downsides to Seq2Seq models:</p> <ol data-svelte-h="svelte-osi73t"><li>They are inherently slower at decoding, since the decoding process happens one step at a time, rather than all at once</li> <li>They are more data hungry, requiring significantly more training data to reach convergence</li></ol> <p data-svelte-h="svelte-1bn96on">In particular, the need for large amounts of training data has been a bottleneck in the advancement of Seq2Seq architectures for
speech. Labelled speech data is difficult to come by, with the largest annotated datasets at the time clocking in at just
10,000 hours. This all changed in 2022 upon the release of <strong>Whisper</strong>. Whisper is a pre-trained model for speech recognition
published in <a href="https://openai.com/blog/whisper/" rel="nofollow">September 2022</a> by the authors Alec Radford et al. from OpenAI. Unlike
its CTC predecessors, which were pre-trained entirely on <strong>un-labelled</strong> audio data, Whisper is pre-trained on a vast quantity of
<strong>labelled</strong> audio-transcription data, 680,000 hours to be precise.</p> <p data-svelte-h="svelte-zyrdbw">This is an order of magnitude more data than the un-labelled audio data used to train Wav2Vec 2.0 (60,000 hours). What is
more, 117,000 hours of this pre-training data is multilingual (or “non-English”) data. This results in checkpoints that can be applied to
over 96 languages, many of which are considered <em>low-resource</em>, meaning the language lacks a large corpus of data suitable for training.</p> <p data-svelte-h="svelte-em8jox">When scaled to 680,000 hours of labelled pre-training data, Whisper models demonstrate a strong ability to generalise to
many datasets and domains. The pre-trained checkpoints achieve competitive results to state-of-the-art pipe systems, with
near 3% word error rate (WER) on the test-clean subset of LibriSpeech pipe and a new state-of-the-art on TED-LIUM with
4.7% WER (<em>c.f.</em> Table 8 of the <a href="https://cdn.openai.com/papers/whisper.pdf" rel="nofollow">Whisper paper</a>).</p> <p data-svelte-h="svelte-1g5392c">Of particular importance is Whisper’s ability to handle long-form audio samples, its robustness to input noise and ability
to predict cased and punctuated transcriptions. This makes it a viable candidate for real-world speech recognition systems.</p> <p data-svelte-h="svelte-5h4os6">The remainder of this section will show you how to use the pre-trained Whisper models for speech recognition using 🤗
Transformers. In many situations, the pre-trained Whisper checkpoints are extremely performant and give great results,
thus we encourage you to try using the pre-trained checkpoints as a first step to solving any speech recognition problem.
Through fine-tuning, the pre-trained checkpoints can be adapted for specific datasets and languages to further improve
upon these results. We’ll demonstrate how to do this in the upcoming subsection on <a href="fine-tuning">fine-tuning</a>.</p> <p data-svelte-h="svelte-uy2cuy">The Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either
English-only or multilingual data. The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints
are available on the <a href="https://huggingface.co/models?search=openai/whisper" rel="nofollow">Hugging Face Hub</a>. The checkpoints are
summarised in the following table with links to the models on the Hub. “VRAM” denotes the required GPU memory to run the
model with the minimum batch size of 1. “Rel Speed” is the relative speed of a checkpoint compared to the largest model.
Based on this information, you can select a checkpoint that is best suited to your hardware.</p> <table data-svelte-h="svelte-leifmh"><thead><tr><th>Size</th> <th>Parameters</th> <th>VRAM / GB</th> <th>Rel Speed</th> <th>English-only</th> <th>Multilingual</th></tr></thead> <tbody><tr><td>tiny</td> <td>39 M</td> <td>1.4</td> <td>32</td> <td><a href="https://huggingface.co/openai/whisper-tiny.en" rel="nofollow"></a></td> <td><a href="https://huggingface.co/openai/whisper-tiny" rel="nofollow"></a></td></tr> <tr><td>base</td> <td>74 M</td> <td>1.5</td> <td>16</td> <td><a href="https://huggingface.co/openai/whisper-base.en" rel="nofollow"></a></td> <td><a href="https://huggingface.co/openai/whisper-base" rel="nofollow"></a></td></tr> <tr><td>small</td> <td>244 M</td> <td>2.3</td> <td>6</td> <td><a href="https://huggingface.co/openai/whisper-small.en" rel="nofollow"></a></td> <td><a href="https://huggingface.co/openai/whisper-small" rel="nofollow"></a></td></tr> <tr><td>medium</td> <td>769 M</td> <td>4.2</td> <td>2</td> <td><a href="https://huggingface.co/openai/whisper-medium.en" rel="nofollow"></a></td> <td><a href="https://huggingface.co/openai/whisper-medium" rel="nofollow"></a></td></tr> <tr><td>large</td> <td>1550 M</td> <td>7.5</td> <td>1</td> <td>x</td> <td><a href="https://huggingface.co/openai/whisper-large-v2" rel="nofollow"></a></td></tr></tbody></table> <p data-svelte-h="svelte-efpzbz">Let’s load the <a href="https://huggingface.co/openai/whisper-base" rel="nofollow">Whisper Base</a> checkpoint, which is of comparable size to the
Wav2Vec2 checkpoint we used previously. Preempting our move to multilingual speech recognition, we’ll load the multilingual
variant of the base checkpoint. We’ll also load the model on the GPU if available, or CPU otherwise. The <code>pipeline()</code> will
subsequently take care of moving all inputs / outputs from the CPU to the GPU as required:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline
device = <span class="hljs-string">&quot;cuda:0&quot;</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">&quot;cpu&quot;</span>
pipe = pipeline(
<span class="hljs-string">&quot;automatic-speech-recognition&quot;</span>, model=<span class="hljs-string">&quot;openai/whisper-base&quot;</span>, device=device
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1feq0tt">Great! Now let’s transcribe the audio as before. The only change we make is passing an extra argument, <code>max_new_tokens</code>,
which tells the model the maximum number of tokens to generate when making its prediction:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pipe(sample[<span class="hljs-string">&quot;audio&quot;</span>], max_new_tokens=<span class="hljs-number">256</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{&#x27;<span class="hljs-built_in">text</span>&#x27;: &#x27; He tells us <span class="hljs-keyword">that</span> <span class="hljs-keyword">at</span> this festive season <span class="hljs-keyword">of</span> <span class="hljs-keyword">the</span> <span class="hljs-built_in">year</span>, <span class="hljs-keyword">with</span> Christmas <span class="hljs-keyword">and</span> roast beef looming <span class="hljs-keyword">before</span> us, similarly <span class="hljs-keyword">is</span> drawn <span class="hljs-keyword">from</span> eating <span class="hljs-keyword">and</span> <span class="hljs-keyword">its</span> results occur most readily <span class="hljs-keyword">to</span> <span class="hljs-keyword">the</span> mind.&#x27;}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-18db65j">Easy enough! The first thing you’ll notice is the presence of both casing and punctuation. Immediately this makes the
transcription easier to read compared to the un-cased and un-punctuated transcription from Wav2Vec2. Let’s put the transcription
side-by-side with the target:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-symbol">Target:</span> HE TELLS US THAT <span class="hljs-built_in">AT</span> THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS <span class="hljs-keyword">AND </span>ROAST <span class="hljs-keyword">BEEF </span>LOOMING <span class="hljs-keyword">BEFORE </span>US SIMILES DRAWN FROM EATING <span class="hljs-keyword">AND </span>ITS RESULTS OCCUR MOST READILY TO THE MIND
<span class="hljs-symbol">Prediction:</span> He tells us that <span class="hljs-built_in">at</span> this festive season of the year, with **Christmas** <span class="hljs-keyword">and </span>**roast** <span class="hljs-keyword">beef </span>looming <span class="hljs-keyword">before </span>us, **similarly** is drawn from eating <span class="hljs-keyword">and </span>its results occur most readily to the mind.<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-mm1pes">Whisper has done a great job at correcting the phonetic errors we saw from Wav2Vec2 - both <em>Christmas</em> and <em>roast</em> are
spelled correctly. We see that the model still struggles with <em>SIMILES</em>, being incorrectly transcribed as <em>similarly</em>, but
this time the prediction is a valid word from the English vocabulary. Using a larger Whisper checkpoint can help further
reduce transcription errors, at the expense of requiring more compute and a longer transcription time.</p> <p data-svelte-h="svelte-1ol57p5">We’ve been promised a model that can handle 96 languages, so lets leave English speech recognition for now and go global 🌎!
The <a href="https://huggingface.co/datasets/facebook/multilingual_librispeech" rel="nofollow">Multilingual LibriSpeech</a> (MLS) dataset is
the multilingual equivalent of the LibriSpeech dataset, with labelled audio data in six languages. We’ll load one sample
from the Spanish split of the MLS dataset, making use of <em>streaming</em> mode so that we don’t have to download the entire dataset:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->dataset = load_dataset(
<span class="hljs-string">&quot;facebook/multilingual_librispeech&quot;</span>, <span class="hljs-string">&quot;spanish&quot;</span>, split=<span class="hljs-string">&quot;validation&quot;</span>, streaming=<span class="hljs-literal">True</span>
)
sample = <span class="hljs-built_in">next</span>(<span class="hljs-built_in">iter</span>(dataset))<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1f9nh4o">Again, we’ll inspect the text transcription and take a listen to the audio segment:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(sample[<span class="hljs-string">&quot;text&quot;</span>])
Audio(sample[<span class="hljs-string">&quot;audio&quot;</span>][<span class="hljs-string">&quot;array&quot;</span>], rate=sample[<span class="hljs-string">&quot;audio&quot;</span>][<span class="hljs-string">&quot;sampling_rate&quot;</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->entonces <span class="hljs-keyword">te</span> delelitarás <span class="hljs-keyword">en</span> jehová y yo <span class="hljs-keyword">te</span> haré subir sobre las alturas <span class="hljs-keyword">de</span> <span class="hljs-keyword">la</span> tierra y <span class="hljs-keyword">te</span> daré á comer <span class="hljs-keyword">la</span> heredad <span class="hljs-keyword">de</span> jacob tu padre porque <span class="hljs-keyword">la</span> boca <span class="hljs-keyword">de</span> jehová lo ha hablado<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-75q7fw">This is the target text that we’re aiming for with our Whisper transcription. Although we now know that we can
probably do better this, since our model is also going to predict punctuation and casing, neither of which are present in the
reference. Let’s forward the audio sample to the pipeline to get our text prediction. One thing to note is that the
pipeline <em>consumes</em> the dictionary of audio inputs that we input, meaning the dictionary can’t be re-used. To circumvent
this, we’ll pass a <em>copy</em> of the audio sample, so that we can re-use the same audio sample in the proceeding code examples:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pipe(sample[<span class="hljs-string">&quot;audio&quot;</span>].copy(), max_new_tokens=<span class="hljs-number">256</span>, generate_kwargs={<span class="hljs-string">&quot;task&quot;</span>: <span class="hljs-string">&quot;transcribe&quot;</span>})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{&#x27;text&#x27;: &#x27; Entonces <span class="hljs-keyword">te</span> deleitarás <span class="hljs-keyword">en</span> Jehová y yo <span class="hljs-keyword">te</span> haré subir sobre las alturas <span class="hljs-keyword">de</span> <span class="hljs-keyword">la</span> tierra y <span class="hljs-keyword">te</span> daré a comer <span class="hljs-keyword">la</span> heredad <span class="hljs-keyword">de</span> Jacob tu padre porque <span class="hljs-keyword">la</span> boca <span class="hljs-keyword">de</span> Jehová lo ha hablado.&#x27;}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-uzlabu">Great - this looks very similar to our reference text (arguably better since it has punctuation and casing!). You’ll notice
that we forwarded the <code>&quot;task&quot;</code> as a <em>generate key-word argument</em> (generate kwarg). Setting the <code>&quot;task&quot;</code> to <code>&quot;transcribe&quot;</code>
forces Whisper to perform the task of <em>speech recognition</em>, where the audio is transcribed in the same language that the
speech was spoken in. Whisper is also capable of performing the closely related task of <em>speech translation</em>, where the
audio in Spanish can be translated to text in English. To achieve this, we set the <code>&quot;task&quot;</code> to <code>&quot;translate&quot;</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pipe(sample[<span class="hljs-string">&quot;audio&quot;</span>], max_new_tokens=<span class="hljs-number">256</span>, generate_kwargs={<span class="hljs-string">&quot;task&quot;</span>: <span class="hljs-string">&quot;translate&quot;</span>})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment">{&#x27;text&#x27;: &#x27; So you will choose in Jehovah and I will raise you on the heights of the earth and I will give you the honor of Jacob to your father because the voice of Jehovah has spoken to you.&#x27;}</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-joibr">Now that we know we can toggle between speech recognition and speech translation, we can pick our task depending on our
needs. Either we recognise from audio in language X to text in the same language X (e.g. Spanish audio to Spanish text),
or we translate from audio in any language X to text in English (e.g. Spanish audio to English text).</p> <p data-svelte-h="svelte-1aecqb1">To read more about how the <code>&quot;task&quot;</code> argument is used to control the properties of the generated text, refer to the
<a href="https://huggingface.co/openai/whisper-base#usage" rel="nofollow">model card</a> for the Whisper base model.</p> <h2 class="relative group"><a id="long-form-transcription-and-timestamps" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#long-form-transcription-and-timestamps"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Long-Form Transcription and Timestamps</span></h2> <p data-svelte-h="svelte-ad6ien">So far, we’ve focussed on transcribing short audio samples of less than 30 seconds. We mentioned that one of the appeals
of Whisper was its ability to work on long audio samples. We’ll tackle this task here!</p> <p data-svelte-h="svelte-1atykwj">Let’s create a long audio file by concatenating sequential samples from the MLS dataset. Since the MLS dataset is
curated by splitting long audiobook recordings into shorter segments, concatenating samples is one way of reconstructing
longer audiobook passages. Consequently, the resulting audio should be coherent across the entire sample.</p> <p data-svelte-h="svelte-8fkb86">We’ll set our target audio length to 5 minutes, and stop concatenating samples once we hit this value:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
target_length_in_m = <span class="hljs-number">5</span>
<span class="hljs-comment"># convert from minutes to seconds (* 60) to num samples (* sampling rate)</span>
sampling_rate = pipe.feature_extractor.sampling_rate
target_length_in_samples = target_length_in_m * <span class="hljs-number">60</span> * sampling_rate
<span class="hljs-comment"># iterate over our streaming dataset, concatenating samples until we hit our target</span>
long_audio = []
<span class="hljs-keyword">for</span> sample <span class="hljs-keyword">in</span> dataset:
long_audio.extend(sample[<span class="hljs-string">&quot;audio&quot;</span>][<span class="hljs-string">&quot;array&quot;</span>])
<span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(long_audio) &gt; target_length_in_samples:
<span class="hljs-keyword">break</span>
long_audio = np.asarray(long_audio)
<span class="hljs-comment"># how did we do?</span>
seconds = <span class="hljs-built_in">len</span>(long_audio) / <span class="hljs-number">16000</span>
minutes, seconds = <span class="hljs-built_in">divmod</span>(seconds, <span class="hljs-number">60</span>)
<span class="hljs-built_in">print</span>(<span class="hljs-string">f&quot;Length of audio sample is <span class="hljs-subst">{minutes}</span> minutes <span class="hljs-subst">{seconds:<span class="hljs-number">.2</span>f}</span> seconds&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-attribute">Length</span> of audio sample is <span class="hljs-number">5</span>.<span class="hljs-number">0</span> minutes <span class="hljs-number">17</span>.<span class="hljs-number">22</span> seconds<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-rlobar">Alright! 5 minutes and 17 seconds of audio to transcribe. There are two problems with forwarding this long audio sample
directly to the model:</p> <ol data-svelte-h="svelte-xow7at"><li>Whisper is inherently designed to work with 30 second samples: anything shorter than 30s is padded to 30s with silence, anything longer than 30s is truncated to 30s by cutting of the extra audio, so if we pass our audio directly we’ll only get the transcription for the first 30s</li> <li>Memory in a transformer network scales with the sequence length squared: doubling the input length quadruples the memory requirement, so passing super long audio files is bound to lead to an out-of-memory (OOM) error</li></ol> <p data-svelte-h="svelte-1kl6cah">The way long-form transcription works in 🤗 Transformers is by <em>chunking</em> the input audio into smaller, more manageable segments.
Each segment has a small amount of overlap with the previous one. This allows us to accurately stitch the segments back together
at the boundaries, since we can find the overlap between segments and merge the transcriptions accordingly:</p> <div class="flex justify-center" data-svelte-h="svelte-dbmn5s"><img src="https://huggingface.co/blog/assets/49_asr_chunking/Striding.png" alt="🤗 Transformers chunking algorithm. Source: https://huggingface.co/blog/asr-chunking."></div> <p>The advantage of chunking the samples is that we don’t need the result of chunk<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex"> i </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em;"></span><span class="mord mathnormal">i</span></span></span></span><!-- HTML_TAG_END --> to transcribe the subsequent
chunk<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow><annotation encoding="application/x-tex"> i + 1 </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7429em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span><!-- HTML_TAG_END -->. The stitching is done after we have transcribed all the chunks at the chunk boundaries, so it doesn’t
matter which order we transcribe chunks in. The algorithm is entirely <strong data-svelte-h="svelte-1caf2ri">stateless</strong>, so we can even do chunk<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow><annotation encoding="application/x-tex"> i + 1 </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7429em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span><!-- HTML_TAG_END -->
at the same time as chunk<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex"> i </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em;"></span><span class="mord mathnormal">i</span></span></span></span><!-- HTML_TAG_END -->! This allows us to <em data-svelte-h="svelte-1swb9sm">batch</em> the chunks and run them through the model in parallel,
providing a large computational speed-up compared to transcribing them sequentially. To read more about chunking in 🤗 Transformers,
you can refer to this <a href="https://huggingface.co/blog/asr-chunking" rel="nofollow" data-svelte-h="svelte-1d46t8h">blog post</a>.</p> <p data-svelte-h="svelte-1bfrg8c">To activate long-form transcriptions, we have to add one additional argument when we call the pipeline. This argument,
<code>chunk_length_s</code>, controls the length of the chunked segments in seconds. For Whisper, 30 second chunks are optimal,
since this matches the input length Whisper expects.</p> <p data-svelte-h="svelte-1ldveni">To activate batching, we need to pass the argument <code>batch_size</code> to the pipeline. Putting it all together, we can transcribe the
long audio sample with chunking and batching as follows:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pipe(
long_audio,
max_new_tokens=<span class="hljs-number">256</span>,
generate_kwargs={<span class="hljs-string">&quot;task&quot;</span>: <span class="hljs-string">&quot;transcribe&quot;</span>},
chunk_length_s=<span class="hljs-number">30</span>,
batch_size=<span class="hljs-number">8</span>,
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{&#x27;text&#x27;: &#x27; Entonces <span class="hljs-keyword">te</span> deleitarás <span class="hljs-keyword">en</span> Jehová, y yo <span class="hljs-keyword">te</span> haré subir sobre las alturas <span class="hljs-keyword">de</span> <span class="hljs-keyword">la</span> tierra, y <span class="hljs-keyword">te</span> daré a comer <span class="hljs-keyword">la</span>
heredad <span class="hljs-keyword">de</span> Jacob tu padre, porque <span class="hljs-keyword">la</span> boca <span class="hljs-keyword">de</span> Jehová lo ha hablado. nosotros curados. Todos nosotros nos descarriamos
como bejas, cada cual <span class="hljs-keyword">se</span> apartó por <span class="hljs-keyword">su</span> camino, mas Jehová cargó <span class="hljs-keyword">en</span> é<span class="hljs-keyword">l</span> el pecado <span class="hljs-keyword">de</span> todos nosotros...<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-wy7qyj">We won’t print the entire output here since it’s pretty long (312 words total)! On a 16GB V100 GPU, you can expect the above
line to take approximately 3.45 seconds to run, which is pretty good for a 317 second audio sample. On a CPU, expect
closer to 30 seconds.</p> <p data-svelte-h="svelte-1pdqzp1">Whisper is also able to predict segment-level <em>timestamps</em> for the audio data. These timestamps indicate the start and end
time for a short passage of audio, and are particularly useful for aligning a transcription with the input audio. Suppose
we want to provide closed captions for a video - we need these timestamps to know which part of the transcription corresponds
to a certain segment of video, in order to display the correct transcription for that time.</p> <p data-svelte-h="svelte-1kodbk1">Activating timestamp prediction is straightforward, we just need to set the argument <code>return_timestamps=True</code>. Timestamps
are compatible with both the chunking and batching methods we used previously, so we can simply append the timestamp
argument to our previous call:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pipe(
long_audio,
max_new_tokens=<span class="hljs-number">256</span>,
generate_kwargs={<span class="hljs-string">&quot;task&quot;</span>: <span class="hljs-string">&quot;transcribe&quot;</span>},
chunk_length_s=<span class="hljs-number">30</span>,
batch_size=<span class="hljs-number">8</span>,
return_timestamps=<span class="hljs-literal">True</span>,
)[<span class="hljs-string">&quot;chunks&quot;</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[{<span class="hljs-symbol">&#x27;timestamp</span><span class="hljs-symbol">&#x27;:</span> (<span class="hljs-name">0.0</span>, <span class="hljs-number">26.4</span>),
<span class="hljs-symbol">&#x27;text</span><span class="hljs-symbol">&#x27;:</span> &#x27; Entonces te deleitarás en Jehová, y yo te haré subir sobre las alturas de la tierra, y te daré a comer la heredad de Jacob tu padre, porque la boca de Jehová lo ha hablado. nosotros curados. Todos nosotros nos descarriamos como bejas, cada cual se apartó por su camino,&#x27;},
{<span class="hljs-symbol">&#x27;timestamp</span><span class="hljs-symbol">&#x27;:</span> (<span class="hljs-name">26.4</span>, <span class="hljs-number">32.48</span>),
<span class="hljs-symbol">&#x27;text</span><span class="hljs-symbol">&#x27;:</span> &#x27; mas Jehová cargó en él el pecado de todos nosotros. No es que partas tu pan con el&#x27;},
{<span class="hljs-symbol">&#x27;timestamp</span><span class="hljs-symbol">&#x27;:</span> (<span class="hljs-name">32.48</span>, <span class="hljs-number">38.4</span>),
<span class="hljs-symbol">&#x27;text</span><span class="hljs-symbol">&#x27;:</span> &#x27; hambriento y a los hombres herrantes metas en casa, que cuando vieres al desnudo lo cubras y no&#x27;},
...<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-7n3xc4">And voila! We have our predicted text as well as corresponding timestamps.</p> <h2 class="relative group"><a id="summary" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#summary"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Summary</span></h2> <p data-svelte-h="svelte-1t35nks">Whisper is a strong pre-trained model for speech recognition and translation. Compared to Wav2Vec2, it has higher
transcription accuracy, with outputs that contain punctuation and casing. It can be used to transcribe speech in English
as well as 96 other languages, both on short audio segments and longer ones through <em>chunking</em>. These attributes make it
a viable model for many speech recognition and translation tasks without the need for fine-tuning. The <code>pipeline()</code> method
provides an easy way of running inference in one-line API calls with control over the generated predictions.</p> <p data-svelte-h="svelte-1eiih7q">While the Whisper model performs extremely well on many high-resource languages, it has lower transcription and translation
accuracy on low-resource languages, i.e. those with less readily available training data. There is also varying performance
across different accents and dialects of certain languages, including lower accuracy for speakers of different genders,
races, ages or other demographic criteria (<em>c.f.</em> <a href="https://arxiv.org/pdf/2212.04356.pdf" rel="nofollow">Whisper paper</a>).</p> <p data-svelte-h="svelte-a24nm6">To boost the performance on low-resource languages, accents or dialects, we can take the pre-trained Whisper model and
train it on a small corpus of appropriately selected data, in a process called <em>fine-tuning</em>. We’ll show that with
as little as ten hours of additional data, we can improve the performance of the Whisper model by over 100% on a low-resource
language. In the next section, we’ll cover the process behind selecting a dataset for fine-tuning.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter5/asr_models.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_yq3w38 = {
assets: "/docs/audio-course/pr_201/en",
base: "/docs/audio-course/pr_201/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"),
import("/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 28],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
77.4 kB
·
Xet hash:
cb3cc001aa83193c35af8e04983aae6c01d0300bd1bb515c1aad51138b54651c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.