Buckets:

rtrm's picture
download
raw
37.7 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Preprocessing an audio dataset&quot;,&quot;local&quot;:&quot;preprocessing-an-audio-dataset&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Resampling the audio data&quot;,&quot;local&quot;:&quot;resampling-the-audio-data&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Filtering the dataset&quot;,&quot;local&quot;:&quot;filtering-the-dataset&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Pre-processing audio data&quot;,&quot;local&quot;:&quot;pre-processing-audio-data&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/audio-course/pr_201/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/scheduler.f7e1785c.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/singletons.0d70d4cc.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.279db187.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/paths.274f629d.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.9f8f0838.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/0.e329f606.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/8.be3a7ddb.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/Tip.4575d9cf.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/CodeBlock.b3510e34.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/EditOnGithub.5a9bb8c5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Preprocessing an audio dataset&quot;,&quot;local&quot;:&quot;preprocessing-an-audio-dataset&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Resampling the audio data&quot;,&quot;local&quot;:&quot;resampling-the-audio-data&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Filtering the dataset&quot;,&quot;local&quot;:&quot;filtering-the-dataset&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Pre-processing audio data&quot;,&quot;local&quot;:&quot;pre-processing-audio-data&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="preprocessing-an-audio-dataset" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#preprocessing-an-audio-dataset"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Preprocessing an audio dataset</span></h1> <p data-svelte-h="svelte-1b3xnqq">Loading a dataset with 🤗 Datasets is just half of the fun. If you plan to use it either for training a model, or for running
inference, you will need to pre-process the data first. In general, this will involve the following steps:</p> <ul data-svelte-h="svelte-1anvx2"><li>Resampling the audio data</li> <li>Filtering the dataset</li> <li>Converting audio data to model’s expected input</li></ul> <h2 class="relative group"><a id="resampling-the-audio-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#resampling-the-audio-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Resampling the audio data</span></h2> <p data-svelte-h="svelte-47dy14">The <code>load_dataset</code> function downloads audio examples with the sampling rate that they were published with. This is not
always the sampling rate expected by a model you plan to train, or use for inference. If there’s a discrepancy between
the sampling rates, you can resample the audio to the model’s expected sampling rate.</p> <p data-svelte-h="svelte-1c2c4da">Most of the available pretrained models have been pretrained on audio datasets at a sampling rate of 16 kHz.
When we explored MINDS-14 dataset, you may have noticed that it is sampled at 8 kHz, which means we will likely need
to upsample it.</p> <p data-svelte-h="svelte-x7frc7">To do so, use 🤗 Datasets’ <code>cast_column</code> method. This operation does not change the audio in-place, but rather signals
to datasets to resample the audio examples on the fly when they are loaded. The following code will set the sampling
rate to 16kHz:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> Audio
minds = minds.cast_column(<span class="hljs-string">&quot;audio&quot;</span>, Audio(sampling_rate=<span class="hljs-number">16_000</span>))<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-96ugis">Re-load the first audio example in the MINDS-14 dataset, and check that it has been resampled to the desired <code>sampling rate</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->minds[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{
<span class="hljs-comment">&quot;path&quot;</span>: <span class="hljs-comment">&quot;/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav&quot;</span>,
<span class="hljs-comment">&quot;audio&quot;</span>: {
<span class="hljs-comment">&quot;path&quot;</span>: <span class="hljs-comment">&quot;/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav&quot;</span>,
<span class="hljs-comment">&quot;array&quot;</span>: array(
[
<span class="hljs-number">2.0634243e-05</span>,
<span class="hljs-number">1.9437837e-04</span>,
<span class="hljs-number">2.2419340e-04</span>,
...,
<span class="hljs-number">9.3852862e-04</span>,
<span class="hljs-number">1.1302452e-03</span>,
<span class="hljs-number">7.1531429e-04</span>,
],
dtype=float32,
),
<span class="hljs-comment">&quot;sampling_rate&quot;</span>: <span class="hljs-number">16000</span>,
},
<span class="hljs-comment">&quot;transcription&quot;</span>: <span class="hljs-comment">&quot;I would like to pay my electricity bill using my card can you please assist&quot;</span>,
<span class="hljs-comment">&quot;intent_class&quot;</span>: <span class="hljs-number">13</span>,
}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-hnb1k">You may notice that the array values are now also different. This is because we’ve now got twice the number of amplitude values for
every one that we had before.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400">💡 Some background on resampling: If an audio signal has been sampled at 8 kHz, so that it has 8000 sample readings per
second, we know that the audio does not contain any frequencies over 4 kHz. This is guaranteed by the Nyquist sampling
theorem. Because of this, we can be certain that in between the sampling points the original continuous signal always
makes a smooth curve. Upsampling to a higher sampling rate is then a matter of calculating additional sample values that go in between
the existing ones, by approximating this curve. Downsampling, however, requires that we first filter out any frequencies
that would be higher than the new Nyquist limit, before estimating the new sample points. In other words, you can&#39;t
downsample by a factor 2x by simply throwing away every other sample — this will create distortions in the signal called
aliases. Doing resampling correctly is tricky and best left to well-tested libraries such as librosa or 🤗 Datasets.</div> <h2 class="relative group"><a id="filtering-the-dataset" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#filtering-the-dataset"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Filtering the dataset</span></h2> <p data-svelte-h="svelte-s6y06a">You may need to filter the data based on some criteria. One of the common cases involves limiting the audio examples to a
certain duration. For instance, we might want to filter out any examples longer than 20s to prevent out-of-memory errors
when training a model.</p> <p data-svelte-h="svelte-1szy3nk">We can do this by using the 🤗 Datasets’ <code>filter</code> method and passing a function with filtering logic to it. Let’s start by writing a
function that indicates which examples to keep and which to discard. This function, <code>is_audio_length_in_range</code>,
returns <code>True</code> if a sample is shorter than 20s, and <code>False</code> if it is longer than 20s.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->MAX_DURATION_IN_SECONDS = <span class="hljs-number">20.0</span>
<span class="hljs-keyword">def</span> <span class="hljs-title function_">is_audio_length_in_range</span>(<span class="hljs-params">input_length</span>):
<span class="hljs-keyword">return</span> input_length &lt; MAX_DURATION_IN_SECONDS<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1btp9dp">The filtering function can be applied to a dataset’s column but we do not have a column with audio track duration in this
dataset. However, we can create one, filter based on the values in that column, and then remove it.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># use librosa to get example&#x27;s duration from the audio file</span>
new_column = [librosa.get_duration(path=x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> minds[<span class="hljs-string">&quot;path&quot;</span>]]
minds = minds.add_column(<span class="hljs-string">&quot;duration&quot;</span>, new_column)
<span class="hljs-comment"># use 🤗 Datasets&#x27; `filter` method to apply the filtering function</span>
minds = minds.<span class="hljs-built_in">filter</span>(is_audio_length_in_range, input_columns=[<span class="hljs-string">&quot;duration&quot;</span>])
<span class="hljs-comment"># remove the temporary helper column</span>
minds = minds.remove_columns([<span class="hljs-string">&quot;duration&quot;</span>])
minds<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-constructor">Dataset({<span class="hljs-params">features</span>: [<span class="hljs-string">&quot;path&quot;</span>, <span class="hljs-string">&quot;audio&quot;</span>, <span class="hljs-string">&quot;transcription&quot;</span>, <span class="hljs-string">&quot;intent_class&quot;</span>], <span class="hljs-params">num_rows</span>: 624})</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ipsrxo">We can verify that dataset has been filtered down from 654 examples to 624.</p> <h2 class="relative group"><a id="pre-processing-audio-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#pre-processing-audio-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Pre-processing audio data</span></h2> <p data-svelte-h="svelte-jbw2e0">One of the most challenging aspects of working with audio datasets is preparing the data in the right format for model
training. As you saw, the raw audio data comes as an array of sample values. However, pre-trained models, whether you use them
for inference, or want to fine-tune them for your task, expect the raw data to be converted into input features. The
requirements for the input features may vary from one model to another — they depend on the model’s architecture, and the data it was
pre-trained with. The good news is, for every supported audio model, 🤗 Transformers offer a feature extractor class
that can convert raw audio data into the input features the model expects.</p> <p data-svelte-h="svelte-1jndbwe">So what does a feature extractor do with the raw audio data? Let’s take a look at <a href="https://huggingface.co/papers/2212.04356" rel="nofollow">Whisper</a>’s
feature extractor to understand some common feature extraction transformations. Whisper is a pre-trained model for
automatic speech recognition (ASR) published in September 2022 by Alec Radford et al. from OpenAI.</p> <p data-svelte-h="svelte-1y8nv8f">First, the Whisper feature extractor pads/truncates a batch of audio examples such that all
examples have an input length of 30s. Examples shorter than this are padded to 30s by appending zeros to the end of the
sequence (zeros in an audio signal correspond to no signal or silence). Examples longer than 30s are truncated to 30s.
Since all elements in the batch are padded/truncated to a maximum length in the input space, there is no need for an attention
mask. Whisper is unique in this regard, most other audio models require an attention mask that details
where sequences have been padded, and thus where they should be ignored in the self-attention mechanism. Whisper is
trained to operate without an attention mask and infer directly from the speech signals where to ignore the inputs.</p> <p data-svelte-h="svelte-1ys9rww">The second operation that the Whisper feature extractor performs is converting the padded audio arrays to log-mel spectrograms.
As you recall, these spectrograms describe how the frequencies of a signal change over time, expressed on the mel scale
and measured in decibels (the log part) to make the frequencies and amplitudes more representative of human hearing.</p> <p data-svelte-h="svelte-n9ngno">All these transformations can be applied to your raw audio data with a couple of lines of code. Let’s go ahead and load
the feature extractor from the pre-trained Whisper checkpoint to have ready for our audio data:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(<span class="hljs-string">&quot;openai/whisper-small&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-kd1gw8">Next, you can write a function to pre-process a single audio example by passing it through the <code>feature_extractor</code>.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">prepare_dataset</span>(<span class="hljs-params">example</span>):
audio = example[<span class="hljs-string">&quot;audio&quot;</span>]
features = feature_extractor(
audio[<span class="hljs-string">&quot;array&quot;</span>], sampling_rate=audio[<span class="hljs-string">&quot;sampling_rate&quot;</span>], padding=<span class="hljs-literal">True</span>
)
<span class="hljs-keyword">return</span> features<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-6ixft2">We can apply the data preparation function to all of our training examples using 🤗 Datasets’ map method:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->minds = minds.<span class="hljs-built_in">map</span>(prepare_dataset)
minds<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mvdyro"><strong>Output:</strong></p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Dataset(
<span class="hljs-punctuation">{</span>
<span class="hljs-symbol"> features:</span> [<span class="hljs-string">&quot;path&quot;</span>, <span class="hljs-string">&quot;audio&quot;</span>, <span class="hljs-string">&quot;transcription&quot;</span>, <span class="hljs-string">&quot;intent_class&quot;</span>, <span class="hljs-string">&quot;input_features&quot;</span>],
<span class="hljs-symbol"> num_rows:</span> <span class="hljs-number">624</span>,
<span class="hljs-punctuation">}</span>
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-14djzfz">As easy as that, we now have log-mel spectrograms as <code>input_features</code> in the dataset.</p> <p data-svelte-h="svelte-1478ups">Let’s visualize it for one of the examples in the <code>minds</code> dataset:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
example = minds[<span class="hljs-number">0</span>]
input_features = example[<span class="hljs-string">&quot;input_features&quot;</span>]
plt.figure().set_figwidth(<span class="hljs-number">12</span>)
librosa.display.specshow(
np.asarray(input_features[<span class="hljs-number">0</span>]),
x_axis=<span class="hljs-string">&quot;time&quot;</span>,
y_axis=<span class="hljs-string">&quot;mel&quot;</span>,
sr=feature_extractor.sampling_rate,
hop_length=feature_extractor.hop_length,
)
plt.colorbar()<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-csckl"><img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/log_mel_whisper.png" alt="Log mel spectrogram plot"></div> <p data-svelte-h="svelte-pj9vlt">Now you can see what the audio input to the Whisper model looks like after preprocessing.</p> <p data-svelte-h="svelte-1x1eerf">The model’s feature extractor class takes care of transforming raw audio data to the format that the model expects. However,
many tasks involving audio are multimodal, e.g. speech recognition. In such cases 🤗 Transformers also offer model-specific
tokenizers to process the text inputs. For a deep dive into tokenizers, please refer to our <a href="https://huggingface.co/course/chapter2/4" rel="nofollow">NLP course</a>.</p> <p data-svelte-h="svelte-ngi3x1">You can load the feature extractor and tokenizer for Whisper and other multimodal models separately, or you can load both via
a so-called processor. To make things even simpler, use <code>AutoProcessor</code> to load a model’s feature extractor and processor from a
checkpoint, like this:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoProcessor
processor = AutoProcessor.from_pretrained(<span class="hljs-string">&quot;openai/whisper-small&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-gvmmsi">Here we have illustrated the fundamental data preparation steps. Of course, custom data may require more complex preprocessing.
In this case, you can extend the function <code>prepare_dataset</code> to perform any sort of custom data transformations. With 🤗 Datasets,
if you can write it as a Python function, you can <a href="https://huggingface.co/docs/datasets/audio_process" rel="nofollow">apply it</a> to your dataset!</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter1/preprocessing.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_yq3w38 = {
assets: "/docs/audio-course/pr_201/en",
base: "/docs/audio-course/pr_201/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"),
import("/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 8],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
37.7 kB
·
Xet hash:
fbca38f5de3b8319008dea1e60b1de607f8ee1c3a71fb06b26c44dd58e50d34a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.