Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Introduction to audio data","local":"introduction-to-audio-data","sections":[{"title":"Sampling and sampling rate","local":"sampling-and-sampling-rate","sections":[],"depth":2},{"title":"Amplitude and bit depth","local":"amplitude-and-bit-depth","sections":[],"depth":2},{"title":"Audio as a waveform","local":"audio-as-a-waveform","sections":[],"depth":2},{"title":"The frequency spectrum","local":"the-frequency-spectrum","sections":[],"depth":2},{"title":"Spectrogram","local":"spectrogram","sections":[],"depth":2},{"title":"Mel spectrogram","local":"mel-spectrogram","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/audio-course/pr_201/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/scheduler.f7e1785c.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/singletons.0d70d4cc.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.279db187.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/paths.274f629d.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.9f8f0838.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/0.e329f606.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/5.36cecace.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/Tip.4575d9cf.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/CodeBlock.b3510e34.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/EditOnGithub.5a9bb8c5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Introduction to audio data","local":"introduction-to-audio-data","sections":[{"title":"Sampling and sampling rate","local":"sampling-and-sampling-rate","sections":[],"depth":2},{"title":"Amplitude and bit depth","local":"amplitude-and-bit-depth","sections":[],"depth":2},{"title":"Audio as a waveform","local":"audio-as-a-waveform","sections":[],"depth":2},{"title":"The frequency spectrum","local":"the-frequency-spectrum","sections":[],"depth":2},{"title":"Spectrogram","local":"spectrogram","sections":[],"depth":2},{"title":"Mel spectrogram","local":"mel-spectrogram","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="introduction-to-audio-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#introduction-to-audio-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Introduction to audio data</span></h1> <p data-svelte-h="svelte-1pmkh95">By nature, a sound wave is a continuous signal, meaning it contains an infinite number of signal values in a given time. | |
| This poses problems for digital devices which expect finite arrays. To be processed, stored, and transmitted by digital | |
| devices, the continuous sound wave needs to be converted into a series of discrete values, known as a digital representation.</p> <p data-svelte-h="svelte-bb9jdf">If you look at any audio dataset, you’ll find digital files with sound excerpts, such as text narration or music. | |
| You may encounter different file formats such as <code>.wav</code> (Waveform Audio File), <code>.flac</code> (Free Lossless Audio Codec) | |
| and <code>.mp3</code> (MPEG-1 Audio Layer 3). These formats mainly differ in how they compress the digital representation of the audio signal.</p> <p data-svelte-h="svelte-1nxlqp8">Let’s take a look at how we arrive from a continuous signal to this representation. The analog signal is first captured by | |
| a microphone, which converts the sound waves into an electrical signal. The electrical signal is then digitized by an | |
| Analog-to-Digital Converter to get the digital representation through sampling.</p> <h2 class="relative group"><a id="sampling-and-sampling-rate" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#sampling-and-sampling-rate"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Sampling and sampling rate</span></h2> <p data-svelte-h="svelte-ou5bn8">Sampling is the process of measuring the value of a continuous signal at fixed time steps. The sampled waveform is <em>discrete</em>, | |
| since it contains a finite number of signal values at uniform intervals.</p> <div class="flex justify-center" data-svelte-h="svelte-1g809yq"><img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/Signal_Sampling.png" alt="Signal sampling illustration"></div> <p data-svelte-h="svelte-42toji"><em>Illustration from Wikipedia article: <a href="https://en.wikipedia.org/wiki/Sampling_(signal_processing)" rel="nofollow">Sampling (signal processing)</a></em></p> <p data-svelte-h="svelte-1bavgg">The <strong>sampling rate</strong> (also called sampling frequency) is the number of samples taken in one second and is measured in | |
| hertz (Hz). To give you a point of reference, CD-quality audio has a sampling rate of 44,100 Hz, meaning samples are taken | |
| 44,100 times per second. For comparison, high-resolution audio has a sampling rate of 192,000 Hz or 192 kHz. A common | |
| sampling rate used in training speech models is 16,000 Hz or 16 kHz.</p> <p data-svelte-h="svelte-erenf">The choice of sampling rate primarily determines the highest frequency that can be captured from the signal. This is also | |
| known as the Nyquist limit and is exactly half the sampling rate. The audible frequencies in human speech are below 8 kHz | |
| and therefore sampling speech at 16 kHz is sufficient. Using a higher sampling rate will not capture more information and | |
| merely leads to an increase in the computational cost of processing such files. On the other hand, sampling audio at too | |
| low a sampling rate will result in information loss. Speech sampled at 8 kHz will sound muffled, as the higher frequencies | |
| cannot be captured at this rate.</p> <p data-svelte-h="svelte-vwfzx1">It’s important to ensure that all audio examples in your dataset have the same sampling rate when working on any audio task. | |
| If you plan to use custom audio data to fine-tune a pre-trained model, the sampling rate of your data should match the | |
| sampling rate of the data the model was pre-trained on. The sampling rate determines the time interval between successive | |
| audio samples, which impacts the temporal resolution of the audio data. Consider an example: a 5-second sound at a sampling | |
| rate of 16,000 Hz will be represented as a series of 80,000 values, while the same 5-second sound at a sampling rate of | |
| 8,000 Hz will be represented as a series of 40,000 values. Transformer models that solve audio tasks treat examples as | |
| sequences and rely on attention mechanisms to learn audio or multimodal representation. Since sequences are different for | |
| audio examples at different sampling rates, it will be challenging for models to generalize between sampling rates. | |
| <strong>Resampling</strong> is the process of making the sampling rates match, and is part of <a href="preprocessing#resampling-the-audio-data">preprocessing</a> the audio data.</p> <h2 class="relative group"><a id="amplitude-and-bit-depth" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#amplitude-and-bit-depth"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Amplitude and bit depth</span></h2> <p data-svelte-h="svelte-aeraln">While the sampling rate tells you how often the samples are taken, what exactly are the values in each sample?</p> <p data-svelte-h="svelte-c1topq">Sound is made by changes in air pressure at frequencies that are audible to humans. The <strong>amplitude</strong> of a sound describes | |
| the sound pressure level at any given instant and is measured in decibels (dB). We perceive the amplitude as loudness. | |
| To give you an example, a normal speaking voice is under 60 dB, and a rock concert can be at around 125 dB, pushing the | |
| limits of human hearing.</p> <p data-svelte-h="svelte-17m5t67">In digital audio, each audio sample records the amplitude of the audio wave at a point in time. The <strong>bit depth</strong> of the | |
| sample determines with how much precision this amplitude value can be described. The higher the bit depth, the more | |
| faithfully the digital representation approximates the original continuous sound wave.</p> <p data-svelte-h="svelte-r3waof">The most common audio bit depths are 16-bit and 24-bit. Each is a binary term, representing the number of possible steps | |
| to which the amplitude value can be quantized when it’s converted from continuous to discrete: 65,536 steps for 16-bit audio, | |
| a whopping 16,777,216 steps for 24-bit audio. Because quantizing involves rounding off the continuous value to a discrete | |
| value, the sampling process introduces noise. The higher the bit depth, the smaller this quantization noise. In practice, | |
| the quantization noise of 16-bit audio is already small enough to be inaudible, and using higher bit depths is generally | |
| not necessary.</p> <p data-svelte-h="svelte-1mkb3me">You may also come across 32-bit audio. This stores the samples as floating-point values, whereas 16-bit and 24-bit audio | |
| use integer samples. The precision of a 32-bit floating-point value is 24 bits, giving it the same bit depth as 24-bit audio. | |
| Floating-point audio samples are expected to lie within the [-1.0, 1.0] range. Since machine learning models naturally | |
| work on floating-point data, the audio must first be converted into floating-point format before it can be used to train | |
| the model. We’ll see how to do this in the next section on <a href="preprocessing">Preprocessing</a>.</p> <p data-svelte-h="svelte-1vo81sc">Just as with continuous audio signals, the amplitude of digital audio is typically expressed in decibels (dB). Since | |
| human hearing is logarithmic in nature — our ears are more sensitive to small fluctuations in quiet sounds than in loud | |
| sounds — the loudness of a sound is easier to interpret if the amplitudes are in decibels, which are also logarithmic. | |
| The decibel scale for real-world audio starts at 0 dB, which represents the quietest possible sound humans can hear, and | |
| louder sounds have larger values. However, for digital audio signals, 0 dB is the loudest possible amplitude, while all | |
| other amplitudes are negative. As a quick rule of thumb: every -6 dB is a halving of the amplitude, and anything below -60 dB | |
| is generally inaudible unless you really crank up the volume.</p> <h2 class="relative group"><a id="audio-as-a-waveform" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#audio-as-a-waveform"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Audio as a waveform</span></h2> <p data-svelte-h="svelte-1cl8jes">You may have seen sounds visualized as a <strong>waveform</strong>, which plots the sample values over time and illustrates the changes | |
| in the sound’s amplitude. This is also known as the <em>time domain</em> representation of sound.</p> <p data-svelte-h="svelte-1ef6fws">This type of visualization is useful for identifying specific features of the audio signal such as the timing of individual | |
| sound events, the overall loudness of the signal, and any irregularities or noise present in the audio.</p> <p data-svelte-h="svelte-7fbxqn">To plot the waveform for an audio signal, we can use a Python library called <code>librosa</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install librosa<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-4c289b">Let’s take an example sound called “trumpet” that comes with the library:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> librosa | |
| array, sampling_rate = librosa.load(librosa.ex(<span class="hljs-string">"trumpet"</span>))<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1uo32p8">The example is loaded as a tuple of audio time series (here we call it <code>array</code>), and sampling rate (<code>sampling_rate</code>). | |
| Let’s take a look at this sound’s waveform by using librosa’s <code>waveshow()</code> function:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt | |
| <span class="hljs-keyword">import</span> librosa.display | |
| plt.figure().set_figwidth(<span class="hljs-number">12</span>) | |
| librosa.display.waveshow(array, sr=sampling_rate)<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-1cse5se"><img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/waveform_plot.png" alt="Waveform plot"></div> <p data-svelte-h="svelte-4i4t88">This plots the amplitude of the signal on the y-axis and time along the x-axis. In other words, each point corresponds | |
| to a single sample value that was taken when this sound was sampled. Also note that librosa returns the audio as | |
| floating-point values already, and that the amplitude values are indeed within the [-1.0, 1.0] range.</p> <p data-svelte-h="svelte-15w4tcn">Visualizing the audio along with listening to it can be a useful tool for understanding the data you are working with. | |
| You can see the shape of the signal, observe patterns, learn to spot noise or distortion. If you preprocess data in some | |
| ways, such as normalization, resampling, or filtering, you can visually confirm that preprocessing steps have been applied as expected. | |
| After training a model, you can also visualize samples where errors occur (e.g. in audio classification task) to debug | |
| the issue.</p> <h2 class="relative group"><a id="the-frequency-spectrum" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#the-frequency-spectrum"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>The frequency spectrum</span></h2> <p data-svelte-h="svelte-zbggm7">Another way to visualize audio data is to plot the <strong>frequency spectrum</strong> of an audio signal, also known as the <em>frequency domain</em> | |
| representation. The spectrum is computed using the discrete Fourier transform or DFT. It describes the individual frequencies | |
| that make up the signal and how strong they are.</p> <p data-svelte-h="svelte-pcmpy3">Let’s plot the frequency spectrum for the same trumpet sound by taking the DFT using numpy’s <code>rfft()</code> function. While it | |
| is possible to plot the spectrum of the entire sound, it’s more useful to look at a small region instead. Here we’ll take | |
| the DFT over the first 4096 samples, which is roughly the length of the first note being played:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np | |
| dft_input = array[:<span class="hljs-number">4096</span>] | |
| <span class="hljs-comment"># calculate the DFT</span> | |
| window = np.hanning(<span class="hljs-built_in">len</span>(dft_input)) | |
| windowed_input = dft_input * window | |
| dft = np.fft.rfft(windowed_input) | |
| <span class="hljs-comment"># get the amplitude spectrum in decibels</span> | |
| amplitude = np.<span class="hljs-built_in">abs</span>(dft) | |
| amplitude_db = librosa.amplitude_to_db(amplitude, ref=np.<span class="hljs-built_in">max</span>) | |
| <span class="hljs-comment"># get the frequency bins</span> | |
| frequency = librosa.fft_frequencies(sr=sampling_rate, n_fft=<span class="hljs-built_in">len</span>(dft_input)) | |
| plt.figure().set_figwidth(<span class="hljs-number">12</span>) | |
| plt.plot(frequency, amplitude_db) | |
| plt.xlabel(<span class="hljs-string">"Frequency (Hz)"</span>) | |
| plt.ylabel(<span class="hljs-string">"Amplitude (dB)"</span>) | |
| plt.xscale(<span class="hljs-string">"log"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-1eg4i6m"><img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/spectrum_plot.png" alt="Spectrum plot"></div> <p data-svelte-h="svelte-2gktww">This plots the strength of the various frequency components that are present in this audio segment. The frequency values are on | |
| the x-axis, usually plotted on a logarithmic scale, while their amplitudes are on the y-axis.</p> <p data-svelte-h="svelte-1iiz53k">The frequency spectrum that we plotted shows several peaks. These peaks correspond to the harmonics of the note that’s | |
| being played, with the higher harmonics being quieter. Since the first peak is at around 620 Hz, this is the frequency spectrum of an E♭ note.</p> <p data-svelte-h="svelte-1djmz7i">The output of the DFT is an array of complex numbers, made up of real and imaginary components. Taking | |
| the magnitude with <code>np.abs(dft)</code> extracts the amplitude information from the spectrogram. The angle between the real and | |
| imaginary components provides the so-called phase spectrum, but this is often discarded in machine learning applications.</p> <p data-svelte-h="svelte-1bt1xj3">You used <code>librosa.amplitude_to_db()</code> to convert the amplitude values to the decibel scale, making it easier to see | |
| the finer details in the spectrum. Sometimes people use the <strong>power spectrum</strong>, which measures energy rather than amplitude; | |
| this is simply a spectrum with the amplitude values squared.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400">💡 In practice, people use the term FFT interchangeably with DFT, as the FFT or Fast Fourier Transform is the only efficient | |
| way to calculate the DFT on a computer.</div> <p data-svelte-h="svelte-59fha7">The frequency spectrum of an audio signal contains the exact same information as its waveform — they are simply two different | |
| ways of looking at the same data (here, the first 4096 samples from the trumpet sound). Where the waveform plots the amplitude | |
| of the audio signal over time, the spectrum visualizes the amplitudes of the individual frequencies at a fixed point in time.</p> <h2 class="relative group"><a id="spectrogram" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#spectrogram"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Spectrogram</span></h2> <p data-svelte-h="svelte-196s45l">What if we want to see how the frequencies in an audio signal change? The trumpet plays several notes and they all have | |
| different frequencies. The problem is that the spectrum only shows a frozen snapshot of the frequencies at a given instant. | |
| The solution is to take multiple DFTs, each covering only a small slice of time, and stack the resulting spectra together | |
| into a <strong>spectrogram</strong>.</p> <p data-svelte-h="svelte-1l55l7k">A spectrogram plots the frequency content of an audio signal as it changes over time. It allows you to see time, frequency, | |
| and amplitude all on one graph. The algorithm that performs this computation is the STFT or Short Time Fourier Transform.</p> <p data-svelte-h="svelte-u0c2pj">The spectrogram is one of the most informative audio tools available to you. For example, when working with a music recording, | |
| you can see the various instruments and vocal tracks and how they contribute to the overall sound. In speech, you can | |
| identify different vowel sounds as each vowel is characterized by particular frequencies.</p> <p data-svelte-h="svelte-rcbtf3">Let’s plot a spectrogram for the same trumpet sound, using librosa’s <code>stft()</code> and <code>specshow()</code> functions:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np | |
| D = librosa.stft(array) | |
| S_db = librosa.amplitude_to_db(np.<span class="hljs-built_in">abs</span>(D), ref=np.<span class="hljs-built_in">max</span>) | |
| plt.figure().set_figwidth(<span class="hljs-number">12</span>) | |
| librosa.display.specshow(S_db, x_axis=<span class="hljs-string">"time"</span>, y_axis=<span class="hljs-string">"hz"</span>) | |
| plt.colorbar()<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-b6kbs6"><img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/spectrogram_plot.png" alt="Spectrogram plot"></div> <p data-svelte-h="svelte-1fap1oe">In this plot, the x-axis represents time as in the waveform visualization but now the y-axis represents frequency in Hz. | |
| The intensity of the color gives the amplitude or power of the frequency component at each point in time, measured in decibels (dB).</p> <p data-svelte-h="svelte-1v7yz7v">The spectrogram is created by taking short segments of the audio signal, typically lasting a few milliseconds, and calculating | |
| the discrete Fourier transform of each segment to obtain its frequency spectrum. The resulting spectra are then stacked | |
| together on the time axis to create the spectrogram. Each vertical slice in this image corresponds to a single frequency | |
| spectrum, seen from the top. By default, <code>librosa.stft()</code> splits the audio signal into segments of 2048 samples, which | |
| gives a good trade-off between frequency resolution and time resolution.</p> <p data-svelte-h="svelte-13z5f9l">Since the spectrogram and the waveform are different views of the same data, it’s possible to turn the spectrogram back | |
| into the original waveform using the inverse STFT. However, this requires the phase information in addition to the amplitude | |
| information. If the spectrogram was generated by a machine learning model, it typically only outputs the amplitudes. In | |
| that case, we can use a phase reconstruction algorithm such as the classic Griffin-Lim algorithm, or using a neural network | |
| called a vocoder, to reconstruct a waveform from the spectrogram.</p> <p data-svelte-h="svelte-fc7u1p">Spectrograms aren’t just used for visualization. Many machine learning models will take spectrograms as input — as opposed | |
| to waveforms — and produce spectrograms as output.</p> <p data-svelte-h="svelte-tu7x1">Now that we know what a spectrogram is and how it’s made, let’s take a look at a variant of it widely used for speech processing: the mel spectrogram.</p> <h2 class="relative group"><a id="mel-spectrogram" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#mel-spectrogram"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Mel spectrogram</span></h2> <p data-svelte-h="svelte-naa7dr">A mel spectrogram is a variation of the spectrogram that is commonly used in speech processing and machine learning tasks. | |
| It is similar to a spectrogram in that it shows the frequency content of an audio signal over time, but on a different frequency axis.</p> <p data-svelte-h="svelte-1x7yj9g">In a standard spectrogram, the frequency axis is linear and is measured in hertz (Hz). However, the human auditory system | |
| is more sensitive to changes in lower frequencies than higher frequencies, and this sensitivity decreases logarithmically | |
| as frequency increases. The mel scale is a perceptual scale that approximates the non-linear frequency response of the human ear.</p> <p data-svelte-h="svelte-9z7nz9">To create a mel spectrogram, the STFT is used just like before, splitting the audio into short segments to obtain a sequence | |
| of frequency spectra. Additionally, each spectrum is sent through a set of filters, the so-called mel filterbank, to | |
| transform the frequencies to the mel scale.</p> <p data-svelte-h="svelte-1bkzq64">Let’s see how we can plot a mel spectrogram using librosa’s <code>melspectrogram()</code> function, which performs all of those steps for us:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->S = librosa.feature.melspectrogram(y=array, sr=sampling_rate, n_mels=<span class="hljs-number">128</span>, fmax=<span class="hljs-number">8000</span>) | |
| S_dB = librosa.power_to_db(S, ref=np.<span class="hljs-built_in">max</span>) | |
| plt.figure().set_figwidth(<span class="hljs-number">12</span>) | |
| librosa.display.specshow(S_dB, x_axis=<span class="hljs-string">"time"</span>, y_axis=<span class="hljs-string">"mel"</span>, sr=sampling_rate, fmax=<span class="hljs-number">8000</span>) | |
| plt.colorbar()<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-1soplef"><img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/mel-spectrogram.png" alt="Mel spectrogram plot"></div> <p data-svelte-h="svelte-otxgpj">In the example above, <code>n_mels</code> stands for the number of mel bands to generate. The mel bands define a set of frequency | |
| ranges that divide the spectrum into perceptually meaningful components, using a set of filters whose shape and spacing | |
| are chosen to mimic the way the human ear responds to different frequencies. Common values for <code>n_mels</code> are 40 or 80. <code>fmax</code> | |
| indicates the highest frequency (in Hz) we care about.</p> <p data-svelte-h="svelte-fposmi">Just as with a regular spectrogram, it’s common practice to express the strength of the mel frequency components in | |
| decibels. This is commonly referred to as a <strong>log-mel spectrogram</strong>, because the conversion to decibels involves a | |
| logarithmic operation. The above example used <code>librosa.power_to_db()</code> as <code>librosa.feature.melspectrogram()</code> creates a power spectrogram.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400">💡 Not all mel spectrograms are the same! There are two different mel scales in common use ("htk" and "slaney"), | |
| and instead of the power spectrogram the amplitude spectrogram may be used. The conversion to a log-mel spectrogram doesn't | |
| always compute true decibels but may simply take the `log`. Therefore, if a machine learning model expects a mel spectrogram | |
| as input, double check to make sure you're computing it the same way.</div> <p data-svelte-h="svelte-ph55g">Creating a mel spectrogram is a lossy operation as it involves filtering the signal. Converting a mel spectrogram back | |
| into a waveform is more difficult than doing this for a regular spectrogram, as it requires estimating the frequencies | |
| that were thrown away. This is why machine learning models such as HiFiGAN vocoder are needed to produce a waveform from a mel | |
| spectrogram.</p> <p data-svelte-h="svelte-lsf6hk">Compared to a standard spectrogram, a mel spectrogram can capture more meaningful features of the audio signal for | |
| human perception, making it a popular choice in tasks such as speech recognition, speaker identification, and music genre classification.</p> <p data-svelte-h="svelte-1xaydvu">Now that you know how to visualize audio data examples, go ahead and try to see what your favorite sounds look like. :)</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter1/audio_data.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_yq3w38 = { | |
| assets: "/docs/audio-course/pr_201/en", | |
| base: "/docs/audio-course/pr_201/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"), | |
| import("/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 5], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 42 kB
- Xet hash:
- e346c3988c96f251e97c39274ac72ad3d6ce2e77fd94c34e9725cb7816ba991e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.