Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / audio-course /pr_231 /en /chapter1 /audio_data.md

rtrm

about 2 months ago

preview code

download

raw

17.7 kB

	# Introduction to audio data

	By nature, a sound wave is a continuous signal, meaning it contains an infinite number of signal values in a given time.
	This poses problems for digital devices which expect finite arrays. To be processed, stored, and transmitted by digital
	devices, the continuous sound wave needs to be converted into a series of discrete values, known as a digital representation.

	If you look at any audio dataset, you'll find digital files with sound excerpts, such as text narration or music.
	You may encounter different file formats such as `.wav` (Waveform Audio File), `.flac` (Free Lossless Audio Codec)
	and `.mp3` (MPEG-1 Audio Layer 3). These formats mainly differ in how they compress the digital representation of the audio signal.

	Let's take a look at how we arrive from a continuous signal to this representation. The analog signal is first captured by
	a microphone, which converts the sound waves into an electrical signal. The electrical signal is then digitized by an
	Analog-to-Digital Converter to get the digital representation through sampling.

	## Sampling and sampling rate

	Sampling is the process of measuring the value of a continuous signal at fixed time steps. The sampled waveform is _discrete_,
	since it contains a finite number of signal values at uniform intervals.



	Illustration from Wikipedia article: [Sampling (signal processing)](https://en.wikipedia.org/wiki/Sampling_(signal_processing))

	The sampling rate (also called sampling frequency) is the number of samples taken in one second and is measured in
	hertz (Hz). To give you a point of reference, CD-quality audio has a sampling rate of 44,100 Hz, meaning samples are taken
	44,100 times per second. For comparison, high-resolution audio has a sampling rate of 192,000 Hz or 192 kHz. A common
	sampling rate used in training speech models is 16,000 Hz or 16 kHz.

	The choice of sampling rate primarily determines the highest frequency that can be captured from the signal. This is also
	known as the Nyquist limit and is exactly half the sampling rate. The audible frequencies in human speech are below 8 kHz
	and therefore sampling speech at 16 kHz is sufficient. Using a higher sampling rate will not capture more information and
	merely leads to an increase in the computational cost of processing such files. On the other hand, sampling audio at too
	low a sampling rate will result in information loss. Speech sampled at 8 kHz will sound muffled, as the higher frequencies
	cannot be captured at this rate.

	It's important to ensure that all audio examples in your dataset have the same sampling rate when working on any audio task.
	If you plan to use custom audio data to fine-tune a pre-trained model, the sampling rate of your data should match the
	sampling rate of the data the model was pre-trained on. The sampling rate determines the time interval between successive
	audio samples, which impacts the temporal resolution of the audio data. Consider an example: a 5-second sound at a sampling
	rate of 16,000 Hz will be represented as a series of 80,000 values, while the same 5-second sound at a sampling rate of
	8,000 Hz will be represented as a series of 40,000 values. Transformer models that solve audio tasks treat examples as
	sequences and rely on attention mechanisms to learn audio or multimodal representation. Since sequences are different for
	audio examples at different sampling rates, it will be challenging for models to generalize between sampling rates.
	Resampling is the process of making the sampling rates match, and is part of [preprocessing](preprocessing#resampling-the-audio-data) the audio data.

	## Amplitude and bit depth

	While the sampling rate tells you how often the samples are taken, what exactly are the values in each sample?

	Sound is made by changes in air pressure at frequencies that are audible to humans. The amplitude of a sound describes
	the sound pressure level at any given instant and is measured in decibels (dB). We perceive the amplitude as loudness.
	To give you an example, a normal speaking voice is under 60 dB, and a rock concert can be at around 125 dB, pushing the
	limits of human hearing.

	In digital audio, each audio sample records the amplitude of the audio wave at a point in time. The bit depth of the
	sample determines with how much precision this amplitude value can be described. The higher the bit depth, the more
	faithfully the digital representation approximates the original continuous sound wave.

	The most common audio bit depths are 16-bit and 24-bit. Each is a binary term, representing the number of possible steps
	to which the amplitude value can be quantized when it's converted from continuous to discrete: 65,536 steps for 16-bit audio,
	a whopping 16,777,216 steps for 24-bit audio. Because quantizing involves rounding off the continuous value to a discrete
	value, the sampling process introduces noise. The higher the bit depth, the smaller this quantization noise. In practice,
	the quantization noise of 16-bit audio is already small enough to be inaudible, and using higher bit depths is generally
	not necessary.

	You may also come across 32-bit audio. This stores the samples as floating-point values, whereas 16-bit and 24-bit audio
	use integer samples. The precision of a 32-bit floating-point value is 24 bits, giving it the same bit depth as 24-bit audio.
	Floating-point audio samples are expected to lie within the [-1.0, 1.0] range. Since machine learning models naturally
	work on floating-point data, the audio must first be converted into floating-point format before it can be used to train
	the model. We'll see how to do this in the next section on [Preprocessing](preprocessing).

	Just as with continuous audio signals, the amplitude of digital audio is typically expressed in decibels (dB). Since
	human hearing is logarithmic in nature — our ears are more sensitive to small fluctuations in quiet sounds than in loud
	sounds — the loudness of a sound is easier to interpret if the amplitudes are in decibels, which are also logarithmic.
	The decibel scale for real-world audio starts at 0 dB, which represents the quietest possible sound humans can hear, and
	louder sounds have larger values. However, for digital audio signals, 0 dB is the loudest possible amplitude, while all
	other amplitudes are negative. As a quick rule of thumb: every -6 dB is a halving of the amplitude, and anything below -60 dB
	is generally inaudible unless you really crank up the volume.

	## Audio as a waveform

	You may have seen sounds visualized as a waveform, which plots the sample values over time and illustrates the changes
	in the sound's amplitude. This is also known as the time domain representation of sound.

	This type of visualization is useful for identifying specific features of the audio signal such as the timing of individual
	sound events, the overall loudness of the signal, and any irregularities or noise present in the audio.

	To plot the waveform for an audio signal, we can use a Python library called `librosa`:

	```bash
	pip install librosa
	```

	Let's take an example sound called "trumpet" that comes with the library:

	```py
	import librosa

	array, sampling_rate = librosa.load(librosa.ex("trumpet"))
	```

	The example is loaded as a tuple of audio time series (here we call it `array`), and sampling rate (`sampling_rate`).
	Let's take a look at this sound's waveform by using librosa's `waveshow()` function:

	```py
	import matplotlib.pyplot as plt
	import librosa.display

	plt.figure().set_figwidth(12)
	librosa.display.waveshow(array, sr=sampling_rate)
	```



	This plots the amplitude of the signal on the y-axis and time along the x-axis. In other words, each point corresponds
	to a single sample value that was taken when this sound was sampled. Also note that librosa returns the audio as
	floating-point values already, and that the amplitude values are indeed within the [-1.0, 1.0] range.

	Visualizing the audio along with listening to it can be a useful tool for understanding the data you are working with.
	You can see the shape of the signal, observe patterns, learn to spot noise or distortion. If you preprocess data in some
	ways, such as normalization, resampling, or filtering, you can visually confirm that preprocessing steps have been applied as expected.
	After training a model, you can also visualize samples where errors occur (e.g. in audio classification task) to debug
	the issue.

	## The frequency spectrum

	Another way to visualize audio data is to plot the frequency spectrum of an audio signal, also known as the frequency domain
	representation. The spectrum is computed using the discrete Fourier transform or DFT. It describes the individual frequencies
	that make up the signal and how strong they are.

	Let's plot the frequency spectrum for the same trumpet sound by taking the DFT using numpy's `rfft()` function. While it
	is possible to plot the spectrum of the entire sound, it's more useful to look at a small region instead. Here we'll take
	the DFT over the first 4096 samples, which is roughly the length of the first note being played:

	```py
	import numpy as np

	dft_input = array[:4096]

	# calculate the DFT
	window = np.hanning(len(dft_input))
	windowed_input = dft_input * window
	dft = np.fft.rfft(windowed_input)

	# get the amplitude spectrum in decibels
	amplitude = np.abs(dft)
	amplitude_db = librosa.amplitude_to_db(amplitude, ref=np.max)

	# get the frequency bins
	frequency = librosa.fft_frequencies(sr=sampling_rate, n_fft=len(dft_input))

	plt.figure().set_figwidth(12)
	plt.plot(frequency, amplitude_db)
	plt.xlabel("Frequency (Hz)")
	plt.ylabel("Amplitude (dB)")
	plt.xscale("log")
	```



	This plots the strength of the various frequency components that are present in this audio segment. The frequency values are on
	the x-axis, usually plotted on a logarithmic scale, while their amplitudes are on the y-axis.

	The frequency spectrum that we plotted shows several peaks. These peaks correspond to the harmonics of the note that's
	being played, with the higher harmonics being quieter. Since the first peak is at around 620 Hz, this is the frequency spectrum of an E♭ note.

	The output of the DFT is an array of complex numbers, made up of real and imaginary components. Taking
	the magnitude with `np.abs(dft)` extracts the amplitude information from the spectrogram. The angle between the real and
	imaginary components provides the so-called phase spectrum, but this is often discarded in machine learning applications.

	You used `librosa.amplitude_to_db()` to convert the amplitude values to the decibel scale, making it easier to see
	the finer details in the spectrum. Sometimes people use the power spectrum, which measures energy rather than amplitude;
	this is simply a spectrum with the amplitude values squared.

	💡 In practice, people use the term FFT interchangeably with DFT, as the FFT or Fast Fourier Transform is the only efficient
	way to calculate the DFT on a computer.

	The frequency spectrum of an audio signal contains the exact same information as its waveform — they are simply two different
	ways of looking at the same data (here, the first 4096 samples from the trumpet sound). Where the waveform plots the amplitude
	of the audio signal over time, the spectrum visualizes the amplitudes of the individual frequencies at a fixed point in time.

	## Spectrogram

	What if we want to see how the frequencies in an audio signal change? The trumpet plays several notes and they all have
	different frequencies. The problem is that the spectrum only shows a frozen snapshot of the frequencies at a given instant.
	The solution is to take multiple DFTs, each covering only a small slice of time, and stack the resulting spectra together
	into a spectrogram.

	A spectrogram plots the frequency content of an audio signal as it changes over time. It allows you to see time, frequency,
	and amplitude all on one graph. The algorithm that performs this computation is the STFT or Short Time Fourier Transform.

	The spectrogram is one of the most informative audio tools available to you. For example, when working with a music recording,
	you can see the various instruments and vocal tracks and how they contribute to the overall sound. In speech, you can
	identify different vowel sounds as each vowel is characterized by particular frequencies.

	Let's plot a spectrogram for the same trumpet sound, using librosa's `stft()` and `specshow()` functions:

	```py
	import numpy as np

	D = librosa.stft(array)
	S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

	plt.figure().set_figwidth(12)
	librosa.display.specshow(S_db, x_axis="time", y_axis="hz")
	plt.colorbar()
	```



	In this plot, the x-axis represents time as in the waveform visualization but now the y-axis represents frequency in Hz.
	The intensity of the color gives the amplitude or power of the frequency component at each point in time, measured in decibels (dB).

	The spectrogram is created by taking short segments of the audio signal, typically lasting a few milliseconds, and calculating
	the discrete Fourier transform of each segment to obtain its frequency spectrum. The resulting spectra are then stacked
	together on the time axis to create the spectrogram. Each vertical slice in this image corresponds to a single frequency
	spectrum, seen from the top. By default, `librosa.stft()` splits the audio signal into segments of 2048 samples, which
	gives a good trade-off between frequency resolution and time resolution.

	Since the spectrogram and the waveform are different views of the same data, it's possible to turn the spectrogram back
	into the original waveform using the inverse STFT. However, this requires the phase information in addition to the amplitude
	information. If the spectrogram was generated by a machine learning model, it typically only outputs the amplitudes. In
	that case, we can use a phase reconstruction algorithm such as the classic Griffin-Lim algorithm, or using a neural network
	called a vocoder, to reconstruct a waveform from the spectrogram.

	Spectrograms aren't just used for visualization. Many machine learning models will take spectrograms as input — as opposed
	to waveforms — and produce spectrograms as output.

	Now that we know what a spectrogram is and how it's made, let's take a look at a variant of it widely used for speech processing: the mel spectrogram.

	## Mel spectrogram

	A mel spectrogram is a variation of the spectrogram that is commonly used in speech processing and machine learning tasks.
	It is similar to a spectrogram in that it shows the frequency content of an audio signal over time, but on a different frequency axis.

	In a standard spectrogram, the frequency axis is linear and is measured in hertz (Hz). However, the human auditory system
	is more sensitive to changes in lower frequencies than higher frequencies, and this sensitivity decreases logarithmically
	as frequency increases. The mel scale is a perceptual scale that approximates the non-linear frequency response of the human ear.

	To create a mel spectrogram, the STFT is used just like before, splitting the audio into short segments to obtain a sequence
	of frequency spectra. Additionally, each spectrum is sent through a set of filters, the so-called mel filterbank, to
	transform the frequencies to the mel scale.

	Let's see how we can plot a mel spectrogram using librosa's `melspectrogram()` function, which performs all of those steps for us:

	```py
	S = librosa.feature.melspectrogram(y=array, sr=sampling_rate, n_mels=128, fmax=8000)
	S_dB = librosa.power_to_db(S, ref=np.max)

	plt.figure().set_figwidth(12)
	librosa.display.specshow(S_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=8000)
	plt.colorbar()
	```



	In the example above, `n_mels` stands for the number of mel bands to generate. The mel bands define a set of frequency
	ranges that divide the spectrum into perceptually meaningful components, using a set of filters whose shape and spacing
	are chosen to mimic the way the human ear responds to different frequencies. Common values for `n_mels` are 40 or 80. `fmax`
	indicates the highest frequency (in Hz) we care about.

	Just as with a regular spectrogram, it's common practice to express the strength of the mel frequency components in
	decibels. This is commonly referred to as a log-mel spectrogram, because the conversion to decibels involves a
	logarithmic operation. The above example used `librosa.power_to_db()` as `librosa.feature.melspectrogram()` creates a power spectrogram.

	💡 Not all mel spectrograms are the same! There are two different mel scales in common use ("htk" and "slaney"),
	and instead of the power spectrogram the amplitude spectrogram may be used. The conversion to a log-mel spectrogram doesn't
	always compute true decibels but may simply take the `log`. Therefore, if a machine learning model expects a mel spectrogram
	as input, double check to make sure you're computing it the same way.

	Creating a mel spectrogram is a lossy operation as it involves filtering the signal. Converting a mel spectrogram back
	into a waveform is more difficult than doing this for a regular spectrogram, as it requires estimating the frequencies
	that were thrown away. This is why machine learning models such as HiFiGAN vocoder are needed to produce a waveform from a mel
	spectrogram.

	Compared to a standard spectrogram, a mel spectrogram can capture more meaningful features of the audio signal for
	human perception, making it a popular choice in tasks such as speech recognition, speaker identification, and music genre classification.

	Now that you know how to visualize audio data examples, go ahead and try to see what your favorite sounds look like. :)

Xet Storage Details

Size:: 17.7 kB
Xet hash:: c4a898d4755f1dc51ba8af31a6997cf70df40fbe3f481bfe38d6b925de2cc637

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.