Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / audio-course /pr_239 /en /chapter1 /preprocessing.md

rtrm

about 1 month ago

preview code

download

raw

5.46 kB

	# Preprocessing an audio dataset

	Loading a dataset with 🤗 Datasets is just half of the fun. If you plan to use it either for training a model, or for running
	inference, you will need to pre-process the data first. In general, this will involve the following steps:

	* Resampling the audio data
	* Filtering the dataset
	* Converting audio data to model's expected input

	## Resampling the audio data

	The `load_dataset` function downloads audio examples with the sampling rate that they were published with. This is not
	always the sampling rate expected by a model you plan to train, or use for inference. If there's a discrepancy between
	the sampling rates, you can resample the audio to the model's expected sampling rate.

	Most of the available pretrained models have been pretrained on audio datasets at a sampling rate of 16 kHz.
	When we explored MINDS-14 dataset, you may have noticed that it is sampled at 8 kHz, which means we will likely need
	to upsample it.

	To do so, use 🤗 Datasets' `cast_column` method. This operation does not change the audio in-place, but rather signals
	to datasets to resample the audio examples on the fly when they are loaded. The following code will set the sampling
	rate to 16kHz:

	```py
	from datasets import Audio

	minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
	```

	Re-load the first audio example in the MINDS-14 dataset, and check that it has been resampled to the desired `sampling rate`:

	```py
	minds[0]
	```

	Output:
	```out
	{
	"path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
	"audio": {
	"path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
	"array": array(
	[
	2.0634243e-05,
	1.9437837e-04,
	2.2419340e-04,
	...,
	9.3852862e-04,
	1.1302452e-03,
	7.1531429e-04,
	],
	dtype=float32,
	),
	"sampling_rate": 16000,
	},
	"transcription": "I would like to pay my electricity bill using my card can you please assist",
	"intent_class": 13,
	}
	```

	You may notice that the array values are now also different. This is because we've now got twice the number of amplitude values for
	every one that we had before.

	💡 Some background on resampling: If an audio signal has been sampled at 8 kHz, so that it has 8000 sample readings per
	second, we know that the audio does not contain any frequencies over 4 kHz. This is guaranteed by the Nyquist sampling
	theorem. Because of this, we can be certain that in between the sampling points the original continuous signal always
	makes a smooth curve. Upsampling to a higher sampling rate is then a matter of calculating additional sample values that go in between
	the existing ones, by approximating this curve. Downsampling, however, requires that we first filter out any frequencies
	that would be higher than the new Nyquist limit, before estimating the new sample points. In other words, you can't
	downsample by a factor 2x by simply throwing away every other sample — this will create distortions in the signal called
	aliases. Doing resampling correctly is tricky and best left to well-tested libraries such as librosa or 🤗 Datasets.

	## Filtering the dataset

	You may need to filter the data based on some criteria. One of the common cases involves limiting the audio examples to a
	certain duration. For instance, we might want to filter out any examples longer than 20s to prevent out-of-memory errors
	when training a model.

	We can do this by using the 🤗 Datasets' `filter` method and passing a function with filtering logic to it. Let's start by writing a
	function that indicates which examples to keep and which to discard. This function, `is_audio_length_in_range`,
	returns `True` if a sample is shorter than 20s, and `False` if it is longer than 20s.

	```py
	MAX_DURATION_IN_SECONDS = 20.0

	def is_audio_length_in_range(input_length):
	return input_length

	Now you can see what the audio input to the Whisper model looks like after preprocessing.

	The model's feature extractor class takes care of transforming raw audio data to the format that the model expects. However,
	many tasks involving audio are multimodal, e.g. speech recognition. In such cases 🤗 Transformers also offer model-specific
	tokenizers to process the text inputs. For a deep dive into tokenizers, please refer to our [NLP course](https://huggingface.co/course/chapter2/4).

	You can load the feature extractor and tokenizer for Whisper and other multimodal models separately, or you can load both via
	a so-called processor. To make things even simpler, use `AutoProcessor` to load a model's feature extractor and processor from a
	checkpoint, like this:

	```py
	from transformers import AutoProcessor

	processor = AutoProcessor.from_pretrained("openai/whisper-small")
	```

	Here we have illustrated the fundamental data preparation steps. Of course, custom data may require more complex preprocessing.
	In this case, you can extend the function `prepare_dataset` to perform any sort of custom data transformations. With 🤗 Datasets,
	if you can write it as a Python function, you can [apply it](https://huggingface.co/docs/datasets/audio_process) to your dataset!

Xet Storage Details

Size:: 5.46 kB
Xet hash:: 2f90f739f76627c3b9f11500102c085ebb7178b88b7fe5cafa7b2603e31a0d36

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.