Buckets:
| # Preprocessing an audio dataset | |
| Loading a dataset with ๐ค Datasets is just half of the fun. If you plan to use it either for training a model, or for running | |
| inference, you will need to pre-process the data first. In general, this will involve the following steps: | |
| * Resampling the audio data | |
| * Filtering the dataset | |
| * Converting audio data to model's expected input | |
| ## Resampling the audio data | |
| The `load_dataset` function downloads audio examples with the sampling rate that they were published with. This is not | |
| always the sampling rate expected by a model you plan to train, or use for inference. If there's a discrepancy between | |
| the sampling rates, you can resample the audio to the model's expected sampling rate. | |
| Most of the available pretrained models have been pretrained on audio datasets at a sampling rate of 16 kHz. | |
| When we explored MINDS-14 dataset, you may have noticed that it is sampled at 8 kHz, which means we will likely need | |
| to upsample it. | |
| To do so, use ๐ค Datasets' `cast_column` method. This operation does not change the audio in-place, but rather signals | |
| to datasets to resample the audio examples on the fly when they are loaded. The following code will set the sampling | |
| rate to 16kHz: | |
| ```py | |
| from datasets import Audio | |
| minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) | |
| ``` | |
| Re-load the first audio example in the MINDS-14 dataset, and check that it has been resampled to the desired `sampling rate`: | |
| ```py | |
| minds[0] | |
| ``` | |
| **Output:** | |
| ```out | |
| { | |
| "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav", | |
| "audio": { | |
| "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav", | |
| "array": array( | |
| [ | |
| 2.0634243e-05, | |
| 1.9437837e-04, | |
| 2.2419340e-04, | |
| ..., | |
| 9.3852862e-04, | |
| 1.1302452e-03, | |
| 7.1531429e-04, | |
| ], | |
| dtype=float32, | |
| ), | |
| "sampling_rate": 16000, | |
| }, | |
| "transcription": "I would like to pay my electricity bill using my card can you please assist", | |
| "intent_class": 13, | |
| } | |
| ``` | |
| You may notice that the array values are now also different. This is because we've now got twice the number of amplitude values for | |
| every one that we had before. | |
| ๐ก Some background on resampling: If an audio signal has been sampled at 8 kHz, so that it has 8000 sample readings per | |
| second, we know that the audio does not contain any frequencies over 4 kHz. This is guaranteed by the Nyquist sampling | |
| theorem. Because of this, we can be certain that in between the sampling points the original continuous signal always | |
| makes a smooth curve. Upsampling to a higher sampling rate is then a matter of calculating additional sample values that go in between | |
| the existing ones, by approximating this curve. Downsampling, however, requires that we first filter out any frequencies | |
| that would be higher than the new Nyquist limit, before estimating the new sample points. In other words, you can't | |
| downsample by a factor 2x by simply throwing away every other sample โ this will create distortions in the signal called | |
| aliases. Doing resampling correctly is tricky and best left to well-tested libraries such as librosa or ๐ค Datasets. | |
| ## Filtering the dataset | |
| You may need to filter the data based on some criteria. One of the common cases involves limiting the audio examples to a | |
| certain duration. For instance, we might want to filter out any examples longer than 20s to prevent out-of-memory errors | |
| when training a model. | |
| We can do this by using the ๐ค Datasets' `filter` method and passing a function with filtering logic to it. Let's start by writing a | |
| function that indicates which examples to keep and which to discard. This function, `is_audio_length_in_range`, | |
| returns `True` if a sample is shorter than 20s, and `False` if it is longer than 20s. | |
| ```py | |
| MAX_DURATION_IN_SECONDS = 20.0 | |
| def is_audio_length_in_range(input_length): | |
| return input_length | |
| Now you can see what the audio input to the Whisper model looks like after preprocessing. | |
| The model's feature extractor class takes care of transforming raw audio data to the format that the model expects. However, | |
| many tasks involving audio are multimodal, e.g. speech recognition. In such cases ๐ค Transformers also offer model-specific | |
| tokenizers to process the text inputs. For a deep dive into tokenizers, please refer to our [NLP course](https://huggingface.co/course/chapter2/4). | |
| You can load the feature extractor and tokenizer for Whisper and other multimodal models separately, or you can load both via | |
| a so-called processor. To make things even simpler, use `AutoProcessor` to load a model's feature extractor and processor from a | |
| checkpoint, like this: | |
| ```py | |
| from transformers import AutoProcessor | |
| processor = AutoProcessor.from_pretrained("openai/whisper-small") | |
| ``` | |
| Here we have illustrated the fundamental data preparation steps. Of course, custom data may require more complex preprocessing. | |
| In this case, you can extend the function `prepare_dataset` to perform any sort of custom data transformations. With ๐ค Datasets, | |
| if you can write it as a Python function, you can [apply it](https://huggingface.co/docs/datasets/audio_process) to your dataset! | |
Xet Storage Details
- Size:
- 5.46 kB
- Xet hash:
- 2f90f739f76627c3b9f11500102c085ebb7178b88b7fe5cafa7b2603e31a0d36
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.