Buckets:
| # Process audio data | |
| This guide shows specific methods for processing audio datasets. Learn how to: | |
| - Resample the sampling rate. | |
| - Use [map()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.map) with audio datasets. | |
| For a guide on how to process any type of dataset, take a look at the general process guide. | |
| ## Cast | |
| The [cast_column()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.cast_column) function is used to cast a column to another feature to be decoded. When you use this function with the [Audio](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Audio) feature, you can resample the sampling rate: | |
| ```py | |
| >>> from datasets import load_dataset, Audio | |
| >>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train") | |
| >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) | |
| ``` | |
| Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz: | |
| ```py | |
| >>> audio = dataset[0]["audio"] | |
| >>> audio = audio_dataset[0]["audio"] | |
| >>> samples = audio.get_all_samples() | |
| >>> samples.data | |
| tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, | |
| -1.9127e-04, -5.3330e-05]] | |
| >>> samples.sample_rate | |
| 16000 | |
| ``` | |
| ## Map | |
| The [map()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.map) function helps preprocess your entire dataset at once. Depending on the type of model you're working with, you'll need to either load a [feature extractor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoFeatureExtractor) or a [processor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor). | |
| - For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`: | |
| ```py | |
| >>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor | |
| >>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53" | |
| # after defining a vocab.json file you can instantiate a tokenizer object: | |
| >>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|") | |
| >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint) | |
| >>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer) | |
| ``` | |
| - For fine-tuned speech recognition models, you only need to load a `processor`: | |
| ```py | |
| >>> from transformers import AutoProcessor | |
| >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") | |
| ``` | |
| When you use [map()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.map) with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data: | |
| ```py | |
| >>> def prepare_dataset(batch): | |
| ... audio = batch["audio"] | |
| ... batch["input_values"] = processor(audio.get_all_samples().data, sampling_rate=audio["sampling_rate"]).input_values[0] | |
| ... batch["input_length"] = len(batch["input_values"]) | |
| ... with processor.as_target_processor(): | |
| ... batch["labels"] = processor(batch["sentence"]).input_ids | |
| ... return batch | |
| >>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names) | |
| ``` | |
Xet Storage Details
- Size:
- 3.3 kB
- Xet hash:
- 350c3c68c8db0e3e6211d468cf04cce58701e33071eb8e1378996642660694ea
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.