Buckets:
| # Load a dataset from the Hub | |
| Finding high-quality datasets that are reproducible and accessible can be difficult. One of ๐ค Datasets main goals is to provide a simple way to load a dataset of any format or type. The easiest way to get started is to discover an existing dataset on the [Hugging Face Hub](https://huggingface.co/datasets) - a community-driven collection of datasets for tasks in NLP, computer vision, and audio - and use ๐ค Datasets to download and generate the dataset. | |
| This tutorial uses the [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) and [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) datasets, but feel free to load any dataset you want and follow along. Head over to the Hub now and find a dataset for your task! | |
| ## Load a dataset | |
| Before you take the time to download a dataset, it's often helpful to quickly get some general information about a dataset. A dataset's information is stored inside [DatasetInfo](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.DatasetInfo) and can include information such as the dataset description, features, and dataset size. | |
| Use the [load_dataset_builder()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset_builder) function to load a dataset builder and inspect a dataset's attributes without committing to downloading it: | |
| ```py | |
| >>> from datasets import load_dataset_builder | |
| >>> ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") | |
| # Inspect dataset description | |
| >>> ds_builder.info.description | |
| Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. | |
| # Inspect dataset features | |
| >>> ds_builder.info.features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| ``` | |
| If you're happy with the dataset, then load it with [load_dataset()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset): | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| ``` | |
| ## Splits | |
| A split is a specific subset of a dataset like `train` and `test`. List a dataset's split names with the [get_dataset_split_names()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.get_dataset_split_names) function: | |
| ```py | |
| >>> from datasets import get_dataset_split_names | |
| >>> get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes") | |
| ['train', 'validation', 'test'] | |
| ``` | |
| Then you can load a specific split with the `split` parameter. Loading a dataset `split` returns a [Dataset](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset) object: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> dataset | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 8530 | |
| }) | |
| ``` | |
| If you don't specify a `split`, ๐ค Datasets returns a [DatasetDict](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.DatasetDict) object instead: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 8530 | |
| }) | |
| validation: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 1066 | |
| }) | |
| test: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 1066 | |
| }) | |
| }) | |
| ``` | |
| ## Configurations | |
| Some datasets contain several sub-datasets. For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as *configurations* or *subsets*, and you must explicitly select one when loading the dataset. If you don't provide a configuration name, ๐ค Datasets will raise a `ValueError` and remind you to choose a configuration. | |
| Use the [get_dataset_config_names()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.get_dataset_config_names) function to retrieve a list of all the possible configurations available to your dataset: | |
| ```py | |
| >>> from datasets import get_dataset_config_names | |
| >>> configs = get_dataset_config_names("PolyAI/minds14") | |
| >>> print(configs) | |
| ['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN', 'all'] | |
| ``` | |
| Then load the configuration you want: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train") | |
| ``` | |
Xet Storage Details
- Size:
- 4.86 kB
- Xet hash:
- 6e78478b7cea453bbdd3a0d32c202067d1a088e51fb48b6833ae5f8e0cc08f3e
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.