Buckets:
| # Use with NumPy | |
| This document is a quick introduction to using `datasets` with NumPy, with a particular focus on how to get | |
| `numpy.ndarray` objects out of our datasets, and how to use them to train models based on NumPy such as `scikit-learn` models. | |
| ## Dataset format | |
| By default, datasets return regular Python objects: integers, floats, strings, lists, etc.. | |
| To get NumPy arrays instead, you can set the format of the dataset to `numpy`: | |
| ```py | |
| >>> from datasets import Dataset | |
| >>> data = [[1, 2], [3, 4]] | |
| >>> ds = Dataset.from_dict({"data": data}) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[0] | |
| {'data': array([1, 2])} | |
| >>> ds[:2] | |
| {'data': array([ | |
| [1, 2], | |
| [3, 4]])} | |
| ``` | |
| > [!TIP] | |
| > A [Dataset](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset) object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to NumPy arrays. | |
| Note that the exact same procedure applies to `DatasetDict` objects, so that | |
| when setting the format of a `DatasetDict` to `numpy`, all the `Dataset`s there | |
| will be formatted as `numpy`: | |
| ```py | |
| >>> from datasets import DatasetDict | |
| >>> data = {"train": {"data": [[1, 2], [3, 4]]}, "test": {"data": [[5, 6], [7, 8]]}} | |
| >>> dds = DatasetDict.from_dict(data) | |
| >>> dds = dds.with_format("numpy") | |
| >>> dds["train"][:2] | |
| {'data': array([ | |
| [1, 2], | |
| [3, 4]])} | |
| ``` | |
| ### N-dimensional arrays | |
| If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same array if the shape is fixed: | |
| ```py | |
| >>> from datasets import Dataset | |
| >>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] # fixed shape | |
| >>> ds = Dataset.from_dict({"data": data}) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[0] | |
| {'data': array([[1, 2], | |
| [3, 4]])} | |
| ``` | |
| ```py | |
| >>> from datasets import Dataset | |
| >>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]] # varying shape | |
| >>> ds = Dataset.from_dict({"data": data}) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[0] | |
| {'data': array([array([1, 2]), array([3])], dtype=object)} | |
| ``` | |
| However this logic often requires slow shape comparisons and data copies. | |
| To avoid this, you must explicitly use the `Array` feature type and specify the shape of your tensors: | |
| ```py | |
| >>> from datasets import Dataset, Features, Array2D | |
| >>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] | |
| >>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')}) | |
| >>> ds = Dataset.from_dict({"data": data}, features=features) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[0] | |
| {'data': array([[1, 2], | |
| [3, 4]])} | |
| >>> ds[:2] | |
| {'data': array([[[1, 2], | |
| [3, 4]], | |
| [[5, 6], | |
| [7, 8]]])} | |
| ``` | |
| ### Other feature types | |
| [ClassLabel](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.ClassLabel) data is properly converted to arrays: | |
| ```py | |
| >>> from datasets import Dataset, Features, ClassLabel | |
| >>> labels = [0, 0, 1] | |
| >>> features = Features({"label": ClassLabel(names=["negative", "positive"])}) | |
| >>> ds = Dataset.from_dict({"label": labels}, features=features) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[:3] | |
| {'label': array([0, 0, 1])} | |
| ``` | |
| String and binary objects are unchanged, since NumPy only supports numbers. | |
| The [Image](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Image) and [Audio](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Audio) feature types are also supported. | |
| > [!TIP] | |
| > To use the [Image](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Image) feature type, you'll need to install the `vision` extra as | |
| > `pip install datasets[vision]`. | |
| ```py | |
| >>> from datasets import Dataset, Features, Image | |
| >>> images = ["path/to/image.png"] * 10 | |
| >>> features = Features({"image": Image()}) | |
| >>> ds = Dataset.from_dict({"image": images}, features=features) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[0]["image"].shape | |
| (512, 512, 3) | |
| >>> ds[0] | |
| {'image': array([[[ 255, 255, 255], | |
| [ 255, 255, 255], | |
| ..., | |
| [ 255, 255, 255], | |
| [ 255, 255, 255]]], dtype=uint8)} | |
| >>> ds[:2]["image"].shape | |
| (2, 512, 512, 3) | |
| >>> ds[:2] | |
| {'image': array([[[[ 255, 255, 255], | |
| [ 255, 255, 255], | |
| ..., | |
| [ 255, 255, 255], | |
| [ 255, 255, 255]]]], dtype=uint8)} | |
| ``` | |
| > [!TIP] | |
| > To use the [Audio](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Audio) feature type, you'll need to install the `audio` extra as | |
| > `pip install datasets[audio]`. | |
| ```py | |
| >>> from datasets import Dataset, Features, Audio | |
| >>> audio = ["path/to/audio.wav"] * 10 | |
| >>> features = Features({"audio": Audio()}) | |
| >>> ds = Dataset.from_dict({"audio": audio}, features=features) | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds[0]["audio"]["array"] | |
| array([-0.059021 , -0.03894043, -0.00735474, ..., 0.0133667 , | |
| 0.01809692, 0.00268555], dtype=float32) | |
| >>> ds[0]["audio"]["sampling_rate"] | |
| array(44100, weak_type=True) | |
| ``` | |
| ## Data loading | |
| NumPy doesn't have any built-in data loading capabilities, so you'll either need to materialize the NumPy arrays like `X, y` to use in `scikit-learn` or use a library such as [PyTorch](https://pytorch.org/) to load your data using a `DataLoader`. | |
| ### Using `with_format('numpy')` | |
| The easiest way to get NumPy arrays out of a dataset is to use the `with_format('numpy')` method. Lets assume | |
| that we want to train a neural network on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) available | |
| at the HuggingFace Hub at https://huggingface.co/datasets/mnist. | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("ylecun/mnist") | |
| >>> ds = ds.with_format("numpy") | |
| >>> ds["train"][0] | |
| {'image': array([[ 0, 0, 0, ...], | |
| [ 0, 0, 0, ...], | |
| ..., | |
| [ 0, 0, 0, ...], | |
| [ 0, 0, 0, ...]], dtype=uint8), | |
| 'label': array(5)} | |
| ``` | |
| Once the format is set we can feed the dataset to the model based on NumPy in batches using the `Dataset.iter()` | |
| method: | |
| ```py | |
| >>> for epoch in range(epochs): | |
| ... for batch in ds["train"].iter(batch_size=32): | |
| ... x, y = batch["image"], batch["label"] | |
| ... ... | |
| ``` | |
Xet Storage Details
- Size:
- 6.11 kB
- Xet hash:
- 08668e9b2a08ec2f5d10b05ba378d9b99e4bbb7a8cb811385829605dd2211e09
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.