Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_7889 /en /use_with_pytorch.md

rtrm

about 2 months ago

preview code

download

raw

10.4 kB

	# Use with PyTorch

	This document is a quick introduction to using `datasets` with PyTorch, with a particular focus on how to get
	`torch.Tensor` objects out of our datasets, and how to use a PyTorch `DataLoader` and a Hugging Face `Dataset`
	with the best performance.

	## Dataset format

	By default, datasets return regular python objects: integers, floats, strings, lists, etc.

	To get PyTorch tensors instead, you can set the format of the dataset to `pytorch` using [Dataset.with_format()](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Dataset.with_format):

	```py
	>>> from datasets import Dataset
	>>> data = [[1, 2],[3, 4]]
	>>> ds = Dataset.from_dict({"data": data})
	>>> ds = ds.with_format("torch")
	>>> ds[0]
	{'data': tensor([1, 2])}
	>>> ds[:2]
	{'data': tensor([[1, 2],
	[3, 4]])}
	```

	> [!TIP]
	> A [Dataset](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Dataset) object is a wrapper of an Arrow table, which allows fast zero-copy reads from arrays in the dataset to PyTorch tensors.

	To load the data as tensors on a GPU, specify the `device` argument:
	```py
	>>> import torch
	>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	>>> ds = ds.with_format("torch", device=device)
	>>> ds[0]
	{'data': tensor([1, 2], device='cuda:0')}
	```

	### N-dimensional arrays

	If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:

	```py
	>>> from datasets import Dataset
	>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] # fixed shape
	>>> ds = Dataset.from_dict({"data": data})
	>>> ds = ds.with_format("torch")
	>>> ds[0]
	{'data': tensor([[1, 2],
	[3, 4]])}
	```

	```py
	>>> from datasets import Dataset
	>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]] # varying shape
	>>> ds = Dataset.from_dict({"data": data})
	>>> ds = ds.with_format("torch")
	>>> ds[0]
	{'data': [tensor([1, 2]), tensor([3])]}
	```

	However this logic often requires slow shape comparisons and data copies.
	To avoid this, you must explicitly use the `Array` feature type and specify the shape of your tensors:

	```py
	>>> from datasets import Dataset, Features, Array2D
	>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
	>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
	>>> ds = Dataset.from_dict({"data": data}, features=features)
	>>> ds = ds.with_format("torch")
	>>> ds[0]
	{'data': tensor([[1, 2],
	[3, 4]])}
	>>> ds[:2]
	{'data': tensor([[[1, 2],
	[3, 4]],

	[[5, 6],
	[7, 8]]])}
	```

	### Other feature types

	[ClassLabel](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.ClassLabel) data are properly converted to tensors:

	```py
	>>> from datasets import Dataset, Features, ClassLabel
	>>> labels = [0, 0, 1]
	>>> features = Features({"label": ClassLabel(names=["negative", "positive"])})
	>>> ds = Dataset.from_dict({"label": labels}, features=features)
	>>> ds = ds.with_format("torch")
	>>> ds[:3]
	{'label': tensor([0, 0, 1])}
	```

	String and binary objects are unchanged, since PyTorch only supports numbers.

	The [Image](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Image) and [Audio](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Audio) feature types are also supported.

	> [!TIP]
	> To use the [Image](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Image) feature type, you'll need to install the `vision` extra as
	> `pip install datasets[vision]`.

	```py
	>>> from datasets import Dataset, Features, Audio, Image
	>>> images = ["path/to/image.png"] * 10
	>>> features = Features({"image": Image()})
	>>> ds = Dataset.from_dict({"image": images}, features=features)
	>>> ds = ds.with_format("torch")
	>>> ds[0]["image"].shape
	torch.Size([512, 512, 4])
	>>> ds[0]
	{'image': tensor([[[255, 215, 106, 255],
	[255, 215, 106, 255],
	...,
	[255, 255, 255, 255],
	[255, 255, 255, 255]]], dtype=torch.uint8)}
	>>> ds[:2]["image"].shape
	torch.Size([2, 512, 512, 4])
	>>> ds[:2]
	{'image': tensor([[[[255, 215, 106, 255],
	[255, 215, 106, 255],
	...,
	[255, 255, 255, 255],
	[255, 255, 255, 255]]]], dtype=torch.uint8)}
	```

	> [!TIP]
	> To use the [Audio](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Audio) feature type, you'll need to install the `audio` extra as
	> `pip install datasets[audio]`.

	```py
	>>> from datasets import Dataset, Features, Audio, Image
	>>> audio = ["path/to/audio.wav"] * 10
	>>> features = Features({"audio": Audio()})
	>>> ds = Dataset.from_dict({"audio": audio}, features=features)
	>>> ds = ds.with_format("torch")
	>>> ds[0]["audio"]["array"]
	tensor([ 6.1035e-05, 1.5259e-05, 1.6785e-04, ..., -1.5259e-05,
	-1.5259e-05, 1.5259e-05])
	>>> ds[0]["audio"]["sampling_rate"]
	tensor(44100)
	```

	## Data loading

	Like `torch.utils.data.Dataset` objects, a [Dataset](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.Dataset) can be passed directly to a PyTorch `DataLoader`:

	```py
	>>> import numpy as np
	>>> from datasets import Dataset
	>>> from torch.utils.data import DataLoader
	>>> data = np.random.rand(16)
	>>> label = np.random.randint(0, 2, size=16)
	>>> ds = Dataset.from_dict({"data": data, "label": label}).with_format("torch")
	>>> dataloader = DataLoader(ds, batch_size=4)
	>>> for batch in dataloader:
	... print(batch)
	{'data': tensor([0.0047, 0.4979, 0.6726, 0.8105]), 'label': tensor([0, 1, 0, 1])}
	{'data': tensor([0.4832, 0.2723, 0.4259, 0.2224]), 'label': tensor([0, 0, 0, 0])}
	{'data': tensor([0.5837, 0.3444, 0.4658, 0.6417]), 'label': tensor([0, 1, 0, 0])}
	{'data': tensor([0.7022, 0.1225, 0.7228, 0.8259]), 'label': tensor([1, 1, 1, 1])}
	```

	### Optimize data loading

	There are several ways you can increase the speed your data is loaded which can save you time, especially if you are working with large datasets.
	PyTorch offers parallelized data loading, retrieving batches of indices instead of individually, and streaming to iterate over the dataset without downloading it on disk.

	#### Use multiple Workers

	You can parallelize data loading with the `num_workers` argument of a PyTorch `DataLoader` and get a higher throughput.

	Under the hood, the `DataLoader` starts `num_workers` processes.
	Each process reloads the dataset passed to the `DataLoader` and is used to query examples.
	Reloading the dataset inside a worker doesn't fill up your RAM, since it simply memory-maps the dataset again from your disk.

	```py
	>>> import numpy as np
	>>> from datasets import Dataset, load_from_disk
	>>> from torch.utils.data import DataLoader
	>>> data = np.random.rand(10_000)
	>>> Dataset.from_dict({"data": data}).save_to_disk("my_dataset")
	>>> ds = load_from_disk("my_dataset").with_format("torch")
	>>> dataloader = DataLoader(ds, batch_size=32, num_workers=4)
	```

	### Stream data

	Stream a dataset by loading it as an [IterableDataset](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.IterableDataset). This allows you to progressively iterate over a remote dataset without downloading it on disk and or over local data files.
	Learn more about which type of dataset is best for your use case in the [choosing between a regular dataset or an iterable dataset](./about_mapstyle_vs_iterable) guide.

	An iterable dataset from `datasets` inherits from `torch.utils.data.IterableDataset` so you can pass it to a `torch.utils.data.DataLoader`:

	```py
	>>> import numpy as np
	>>> from datasets import Dataset, load_dataset
	>>> from torch.utils.data import DataLoader
	>>> data = np.random.rand(10_000)
	>>> Dataset.from_dict({"data": data}).push_to_hub("/my_dataset") # Upload to the Hugging Face Hub
	>>> my_iterable_dataset = load_dataset("/my_dataset", streaming=True, split="train")
	>>> dataloader = DataLoader(my_iterable_dataset, batch_size=32)
	```

	If the dataset is split in several shards (i.e. if the dataset consists of multiple data files), then you can stream in parallel using `num_workers`:

	```py
	>>> my_iterable_dataset = load_dataset("deepmind/code_contests", streaming=True, split="train")
	>>> my_iterable_dataset.num_shards
	39
	>>> dataloader = DataLoader(my_iterable_dataset, batch_size=32, num_workers=4)
	```

	In this case each worker is given a subset of the list of shards to stream from.

	### Checkpoint and resume

	If you need a DataLoader that you can checkpoint and resume in the middle of training, you can use the `StatefulDataLoader` from [torchdata](https://github.com/pytorch/data):

	```py
	>>> from torchdata.stateful_dataloader import StatefulDataLoader
	>>> my_iterable_dataset = load_dataset("deepmind/code_contests", streaming=True, split="train")
	>>> dataloader = StatefulDataLoader(my_iterable_dataset, batch_size=32, num_workers=4)
	>>> # save in the middle of training
	>>> state_dict = dataloader.state_dict()
	>>> # and resume later
	>>> dataloader.load_state_dict(state_dict)
	```

	This is possible thanks to [IterableDataset.state_dict()](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.IterableDataset.state_dict) and [IterableDataset.load_state_dict()](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.IterableDataset.load_state_dict).

	### Distributed

	To split your dataset across your training nodes, you can use [datasets.distributed.split_dataset_by_node()](/docs/datasets/pr_7889/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node):

	```python
	import os
	from datasets.distributed import split_dataset_by_node

	ds = split_dataset_by_node(ds, rank=int(os.environ["RANK"]), world_size=int(os.environ["WORLD_SIZE"]))
	```

	This works for both map-style datasets and iterable datasets.
	The dataset is split for the node at rank `rank` in a pool of nodes of size `world_size`.

	For map-style datasets:

	Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset.

	For iterable datasets:

	If the dataset has a number of shards that is a factor of `world_size` (i.e. if `dataset.num_shards % world_size == 0`),
	then the shards are evenly assigned across the nodes, which is the most optimized.
	Otherwise, each node keeps 1 example out of `world_size`, skipping the other examples.

	This can also be combined with a `torch.utils.data.DataLoader` if you want each node to use multiple workers to load the data.

Xet Storage Details

Size:: 10.4 kB
Xet hash:: ec77a85bdcd9dd24843c7f6e4d3dba6738aa1d15dba4a0e53d8e8af98d62de3d

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.