Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_8021 /en /create_dataset.md

rtrm

29 days ago

preview code

download

raw

6.84 kB

	# Create a dataset

	Sometimes, you may need to create a dataset if you're working with your own data. Creating a dataset with 🤗 Datasets confers all the advantages of the library to your dataset: fast loading and processing, [stream enormous datasets](stream), [memory-mapping](https://huggingface.co/course/chapter5/4?fw=pt#the-magic-of-memory-mapping), and more. You can easily and rapidly create a dataset with 🤗 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as [dragging and dropping](upload_dataset#upload-with-the-hub-ui) your data files into a dataset repository on the Hub.

	In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for creating all types of datasets:

	- Folder-based builders for quickly creating an image or audio dataset
	- `from_` methods for creating datasets from local files

	## File-based builders

	🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`.

	For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):

	```py
	>>> from datasets import load_dataset
	>>> dataset = load_dataset("csv", data_files="my_file.csv")
	```

	To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

	## Folder-based builders

	There are two folder-based builders, `ImageFolder` and `AudioFolder`. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:

	- `ImageFolder` uses the [Image](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Image) feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/imagefolder/imagefolder.py#L39) of supported image extensions.
	- `AudioFolder` uses the [Audio](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Audio) feature to decode an audio file. Extensions such as wav, mp3, and even mp4 are supported, and you can check the complete [list](https://ffmpeg.org/ffmpeg-formats.html) of supported audio extensions. Decoding is done via ffmpeg.

	The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.

	For example, if your image dataset (it is the same for an audio dataset) is stored like this:

	```
	pokemon/train/grass/bulbasaur.png
	pokemon/train/fire/charmander.png
	pokemon/train/water/squirtle.png

	pokemon/test/grass/ivysaur.png
	pokemon/test/fire/charmeleon.png
	pokemon/test/water/wartortle.png
	```

	Then this is how the folder-based builder generates an example:



	Create the image dataset by specifying `imagefolder` in [load_dataset()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset):

	```py
	>>> from datasets import load_dataset

	>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
	```

	An audio dataset is created in the same way, except you specify `audiofolder` in [load_dataset()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset) instead:

	```py
	>>> from datasets import load_dataset

	>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
	```

	Any additional information about your dataset, such as text captions or transcriptions, can be included with a `metadata.csv` file in the folder containing your dataset. The metadata file needs to have a `file_name` column that links the image or audio file to its corresponding metadata:

	```
	file_name, text
	bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
	charmander.png, It has a preference for hot things.
	squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.
	```

	To learn more about each of these folder-based builders, check out the and ImageFolder or AudioFolder guides.

	## From Python dictionaries

	You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods:

	* The [from_generator()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_generator) method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

	```py
	>>> from datasets import Dataset
	>>> def gen():
	... yield {"pokemon": "bulbasaur", "type": "grass"}
	... yield {"pokemon": "squirtle", "type": "water"}
	>>> ds = Dataset.from_generator(gen)
	>>> ds[0]
	{"pokemon": "bulbasaur", "type": "grass"}
	```

	A generator-based [IterableDataset](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.IterableDataset) needs to be iterated over with a `for` loop for example:

	```py
	>>> from datasets import IterableDataset
	>>> ds = IterableDataset.from_generator(gen)
	>>> for example in ds:
	... print(example)
	{"pokemon": "bulbasaur", "type": "grass"}
	{"pokemon": "squirtle", "type": "water"}
	```

	* The [from_dict()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_dict) method is a straightforward way to create a dataset from a dictionary:

	```py
	>>> from datasets import Dataset
	>>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
	>>> ds[0]
	{"pokemon": "bulbasaur", "type": "grass"}
	```

	To create an image or audio dataset, chain the [cast_column()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.cast_column) method with [from_dict()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_dict) and specify the column and feature type. For example, to create an audio dataset:

	```py
	>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
	```

	Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.

Xet Storage Details

Size:: 6.84 kB
Xet hash:: 75f186408343c09c940e5cd0b06b1d2a2293aac477a3911c60974b29c20c7b96

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.