Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_7835 /en /package_reference /builder_classes.md

rtrm

about 1 month ago

preview code

download

raw

47.5 kB

	# Builder classes

	## Builders[[datasets.DatasetBuilder]]

	🤗 Datasets relies on two main classes during the dataset building process: [DatasetBuilder](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder) and [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig).

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.DatasetBuilder</name><anchor>datasets.DatasetBuilder</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L210</source><parameters>[{"name": "cache_dir", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "hash", "val": ": typing.Optional[str] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "repo_id", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = None"}, {"name": "config_id", "val": ": typing.Optional[str] = None"}, {"name": "config_kwargs", "val": ""}]</parameters><paramsdesc>- cache_dir** (`str`, optional) --
	Directory to cache data. Defaults to `"~/.cache/huggingface/datasets"`.
	- dataset_name (`str`, optional) --
	Name of the dataset, if different from the builder name. Useful for packaged builders
	like csv, imagefolder, audiofolder, etc. to reflect the difference between datasets
	that use the same packaged builder.
	- config_name (`str`, optional) --
	Name of the dataset configuration.
	It affects the data generated on disk. Different configurations will have their own subdirectories and
	versions.
	If not provided, the default configuration is used (if it exists).

	<Added version="2.3.0">

	Parameter `name` was renamed to `config_name`.

	</Added>
	- hash (`str`, optional) --
	Hash specific to the dataset builder code. Used to update the caching directory when the
	dataset builder code is updated (to avoid reusing old data).
	The typical caching directory (defined in `self._relative_data_dir`) is `name/version/hash/`.
	- base_path (`str`, optional) --
	Base path for relative paths that are used to download files.
	This can be a remote URL.
	- features ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), optional) --
	Features types to use with this dataset.
	It can be used to change the [Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features) types of a dataset, for example.
	- token (`str` or `bool`, optional) --
	String or boolean to use as Bearer token for remote files on the
	Datasets Hub. If `True`, will get token from `"~/.huggingface"`.
	- repo_id (`str`, optional) --
	ID of the dataset repository.
	Used to distinguish builders with the same name but not coming from the same namespace, for example "rajpurkar/squad"
	and "lhoestq/squad" repo IDs. In the latter, the builder name would be "lhoestq___squad".
	- data_files (`str` or `Sequence` or `Mapping`, optional) --
	Path(s) to source data file(s).
	For builders like "csv" or "json" that need the user to specify data files. They can be either
	local or remote files. For convenience, you can use a `DataFilesDict`.
	- data_dir (`str`, optional) --
	Path to directory containing source data file(s).
	Use only if `data_files` is not passed, in which case it is equivalent to passing
	`os.path.join(data_dir, "**")` as `data_files`.
	For builders that require manual download, it must be the path to the local directory containing the
	manually downloaded data.
	- storage_options (`dict`, optional) --
	Key/value pairs to be passed on to the dataset file-system backend, if any.
	- writer_batch_size (`int`, optional) --
	Batch size used by the ArrowWriter.
	It defines the number of samples that are kept in memory before writing them
	and also the length of the arrow chunks.
	None means that the ArrowWriter will use its default value.
	- **config_kwargs (additional keyword arguments) -- Keyword arguments to be passed to the corresponding builder
	configuration class, set on the class attribute [DatasetBuilder.BUILDER_CONFIG_CLASS](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig). The builder
	configuration class is [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig) or a subclass of it.</paramsdesc><paramgroups>0</paramgroups></docstring>
	Abstract base class for all datasets.

	`DatasetBuilder` has 3 key methods:

	- `DatasetBuilder.info`: Documents the dataset, including feature
	names, types, shapes, version, splits, citation, etc.
	- [DatasetBuilder.download_and_prepare()](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder.download_and_prepare): Downloads the source data
	and writes it to disk.
	- [DatasetBuilder.as_dataset()](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder.as_dataset): Generates a [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset).

	Some `DatasetBuilder`s expose multiple variants of the
	dataset by defining a [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig) subclass and accepting a
	config object (or name) on construction. Configurable datasets expose a
	pre-defined set of configurations in `DatasetBuilder.builder_configs()`.





	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>as_dataset</name><anchor>datasets.DatasetBuilder.as_dataset</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L1030</source><parameters>[{"name": "split", "val": ": typing.Union[str, datasets.splits.Split, list[str], list[datasets.splits.Split], NoneType] = None"}, {"name": "run_post_process", "val": " = True"}, {"name": "verification_mode", "val": ": typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None"}, {"name": "in_memory", "val": " = False"}]</parameters><paramsdesc>- split (`datasets.Split`) --
	Which subset of the data to return.
	- run_post_process (`bool`, defaults to `True`) --
	Whether to run post-processing dataset transforms and/or add
	indexes.
	- verification_mode ([VerificationMode](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.VerificationMode) or `str`, defaults to `BASIC_CHECKS`) --
	Verification mode determining the checks to run on the
	downloaded/processed dataset information (checksums/size/splits/...).

	<Added version="2.9.1"/>
	- in_memory (`bool`, defaults to `False`) --
	Whether to copy the data in-memory.</paramsdesc><paramgroups>0</paramgroups><retdesc>datasets.Dataset</retdesc></docstring>
	Return a Dataset for the specified split.





	<ExampleCodeBlock anchor="datasets.DatasetBuilder.as_dataset.example">

	Example:

	```py
	>>> from datasets import load_dataset_builder
	>>> builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes')
	>>> builder.download_and_prepare()
	>>> ds = builder.as_dataset(split='train')
	>>> ds
	Dataset({
	features: ['text', 'label'],
	num_rows: 8530
	})
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>download_and_prepare</name><anchor>datasets.DatasetBuilder.download_and_prepare</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L694</source><parameters>[{"name": "output_dir", "val": ": typing.Optional[str] = None"}, {"name": "download_config", "val": ": typing.Optional[datasets.download.download_config.DownloadConfig] = None"}, {"name": "download_mode", "val": ": typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None"}, {"name": "verification_mode", "val": ": typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None"}, {"name": "dl_manager", "val": ": typing.Optional[datasets.download.download_manager.DownloadManager] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "file_format", "val": ": str = 'arrow'"}, {"name": "max_shard_size", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "download_and_prepare_kwargs", "val": ""}]</parameters><paramsdesc>- output_dir** (`str`, optional) --
	Output directory for the dataset.
	Default to this builder's `cache_dir`, which is inside `~/.cache/huggingface/datasets` by default.

	<Added version="2.5.0"/>
	- download_config (`DownloadConfig`, optional) --
	Specific download configuration parameters.
	- download_mode ([DownloadMode](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DownloadMode) or `str`, optional) --
	Select the download/generate mode, default to `REUSE_DATASET_IF_EXISTS`.
	- verification_mode ([VerificationMode](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.VerificationMode) or `str`, defaults to `BASIC_CHECKS`) --
	Verification mode determining the checks to run on the downloaded/processed dataset information (checksums/size/splits/...).

	<Added version="2.9.1"/>
	- dl_manager (`DownloadManager`, optional) --
	Specific `DownloadManger` to use.
	- base_path (`str`, optional) --
	Base path for relative paths that are used to download files. This can be a remote url.
	If not specified, the value of the `base_path` attribute (`self.base_path`) will be used instead.
	- file_format (`str`, optional) --
	Format of the data files in which the dataset will be written.
	Supported formats: "arrow", "parquet". Default to "arrow" format.
	If the format is "parquet", then image and audio data are embedded into the Parquet files instead of pointing to local files.

	<Added version="2.5.0"/>
	- max_shard_size (`Union[str, int]`, optional) --
	Maximum number of bytes written per shard, default is "500MB".
	The size is based on uncompressed data size, so in practice your shard files may be smaller than
	`max_shard_size` thanks to Parquet compression for example.

	<Added version="2.5.0"/>
	- num_proc (`int`, optional, defaults to `None`) --
	Number of processes when downloading and generating the dataset locally.
	Multiprocessing is disabled by default.

	<Added version="2.7.0"/>
	- storage_options (`dict`, optional) --
	Key/value pairs to be passed on to the caching file-system backend, if any.

	<Added version="2.5.0"/>
	- **download_and_prepare_kwargs (additional keyword arguments) -- Keyword arguments.</paramsdesc><paramgroups>0</paramgroups></docstring>
	Downloads and prepares dataset for reading.



	Example:

	<ExampleCodeBlock anchor="datasets.DatasetBuilder.download_and_prepare.example">

	Download and prepare the dataset as Arrow files that can be loaded as a Dataset using `builder.as_dataset()`:

	```py
	>>> from datasets import load_dataset_builder
	>>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
	>>> builder.download_and_prepare()
	```

	</ExampleCodeBlock>

	<ExampleCodeBlock anchor="datasets.DatasetBuilder.download_and_prepare.example-2">

	Download and prepare the dataset as sharded Parquet files locally:

	```py
	>>> from datasets import load_dataset_builder
	>>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
	>>> builder.download_and_prepare("./output_dir", file_format="parquet")
	```

	</ExampleCodeBlock>

	<ExampleCodeBlock anchor="datasets.DatasetBuilder.download_and_prepare.example-3">

	Download and prepare the dataset as sharded Parquet files in a cloud storage:

	```py
	>>> from datasets import load_dataset_builder
	>>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key}
	>>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
	>>> builder.download_and_prepare("s3://my-bucket/my_rotten_tomatoes", storage_options=storage_options, file_format="parquet")
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>get_imported_module_dir</name><anchor>datasets.DatasetBuilder.get_imported_module_dir</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L686</source><parameters>[]</parameters></docstring>
	Return the path of the module of this class or subclass.

	</div></div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.GeneratorBasedBuilder</name><anchor>datasets.GeneratorBasedBuilder</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L1359</source><parameters>[{"name": "cache_dir", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "hash", "val": ": typing.Optional[str] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "repo_id", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = None"}, {"name": "config_id", "val": ": typing.Optional[str] = None"}, {"name": "**config_kwargs", "val": ""}]</parameters></docstring>
	Base class for datasets with data generation based on dict generators.

	`GeneratorBasedBuilder` is a convenience class that abstracts away much
	of the data writing and reading of `DatasetBuilder`. It expects subclasses to
	implement generators of feature dictionaries across the dataset splits
	(`_split_generators`). See the method docstrings for details.


	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.ArrowBasedBuilder</name><anchor>datasets.ArrowBasedBuilder</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L1624</source><parameters>[{"name": "cache_dir", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "hash", "val": ": typing.Optional[str] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "repo_id", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = None"}, {"name": "config_id", "val": ": typing.Optional[str] = None"}, {"name": "**config_kwargs", "val": ""}]</parameters></docstring>
	Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).

	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.BuilderConfig</name><anchor>datasets.BuilderConfig</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L97</source><parameters>[{"name": "name", "val": ": str = 'default'"}, {"name": "version", "val": ": typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None"}, {"name": "description", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- name (`str`, defaults to `default`) --
	The name of the configuration.
	- version (`Version` or `str`, defaults to `0.0.0`) --
	The version of the configuration.
	- data_dir (`str`, optional) --
	Path to the directory containing the source data.
	- data_files (`str` or `Sequence` or `Mapping`, optional) --
	Path(s) to source data file(s).
	- description (`str`, optional) --
	A human description of the configuration.</paramsdesc><paramgroups>0</paramgroups></docstring>
	Base class for `DatasetBuilder` data configuration.

	`DatasetBuilder` subclasses with data configuration options should subclass
	`BuilderConfig` and add their own properties.





	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>create_config_id</name><anchor>datasets.BuilderConfig.create_config_id</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L140</source><parameters>[{"name": "config_kwargs", "val": ": dict"}, {"name": "custom_features", "val": ": typing.Optional[datasets.features.features.Features] = None"}]</parameters></docstring>

	The config id is used to build the cache directory.
	By default it is equal to the config name.
	However the name of a config is not sufficient to have a unique identifier for the dataset being generated
	since it doesn't take into account:
	- the config kwargs that can be used to overwrite attributes
	- the custom features used to write the dataset
	- the data_files for json/text/csv/pandas datasets

	Therefore the config id is just the config name with an optional suffix based on these.


	</div></div>

	## Download[[datasets.DownloadManager]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.DownloadManager</name><anchor>datasets.DownloadManager</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L71</source><parameters>[{"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "download_config", "val": ": typing.Optional[datasets.download.download_config.DownloadConfig] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "record_checksums", "val": " = True"}]</parameters></docstring>



	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>download</name><anchor>datasets.DownloadManager.download</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L131</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- url_or_urls (`str` or `list` or `dict`) --
	URL or `list` or `dict` of URLs to download. Each URL is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>`str` or `list` or `dict`</rettype><retdesc>The downloaded paths matching the given input `url_or_urls`.</retdesc></docstring>
	Download given URL(s).

	By default, only one process is used for download. Pass customized `download_config.num_proc` to change this behavior.







	<ExampleCodeBlock anchor="datasets.DownloadManager.download.example">

	Example:

	```py
	>>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>download_and_extract</name><anchor>datasets.DownloadManager.download_and_extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L310</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- url_or_urls (`str` or `list` or `dict`) --
	URL or `list` or `dict` of URLs to download and extract. Each URL is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>extracted_path(s)</rettype><retdesc>`str`, extracted paths of given URL(s).</retdesc></docstring>
	Download and extract given `url_or_urls`.

	<ExampleCodeBlock anchor="datasets.DownloadManager.download_and_extract.example">

	Is roughly equivalent to:

	```
	extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))
	```

	</ExampleCodeBlock>








	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>extract</name><anchor>datasets.DownloadManager.extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L278</source><parameters>[{"name": "path_or_paths", "val": ""}]</parameters><paramsdesc>- path_or_paths (path or `list` or `dict`) --
	Path of file to extract. Each path is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>extracted_path(s)</rettype><retdesc>`str`, The extracted paths matching the given input
	path_or_paths.</retdesc></docstring>
	Extract given path(s).







	<ExampleCodeBlock anchor="datasets.DownloadManager.extract.example">

	Example:

	```py
	>>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')
	>>> extracted_files = dl_manager.extract(downloaded_files)
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>iter_archive</name><anchor>datasets.DownloadManager.iter_archive</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L234</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, _io.BufferedReader]"}]</parameters><paramsdesc>- path_or_buf (`str` or `io.BufferedReader`) --
	Archive path or archive binary file object.</paramsdesc><paramgroups>0</paramgroups><yieldtype>`tuple[str, io.BufferedReader]`</yieldtype><yielddesc>2-tuple (path_within_archive, file_object).
	File object is opened in binary mode.</yielddesc></docstring>
	Iterate over files within an archive.







	<ExampleCodeBlock anchor="datasets.DownloadManager.iter_archive.example">

	Example:

	```py
	>>> archive = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')
	>>> files = dl_manager.iter_archive(archive)
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>iter_files</name><anchor>datasets.DownloadManager.iter_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L259</source><parameters>[{"name": "paths", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- paths (`str` or `list` of `str`) --
	Root paths.</paramsdesc><paramgroups>0</paramgroups><yieldtype>`str`</yieldtype><yielddesc>File path.</yielddesc></docstring>
	Iterate over file paths.







	<ExampleCodeBlock anchor="datasets.DownloadManager.iter_files.example">

	Example:

	```py
	>>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip')
	>>> files = dl_manager.iter_files(files)
	```

	</ExampleCodeBlock>


	</div></div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.StreamingDownloadManager</name><anchor>datasets.StreamingDownloadManager</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L47</source><parameters>[{"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "download_config", "val": ": typing.Optional[datasets.download.download_config.DownloadConfig] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}]</parameters></docstring>

	Download manager that uses the "::" separator to navigate through (possibly remote) compressed archives.
	Contrary to the regular `DownloadManager`, the `download` and `extract` methods don't actually download nor extract
	data, but they rather return the path or url that could be opened using the `xopen` function which extends the
	built-in `open` function to stream data from remote files.



	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>download</name><anchor>datasets.StreamingDownloadManager.download</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L75</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- url_or_urls (`str` or `list` or `dict`) --
	URL(s) of files to stream data from. Each url is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>url(s)</rettype><retdesc>(`str` or `list` or `dict`), URL(s) to stream data from matching the given input url_or_urls.</retdesc></docstring>
	Normalize URL(s) of files to stream data from.
	This is the lazy version of `DownloadManager.download` for streaming.







	<ExampleCodeBlock anchor="datasets.StreamingDownloadManager.download.example">

	Example:

	```py
	>>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>download_and_extract</name><anchor>datasets.StreamingDownloadManager.download_and_extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L151</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- url_or_urls (`str` or `list` or `dict`) --
	URL(s) to stream from data from. Each url is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>url(s)</rettype><retdesc>(`str` or `list` or `dict`), URL(s) to stream data from matching the given input `url_or_urls`.</retdesc></docstring>
	Prepare given `url_or_urls` for streaming (add extraction protocol).

	This is the lazy version of `DownloadManager.download_and_extract` for streaming.

	<ExampleCodeBlock anchor="datasets.StreamingDownloadManager.download_and_extract.example">

	Is equivalent to:

	```
	urls = dl_manager.extract(dl_manager.download(url_or_urls))
	```

	</ExampleCodeBlock>








	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>extract</name><anchor>datasets.StreamingDownloadManager.extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L102</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- url_or_urls (`str` or `list` or `dict`) --
	URL(s) of files to stream data from. Each url is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>url(s)</rettype><retdesc>(`str` or `list` or `dict`), URL(s) to stream data from matching the given input `url_or_urls`.</retdesc></docstring>
	Add extraction protocol for given url(s) for streaming.

	This is the lazy version of `DownloadManager.extract` for streaming.







	<ExampleCodeBlock anchor="datasets.StreamingDownloadManager.extract.example">

	Example:

	```py
	>>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')
	>>> extracted_files = dl_manager.extract(downloaded_files)
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>iter_archive</name><anchor>datasets.StreamingDownloadManager.iter_archive</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L171</source><parameters>[{"name": "urlpath_or_buf", "val": ": typing.Union[str, _io.BufferedReader]"}]</parameters><paramsdesc>- urlpath_or_buf (`str` or `io.BufferedReader`) --
	Archive path or archive binary file object.</paramsdesc><paramgroups>0</paramgroups><yieldtype>`tuple[str, io.BufferedReader]`</yieldtype><yielddesc>2-tuple (path_within_archive, file_object).
	File object is opened in binary mode.</yielddesc></docstring>
	Iterate over files within an archive.







	<ExampleCodeBlock anchor="datasets.StreamingDownloadManager.iter_archive.example">

	Example:

	```py
	>>> archive = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')
	>>> files = dl_manager.iter_archive(archive)
	```

	</ExampleCodeBlock>


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>iter_files</name><anchor>datasets.StreamingDownloadManager.iter_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L196</source><parameters>[{"name": "urlpaths", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- urlpaths (`str` or `list` of `str`) --
	Root paths.</paramsdesc><paramgroups>0</paramgroups><yieldtype>str</yieldtype><yielddesc>File URL path.</yielddesc></docstring>
	Iterate over files.







	<ExampleCodeBlock anchor="datasets.StreamingDownloadManager.iter_files.example">

	Example:

	```py
	>>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip')
	>>> files = dl_manager.iter_files(files)
	```

	</ExampleCodeBlock>


	</div></div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.DownloadConfig</name><anchor>datasets.DownloadConfig</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_config.py#L10</source><parameters>[{"name": "cache_dir", "val": ": typing.Union[str, pathlib.Path, NoneType] = None"}, {"name": "force_download", "val": ": bool = False"}, {"name": "resume_download", "val": ": bool = False"}, {"name": "local_files_only", "val": ": bool = False"}, {"name": "proxies", "val": ": typing.Optional[dict] = None"}, {"name": "user_agent", "val": ": typing.Optional[str] = None"}, {"name": "extract_compressed_file", "val": ": bool = False"}, {"name": "force_extract", "val": ": bool = False"}, {"name": "delete_extracted", "val": ": bool = False"}, {"name": "extract_on_the_fly", "val": ": bool = False"}, {"name": "use_etag", "val": ": bool = True"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "max_retries", "val": ": int = 1"}, {"name": "token", "val": ": typing.Union[str, bool, NoneType] = None"}, {"name": "storage_options", "val": ": dict = <factory>"}, {"name": "download_desc", "val": ": typing.Optional[str] = None"}, {"name": "disable_tqdm", "val": ": bool = False"}]</parameters><paramsdesc>- cache_dir (`str` or `Path`, optional) --
	Specify a cache directory to save the file to (overwrite the
	default cache dir).
	- force_download (`bool`, defaults to `False`) --
	If `True`, re-download the file even if it's already cached in
	the cache dir.
	- resume_download (`bool`, defaults to `False`) --
	If `True`, resume the download if an incompletely received file is
	found.
	- proxies (`dict`, optional) --
	- user_agent (`str`, optional) --
	Optional string or dict that will be appended to the user-agent on remote
	requests.
	- extract_compressed_file (`bool`, defaults to `False`) --
	If `True` and the path point to a zip or tar file,
	extract the compressed file in a folder along the archive.
	- force_extract (`bool`, defaults to `False`) --
	If `True` when `extract_compressed_file` is `True` and the archive
	was already extracted, re-extract the archive and override the folder where it was extracted.
	- delete_extracted (`bool`, defaults to `False`) --
	Whether to delete (or keep) the extracted files.
	- extract_on_the_fly (`bool`, defaults to `False`) --
	If `True`, extract compressed files while they are being read.
	- use_etag (`bool`, defaults to `True`) --
	Whether to use the ETag HTTP response header to validate the cached files.
	- num_proc (`int`, optional) --
	The number of processes to launch to download the files in parallel.
	- max_retries (`int`, default to `1`) --
	The number of times to retry an HTTP request if it fails.
	- token (`str` or `bool`, optional) --
	Optional string or boolean to use as Bearer token
	for remote files on the Datasets Hub. If `True`, or not specified, will get token from `~/.huggingface`.
	- storage_options (`dict`, optional) --
	Key/value pairs to be passed on to the dataset file-system backend, if any.
	- download_desc (`str`, optional) --
	A description to be displayed alongside with the progress bar while downloading the files.
	- disable_tqdm (`bool`, defaults to `False`) --
	Whether to disable the individual files download progress bar</paramsdesc><paramgroups>0</paramgroups></docstring>
	Configuration for our cached path manager.




	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.DownloadMode</name><anchor>datasets.DownloadMode</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L50</source><parameters>[{"name": "value", "val": ""}, {"name": "names", "val": " = None"}, {"name": "module", "val": " = None"}, {"name": "qualname", "val": " = None"}, {"name": "type", "val": " = None"}, {"name": "start", "val": " = 1"}]</parameters></docstring>
	`Enum` for how to treat pre-existing downloads and data.

	The default mode is `REUSE_DATASET_IF_EXISTS`, which will reuse both
	raw downloads and the prepared dataset if they exist.

	The generations modes:

	\| \| Downloads \| Dataset \|
	\|-------------------------------------\|-----------\|---------\|
	\| `REUSE_DATASET_IF_EXISTS` (default) \| Reuse \| Reuse \|
	\| `REUSE_CACHE_IF_EXISTS` \| Reuse \| Fresh \|
	\| `FORCE_REDOWNLOAD` \| Fresh \| Fresh \|



	</div>

	## Verification[[datasets.VerificationMode]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.VerificationMode</name><anchor>datasets.VerificationMode</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/utils/info_utils.py#L22</source><parameters>[{"name": "value", "val": ""}, {"name": "names", "val": " = None"}, {"name": "module", "val": " = None"}, {"name": "qualname", "val": " = None"}, {"name": "type", "val": " = None"}, {"name": "start", "val": " = 1"}]</parameters></docstring>
	`Enum` that specifies which verification checks to run.

	The default mode is `BASIC_CHECKS`, which will perform only rudimentary checks to avoid slowdowns
	when generating/downloading a dataset for the first time.

	The verification modes:

	\| \| Verification checks \|
	\|---------------------------\|------------------------------------------------------------------------------ \|
	\| `ALL_CHECKS` \| Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder \|
	\| \| and the validity (number of files, checksums, etc.) of downloaded files \|
	\| `BASIC_CHECKS` (default) \| Same as `ALL_CHECKS` but without checking downloaded files \|
	\| `NO_CHECKS` \| None \|



	</div>

	## Splits[[datasets.SplitGenerator]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.SplitGenerator</name><anchor>datasets.SplitGenerator</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L602</source><parameters>[{"name": "name", "val": ": str"}, {"name": "gen_kwargs", "val": ": dict = <factory>"}]</parameters><paramsdesc>- name (`str`) --
	Name of the `Split` for which the generator will
	create the examples.
	- **gen_kwargs (additional keyword arguments) --
	Keyword arguments to forward to the `DatasetBuilder._generate_examples` method
	of the builder.</paramsdesc><paramgroups>0</paramgroups></docstring>
	Defines the split information for the generator.

	This should be used as returned value of
	`GeneratorBasedBuilder._split_generators`.
	See `GeneratorBasedBuilder._split_generators` for more info and example
	of usage.



	<ExampleCodeBlock anchor="datasets.SplitGenerator.example">

	Example:

	```py
	>>> datasets.SplitGenerator(
	... name=datasets.Split.TRAIN,
	... gen_kwargs={"split_key": "train", "files": dl_manager.download_and_extract(url)},
	... )
	```

	</ExampleCodeBlock>


	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.Split</name><anchor>datasets.Split</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L406</source><parameters>[{"name": "name", "val": ""}]</parameters></docstring>
	`Enum` for dataset splits.

	Datasets are typically split into different subsets to be used at various
	stages of training and evaluation.

	- `TRAIN`: the training data.
	- `VALIDATION`: the validation data. If present, this is typically used as
	evaluation data while iterating on a model (e.g. changing hyperparameters,
	model architecture, etc.).
	- `TEST`: the testing data. This is the data to report metrics on. Typically
	you do not want to use this during model iteration as you may overfit to it.
	- `ALL`: the union of all defined dataset splits.

	All splits, including compositions inherit from `datasets.SplitBase`.

	See the [guide](../load_hub#splits) on splits for more information.

	<ExampleCodeBlock anchor="datasets.Split.example">

	Example:

	```py
	>>> datasets.SplitGenerator(
	... name=datasets.Split.TRAIN,
	... gen_kwargs={"split_key": "train", "files": dl_manager.download_and extract(url)},
	... ),
	... datasets.SplitGenerator(
	... name=datasets.Split.VALIDATION,
	... gen_kwargs={"split_key": "validation", "files": dl_manager.download_and extract(url)},
	... ),
	... datasets.SplitGenerator(
	... name=datasets.Split.TEST,
	... gen_kwargs={"split_key": "test", "files": dl_manager.download_and extract(url)},
	... )
	```

	</ExampleCodeBlock>


	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.NamedSplit</name><anchor>datasets.NamedSplit</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L314</source><parameters>[{"name": "name", "val": ""}]</parameters></docstring>
	Descriptor corresponding to a named split (train, test, ...).

	Example:
	<ExampleCodeBlock anchor="datasets.NamedSplit.example">

	Each descriptor can be composed with other using addition or slice:

	```py
	split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST
	```

	</ExampleCodeBlock>

	The resulting split will correspond to 25% of the train split merged with
	100% of the test split.

	<ExampleCodeBlock anchor="datasets.NamedSplit.example-2">

	A split cannot be added twice, so the following will fail:

	```py
	split = (
	datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
	datasets.Split.TRAIN.subsplit(datasets.percent[75:])
	) # Error
	split = datasets.Split.TEST + datasets.Split.ALL # Error
	```

	</ExampleCodeBlock>

	<ExampleCodeBlock anchor="datasets.NamedSplit.example-3">

	The slices can be applied only one time. So the following are valid:

	```py
	split = (
	datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
	datasets.Split.TEST.subsplit(datasets.percent[:50])
	)
	split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])
	```

	</ExampleCodeBlock>

	<ExampleCodeBlock anchor="datasets.NamedSplit.example-4">

	But this is not valid:

	```py
	train = datasets.Split.TRAIN
	test = datasets.Split.TEST
	split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25])
	split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])
	```

	</ExampleCodeBlock>


	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.NamedSplitAll</name><anchor>datasets.NamedSplitAll</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L391</source><parameters>[]</parameters></docstring>
	Split corresponding to the union of all defined dataset splits.

	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.ReadInstruction</name><anchor>datasets.ReadInstruction</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_reader.py#L456</source><parameters>[{"name": "split_name", "val": ""}, {"name": "rounding", "val": " = None"}, {"name": "from_", "val": " = None"}, {"name": "to", "val": " = None"}, {"name": "unit", "val": " = None"}]</parameters></docstring>
	Reading instruction for a dataset.

	<ExampleCodeBlock anchor="datasets.ReadInstruction.example">

	Examples:

	```python
	# The following lines are equivalent:
	ds = datasets.load_dataset('mnist', split='test[:33%]')
	ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
	ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
	ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
	'test', from_=0, to=33, unit='%'))

	# The following lines are equivalent:
	ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
	ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
	'test[:33%]+train[1:-1]'))
	ds = datasets.load_dataset('mnist', split=(
	datasets.ReadInstruction('test', to=33, unit='%') +
	datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))

	# The following lines are equivalent:
	ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
	ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
	'test[:33%](pct1_dropremainder)'))
	ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
	'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))

	# 10-fold validation:
	tests = datasets.load_dataset(
	'mnist',
	[datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
	for k in range(0, 100, 10)])
	trains = datasets.load_dataset(
	'mnist',
	[datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
	for k in range(0, 100, 10)])
	```

	</ExampleCodeBlock>



	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>from_spec</name><anchor>datasets.ReadInstruction.from_spec</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_reader.py#L536</source><parameters>[{"name": "spec", "val": ""}]</parameters><paramsdesc>- spec (`str`) --
	Split(s) + optional slice(s) to read + optional rounding
	if percents are used as the slicing unit. A slice can be specified,
	using absolute numbers (`int`) or percentages (`int`).</paramsdesc><paramgroups>0</paramgroups><retdesc>ReadInstruction instance.</retdesc></docstring>
	Creates a `ReadInstruction` instance out of a string spec.



	<ExampleCodeBlock anchor="datasets.ReadInstruction.from_spec.example">

	Examples:

	```
	test: test split.
	test + validation: test split + validation split.
	test[10:]: test split, minus its first 10 records.
	test[:10%]: first 10% records of test split.
	test[:20%](pct1_dropremainder): first 10% records, rounded with the pct1_dropremainder rounding.
	test[:-5%]+train[40%:60%]: first 95% of test + middle 20% of train.
	```

	</ExampleCodeBlock>




	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>to_absolute</name><anchor>datasets.ReadInstruction.to_absolute</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_reader.py#L608</source><parameters>[{"name": "name2len", "val": ""}]</parameters><paramsdesc>- name2len (`dict`) --
	Associating split names to number of examples.</paramsdesc><paramgroups>0</paramgroups><retdesc>list of _AbsoluteInstruction instances (corresponds to the + in spec).</retdesc></docstring>
	Translate instruction into a list of absolute instructions.

	Those absolute instructions are then to be added together.






	</div></div>

	## Version[[datasets.Version]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class datasets.Version</name><anchor>datasets.Version</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/utils/version.py#L30</source><parameters>[{"name": "version_str", "val": ": str"}, {"name": "description", "val": ": typing.Optional[str] = None"}, {"name": "major", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "minor", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "patch", "val": ": typing.Union[str, int, NoneType] = None"}]</parameters><paramsdesc>- version_str (`str`) --
	The dataset version.
	- description (`str`) --
	A description of what is new in this version.
	- major (`str`) --
	- minor (`str`) --
	- patch (`str`) --</paramsdesc><paramgroups>0</paramgroups></docstring>
	Dataset version `MAJOR.MINOR.PATCH`.



	<ExampleCodeBlock anchor="datasets.Version.example">

	Example:

	```py
	>>> VERSION = datasets.Version("1.0.0")
	```

	</ExampleCodeBlock>


	</div>

	<EditOnGithub source="https://github.com/huggingface/datasets/blob/main/docs/source/package_reference/builder_classes.mdx" />

Xet Storage Details

Size:: 47.5 kB
Xet hash:: 003743a31f3c3c6fb1b3a55b9597aa4dfbed5bdb54618c695021e10eb2d90257

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.