Buckets:
| # Builder classes | |
| ## Builders[[datasets.DatasetBuilder]] | |
| 🤗 Datasets relies on two main classes during the dataset building process: [DatasetBuilder](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder) and [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig). | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.DatasetBuilder</name><anchor>datasets.DatasetBuilder</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L210</source><parameters>[{"name": "cache_dir", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "hash", "val": ": typing.Optional[str] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "repo_id", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = None"}, {"name": "config_id", "val": ": typing.Optional[str] = None"}, {"name": "**config_kwargs", "val": ""}]</parameters><paramsdesc>- **cache_dir** (`str`, *optional*) -- | |
| Directory to cache data. Defaults to `"~/.cache/huggingface/datasets"`. | |
| - **dataset_name** (`str`, *optional*) -- | |
| Name of the dataset, if different from the builder name. Useful for packaged builders | |
| like csv, imagefolder, audiofolder, etc. to reflect the difference between datasets | |
| that use the same packaged builder. | |
| - **config_name** (`str`, *optional*) -- | |
| Name of the dataset configuration. | |
| It affects the data generated on disk. Different configurations will have their own subdirectories and | |
| versions. | |
| If not provided, the default configuration is used (if it exists). | |
| <Added version="2.3.0"> | |
| Parameter `name` was renamed to `config_name`. | |
| </Added> | |
| - **hash** (`str`, *optional*) -- | |
| Hash specific to the dataset builder code. Used to update the caching directory when the | |
| dataset builder code is updated (to avoid reusing old data). | |
| The typical caching directory (defined in `self._relative_data_dir`) is `name/version/hash/`. | |
| - **base_path** (`str`, *optional*) -- | |
| Base path for relative paths that are used to download files. | |
| This can be a remote URL. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Features types to use with this dataset. | |
| It can be used to change the [Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features) types of a dataset, for example. | |
| - **token** (`str` or `bool`, *optional*) -- | |
| String or boolean to use as Bearer token for remote files on the | |
| Datasets Hub. If `True`, will get token from `"~/.huggingface"`. | |
| - **repo_id** (`str`, *optional*) -- | |
| ID of the dataset repository. | |
| Used to distinguish builders with the same name but not coming from the same namespace, for example "rajpurkar/squad" | |
| and "lhoestq/squad" repo IDs. In the latter, the builder name would be "lhoestq___squad". | |
| - **data_files** (`str` or `Sequence` or `Mapping`, *optional*) -- | |
| Path(s) to source data file(s). | |
| For builders like "csv" or "json" that need the user to specify data files. They can be either | |
| local or remote files. For convenience, you can use a `DataFilesDict`. | |
| - **data_dir** (`str`, *optional*) -- | |
| Path to directory containing source data file(s). | |
| Use only if `data_files` is not passed, in which case it is equivalent to passing | |
| `os.path.join(data_dir, "**")` as `data_files`. | |
| For builders that require manual download, it must be the path to the local directory containing the | |
| manually downloaded data. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the dataset file-system backend, if any. | |
| - **writer_batch_size** (`int`, *optional*) -- | |
| Batch size used by the ArrowWriter. | |
| It defines the number of samples that are kept in memory before writing them | |
| and also the length of the arrow chunks. | |
| None means that the ArrowWriter will use its default value. | |
| - ****config_kwargs** (additional keyword arguments) -- Keyword arguments to be passed to the corresponding builder | |
| configuration class, set on the class attribute [DatasetBuilder.BUILDER_CONFIG_CLASS](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig). The builder | |
| configuration class is [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig) or a subclass of it.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Abstract base class for all datasets. | |
| `DatasetBuilder` has 3 key methods: | |
| - `DatasetBuilder.info`: Documents the dataset, including feature | |
| names, types, shapes, version, splits, citation, etc. | |
| - [DatasetBuilder.download_and_prepare()](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder.download_and_prepare): Downloads the source data | |
| and writes it to disk. | |
| - [DatasetBuilder.as_dataset()](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder.as_dataset): Generates a [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset). | |
| Some `DatasetBuilder`s expose multiple variants of the | |
| dataset by defining a [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig) subclass and accepting a | |
| config object (or name) on construction. Configurable datasets expose a | |
| pre-defined set of configurations in `DatasetBuilder.builder_configs()`. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>as_dataset</name><anchor>datasets.DatasetBuilder.as_dataset</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L1030</source><parameters>[{"name": "split", "val": ": typing.Union[str, datasets.splits.Split, list[str], list[datasets.splits.Split], NoneType] = None"}, {"name": "run_post_process", "val": " = True"}, {"name": "verification_mode", "val": ": typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None"}, {"name": "in_memory", "val": " = False"}]</parameters><paramsdesc>- **split** (`datasets.Split`) -- | |
| Which subset of the data to return. | |
| - **run_post_process** (`bool`, defaults to `True`) -- | |
| Whether to run post-processing dataset transforms and/or add | |
| indexes. | |
| - **verification_mode** ([VerificationMode](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.VerificationMode) or `str`, defaults to `BASIC_CHECKS`) -- | |
| Verification mode determining the checks to run on the | |
| downloaded/processed dataset information (checksums/size/splits/...). | |
| <Added version="2.9.1"/> | |
| - **in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory.</paramsdesc><paramgroups>0</paramgroups><retdesc>datasets.Dataset</retdesc></docstring> | |
| Return a Dataset for the specified split. | |
| <ExampleCodeBlock anchor="datasets.DatasetBuilder.as_dataset.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset_builder | |
| >>> builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes') | |
| >>> builder.download_and_prepare() | |
| >>> ds = builder.as_dataset(split='train') | |
| >>> ds | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 8530 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_and_prepare</name><anchor>datasets.DatasetBuilder.download_and_prepare</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L694</source><parameters>[{"name": "output_dir", "val": ": typing.Optional[str] = None"}, {"name": "download_config", "val": ": typing.Optional[datasets.download.download_config.DownloadConfig] = None"}, {"name": "download_mode", "val": ": typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None"}, {"name": "verification_mode", "val": ": typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None"}, {"name": "dl_manager", "val": ": typing.Optional[datasets.download.download_manager.DownloadManager] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "file_format", "val": ": str = 'arrow'"}, {"name": "max_shard_size", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**download_and_prepare_kwargs", "val": ""}]</parameters><paramsdesc>- **output_dir** (`str`, *optional*) -- | |
| Output directory for the dataset. | |
| Default to this builder's `cache_dir`, which is inside `~/.cache/huggingface/datasets` by default. | |
| <Added version="2.5.0"/> | |
| - **download_config** (`DownloadConfig`, *optional*) -- | |
| Specific download configuration parameters. | |
| - **download_mode** ([DownloadMode](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DownloadMode) or `str`, *optional*) -- | |
| Select the download/generate mode, default to `REUSE_DATASET_IF_EXISTS`. | |
| - **verification_mode** ([VerificationMode](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.VerificationMode) or `str`, defaults to `BASIC_CHECKS`) -- | |
| Verification mode determining the checks to run on the downloaded/processed dataset information (checksums/size/splits/...). | |
| <Added version="2.9.1"/> | |
| - **dl_manager** (`DownloadManager`, *optional*) -- | |
| Specific `DownloadManger` to use. | |
| - **base_path** (`str`, *optional*) -- | |
| Base path for relative paths that are used to download files. This can be a remote url. | |
| If not specified, the value of the `base_path` attribute (`self.base_path`) will be used instead. | |
| - **file_format** (`str`, *optional*) -- | |
| Format of the data files in which the dataset will be written. | |
| Supported formats: "arrow", "parquet". Default to "arrow" format. | |
| If the format is "parquet", then image and audio data are embedded into the Parquet files instead of pointing to local files. | |
| <Added version="2.5.0"/> | |
| - **max_shard_size** (`Union[str, int]`, *optional*) -- | |
| Maximum number of bytes written per shard, default is "500MB". | |
| The size is based on uncompressed data size, so in practice your shard files may be smaller than | |
| `max_shard_size` thanks to Parquet compression for example. | |
| <Added version="2.5.0"/> | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| Multiprocessing is disabled by default. | |
| <Added version="2.7.0"/> | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the caching file-system backend, if any. | |
| <Added version="2.5.0"/> | |
| - ****download_and_prepare_kwargs** (additional keyword arguments) -- Keyword arguments.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Downloads and prepares dataset for reading. | |
| Example: | |
| <ExampleCodeBlock anchor="datasets.DatasetBuilder.download_and_prepare.example"> | |
| Download and prepare the dataset as Arrow files that can be loaded as a Dataset using `builder.as_dataset()`: | |
| ```py | |
| >>> from datasets import load_dataset_builder | |
| >>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") | |
| >>> builder.download_and_prepare() | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.DatasetBuilder.download_and_prepare.example-2"> | |
| Download and prepare the dataset as sharded Parquet files locally: | |
| ```py | |
| >>> from datasets import load_dataset_builder | |
| >>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") | |
| >>> builder.download_and_prepare("./output_dir", file_format="parquet") | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.DatasetBuilder.download_and_prepare.example-3"> | |
| Download and prepare the dataset as sharded Parquet files in a cloud storage: | |
| ```py | |
| >>> from datasets import load_dataset_builder | |
| >>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} | |
| >>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") | |
| >>> builder.download_and_prepare("s3://my-bucket/my_rotten_tomatoes", storage_options=storage_options, file_format="parquet") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_imported_module_dir</name><anchor>datasets.DatasetBuilder.get_imported_module_dir</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L686</source><parameters>[]</parameters></docstring> | |
| Return the path of the module of this class or subclass. | |
| </div></div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.GeneratorBasedBuilder</name><anchor>datasets.GeneratorBasedBuilder</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L1359</source><parameters>[{"name": "cache_dir", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "hash", "val": ": typing.Optional[str] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "repo_id", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = None"}, {"name": "config_id", "val": ": typing.Optional[str] = None"}, {"name": "**config_kwargs", "val": ""}]</parameters></docstring> | |
| Base class for datasets with data generation based on dict generators. | |
| `GeneratorBasedBuilder` is a convenience class that abstracts away much | |
| of the data writing and reading of `DatasetBuilder`. It expects subclasses to | |
| implement generators of feature dictionaries across the dataset splits | |
| (`_split_generators`). See the method docstrings for details. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.ArrowBasedBuilder</name><anchor>datasets.ArrowBasedBuilder</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L1624</source><parameters>[{"name": "cache_dir", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "hash", "val": ": typing.Optional[str] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "repo_id", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = None"}, {"name": "config_id", "val": ": typing.Optional[str] = None"}, {"name": "**config_kwargs", "val": ""}]</parameters></docstring> | |
| Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet). | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.BuilderConfig</name><anchor>datasets.BuilderConfig</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L97</source><parameters>[{"name": "name", "val": ": str = 'default'"}, {"name": "version", "val": ": typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None"}, {"name": "description", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **name** (`str`, defaults to `default`) -- | |
| The name of the configuration. | |
| - **version** (`Version` or `str`, defaults to `0.0.0`) -- | |
| The version of the configuration. | |
| - **data_dir** (`str`, *optional*) -- | |
| Path to the directory containing the source data. | |
| - **data_files** (`str` or `Sequence` or `Mapping`, *optional*) -- | |
| Path(s) to source data file(s). | |
| - **description** (`str`, *optional*) -- | |
| A human description of the configuration.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Base class for `DatasetBuilder` data configuration. | |
| `DatasetBuilder` subclasses with data configuration options should subclass | |
| `BuilderConfig` and add their own properties. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>create_config_id</name><anchor>datasets.BuilderConfig.create_config_id</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/builder.py#L140</source><parameters>[{"name": "config_kwargs", "val": ": dict"}, {"name": "custom_features", "val": ": typing.Optional[datasets.features.features.Features] = None"}]</parameters></docstring> | |
| The config id is used to build the cache directory. | |
| By default it is equal to the config name. | |
| However the name of a config is not sufficient to have a unique identifier for the dataset being generated | |
| since it doesn't take into account: | |
| - the config kwargs that can be used to overwrite attributes | |
| - the custom features used to write the dataset | |
| - the data_files for json/text/csv/pandas datasets | |
| Therefore the config id is just the config name with an optional suffix based on these. | |
| </div></div> | |
| ## Download[[datasets.DownloadManager]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.DownloadManager</name><anchor>datasets.DownloadManager</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L71</source><parameters>[{"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "download_config", "val": ": typing.Optional[datasets.download.download_config.DownloadConfig] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}, {"name": "record_checksums", "val": " = True"}]</parameters></docstring> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download</name><anchor>datasets.DownloadManager.download</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L131</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- **url_or_urls** (`str` or `list` or `dict`) -- | |
| URL or `list` or `dict` of URLs to download. Each URL is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>`str` or `list` or `dict`</rettype><retdesc>The downloaded paths matching the given input `url_or_urls`.</retdesc></docstring> | |
| Download given URL(s). | |
| By default, only one process is used for download. Pass customized `download_config.num_proc` to change this behavior. | |
| <ExampleCodeBlock anchor="datasets.DownloadManager.download.example"> | |
| Example: | |
| ```py | |
| >>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_and_extract</name><anchor>datasets.DownloadManager.download_and_extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L310</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- **url_or_urls** (`str` or `list` or `dict`) -- | |
| URL or `list` or `dict` of URLs to download and extract. Each URL is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>extracted_path(s)</rettype><retdesc>`str`, extracted paths of given URL(s).</retdesc></docstring> | |
| Download and extract given `url_or_urls`. | |
| <ExampleCodeBlock anchor="datasets.DownloadManager.download_and_extract.example"> | |
| Is roughly equivalent to: | |
| ``` | |
| extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls)) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>extract</name><anchor>datasets.DownloadManager.extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L278</source><parameters>[{"name": "path_or_paths", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (path or `list` or `dict`) -- | |
| Path of file to extract. Each path is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>extracted_path(s)</rettype><retdesc>`str`, The extracted paths matching the given input | |
| path_or_paths.</retdesc></docstring> | |
| Extract given path(s). | |
| <ExampleCodeBlock anchor="datasets.DownloadManager.extract.example"> | |
| Example: | |
| ```py | |
| >>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') | |
| >>> extracted_files = dl_manager.extract(downloaded_files) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>iter_archive</name><anchor>datasets.DownloadManager.iter_archive</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L234</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, _io.BufferedReader]"}]</parameters><paramsdesc>- **path_or_buf** (`str` or `io.BufferedReader`) -- | |
| Archive path or archive binary file object.</paramsdesc><paramgroups>0</paramgroups><yieldtype>`tuple[str, io.BufferedReader]`</yieldtype><yielddesc>2-tuple (path_within_archive, file_object). | |
| File object is opened in binary mode.</yielddesc></docstring> | |
| Iterate over files within an archive. | |
| <ExampleCodeBlock anchor="datasets.DownloadManager.iter_archive.example"> | |
| Example: | |
| ```py | |
| >>> archive = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') | |
| >>> files = dl_manager.iter_archive(archive) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>iter_files</name><anchor>datasets.DownloadManager.iter_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L259</source><parameters>[{"name": "paths", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **paths** (`str` or `list` of `str`) -- | |
| Root paths.</paramsdesc><paramgroups>0</paramgroups><yieldtype>`str`</yieldtype><yielddesc>File path.</yielddesc></docstring> | |
| Iterate over file paths. | |
| <ExampleCodeBlock anchor="datasets.DownloadManager.iter_files.example"> | |
| Example: | |
| ```py | |
| >>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip') | |
| >>> files = dl_manager.iter_files(files) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.StreamingDownloadManager</name><anchor>datasets.StreamingDownloadManager</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L47</source><parameters>[{"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "download_config", "val": ": typing.Optional[datasets.download.download_config.DownloadConfig] = None"}, {"name": "base_path", "val": ": typing.Optional[str] = None"}]</parameters></docstring> | |
| Download manager that uses the "::" separator to navigate through (possibly remote) compressed archives. | |
| Contrary to the regular `DownloadManager`, the `download` and `extract` methods don't actually download nor extract | |
| data, but they rather return the path or url that could be opened using the `xopen` function which extends the | |
| built-in `open` function to stream data from remote files. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download</name><anchor>datasets.StreamingDownloadManager.download</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L75</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- **url_or_urls** (`str` or `list` or `dict`) -- | |
| URL(s) of files to stream data from. Each url is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>url(s)</rettype><retdesc>(`str` or `list` or `dict`), URL(s) to stream data from matching the given input url_or_urls.</retdesc></docstring> | |
| Normalize URL(s) of files to stream data from. | |
| This is the lazy version of `DownloadManager.download` for streaming. | |
| <ExampleCodeBlock anchor="datasets.StreamingDownloadManager.download.example"> | |
| Example: | |
| ```py | |
| >>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_and_extract</name><anchor>datasets.StreamingDownloadManager.download_and_extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L151</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- **url_or_urls** (`str` or `list` or `dict`) -- | |
| URL(s) to stream from data from. Each url is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>url(s)</rettype><retdesc>(`str` or `list` or `dict`), URL(s) to stream data from matching the given input `url_or_urls`.</retdesc></docstring> | |
| Prepare given `url_or_urls` for streaming (add extraction protocol). | |
| This is the lazy version of `DownloadManager.download_and_extract` for streaming. | |
| <ExampleCodeBlock anchor="datasets.StreamingDownloadManager.download_and_extract.example"> | |
| Is equivalent to: | |
| ``` | |
| urls = dl_manager.extract(dl_manager.download(url_or_urls)) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>extract</name><anchor>datasets.StreamingDownloadManager.extract</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L102</source><parameters>[{"name": "url_or_urls", "val": ""}]</parameters><paramsdesc>- **url_or_urls** (`str` or `list` or `dict`) -- | |
| URL(s) of files to stream data from. Each url is a `str`.</paramsdesc><paramgroups>0</paramgroups><rettype>url(s)</rettype><retdesc>(`str` or `list` or `dict`), URL(s) to stream data from matching the given input `url_or_urls`.</retdesc></docstring> | |
| Add extraction protocol for given url(s) for streaming. | |
| This is the lazy version of `DownloadManager.extract` for streaming. | |
| <ExampleCodeBlock anchor="datasets.StreamingDownloadManager.extract.example"> | |
| Example: | |
| ```py | |
| >>> downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') | |
| >>> extracted_files = dl_manager.extract(downloaded_files) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>iter_archive</name><anchor>datasets.StreamingDownloadManager.iter_archive</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L171</source><parameters>[{"name": "urlpath_or_buf", "val": ": typing.Union[str, _io.BufferedReader]"}]</parameters><paramsdesc>- **urlpath_or_buf** (`str` or `io.BufferedReader`) -- | |
| Archive path or archive binary file object.</paramsdesc><paramgroups>0</paramgroups><yieldtype>`tuple[str, io.BufferedReader]`</yieldtype><yielddesc>2-tuple (path_within_archive, file_object). | |
| File object is opened in binary mode.</yielddesc></docstring> | |
| Iterate over files within an archive. | |
| <ExampleCodeBlock anchor="datasets.StreamingDownloadManager.iter_archive.example"> | |
| Example: | |
| ```py | |
| >>> archive = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') | |
| >>> files = dl_manager.iter_archive(archive) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>iter_files</name><anchor>datasets.StreamingDownloadManager.iter_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/streaming_download_manager.py#L196</source><parameters>[{"name": "urlpaths", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **urlpaths** (`str` or `list` of `str`) -- | |
| Root paths.</paramsdesc><paramgroups>0</paramgroups><yieldtype>str</yieldtype><yielddesc>File URL path.</yielddesc></docstring> | |
| Iterate over files. | |
| <ExampleCodeBlock anchor="datasets.StreamingDownloadManager.iter_files.example"> | |
| Example: | |
| ```py | |
| >>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip') | |
| >>> files = dl_manager.iter_files(files) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.DownloadConfig</name><anchor>datasets.DownloadConfig</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_config.py#L10</source><parameters>[{"name": "cache_dir", "val": ": typing.Union[str, pathlib.Path, NoneType] = None"}, {"name": "force_download", "val": ": bool = False"}, {"name": "resume_download", "val": ": bool = False"}, {"name": "local_files_only", "val": ": bool = False"}, {"name": "proxies", "val": ": typing.Optional[dict] = None"}, {"name": "user_agent", "val": ": typing.Optional[str] = None"}, {"name": "extract_compressed_file", "val": ": bool = False"}, {"name": "force_extract", "val": ": bool = False"}, {"name": "delete_extracted", "val": ": bool = False"}, {"name": "extract_on_the_fly", "val": ": bool = False"}, {"name": "use_etag", "val": ": bool = True"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "max_retries", "val": ": int = 1"}, {"name": "token", "val": ": typing.Union[str, bool, NoneType] = None"}, {"name": "storage_options", "val": ": dict = <factory>"}, {"name": "download_desc", "val": ": typing.Optional[str] = None"}, {"name": "disable_tqdm", "val": ": bool = False"}]</parameters><paramsdesc>- **cache_dir** (`str` or `Path`, *optional*) -- | |
| Specify a cache directory to save the file to (overwrite the | |
| default cache dir). | |
| - **force_download** (`bool`, defaults to `False`) -- | |
| If `True`, re-download the file even if it's already cached in | |
| the cache dir. | |
| - **resume_download** (`bool`, defaults to `False`) -- | |
| If `True`, resume the download if an incompletely received file is | |
| found. | |
| - **proxies** (`dict`, *optional*) -- | |
| - **user_agent** (`str`, *optional*) -- | |
| Optional string or dict that will be appended to the user-agent on remote | |
| requests. | |
| - **extract_compressed_file** (`bool`, defaults to `False`) -- | |
| If `True` and the path point to a zip or tar file, | |
| extract the compressed file in a folder along the archive. | |
| - **force_extract** (`bool`, defaults to `False`) -- | |
| If `True` when `extract_compressed_file` is `True` and the archive | |
| was already extracted, re-extract the archive and override the folder where it was extracted. | |
| - **delete_extracted** (`bool`, defaults to `False`) -- | |
| Whether to delete (or keep) the extracted files. | |
| - **extract_on_the_fly** (`bool`, defaults to `False`) -- | |
| If `True`, extract compressed files while they are being read. | |
| - **use_etag** (`bool`, defaults to `True`) -- | |
| Whether to use the ETag HTTP response header to validate the cached files. | |
| - **num_proc** (`int`, *optional*) -- | |
| The number of processes to launch to download the files in parallel. | |
| - **max_retries** (`int`, default to `1`) -- | |
| The number of times to retry an HTTP request if it fails. | |
| - **token** (`str` or `bool`, *optional*) -- | |
| Optional string or boolean to use as Bearer token | |
| for remote files on the Datasets Hub. If `True`, or not specified, will get token from `~/.huggingface`. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the dataset file-system backend, if any. | |
| - **download_desc** (`str`, *optional*) -- | |
| A description to be displayed alongside with the progress bar while downloading the files. | |
| - **disable_tqdm** (`bool`, defaults to `False`) -- | |
| Whether to disable the individual files download progress bar</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Configuration for our cached path manager. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.DownloadMode</name><anchor>datasets.DownloadMode</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/download/download_manager.py#L50</source><parameters>[{"name": "value", "val": ""}, {"name": "names", "val": " = None"}, {"name": "module", "val": " = None"}, {"name": "qualname", "val": " = None"}, {"name": "type", "val": " = None"}, {"name": "start", "val": " = 1"}]</parameters></docstring> | |
| `Enum` for how to treat pre-existing downloads and data. | |
| The default mode is `REUSE_DATASET_IF_EXISTS`, which will reuse both | |
| raw downloads and the prepared dataset if they exist. | |
| The generations modes: | |
| | | Downloads | Dataset | | |
| |-------------------------------------|-----------|---------| | |
| | `REUSE_DATASET_IF_EXISTS` (default) | Reuse | Reuse | | |
| | `REUSE_CACHE_IF_EXISTS` | Reuse | Fresh | | |
| | `FORCE_REDOWNLOAD` | Fresh | Fresh | | |
| </div> | |
| ## Verification[[datasets.VerificationMode]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.VerificationMode</name><anchor>datasets.VerificationMode</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/utils/info_utils.py#L22</source><parameters>[{"name": "value", "val": ""}, {"name": "names", "val": " = None"}, {"name": "module", "val": " = None"}, {"name": "qualname", "val": " = None"}, {"name": "type", "val": " = None"}, {"name": "start", "val": " = 1"}]</parameters></docstring> | |
| `Enum` that specifies which verification checks to run. | |
| The default mode is `BASIC_CHECKS`, which will perform only rudimentary checks to avoid slowdowns | |
| when generating/downloading a dataset for the first time. | |
| The verification modes: | |
| | | Verification checks | | |
| |---------------------------|------------------------------------------------------------------------------ | | |
| | `ALL_CHECKS` | Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder | | |
| | | and the validity (number of files, checksums, etc.) of downloaded files | | |
| | `BASIC_CHECKS` (default) | Same as `ALL_CHECKS` but without checking downloaded files | | |
| | `NO_CHECKS` | None | | |
| </div> | |
| ## Splits[[datasets.SplitGenerator]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.SplitGenerator</name><anchor>datasets.SplitGenerator</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L602</source><parameters>[{"name": "name", "val": ": str"}, {"name": "gen_kwargs", "val": ": dict = <factory>"}]</parameters><paramsdesc>- **name** (`str`) -- | |
| Name of the `Split` for which the generator will | |
| create the examples. | |
| - ****gen_kwargs** (additional keyword arguments) -- | |
| Keyword arguments to forward to the `DatasetBuilder._generate_examples` method | |
| of the builder.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Defines the split information for the generator. | |
| This should be used as returned value of | |
| `GeneratorBasedBuilder._split_generators`. | |
| See `GeneratorBasedBuilder._split_generators` for more info and example | |
| of usage. | |
| <ExampleCodeBlock anchor="datasets.SplitGenerator.example"> | |
| Example: | |
| ```py | |
| >>> datasets.SplitGenerator( | |
| ... name=datasets.Split.TRAIN, | |
| ... gen_kwargs={"split_key": "train", "files": dl_manager.download_and_extract(url)}, | |
| ... ) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Split</name><anchor>datasets.Split</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L406</source><parameters>[{"name": "name", "val": ""}]</parameters></docstring> | |
| `Enum` for dataset splits. | |
| Datasets are typically split into different subsets to be used at various | |
| stages of training and evaluation. | |
| - `TRAIN`: the training data. | |
| - `VALIDATION`: the validation data. If present, this is typically used as | |
| evaluation data while iterating on a model (e.g. changing hyperparameters, | |
| model architecture, etc.). | |
| - `TEST`: the testing data. This is the data to report metrics on. Typically | |
| you do not want to use this during model iteration as you may overfit to it. | |
| - `ALL`: the union of all defined dataset splits. | |
| All splits, including compositions inherit from `datasets.SplitBase`. | |
| See the [guide](../load_hub#splits) on splits for more information. | |
| <ExampleCodeBlock anchor="datasets.Split.example"> | |
| Example: | |
| ```py | |
| >>> datasets.SplitGenerator( | |
| ... name=datasets.Split.TRAIN, | |
| ... gen_kwargs={"split_key": "train", "files": dl_manager.download_and extract(url)}, | |
| ... ), | |
| ... datasets.SplitGenerator( | |
| ... name=datasets.Split.VALIDATION, | |
| ... gen_kwargs={"split_key": "validation", "files": dl_manager.download_and extract(url)}, | |
| ... ), | |
| ... datasets.SplitGenerator( | |
| ... name=datasets.Split.TEST, | |
| ... gen_kwargs={"split_key": "test", "files": dl_manager.download_and extract(url)}, | |
| ... ) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.NamedSplit</name><anchor>datasets.NamedSplit</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L314</source><parameters>[{"name": "name", "val": ""}]</parameters></docstring> | |
| Descriptor corresponding to a named split (train, test, ...). | |
| Example: | |
| <ExampleCodeBlock anchor="datasets.NamedSplit.example"> | |
| Each descriptor can be composed with other using addition or slice: | |
| ```py | |
| split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST | |
| ``` | |
| </ExampleCodeBlock> | |
| The resulting split will correspond to 25% of the train split merged with | |
| 100% of the test split. | |
| <ExampleCodeBlock anchor="datasets.NamedSplit.example-2"> | |
| A split cannot be added twice, so the following will fail: | |
| ```py | |
| split = ( | |
| datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + | |
| datasets.Split.TRAIN.subsplit(datasets.percent[75:]) | |
| ) # Error | |
| split = datasets.Split.TEST + datasets.Split.ALL # Error | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.NamedSplit.example-3"> | |
| The slices can be applied only one time. So the following are valid: | |
| ```py | |
| split = ( | |
| datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + | |
| datasets.Split.TEST.subsplit(datasets.percent[:50]) | |
| ) | |
| split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50]) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.NamedSplit.example-4"> | |
| But this is not valid: | |
| ```py | |
| train = datasets.Split.TRAIN | |
| test = datasets.Split.TEST | |
| split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25]) | |
| split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50]) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.NamedSplitAll</name><anchor>datasets.NamedSplitAll</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/splits.py#L391</source><parameters>[]</parameters></docstring> | |
| Split corresponding to the union of all defined dataset splits. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.ReadInstruction</name><anchor>datasets.ReadInstruction</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_reader.py#L456</source><parameters>[{"name": "split_name", "val": ""}, {"name": "rounding", "val": " = None"}, {"name": "from_", "val": " = None"}, {"name": "to", "val": " = None"}, {"name": "unit", "val": " = None"}]</parameters></docstring> | |
| Reading instruction for a dataset. | |
| <ExampleCodeBlock anchor="datasets.ReadInstruction.example"> | |
| Examples: | |
| ```python | |
| # The following lines are equivalent: | |
| ds = datasets.load_dataset('mnist', split='test[:33%]') | |
| ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]')) | |
| ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%')) | |
| ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction( | |
| 'test', from_=0, to=33, unit='%')) | |
| # The following lines are equivalent: | |
| ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]') | |
| ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec( | |
| 'test[:33%]+train[1:-1]')) | |
| ds = datasets.load_dataset('mnist', split=( | |
| datasets.ReadInstruction('test', to=33, unit='%') + | |
| datasets.ReadInstruction('train', from_=1, to=-1, unit='abs'))) | |
| # The following lines are equivalent: | |
| ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)') | |
| ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec( | |
| 'test[:33%](pct1_dropremainder)')) | |
| ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction( | |
| 'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder")) | |
| # 10-fold validation: | |
| tests = datasets.load_dataset( | |
| 'mnist', | |
| [datasets.ReadInstruction('train', from_=k, to=k+10, unit='%') | |
| for k in range(0, 100, 10)]) | |
| trains = datasets.load_dataset( | |
| 'mnist', | |
| [datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%') | |
| for k in range(0, 100, 10)]) | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_spec</name><anchor>datasets.ReadInstruction.from_spec</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_reader.py#L536</source><parameters>[{"name": "spec", "val": ""}]</parameters><paramsdesc>- **spec** (`str`) -- | |
| Split(s) + optional slice(s) to read + optional rounding | |
| if percents are used as the slicing unit. A slice can be specified, | |
| using absolute numbers (`int`) or percentages (`int`).</paramsdesc><paramgroups>0</paramgroups><retdesc>ReadInstruction instance.</retdesc></docstring> | |
| Creates a `ReadInstruction` instance out of a string spec. | |
| <ExampleCodeBlock anchor="datasets.ReadInstruction.from_spec.example"> | |
| Examples: | |
| ``` | |
| test: test split. | |
| test + validation: test split + validation split. | |
| test[10:]: test split, minus its first 10 records. | |
| test[:10%]: first 10% records of test split. | |
| test[:20%](pct1_dropremainder): first 10% records, rounded with the pct1_dropremainder rounding. | |
| test[:-5%]+train[40%:60%]: first 95% of test + middle 20% of train. | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_absolute</name><anchor>datasets.ReadInstruction.to_absolute</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_reader.py#L608</source><parameters>[{"name": "name2len", "val": ""}]</parameters><paramsdesc>- **name2len** (`dict`) -- | |
| Associating split names to number of examples.</paramsdesc><paramgroups>0</paramgroups><retdesc>list of _AbsoluteInstruction instances (corresponds to the + in spec).</retdesc></docstring> | |
| Translate instruction into a list of absolute instructions. | |
| Those absolute instructions are then to be added together. | |
| </div></div> | |
| ## Version[[datasets.Version]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Version</name><anchor>datasets.Version</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/utils/version.py#L30</source><parameters>[{"name": "version_str", "val": ": str"}, {"name": "description", "val": ": typing.Optional[str] = None"}, {"name": "major", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "minor", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "patch", "val": ": typing.Union[str, int, NoneType] = None"}]</parameters><paramsdesc>- **version_str** (`str`) -- | |
| The dataset version. | |
| - **description** (`str`) -- | |
| A description of what is new in this version. | |
| - **major** (`str`) -- | |
| - **minor** (`str`) -- | |
| - **patch** (`str`) --</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Dataset version `MAJOR.MINOR.PATCH`. | |
| <ExampleCodeBlock anchor="datasets.Version.example"> | |
| Example: | |
| ```py | |
| >>> VERSION = datasets.Version("1.0.0") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/datasets/blob/main/docs/source/package_reference/builder_classes.mdx" /> |
Xet Storage Details
- Size:
- 47.5 kB
- Xet hash:
- 003743a31f3c3c6fb1b3a55b9597aa4dfbed5bdb54618c695021e10eb2d90257
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.