Buckets:
| # Main classes | |
| ## DatasetInfo[[datasets.DatasetInfo]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.DatasetInfo</name><anchor>datasets.DatasetInfo</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/info.py#L92</source><parameters>[{"name": "description", "val": ": str = <factory>"}, {"name": "citation", "val": ": str = <factory>"}, {"name": "homepage", "val": ": str = <factory>"}, {"name": "license", "val": ": str = <factory>"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "post_processed", "val": ": typing.Optional[datasets.info.PostProcessedInfo] = None"}, {"name": "supervised_keys", "val": ": typing.Optional[datasets.info.SupervisedKeysData] = None"}, {"name": "builder_name", "val": ": typing.Optional[str] = None"}, {"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "config_name", "val": ": typing.Optional[str] = None"}, {"name": "version", "val": ": typing.Union[str, datasets.utils.version.Version, NoneType] = None"}, {"name": "splits", "val": ": typing.Optional[dict] = None"}, {"name": "download_checksums", "val": ": typing.Optional[dict] = None"}, {"name": "download_size", "val": ": typing.Optional[int] = None"}, {"name": "post_processing_size", "val": ": typing.Optional[int] = None"}, {"name": "dataset_size", "val": ": typing.Optional[int] = None"}, {"name": "size_in_bytes", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **description** (`str`) -- | |
| A description of the dataset. | |
| - **citation** (`str`) -- | |
| A BibTeX citation of the dataset. | |
| - **homepage** (`str`) -- | |
| A URL to the official homepage for the dataset. | |
| - **license** (`str`) -- | |
| The dataset's license. It can be the name of the license or a paragraph containing the terms of the license. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| The features used to specify the dataset's column types. | |
| - **post_processed** (`PostProcessedInfo`, *optional*) -- | |
| Information regarding the resources of a possible post-processing of a dataset. For example, it can contain the information of an index. | |
| - **supervised_keys** (`SupervisedKeysData`, *optional*) -- | |
| Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS). | |
| - **builder_name** (`str`, *optional*) -- | |
| The name of the `GeneratorBasedBuilder` subclass used to create the dataset. It is also the snake_case version of the dataset builder class name. | |
| - **config_name** (`str`, *optional*) -- | |
| The name of the configuration derived from [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig). | |
| - **version** (`str` or [Version](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.Version), *optional*) -- | |
| The version of the dataset. | |
| - **splits** (`dict`, *optional*) -- | |
| The mapping between split name and metadata. | |
| - **download_checksums** (`dict`, *optional*) -- | |
| The mapping between the URL to download the dataset's checksums and corresponding metadata. | |
| - **download_size** (`int`, *optional*) -- | |
| The size of the files to download to generate the dataset, in bytes. | |
| - **post_processing_size** (`int`, *optional*) -- | |
| Size of the dataset in bytes after post-processing, if any. | |
| - **dataset_size** (`int`, *optional*) -- | |
| The combined size in bytes of the Arrow tables for all splits. | |
| - **size_in_bytes** (`int`, *optional*) -- | |
| The combined size in bytes of all files associated with the dataset (downloaded files + Arrow files). | |
| - ****config_kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to the [BuilderConfig](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.BuilderConfig) and used in the [DatasetBuilder](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.DatasetBuilder).</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Information about a dataset. | |
| `DatasetInfo` documents datasets, including its name, version, and features. | |
| See the constructor arguments and properties for a full list. | |
| Not all fields are known on construction and may be updated later. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_directory</name><anchor>datasets.DatasetInfo.from_directory</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/info.py#L247</source><parameters>[{"name": "dataset_info_dir", "val": ": str"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **dataset_info_dir** (`str`) -- | |
| The directory containing the metadata file. This | |
| should be the root directory of a specific dataset version. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.9.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create [DatasetInfo](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetInfo) from the JSON file in `dataset_info_dir`. | |
| This function updates all the dynamically generated fields (num_examples, | |
| hash, time of creation,...) of the [DatasetInfo](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetInfo). | |
| This will overwrite all previous metadata. | |
| <ExampleCodeBlock anchor="datasets.DatasetInfo.from_directory.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import DatasetInfo | |
| >>> ds_info = DatasetInfo.from_directory("/path/to/directory/") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>write_to_directory</name><anchor>datasets.DatasetInfo.write_to_directory</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/info.py#L186</source><parameters>[{"name": "dataset_info_dir", "val": ""}, {"name": "pretty_print", "val": " = False"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **dataset_info_dir** (`str`) -- | |
| Destination directory. | |
| - **pretty_print** (`bool`, defaults to `False`) -- | |
| If `True`, the JSON will be pretty-printed with the indent level of 4. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.9.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Write `DatasetInfo` and license (if present) as JSON files to `dataset_info_dir`. | |
| <ExampleCodeBlock anchor="datasets.DatasetInfo.write_to_directory.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.info.write_to_directory("/path/to/directory/") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ## Dataset[[datasets.Dataset]] | |
| The base class [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) implements a Dataset backed by an Apache Arrow table. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Dataset</name><anchor>datasets.Dataset</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L704</source><parameters>[{"name": "arrow_table", "val": ": Table"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "indices_table", "val": ": typing.Optional[datasets.table.Table] = None"}, {"name": "fingerprint", "val": ": typing.Optional[str] = None"}]</parameters></docstring> | |
| A Dataset backed by an Arrow table. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_column</name><anchor>datasets.Dataset.add_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L6076</source><parameters>[{"name": "name", "val": ": str"}, {"name": "column", "val": ": typing.Union[list, numpy.ndarray]"}, {"name": "new_fingerprint", "val": ": str"}, {"name": "feature", "val": ": typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.List, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf, datasets.features.nifti.Nifti, datasets.features.dicom.Dicom, NoneType] = None"}]</parameters><paramsdesc>- **name** (`str`) -- | |
| Column name. | |
| - **column** (`list` or `np.array`) -- | |
| Column data to be added. | |
| - **feature** (`FeatureType` or `None`, defaults to `None`) -- | |
| Column datatype.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Add column to Dataset. | |
| <Added version="1.7"/> | |
| <ExampleCodeBlock anchor="datasets.Dataset.add_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> more_text = ds["text"] | |
| >>> ds = ds.add_column(name="text_2", column=more_text) | |
| >>> ds | |
| Dataset({ | |
| features: ['text', 'label', 'text_2'], | |
| num_rows: 1066 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_item</name><anchor>datasets.Dataset.add_item</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L6334</source><parameters>[{"name": "item", "val": ": dict"}, {"name": "new_fingerprint", "val": ": str"}]</parameters><paramsdesc>- **item** (`dict`) -- | |
| Item data to be added.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Add item to Dataset. | |
| <Added version="1.7"/> | |
| <ExampleCodeBlock anchor="datasets.Dataset.add_item.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> new_review = {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'} | |
| >>> ds = ds.add_item(new_review) | |
| >>> ds[-1] | |
| {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_file</name><anchor>datasets.Dataset.from_file</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L787</source><parameters>[{"name": "filename", "val": ": str"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "indices_filename", "val": ": typing.Optional[str] = None"}, {"name": "in_memory", "val": ": bool = False"}]</parameters><paramsdesc>- **filename** (`str`) -- | |
| File name of the dataset. | |
| - **info** (`DatasetInfo`, *optional*) -- | |
| Dataset information, like description, citation, etc. | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Name of the dataset split. | |
| - **indices_filename** (`str`, *optional*) -- | |
| File names of the indices. | |
| - **in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Instantiate a Dataset backed by an Arrow table at filename. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_buffer</name><anchor>datasets.Dataset.from_buffer</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L827</source><parameters>[{"name": "buffer", "val": ": Buffer"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "indices_buffer", "val": ": typing.Optional[pyarrow.lib.Buffer] = None"}]</parameters><paramsdesc>- **buffer** (`pyarrow.Buffer`) -- | |
| Arrow buffer. | |
| - **info** (`DatasetInfo`, *optional*) -- | |
| Dataset information, like description, citation, etc. | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Name of the dataset split. | |
| - **indices_buffer** (`pyarrow.Buffer`, *optional*) -- | |
| Indices Arrow buffer.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Instantiate a Dataset backed by an Arrow buffer. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_pandas</name><anchor>datasets.Dataset.from_pandas</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L859</source><parameters>[{"name": "df", "val": ": DataFrame"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "preserve_index", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **df** (`pandas.DataFrame`) -- | |
| Dataframe that contains the dataset. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **info** (`DatasetInfo`, *optional*) -- | |
| Dataset information, like description, citation, etc. | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Name of the dataset split. | |
| - **preserve_index** (`bool`, *optional*) -- | |
| Whether to store the index as an additional column in the resulting Dataset. | |
| The default of `None` will store the index as a column, except for `RangeIndex` which is stored as metadata only. | |
| Use `preserve_index=True` to force it to be stored as a column.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Convert `pandas.DataFrame` to a `pyarrow.Table` to create a [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset). | |
| The column types in the resulting Arrow Table are inferred from the dtypes of the `pandas.Series` in the | |
| DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the | |
| case of `object`, we need to guess the datatype by looking at the Python objects in this Series. | |
| Be aware that Series of the `object` dtype don't carry enough information to always lead to a meaningful Arrow | |
| type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only | |
| contains `None/nan` objects, the type is set to `null`. This behavior can be avoided by constructing explicit | |
| features and passing it to this function. | |
| Important: a dataset created with from_pandas() lives in memory | |
| and therefore doesn't have an associated cache directory. | |
| This may change in the future, but in the meantime if you | |
| want to reduce memory usage you should write it back on disk | |
| and reload using e.g. save_to_disk / load_from_disk. | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_pandas.example"> | |
| Example: | |
| ```py | |
| >>> ds = Dataset.from_pandas(df) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_dict</name><anchor>datasets.Dataset.from_dict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L973</source><parameters>[{"name": "mapping", "val": ": dict"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}]</parameters><paramsdesc>- **mapping** (`Mapping`) -- | |
| Mapping of strings to Arrays or Python lists. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **info** (`DatasetInfo`, *optional*) -- | |
| Dataset information, like description, citation, etc. | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Name of the dataset split.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Convert `dict` to a `pyarrow.Table` to create a [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset). | |
| Important: a dataset created with from_dict() lives in memory | |
| and therefore doesn't have an associated cache directory. | |
| This may change in the future, but in the meantime if you | |
| want to reduce memory usage you should write it back on disk | |
| and reload using e.g. save_to_disk / load_from_disk. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_generator</name><anchor>datasets.Dataset.from_generator</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1123</source><parameters>[{"name": "generator", "val": ": typing.Callable"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "gen_kwargs", "val": ": typing.Optional[dict] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "split", "val": ": NamedSplit = NamedSplit('train')"}, {"name": "fingerprint", "val": ": typing.Optional[str] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **generator** ( --`Callable`): | |
| A generator function that `yields` examples. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **gen_kwargs(`dict`,** *optional*) -- | |
| Keyword arguments to be passed to the `generator` callable. | |
| You can define a sharded dataset by passing the list of shards in `gen_kwargs` and setting `num_proc` greater than 1. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default. | |
| If `num_proc` is greater than one, then all list values in `gen_kwargs` must be the same length. These values will be split between calls to the generator. The number of shards will be the minimum of the shortest list in `gen_kwargs` and `num_proc`. | |
| <Added version="2.7.0"/> | |
| - **split** ([NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit), defaults to `Split.TRAIN`) -- | |
| Split name to be assigned to the dataset. | |
| <Added version="2.21.0"/> | |
| - **fingerprint** (`str`, *optional*) -- | |
| Fingerprint that will be used to generate dataset ID. | |
| By default `fingerprint` is generated by hashing the generator function and all the args which can be slow | |
| if it uses large objects like AI models. | |
| <Added version="4.3.0"/> | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to :`GeneratorConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Create a Dataset from a generator. | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_generator.example"> | |
| Example: | |
| ```py | |
| >>> def gen(): | |
| ... yield {"text": "Good", "label": 0} | |
| ... yield {"text": "Bad", "label": 1} | |
| ... | |
| >>> ds = Dataset.from_generator(gen) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_generator.example-2"> | |
| ```py | |
| >>> def gen(shards): | |
| ... for shard in shards: | |
| ... with open(shard) as f: | |
| ... for line in f: | |
| ... yield {"line": line} | |
| ... | |
| >>> shards = [f"data{i}.txt" for i in range(32)] | |
| >>> ds = Dataset.from_generator(gen, gen_kwargs={"shards": shards}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>data</name><anchor>datasets.Dataset.data</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1825</source><parameters>[]</parameters></docstring> | |
| The Apache Arrow table backing the dataset. | |
| <ExampleCodeBlock anchor="datasets.Dataset.data.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.data | |
| MemoryMappedTable | |
| text: string | |
| label: int64 | |
| ---- | |
| text: [["compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .","the soundtrack alone is worth the price of admission .","rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .","beneath the film's obvious determination to shock at any cost lies considerable skill and determination , backed by sheer nerve .","bielinsky is a filmmaker of impressive talent .","so beautifully acted and directed , it's clear that washington most certainly has a new career ahead of him if he so chooses .","a visual spectacle full of stunning images and effects .","a gentle and engrossing character study .","it's enough to watch huppert scheming , with her small , intelligent eyes as steady as any noir villain , and to enjoy the perfectly pitched web of tension that chabrol spins .","an engrossing portrait of uncompromising artists trying to create something original against the backdrop of a corporate music industry that only seems to care about the bottom line .",...,"ultimately , jane learns her place as a girl , softens up and loses some of the intensity that made her an interesting character to begin with .","ah-nuld's action hero days might be over .","it's clear why deuces wild , which was shot two years ago , has been gathering dust on mgm's shelf .","feels like nothing quite so much as a middle-aged moviemaker's attempt to surround himself with beautiful , half-naked women .","when the precise nature of matthew's predicament finally comes into sharp focus , the revelation fails to justify the build-up .","this picture is murder by numbers , and as easy to be bored by as your abc's , despite a few whopping shootouts .","hilarious musical comedy though stymied by accents thick as mud .","if you are into splatter movies , then you will probably have a reasonably good time with the salton sea .","a dull , simple-minded and stereotypical tale of drugs , death and mind-numbing indifference on the inner-city streets .","the feature-length stretch . . . strains the show's concept ."]] | |
| label: [[1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0]] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cache_files</name><anchor>datasets.Dataset.cache_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1845</source><parameters>[]</parameters></docstring> | |
| The cache files containing the Apache Arrow table backing the dataset. | |
| <ExampleCodeBlock anchor="datasets.Dataset.cache_files.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.cache_files | |
| [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>num_columns</name><anchor>datasets.Dataset.num_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1863</source><parameters>[]</parameters></docstring> | |
| Number of columns in the dataset. | |
| <ExampleCodeBlock anchor="datasets.Dataset.num_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.num_columns | |
| 2 | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>num_rows</name><anchor>datasets.Dataset.num_rows</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1878</source><parameters>[]</parameters></docstring> | |
| Number of rows in the dataset (same as [Dataset.__len__()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.__len__)). | |
| <ExampleCodeBlock anchor="datasets.Dataset.num_rows.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.num_rows | |
| 1066 | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>column_names</name><anchor>datasets.Dataset.column_names</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1895</source><parameters>[]</parameters></docstring> | |
| Names of the columns in the dataset. | |
| <ExampleCodeBlock anchor="datasets.Dataset.column_names.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.column_names | |
| ['text', 'label'] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shape</name><anchor>datasets.Dataset.shape</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1910</source><parameters>[]</parameters></docstring> | |
| Shape of the dataset (number of columns, number of rows). | |
| <ExampleCodeBlock anchor="datasets.Dataset.shape.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.shape | |
| (1066, 2) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>unique</name><anchor>datasets.Dataset.unique</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1927</source><parameters>[{"name": "column", "val": ": str"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| Column name (list all the column names with [column_names](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.column_names)).</paramsdesc><paramgroups>0</paramgroups><rettype>`list`</rettype><retdesc>List of unique elements in the given column.</retdesc></docstring> | |
| Return a list of the unique elements in a column. | |
| This is implemented in the low-level backend and as such, very fast. | |
| <ExampleCodeBlock anchor="datasets.Dataset.unique.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.unique('label') | |
| [1, 0] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Dataset.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2033</source><parameters>[{"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}, {"name": "max_depth", "val": " = 16"}]</parameters><paramsdesc>- **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>A copy of the dataset with flattened columns.</retdesc></docstring> | |
| Flatten the table. | |
| Each column with a struct type is flattened into one column per struct field. | |
| Other columns are left unchanged. | |
| <ExampleCodeBlock anchor="datasets.Dataset.flatten.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("rajpurkar/squad", split="train") | |
| >>> ds.features | |
| {'id': Value('string'), | |
| 'title': Value('string'), | |
| 'context': Value('string'), | |
| 'question': Value('string'), | |
| 'answers': {'text': List(Value('string')), | |
| 'answer_start': List(Value('int32'))}} | |
| >>> ds = ds.flatten() | |
| >>> ds | |
| Dataset({ | |
| features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], | |
| num_rows: 87599 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast</name><anchor>datasets.Dataset.cast</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2080</source><parameters>[{"name": "features", "val": ": Features"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features)) -- | |
| New features to cast the dataset to. | |
| The name of the fields in the features must match the current column names. | |
| The type of the data must also be convertible from one type to the other. | |
| For non-trivial conversion, e.g. `str` <-> `ClassLabel` you should use [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.map) to update the Dataset. | |
| - **batch_size** (`int`, defaults to `1000`) -- | |
| Number of examples per batch provided to cast. | |
| If `batch_size <= 0` or `batch_size == None` then provide the full dataset as a single batch to cast. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **load_from_cache_file** (`bool`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the current computation from `function` | |
| can be identified, use it instead of recomputing. | |
| - **cache_file_name** (`str`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| results of the computation instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.map). | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes for multiprocessing. By default it doesn't | |
| use multiprocessing.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>A copy of the dataset with casted features.</retdesc></docstring> | |
| Cast the dataset to a new set of features. | |
| <ExampleCodeBlock anchor="datasets.Dataset.cast.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel, Value | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> new_features = ds.features.copy() | |
| >>> new_features['label'] = ClassLabel(names=['bad', 'good']) | |
| >>> new_features['text'] = Value('large_string') | |
| >>> ds = ds.cast(new_features) | |
| >>> ds.features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('large_string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_column</name><anchor>datasets.Dataset.cast_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2164</source><parameters>[{"name": "column", "val": ": str"}, {"name": "feature", "val": ": typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.List, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf, datasets.features.nifti.Nifti, datasets.features.dicom.Dicom]"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| Column name. | |
| - **feature** (`FeatureType`) -- | |
| Target feature. | |
| - **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Cast column to feature for decoding. | |
| <ExampleCodeBlock anchor="datasets.Dataset.cast_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) | |
| >>> ds.features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>remove_columns</name><anchor>datasets.Dataset.remove_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2207</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **column_names** (`Union[str, List[str]]`) -- | |
| Name of the column(s) to remove. | |
| - **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>A copy of the dataset object without the columns to remove.</retdesc></docstring> | |
| Remove one or several column(s) in the dataset and the features associated to them. | |
| You can also remove a column using [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.map) with `remove_columns` but the present method | |
| doesn't copy the data of the remaining columns and is thus faster. | |
| <ExampleCodeBlock anchor="datasets.Dataset.remove_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.remove_columns('label') | |
| Dataset({ | |
| features: ['text'], | |
| num_rows: 1066 | |
| }) | |
| >>> ds = ds.remove_columns(column_names=ds.column_names) # Removing all the columns returns an empty dataset with the `num_rows` property set to 0 | |
| Dataset({ | |
| features: [], | |
| num_rows: 0 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_column</name><anchor>datasets.Dataset.rename_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2262</source><parameters>[{"name": "original_column_name", "val": ": str"}, {"name": "new_column_name", "val": ": str"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **original_column_name** (`str`) -- | |
| Name of the column to rename. | |
| - **new_column_name** (`str`) -- | |
| New name for the column. | |
| - **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>A copy of the dataset with a renamed column.</retdesc></docstring> | |
| Rename a column in the dataset, and move the features associated to the original column under the new column | |
| name. | |
| <ExampleCodeBlock anchor="datasets.Dataset.rename_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.rename_column('label', 'label_new') | |
| Dataset({ | |
| features: ['text', 'label_new'], | |
| num_rows: 1066 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_columns</name><anchor>datasets.Dataset.rename_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2328</source><parameters>[{"name": "column_mapping", "val": ": dict"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **column_mapping** (`Dict[str, str]`) -- | |
| A mapping of columns to rename to their new names | |
| - **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>A copy of the dataset with renamed columns</retdesc></docstring> | |
| Rename several columns in the dataset, and move the features associated to the original columns under | |
| the new column names. | |
| <ExampleCodeBlock anchor="datasets.Dataset.rename_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) | |
| Dataset({ | |
| features: ['text_new', 'label_new'], | |
| num_rows: 1066 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>select_columns</name><anchor>datasets.Dataset.select_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2395</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **column_names** (`Union[str, List[str]]`) -- | |
| Name of the column(s) to keep. | |
| - **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. If `None`, | |
| the new fingerprint is computed using a hash of the previous | |
| fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>A copy of the dataset object which only consists of | |
| selected columns.</retdesc></docstring> | |
| Select one or several column(s) in the dataset and the features | |
| associated to them. | |
| <ExampleCodeBlock anchor="datasets.Dataset.select_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.select_columns(['text']) | |
| >>> ds | |
| Dataset({ | |
| features: ['text'], | |
| num_rows: 1066 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class_encode_column</name><anchor>datasets.Dataset.class_encode_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1958</source><parameters>[{"name": "column", "val": ": str"}, {"name": "include_nulls", "val": ": bool = False"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| The name of the column to cast (list all the column names with [column_names](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.column_names)) | |
| - **include_nulls** (`bool`, defaults to `False`) -- | |
| Whether to include null values in the class labels. If `True`, the null values will be encoded as the `"None"` class label. | |
| <Added version="1.14.2"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Casts the given column as [ClassLabel](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.ClassLabel) and updates the table. | |
| <ExampleCodeBlock anchor="datasets.Dataset.class_encode_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("boolq", split="validation") | |
| >>> ds.features | |
| {'answer': Value('bool'), | |
| 'passage': Value('string'), | |
| 'question': Value('string')} | |
| >>> ds = ds.class_encode_column('answer') | |
| >>> ds.features | |
| {'answer': ClassLabel(num_classes=2, names=['False', 'True']), | |
| 'passage': Value('string'), | |
| 'question': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__len__</name><anchor>datasets.Dataset.__len__</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2451</source><parameters>[]</parameters></docstring> | |
| Number of rows in the dataset. | |
| <ExampleCodeBlock anchor="datasets.Dataset.__len__.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.__len__ | |
| <bound method Dataset.__len__ of Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 1066 | |
| })> | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__iter__</name><anchor>datasets.Dataset.__iter__</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2468</source><parameters>[]</parameters></docstring> | |
| Iterate through the examples. | |
| If a formatting is set with [Dataset.set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format) rows will be returned with the | |
| selected format. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>iter</name><anchor>datasets.Dataset.iter</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2497</source><parameters>[{"name": "batch_size", "val": ": int"}, {"name": "drop_last_batch", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`) -- size of each batch to yield. | |
| - **drop_last_batch** (`bool`, default *False*) -- Whether a last batch smaller than the batch_size should be | |
| dropped</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Iterate through the batches of size *batch_size*. | |
| If a formatting is set with [*~datasets.Dataset.set_format*] rows will be returned with the | |
| selected format. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>formatted_as</name><anchor>datasets.Dataset.formatted_as</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2541</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}, {"name": "**format_kwargs", "val": ""}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means `__getitem__`` returns python objects (default). | |
| - **columns** (`List[str]`, *optional*) -- | |
| Columns to format in the output. | |
| `None` means `__getitem__` returns all columns (default). | |
| - **output_all_columns** (`bool`, defaults to `False`) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| - ****format_kwargs** (additional keyword arguments) -- | |
| Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| To be used in a `with` statement. Set `__getitem__` return format (type and columns). | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>set_format</name><anchor>datasets.Dataset.set_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2573</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}, {"name": "**format_kwargs", "val": ""}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means `__getitem__` returns python objects (default). | |
| - **columns** (`List[str]`, *optional*) -- | |
| Columns to format in the output. | |
| `None` means `__getitem__` returns all columns (default). | |
| - **output_all_columns** (`bool`, defaults to `False`) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| - ****format_kwargs** (additional keyword arguments) -- | |
| Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format (type and columns). The data formatting is applied on-the-fly. | |
| The format `type` (for example "numpy") is used to format batches when using `__getitem__`. | |
| It's also possible to use custom transforms for formatting using [set_transform()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_transform). | |
| It is possible to call [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.map) after calling `set_format`. Since `map` may add new columns, then the list of formatted columns | |
| <ExampleCodeBlock anchor="datasets.Dataset.set_format.example"> | |
| gets updated. In this case, if you apply `map` on a dataset to add a new column, then this column will be formatted as: | |
| ``` | |
| new formatted columns = (all columns - previously unformatted columns) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.set_format.example-2"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) | |
| >>> ds.set_format(type='numpy', columns=['text', 'label']) | |
| >>> ds.format | |
| {'type': 'numpy', | |
| 'format_kwargs': {}, | |
| 'columns': ['text', 'label'], | |
| 'output_all_columns': False} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>set_transform</name><anchor>datasets.Dataset.set_transform</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2681</source><parameters>[{"name": "transform", "val": ": typing.Optional[typing.Callable]"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}]</parameters><paramsdesc>- **transform** (`Callable`, *optional*) -- | |
| User-defined formatting transform, replaces the format defined by [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format). | |
| A formatting function is a callable that takes a batch (as a `dict`) as input and returns a batch. | |
| This function is applied right before returning the objects in `__getitem__`. | |
| - **columns** (`List[str]`, *optional*) -- | |
| Columns to format in the output. | |
| If specified, then the input batch of the transform only contains those columns. | |
| - **output_all_columns** (`bool`, defaults to `False`) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| If set to True, then the other un-formatted columns are kept with the output of the transform.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format using this transform. The transform is applied on-the-fly on batches when `__getitem__` is called. | |
| As [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format), this can be reset using [reset_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.reset_format). | |
| <ExampleCodeBlock anchor="datasets.Dataset.set_transform.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') | |
| >>> def encode(batch): | |
| ... return tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt') | |
| >>> ds.set_transform(encode) | |
| >>> ds[0] | |
| {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1]), | |
| 'input_ids': tensor([ 101, 29353, 2135, 15102, 1996, 9428, 20868, 2890, 8663, 6895, | |
| 20470, 2571, 3663, 2090, 4603, 3017, 3008, 1998, 2037, 24211, | |
| 5637, 1998, 11690, 2336, 1012, 102]), | |
| 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0])} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>reset_format</name><anchor>datasets.Dataset.reset_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2652</source><parameters>[]</parameters></docstring> | |
| Reset `__getitem__` return format to python objects and all columns. | |
| Same as `self.set_format()` | |
| <ExampleCodeBlock anchor="datasets.Dataset.reset_format.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) | |
| >>> ds.set_format(type='numpy', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) | |
| >>> ds.format | |
| {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': 'numpy'} | |
| >>> ds.reset_format() | |
| >>> ds.format | |
| {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': None} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>with_format</name><anchor>datasets.Dataset.with_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2724</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}, {"name": "**format_kwargs", "val": ""}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means `__getitem__` returns python objects (default). | |
| - **columns** (`List[str]`, *optional*) -- | |
| Columns to format in the output. | |
| `None` means `__getitem__` returns all columns (default). | |
| - **output_all_columns** (`bool`, defaults to `False`) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| - ****format_kwargs** (additional keyword arguments) -- | |
| Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format (type and columns). The data formatting is applied on-the-fly. | |
| The format `type` (for example "numpy") is used to format batches when using `__getitem__`. | |
| It's also possible to use custom transforms for formatting using [with_transform()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.with_transform). | |
| Contrary to [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format), `with_format` returns a new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) object. | |
| <ExampleCodeBlock anchor="datasets.Dataset.with_format.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) | |
| >>> ds.format | |
| {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': None} | |
| >>> ds = ds.with_format("torch") | |
| >>> ds.format | |
| {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': 'torch'} | |
| >>> ds[0] | |
| {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', | |
| 'label': tensor(1), | |
| 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, | |
| 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, | |
| 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0]), | |
| 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), | |
| 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>with_transform</name><anchor>datasets.Dataset.with_transform</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2795</source><parameters>[{"name": "transform", "val": ": typing.Optional[typing.Callable]"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}]</parameters><paramsdesc>- **transform** (`Callable`, `optional`) -- | |
| User-defined formatting transform, replaces the format defined by [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format). | |
| A formatting function is a callable that takes a batch (as a `dict`) as input and returns a batch. | |
| This function is applied right before returning the objects in `__getitem__`. | |
| - **columns** (`List[str]`, `optional`) -- | |
| Columns to format in the output. | |
| If specified, then the input batch of the transform only contains those columns. | |
| - **output_all_columns** (`bool`, defaults to `False`) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| If set to `True`, then the other un-formatted columns are kept with the output of the transform.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format using this transform. The transform is applied on-the-fly on batches when `__getitem__` is called. | |
| As [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format), this can be reset using [reset_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.reset_format). | |
| Contrary to [set_transform()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_transform), `with_transform` returns a new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) object. | |
| <ExampleCodeBlock anchor="datasets.Dataset.with_transform.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> def encode(example): | |
| ... return tokenizer(example["text"], padding=True, truncation=True, return_tensors='pt') | |
| >>> ds = ds.with_transform(encode) | |
| >>> ds[0] | |
| {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1, 1, 1, 1]), | |
| 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, | |
| 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, | |
| 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102]), | |
| 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0])} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__getitem__</name><anchor>datasets.Dataset.__getitem__</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2871</source><parameters>[{"name": "key", "val": ""}]</parameters></docstring> | |
| Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools). | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cleanup_cache_files</name><anchor>datasets.Dataset.cleanup_cache_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2884</source><parameters>[]</parameters><rettype>`int`</rettype><retdesc>Number of removed files.</retdesc></docstring> | |
| Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is | |
| one. | |
| Be careful when running this command that no other process is currently using other cache files. | |
| <ExampleCodeBlock anchor="datasets.Dataset.cleanup_cache_files.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.cleanup_cache_files() | |
| 10 | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>map</name><anchor>datasets.Dataset.map</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L2931</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": ": bool = False"}, {"name": "with_rank", "val": ": bool = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "drop_last_batch", "val": ": bool = False"}, {"name": "remove_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "disable_nullable", "val": ": bool = False"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "suffix_template", "val": ": str = '_{rank:05d}_of_{num_proc:05d}'"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}, {"name": "desc", "val": ": typing.Optional[str] = None"}, {"name": "try_original_type", "val": ": typing.Optional[bool] = True"}]</parameters><paramsdesc>- **function** (`Callable`) -- Function with one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` and `with_rank=False` | |
| - `function(example: Dict[str, Any], *extra_args) -> Dict[str, Any]` if `batched=False` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) | |
| - `function(batch: Dict[str, List]) -> Dict[str, List]` if `batched=True` and `with_indices=False` and `with_rank=False` | |
| - `function(batch: Dict[str, List], *extra_args) -> Dict[str, List]` if `batched=True` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) | |
| For advanced usage, the function can also return a `pyarrow.Table`. | |
| If the function is asynchronous, then `map` will run your function in parallel. | |
| Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. | |
| If no function is provided, default to identity function: `lambda x: x`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the | |
| signature of `function` should be `def function(example, idx[, rank]): ...`. | |
| - **with_rank** (`bool`, defaults to `False`) -- | |
| Provide process rank to `function`. Note that in this case the | |
| signature of `function` should be `def function(example[, idx], rank): ...`. | |
| - **input_columns** (`Optional[Union[str, List[str]]]`, defaults to `None`) -- | |
| The columns to be passed into `function` | |
| as positional arguments. If `None`, a `dict` mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True`. | |
| If `batch_size <= 0` or `batch_size == None`, provide the full dataset as a single batch to `function`. | |
| - **drop_last_batch** (`bool`, defaults to `False`) -- | |
| Whether a last batch smaller than the batch_size should be | |
| dropped instead of being processed by the function. | |
| - **remove_columns** (`Optional[Union[str, List[str]]]`, defaults to `None`) -- | |
| Remove a selection of columns while doing the mapping. | |
| Columns will be removed before updating the examples with the output of `function`, i.e. if `function` is adding | |
| columns with names in `remove_columns`, these columns will be kept. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the current computation from `function` | |
| can be identified, use it instead of recomputing. | |
| - **cache_file_name** (`str`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| results of the computation instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **features** (`Optional[datasets.Features]`, defaults to `None`) -- | |
| Use a specific Features to store the cache file | |
| instead of the automatically generated one. | |
| - **disable_nullable** (`bool`, defaults to `False`) -- | |
| Disallow null values in the table. | |
| - **fn_kwargs** (`Dict`, *optional*, defaults to `None`) -- | |
| Keyword arguments to be passed to `function`. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| The number of processes to use for multiprocessing. | |
| - If `None` or `0`, no multiprocessing is used and the operation runs in the main process. | |
| - If greater than `1`, one or multiple worker processes are used to process data in parallel. | |
| Note: The function passed to `map()` must be picklable for multiprocessing to work correctly | |
| (i.e., prefer functions defined at the top level of a module, not inside another function or class). | |
| suffix_template (`str`): | |
| If `cache_file_name` is specified, then this suffix | |
| will be added at the end of the base name of each. Defaults to `"_{rank:05d}_of_{num_proc:05d}"`. For example, if `cache_file_name` is "processed.arrow", then for | |
| `rank=1` and `num_proc=4`, the resulting file would be `"processed_00001_of_00004.arrow"` for the default suffix. | |
| - **new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments. | |
| - **desc** (`str`, *optional*, defaults to `None`) -- | |
| Meaningful description to be displayed alongside with the progress bar while mapping examples. | |
| - **try_original_type** (`Optional[bool]`, defaults to `True`) -- | |
| Try to keep the types of the original columns (e.g. int32 -> int32). | |
| Set to False if you want to always infer new types.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a function to all the examples in the table (individually or in batches) and update the table. | |
| If your function returns a column that already exists, then it overwrites it. | |
| You can specify whether the function should be batched or not with the `batched` parameter: | |
| - If batched is `False`, then the function takes 1 example in and should return 1 example. | |
| An example is a dictionary, e.g. `{"text": "Hello there !"}`. | |
| - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. | |
| A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`. | |
| - If batched is `True` and `batch_size` is `n > 1`, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. | |
| Note that the last batch may have less than `n` examples. | |
| A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. | |
| If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simultaneous calls. | |
| It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. | |
| <ExampleCodeBlock anchor="datasets.Dataset.map.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> def add_prefix(example): | |
| ... example["text"] = "Review: " + example["text"] | |
| ... return example | |
| >>> ds = ds.map(add_prefix) | |
| >>> ds[0:3]["text"] | |
| ['Review: compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', | |
| 'Review: the soundtrack alone is worth the price of admission .', | |
| 'Review: rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .'] | |
| # process a batch of examples | |
| >>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True) | |
| # set number of processors | |
| >>> ds = ds.map(add_prefix, num_proc=4) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>filter</name><anchor>datasets.Dataset.filter</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L3809</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": ": bool = False"}, {"name": "with_rank", "val": ": bool = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "suffix_template", "val": ": str = '_{rank:05d}_of_{num_proc:05d}'"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}, {"name": "desc", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **function** (`Callable`) -- Callable with one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> bool` if `batched=False` and `with_indices=False` and `with_rank=False` | |
| - `function(example: Dict[str, Any], *extra_args) -> bool` if `batched=False` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) | |
| - `function(batch: Dict[str, List]) -> List[bool]` if `batched=True` and `with_indices=False` and `with_rank=False` | |
| - `function(batch: Dict[str, List], *extra_args) -> List[bool]` if `batched=True` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) | |
| If the function is asynchronous, then `filter` will run your function in parallel. | |
| If no function is provided, defaults to an always `True` function: `lambda x: True`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the | |
| signature of `function` should be `def function(example, idx[, rank]): ...`. | |
| - **with_rank** (`bool`, defaults to `False`) -- | |
| Provide process rank to `function`. Note that in this case the | |
| signature of `function` should be `def function(example[, idx], rank): ...`. | |
| - **input_columns** (`str` or `List[str]`, *optional*) -- | |
| The columns to be passed into `function` as | |
| positional arguments. If `None`, a `dict` mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if | |
| `batched = True`. If `batched = False`, one example per batch is passed to `function`. | |
| If `batch_size <= 0` or `batch_size == None`, provide the full dataset as a single batch to `function`. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the current computation from `function` | |
| can be identified, use it instead of recomputing. | |
| - **cache_file_name** (`str`, *optional*) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| results of the computation instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **fn_kwargs** (`dict`, *optional*) -- | |
| Keyword arguments to be passed to `function`. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| The number of processes to use for multiprocessing. | |
| - If `None` or `0`, no multiprocessing is used and the operation runs in the main process. | |
| - If greater than `1`, one or multiple worker processes are used to process data in parallel. | |
| Note: The function passed to `map()` must be picklable for multiprocessing to work correctly | |
| (i.e., prefer functions defined at the top level of a module, not inside another function or class). | |
| - **suffix_template** (`str`) -- | |
| If `cache_file_name` is specified, then this suffix will be added at the end of the base name of each. | |
| For example, if `cache_file_name` is `"processed.arrow"`, then for `rank = 1` and `num_proc = 4`, | |
| the resulting file would be `"processed_00001_of_00004.arrow"` for the default suffix (default | |
| `_{rank:05d}_of_{num_proc:05d}`). | |
| - **new_fingerprint** (`str`, *optional*) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments. | |
| - **desc** (`str`, *optional*, defaults to `None`) -- | |
| Meaningful description to be displayed alongside with the progress bar while filtering examples.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a filter function to all the elements in the table in batches | |
| and update the table so that the dataset only includes examples according to the filter function. | |
| If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simultaneous calls (configurable). | |
| It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. | |
| <ExampleCodeBlock anchor="datasets.Dataset.filter.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.filter(lambda x: x["label"] == 1) | |
| >>> ds | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 533 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>select</name><anchor>datasets.Dataset.select</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4036</source><parameters>[{"name": "indices", "val": ": Iterable"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "indices_cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **indices** (`range`, `list`, `iterable`, `ndarray` or `Series`) -- | |
| Range, list or 1D-array of integer indices for indexing. | |
| If the indices correspond to a contiguous range, the Arrow table is simply sliced. | |
| However passing a list of indices that are not contiguous creates indices mapping, which is much less efficient, | |
| but still faster than recreating an Arrow table made of the requested rows. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the indices mapping in memory instead of writing it to a cache file. | |
| - **indices_cache_file_name** (`str`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| indices mapping instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new dataset with rows selected following the list/array of indices. | |
| <ExampleCodeBlock anchor="datasets.Dataset.select.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.select(range(4)) | |
| >>> ds | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 4 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>sort</name><anchor>datasets.Dataset.sort</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4374</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, collections.abc.Sequence[str]]"}, {"name": "reverse", "val": ": typing.Union[bool, collections.abc.Sequence[bool]] = False"}, {"name": "null_placement", "val": ": str = 'at_end'"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "indices_cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **column_names** (`Union[str, Sequence[str]]`) -- | |
| Column name(s) to sort by. | |
| - **reverse** (`Union[bool, Sequence[bool]]`, defaults to `False`) -- | |
| If `True`, sort by descending order rather than ascending. If a single bool is provided, | |
| the value is applied to the sorting of all column names. Otherwise a list of bools with the | |
| same length and order as column_names must be provided. | |
| - **null_placement** (`str`, defaults to `at_end`) -- | |
| Put `None` values at the beginning if `at_start` or `first` or at the end if `at_end` or `last` | |
| <Added version="1.14.2"/> | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the sorted indices in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the sorted indices | |
| can be identified, use it instead of recomputing. | |
| - **indices_cache_file_name** (`str`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| sorted indices instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| Higher value gives smaller cache files, lower value consume less temporary memory. | |
| - **new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new dataset sorted according to a single or multiple columns. | |
| <ExampleCodeBlock anchor="datasets.Dataset.sort.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset('cornell-movie-review-data/rotten_tomatoes', split='validation') | |
| >>> ds['label'][:10] | |
| [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
| >>> sorted_ds = ds.sort('label') | |
| >>> sorted_ds['label'][:10] | |
| [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] | |
| >>> another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False]) | |
| >>> another_sorted_ds['label'][:10] | |
| [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shuffle</name><anchor>datasets.Dataset.shuffle</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4502</source><parameters>[{"name": "seed", "val": ": typing.Optional[int] = None"}, {"name": "generator", "val": ": typing.Optional[numpy.random._generator.Generator] = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "indices_cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **seed** (`int`, *optional*) -- | |
| A seed to initialize the default BitGenerator if `generator=None`. | |
| If `None`, then fresh, unpredictable entropy will be pulled from the OS. | |
| If an `int` or `array_like[ints]` is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. | |
| - **generator** (`numpy.random.Generator`, *optional*) -- | |
| Numpy random Generator to use to compute the permutation of the dataset rows. | |
| If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). | |
| - **keep_in_memory** (`bool`, default `False`) -- | |
| Keep the shuffled indices in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the shuffled indices | |
| can be identified, use it instead of recomputing. | |
| - **indices_cache_file_name** (`str`, *optional*) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| shuffled indices instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new Dataset where the rows are shuffled. | |
| Currently shuffling uses numpy random generators. | |
| You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy's default random generator (PCG64). | |
| Shuffling takes the list of indices `[0:len(my_dataset)]` and shuffles it to create an indices mapping. | |
| However as soon as your [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) has an indices mapping, the speed can become 10x slower. | |
| This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore. | |
| To restore the speed, you'd need to rewrite the entire dataset on your disk again using [Dataset.flatten_indices()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.flatten_indices), which removes the indices mapping. | |
| <ExampleCodeBlock anchor="datasets.Dataset.shuffle.example"> | |
| This may take a lot of time depending of the size of your dataset though: | |
| ```python | |
| my_dataset[0] # fast | |
| my_dataset = my_dataset.shuffle(seed=42) | |
| my_dataset[0] # up to 10x slower | |
| my_dataset = my_dataset.flatten_indices() # rewrite the shuffled dataset on disk as contiguous chunks of data | |
| my_dataset[0] # fast again | |
| ``` | |
| </ExampleCodeBlock> | |
| In this case, we recommend switching to an [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) and leveraging its fast approximate shuffling method [IterableDataset.shuffle()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset.shuffle). | |
| <ExampleCodeBlock anchor="datasets.Dataset.shuffle.example-2"> | |
| It only shuffles the shards order and adds a shuffle buffer to your dataset, which keeps the speed of your dataset optimal: | |
| ```python | |
| my_iterable_dataset = my_dataset.to_iterable_dataset(num_shards=128) | |
| for example in enumerate(my_iterable_dataset): # fast | |
| pass | |
| shuffled_iterable_dataset = my_iterable_dataset.shuffle(seed=42, buffer_size=100) | |
| for example in enumerate(shuffled_iterable_dataset): # as fast as before | |
| pass | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.shuffle.example-3"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds['label'][:10] | |
| [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
| # set a seed | |
| >>> shuffled_ds = ds.shuffle(seed=42) | |
| >>> shuffled_ds['label'][:10] | |
| [1, 0, 1, 1, 0, 0, 0, 0, 0, 0] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>skip</name><anchor>datasets.Dataset.skip</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4289</source><parameters>[{"name": "n", "val": ": int"}]</parameters><paramsdesc>- **n** (`int`) -- | |
| Number of elements to skip.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) that skips the first `n` elements. | |
| <ExampleCodeBlock anchor="datasets.Dataset.skip.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> list(ds.take(3)) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}] | |
| >>> ds = ds.skip(1) | |
| >>> list(ds.take(3)) | |
| [{'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}, | |
| {'label': 1, | |
| 'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>take</name><anchor>datasets.Dataset.take</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4351</source><parameters>[{"name": "n", "val": ": int"}]</parameters><paramsdesc>- **n** (`int`) -- | |
| Number of elements to take.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) with only the first `n` elements. | |
| <ExampleCodeBlock anchor="datasets.Dataset.take.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> small_ds = ds.take(2) | |
| >>> list(small_ds) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>train_test_split</name><anchor>datasets.Dataset.train_test_split</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4634</source><parameters>[{"name": "test_size", "val": ": typing.Union[float, int, NoneType] = None"}, {"name": "train_size", "val": ": typing.Union[float, int, NoneType] = None"}, {"name": "shuffle", "val": ": bool = True"}, {"name": "stratify_by_column", "val": ": typing.Optional[str] = None"}, {"name": "seed", "val": ": typing.Optional[int] = None"}, {"name": "generator", "val": ": typing.Optional[numpy.random._generator.Generator] = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "train_indices_cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "test_indices_cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "train_new_fingerprint", "val": ": typing.Optional[str] = None"}, {"name": "test_new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **test_size** (`Union[float, int, None]`, *optional*) -- | |
| Size of the test split | |
| If `float`, should be between `0.0` and `1.0` and represent the proportion of the dataset to include in the test split. | |
| If `int`, represents the absolute number of test samples. | |
| If `None`, the value is set to the complement of the train size. | |
| If `train_size` is also `None`, it will be set to `0.25`. | |
| - **train_size** (`Union[float, int, None]`, *optional*) -- | |
| Size of the train split | |
| If `float`, should be between `0.0` and `1.0` and represent the proportion of the dataset to include in the train split. | |
| If `int`, represents the absolute number of train samples. | |
| If `None`, the value is automatically set to the complement of the test size. | |
| - **shuffle** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to shuffle the data before splitting. | |
| - **stratify_by_column** (`str`, *optional*, defaults to `None`) -- | |
| The column name of labels to be used to perform stratified split of data. | |
| - **seed** (`int`, *optional*) -- | |
| A seed to initialize the default BitGenerator if `generator=None`. | |
| If `None`, then fresh, unpredictable entropy will be pulled from the OS. | |
| If an `int` or `array_like[ints]` is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. | |
| - **generator** (`numpy.random.Generator`, *optional*) -- | |
| Numpy random Generator to use to compute the permutation of the dataset rows. | |
| If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the splits indices in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the splits indices | |
| can be identified, use it instead of recomputing. | |
| - **train_cache_file_name** (`str`, *optional*) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| train split indices instead of the automatically generated cache file name. | |
| - **test_cache_file_name** (`str`, *optional*) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| test split indices instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **train_new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the train set after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments | |
| - **test_new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the test set after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Return a dictionary ([datasets.DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)) with two random train and test subsets (`train` and `test` `Dataset` splits). | |
| Splits are created from the dataset according to `test_size`, `train_size` and `shuffle`. | |
| This method is similar to scikit-learn `train_test_split`. | |
| <ExampleCodeBlock anchor="datasets.Dataset.train_test_split.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds = ds.train_test_split(test_size=0.2, shuffle=True) | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 852 | |
| }) | |
| test: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 214 | |
| }) | |
| }) | |
| # set a seed | |
| >>> ds = ds.train_test_split(test_size=0.2, seed=42) | |
| # stratified split | |
| >>> ds = load_dataset("imdb",split="train") | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 25000 | |
| }) | |
| >>> ds = ds.train_test_split(test_size=0.2, stratify_by_column="label") | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 20000 | |
| }) | |
| test: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 5000 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shard</name><anchor>datasets.Dataset.shard</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4917</source><parameters>[{"name": "num_shards", "val": ": int"}, {"name": "index", "val": ": int"}, {"name": "contiguous", "val": ": bool = True"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "indices_cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}]</parameters><paramsdesc>- **num_shards** (`int`) -- | |
| How many shards to split the dataset into. | |
| - **index** (`int`) -- | |
| Which shard to select and return. | |
| - **contiguous** -- (`bool`, defaults to `True`): | |
| Whether to select contiguous blocks of indices for shards. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **indices_cache_file_name** (`str`, *optional*) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| indices of each shard instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| This only concerns the indices mapping. | |
| Number of indices per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Return the `index`-nth shard from dataset split into `num_shards` pieces. | |
| This shards deterministically. `dataset.shard(n, i)` splits the dataset into contiguous chunks, | |
| so it can be easily concatenated back together after processing. If `len(dataset) % n == l`, then the | |
| first `l` dataset each have length `(len(dataset) // n) + 1`, and the remaining dataset have length `(len(dataset) // n)`. | |
| `datasets.concatenate_datasets([dset.shard(n, i) for i in range(n)])` returns a dataset with the same order as the original. | |
| Note: n should be less or equal to the number of elements in the dataset `len(dataset)`. | |
| On the other hand, `dataset.shard(n, i, contiguous=False)` contains all elements of the dataset whose index mod `n = i`. | |
| Be sure to shard before using any randomizing operator (such as `shuffle`). | |
| It is best if the shard operator is used early in the dataset pipeline. | |
| <ExampleCodeBlock anchor="datasets.Dataset.shard.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 1066 | |
| }) | |
| >>> ds = ds.shard(num_shards=2, index=0) | |
| >>> ds | |
| Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 533 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>repeat</name><anchor>datasets.Dataset.repeat</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4319</source><parameters>[{"name": "num_times", "val": ": int"}]</parameters><paramsdesc>- **num_times** (`int`) -- | |
| Number of times to repeat the dataset.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) that repeats the underlying dataset `num_times` times. | |
| Like itertools.repeat, repeating once just returns the full dataset. | |
| <ExampleCodeBlock anchor="datasets.Dataset.repeat.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> ds = ds.take(2).repeat(2) | |
| >>> list(ds) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}, | |
| {'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_tf_dataset</name><anchor>datasets.Dataset.to_tf_dataset</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L331</source><parameters>[{"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "shuffle", "val": ": bool = False"}, {"name": "collate_fn", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "drop_remainder", "val": ": bool = False"}, {"name": "collate_fn_args", "val": ": typing.Optional[dict[str, typing.Any]] = None"}, {"name": "label_cols", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "prefetch", "val": ": bool = True"}, {"name": "num_workers", "val": ": int = 0"}, {"name": "num_test_batches", "val": ": int = 20"}]</parameters><paramsdesc>- **batch_size** (`int`, *optional*) -- | |
| Size of batches to load from the dataset. Defaults to `None`, which implies that the dataset won't be | |
| batched, but the returned dataset can be batched later with `tf_dataset.batch(batch_size)`. | |
| - **columns** (`List[str]` or `str`, *optional*) -- | |
| Dataset column(s) to load in the `tf.data.Dataset`. | |
| Column names that are created by the `collate_fn` and that do not exist in the original dataset can be used. | |
| - **shuffle(`bool`,** defaults to `False`) -- | |
| Shuffle the dataset order when loading. Recommended `True` for training, `False` for | |
| validation/evaluation. | |
| - **drop_remainder(`bool`,** defaults to `False`) -- | |
| Drop the last incomplete batch when loading. Ensures | |
| that all batches yielded by the dataset will have the same length on the batch dimension. | |
| - **collate_fn(`Callable`,** *optional*) -- | |
| A function or callable object (such as a `DataCollator`) that will collate | |
| lists of samples into a batch. | |
| - **collate_fn_args** (`Dict`, *optional*) -- | |
| An optional `dict` of keyword arguments to be passed to the | |
| `collate_fn`. | |
| - **label_cols** (`List[str]` or `str`, defaults to `None`) -- | |
| Dataset column(s) to load as labels. | |
| Note that many models compute loss internally rather than letting Keras do it, in which case | |
| passing the labels here is optional, as long as they're in the input `columns`. | |
| - **prefetch** (`bool`, defaults to `True`) -- | |
| Whether to run the dataloader in a separate thread and maintain | |
| a small buffer of batches for training. Improves performance by allowing data to be loaded in the | |
| background while the model is training. | |
| - **num_workers** (`int`, defaults to `0`) -- | |
| Number of workers to use for loading the dataset. | |
| - **num_test_batches** (`int`, defaults to `20`) -- | |
| Number of batches to use to infer the output signature of the dataset. | |
| The higher this number, the more accurate the signature will be, but the longer it will take to | |
| create the dataset.</paramsdesc><paramgroups>0</paramgroups><retdesc>`tf.data.Dataset`</retdesc></docstring> | |
| Create a `tf.data.Dataset` from the underlying Dataset. This `tf.data.Dataset` will load and collate batches from | |
| the Dataset, and is suitable for passing to methods like `model.fit()` or `model.predict()`. The dataset will yield | |
| `dicts` for both inputs and labels unless the `dict` would contain only a single key, in which case a raw | |
| `tf.Tensor` is yielded instead. | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_tf_dataset.example"> | |
| Example: | |
| ```py | |
| >>> ds_train = ds["train"].to_tf_dataset( | |
| ... columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'], | |
| ... shuffle=True, | |
| ... batch_size=16, | |
| ... collate_fn=data_collator, | |
| ... ) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>push_to_hub</name><anchor>datasets.Dataset.push_to_hub</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5658</source><parameters>[{"name": "repo_id", "val": ": str"}, {"name": "config_name", "val": ": str = 'default'"}, {"name": "set_default", "val": ": typing.Optional[bool] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "commit_message", "val": ": typing.Optional[str] = None"}, {"name": "commit_description", "val": ": typing.Optional[str] = None"}, {"name": "private", "val": ": typing.Optional[bool] = None"}, {"name": "token", "val": ": typing.Optional[str] = None"}, {"name": "revision", "val": ": typing.Optional[str] = None"}, {"name": "create_pr", "val": ": typing.Optional[bool] = False"}, {"name": "max_shard_size", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "num_shards", "val": ": typing.Optional[int] = None"}, {"name": "embed_external_files", "val": ": bool = True"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **repo_id** (`str`) -- | |
| The ID of the repository to push to in the following format: `<user>/<dataset_name>` or | |
| `<org>/<dataset_name>`. Also accepts `<dataset_name>`, which will default to the namespace | |
| of the logged-in user. | |
| - **config_name** (`str`, defaults to "default") -- | |
| The configuration name (or subset) of a dataset. Defaults to "default". | |
| - **set_default** (`bool`, *optional*) -- | |
| Whether to set this configuration as the default one. Otherwise, the default configuration is the one | |
| named "default". | |
| - **split** (`str`, *optional*) -- | |
| The name of the split that will be given to that dataset. Defaults to `self.split`. | |
| - **data_dir** (`str`, *optional*) -- | |
| Directory name that will contain the uploaded data files. Defaults to the `config_name` if different | |
| from "default", else "data". | |
| <Added version="2.17.0"/> | |
| - **commit_message** (`str`, *optional*) -- | |
| Message to commit while pushing. Will default to `"Upload dataset"`. | |
| - **commit_description** (`str`, *optional*) -- | |
| Description of the commit that will be created. | |
| Additionally, description of the PR if a PR is created (`create_pr` is True). | |
| <Added version="2.16.0"/> | |
| - **private** (`bool`, *optional*) -- | |
| Whether to make the repo private. If `None` (default), the repo will be public unless the | |
| organization's default is private. This value is ignored if the repo already exists. | |
| - **token** (`str`, *optional*) -- | |
| An optional authentication token for the Hugging Face Hub. If no token is passed, will default | |
| to the token saved locally when logging in with `huggingface-cli login`. Will raise an error | |
| if no token is passed and the user is not logged-in. | |
| - **revision** (`str`, *optional*) -- | |
| Branch to push the uploaded files to. Defaults to the `"main"` branch. | |
| <Added version="2.15.0"/> | |
| - **create_pr** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to create a PR with the uploaded files or directly commit. | |
| <Added version="2.15.0"/> | |
| - **max_shard_size** (`int` or `str`, *optional*, defaults to `"500MB"`) -- | |
| The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by | |
| a unit (like `"5MB"`). | |
| - **num_shards** (`int`, *optional*) -- | |
| Number of shards to write. By default, the number of shards depends on `max_shard_size`. | |
| <Added version="2.8.0"/> | |
| - **embed_external_files** (`bool`, defaults to `True`) -- | |
| Whether to embed file bytes in the shards. | |
| In particular, this will do the following before the push for the fields of type: | |
| - [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) and [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image): remove local path information and embed file content in the Parquet files. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when preparing and uploading the dataset. | |
| This is helpful if the dataset is made of many samples or media files to embed. | |
| Multiprocessing is disabled by default. | |
| <Added version="4.0.0"/></paramsdesc><paramgroups>0</paramgroups><retdesc>huggingface_hub.CommitInfo</retdesc></docstring> | |
| Pushes the dataset to the hub as a Parquet dataset. | |
| The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed. | |
| The resulting Parquet files are self-contained by default. If your dataset contains [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image), [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) or [Video](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Video) | |
| data, the Parquet files will store the bytes of your images or audio files. | |
| You can disable this by setting `embed_external_files` to `False`. | |
| <ExampleCodeBlock anchor="datasets.Dataset.push_to_hub.example"> | |
| Example: | |
| ```python | |
| >>> dataset.push_to_hub("<organization>/<dataset_id>") | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", private=True) | |
| >>> dataset.push_to_hub("<organization>/<dataset_id>", max_shard_size="1GB") | |
| >>> dataset.push_to_hub("<organization>/<dataset_id>", num_shards=1024) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.push_to_hub.example-2"> | |
| If your dataset has multiple splits (e.g. train/validation/test): | |
| ```python | |
| >>> train_dataset.push_to_hub("<organization>/<dataset_id>", split="train") | |
| >>> val_dataset.push_to_hub("<organization>/<dataset_id>", split="validation") | |
| >>> # later | |
| >>> dataset = load_dataset("<organization>/<dataset_id>") | |
| >>> train_dataset = dataset["train"] | |
| >>> val_dataset = dataset["validation"] | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.push_to_hub.example-3"> | |
| If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages): | |
| ```python | |
| >>> english_dataset.push_to_hub("<organization>/<dataset_id>", "en") | |
| >>> french_dataset.push_to_hub("<organization>/<dataset_id>", "fr") | |
| >>> # later | |
| >>> english_dataset = load_dataset("<organization>/<dataset_id>", "en") | |
| >>> french_dataset = load_dataset("<organization>/<dataset_id>", "fr") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>save_to_disk</name><anchor>datasets.Dataset.save_to_disk</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1510</source><parameters>[{"name": "dataset_path", "val": ": typing.Union[str, bytes, os.PathLike]"}, {"name": "max_shard_size", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "num_shards", "val": ": typing.Optional[int] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **dataset_path** (`path-like`) -- | |
| Path (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`) | |
| of the dataset directory where the dataset will be saved to. | |
| - **max_shard_size** (`int` or `str`, *optional*, defaults to `"500MB"`) -- | |
| The maximum size of the dataset shards to be saved to the filesystem. If expressed as a string, needs to be digits followed by a unit | |
| (like `"50MB"`). | |
| - **num_shards** (`int`, *optional*) -- | |
| Number of shards to write. By default the number of shards depends on `max_shard_size` and `num_proc`. | |
| <Added version="2.8.0"/> | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| Multiprocessing is disabled by default. | |
| <Added version="2.8.0"/> | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.8.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Saves a dataset to a dataset directory, or in a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`. | |
| For [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image), [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) and [Video](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Video) data: | |
| All the Image(), Audio() and Video() data are stored in the arrow files. | |
| If you want to store paths or urls, please use the Value("string") type. | |
| <ExampleCodeBlock anchor="datasets.Dataset.save_to_disk.example"> | |
| Example: | |
| ```py | |
| >>> ds.save_to_disk("path/to/dataset/directory") | |
| >>> ds.save_to_disk("path/to/dataset/directory", max_shard_size="1GB") | |
| >>> ds.save_to_disk("path/to/dataset/directory", num_shards=1024) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>load_from_disk</name><anchor>datasets.Dataset.load_from_disk</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1704</source><parameters>[{"name": "dataset_path", "val": ": typing.Union[str, bytes, os.PathLike]"}, {"name": "keep_in_memory", "val": ": typing.Optional[bool] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **dataset_path** (`path-like`) -- | |
| Path (e.g. `"dataset/train"`) or remote URI (e.g. `"s3//my-bucket/dataset/train"`) | |
| of the dataset directory where the dataset will be loaded from. | |
| - **keep_in_memory** (`bool`, defaults to `None`) -- | |
| Whether to copy the dataset in-memory. If `None`, the | |
| dataset will not be copied in-memory unless explicitly enabled by setting | |
| `datasets.config.IN_MEMORY_MAX_SIZE` to nonzero. See more details in the | |
| [improve performance](../cache#improve-performance) section. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.8.0"/></paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>- If `dataset_path` is a path of a dataset directory, the dataset requested. | |
| - If `dataset_path` is a path of a dataset dict directory, a `datasets.DatasetDict` with each split.</retdesc></docstring> | |
| Loads a dataset that was previously saved using `save_to_disk` from a dataset directory, or from a | |
| filesystem using any implementation of `fsspec.spec.AbstractFileSystem`. | |
| <ExampleCodeBlock anchor="datasets.Dataset.load_from_disk.example"> | |
| Example: | |
| ```py | |
| >>> ds = load_from_disk("path/to/dataset/directory") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten_indices</name><anchor>datasets.Dataset.flatten_indices</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L3957</source><parameters>[{"name": "keep_in_memory", "val": ": bool = False"}, {"name": "cache_file_name", "val": ": typing.Optional[str] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "disable_nullable", "val": ": bool = False"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "new_fingerprint", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **cache_file_name** (`str`, *optional*, default `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| results of the computation instead of the automatically generated cache file name. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **features** (`Optional[datasets.Features]`, defaults to `None`) -- | |
| Use a specific [Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features) to store the cache file | |
| instead of the automatically generated one. | |
| - **disable_nullable** (`bool`, defaults to `False`) -- | |
| Allow null values in the table. | |
| - **num_proc** (`int`, optional, default `None`) -- | |
| Max number of processes when generating cache. Already cached shards are loaded sequentially | |
| - **new_fingerprint** (`str`, *optional*, defaults to `None`) -- | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create and cache a new Dataset by flattening the indices mapping. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_csv</name><anchor>datasets.Dataset.to_csv</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L4994</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, bytes, os.PathLike, typing.BinaryIO]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**to_csv_kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_buf** (`PathLike` or `FileOrBuffer`) -- | |
| Either a path to a file (e.g. `file.csv`), a remote URI (e.g. `hf://datasets/username/my_dataset_name/data.csv`), | |
| or a BinaryIO, where the dataset will be saved to in the specified format. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes for multiprocessing. By default it doesn't | |
| use multiprocessing. `batch_size` in this case defaults to | |
| `datasets.config.DEFAULT_MAX_BATCH_SIZE` but feel free to make it 5x or 10x of the default | |
| value if you have sufficient compute power. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.19.0"/> | |
| - ****to_csv_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to pandas's [`pandas.DataFrame.to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). | |
| <Changed version="2.10.0"> | |
| Now, `index` defaults to `False` if not specified. | |
| If you would like to write the index, pass `index=True` and also set a name for the index column by | |
| passing `index_label`. | |
| </Changed></paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of characters or bytes written.</retdesc></docstring> | |
| Exports the dataset to csv | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_csv.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_csv("path/to/dataset/directory") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_pandas</name><anchor>datasets.Dataset.to_pandas</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5158</source><parameters>[{"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "batched", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`, *optional*) -- | |
| The size (number of rows) of the batches if `batched` is `True`. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **batched** (`bool`) -- | |
| Set to `True` to return a generator that yields the dataset as batches | |
| of `batch_size` rows. Defaults to `False` (returns the whole datasets once).</paramsdesc><paramgroups>0</paramgroups><retdesc>`pandas.DataFrame` or `Iterator[pandas.DataFrame]`</retdesc></docstring> | |
| Returns the dataset as a `pandas.DataFrame`. Can also return a generator for large datasets. | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_pandas.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_pandas() | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_dict</name><anchor>datasets.Dataset.to_dict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5053</source><parameters>[{"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "batched", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`, *optional*) -- The size (number of rows) of the batches if `batched` is `True`. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **batched** (`bool`) -- | |
| Set to `True` to return a generator that yields the dataset as batches | |
| of `batch_size` rows. Defaults to `False` (returns the whole datasets once).</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` or `Iterator[dict]`</retdesc></docstring> | |
| Returns the dataset as a Python dict. Can also return a generator for large datasets. | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_dict.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_dict() | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_json</name><anchor>datasets.Dataset.to_json</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5096</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, bytes, os.PathLike, typing.BinaryIO]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**to_json_kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_buf** (`PathLike` or `FileOrBuffer`) -- | |
| Either a path to a file (e.g. `file.json`), a remote URI (e.g. `hf://datasets/username/my_dataset_name/data.json`), | |
| or a BinaryIO, where the dataset will be saved to in the specified format. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes for multiprocessing. By default, it doesn't | |
| use multiprocessing. `batch_size` in this case defaults to | |
| `datasets.config.DEFAULT_MAX_BATCH_SIZE` but feel free to make it 5x or 10x of the default | |
| value if you have sufficient compute power. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.19.0"/> | |
| - ****to_json_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to pandas's [`pandas.DataFrame.to_json`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html). | |
| Default arguments are `lines=True` and `orient="records". | |
| <Changed version="2.11.0"> | |
| The parameter `index` defaults to `False` if `orient` is `"split"` or `"table"`. | |
| If you would like to write the index, pass `index=True`. | |
| </Changed></paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of characters or bytes written.</retdesc></docstring> | |
| Export the dataset to JSON Lines or JSON. | |
| The default output format is [JSON Lines](https://jsonlines.org/). | |
| To export to [JSON](https://www.json.org), pass `lines=False` argument and the desired `orient`. | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_json.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_json("path/to/dataset/directory/filename.jsonl") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_parquet</name><anchor>datasets.Dataset.to_parquet</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5257</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, bytes, os.PathLike, typing.BinaryIO]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**parquet_writer_kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_buf** (`PathLike` or `FileOrBuffer`) -- | |
| Either a path to a file (e.g. `file.parquet`), a remote URI (e.g. `hf://datasets/username/my_dataset_name/data.parquet`), | |
| or a BinaryIO, where the dataset will be saved to in the specified format. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| By default it aims for row groups with maximum uncompressed byte size of "100MB", | |
| defined by `datasets.config.MAX_ROW_GROUP_SIZE`. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.19.0"/> | |
| - ****parquet_writer_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to PyArrow's `pyarrow.parquet.ParquetWriter`.</paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of characters or bytes written.</retdesc></docstring> | |
| Exports the dataset to parquet | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_parquet.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_parquet("path/to/dataset/directory") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_sql</name><anchor>datasets.Dataset.to_sql</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5297</source><parameters>[{"name": "name", "val": ": str"}, {"name": "con", "val": ": typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "**sql_writer_kwargs", "val": ""}]</parameters><paramsdesc>- **name** (`str`) -- | |
| Name of SQL table. | |
| - **con** (`str` or `sqlite3.Connection` or `sqlalchemy.engine.Connection` or `sqlalchemy.engine.Connection`) -- | |
| A [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) or a SQLite3/SQLAlchemy connection object used to write to a database. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - ****sql_writer_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to pandas's [`pandas.DataFrame.to_sql`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html). | |
| <Changed version="2.11.0"> | |
| Now, `index` defaults to `False` if not specified. | |
| If you would like to write the index, pass `index=True` and also set a name for the index column by | |
| passing `index_label`. | |
| </Changed></paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of records written.</retdesc></docstring> | |
| Exports the dataset to a SQL database. | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_sql.example"> | |
| Example: | |
| ```py | |
| >>> # con provided as a connection URI string | |
| >>> ds.to_sql("data", "sqlite:///my_own_db.sql") | |
| >>> # con provided as a sqlite3 connection object | |
| >>> import sqlite3 | |
| >>> con = sqlite3.connect("my_own_db.sql") | |
| >>> with con: | |
| ... ds.to_sql("data", con) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_iterable_dataset</name><anchor>datasets.Dataset.to_iterable_dataset</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L5389</source><parameters>[{"name": "num_shards", "val": ": typing.Optional[int] = 1"}]</parameters><paramsdesc>- **num_shards** (`int`, default to `1`) -- | |
| Number of shards to define when instantiating the iterable dataset. This is especially useful for big datasets to be able to shuffle properly, | |
| and also to enable fast parallel loading using a PyTorch DataLoader or in distributed setups for example. | |
| Shards are defined using [datasets.Dataset.shard()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.shard): it simply slices the data without writing anything on disk.</paramsdesc><paramgroups>0</paramgroups><retdesc>[datasets.IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset)</retdesc></docstring> | |
| Get an [datasets.IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) from a map-style [datasets.Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset). | |
| This is equivalent to loading a dataset in streaming mode with [datasets.load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset), but much faster since the data is streamed from local files. | |
| Contrary to map-style datasets, iterable datasets are lazy and can only be iterated over (e.g. using a for loop). | |
| Since they are read sequentially in training loops, iterable datasets are much faster than map-style datasets. | |
| All the transformations applied to iterable datasets like filtering or processing are done on-the-fly when you start iterating over the dataset. | |
| Still, it is possible to shuffle an iterable dataset using [datasets.IterableDataset.shuffle()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset.shuffle). | |
| This is a fast approximate shuffling that works best if you have multiple shards and if you specify a buffer size that is big enough. | |
| To get the best speed performance, make sure your dataset doesn't have an indices mapping. | |
| If this is the case, the data are not read contiguously, which can be slow sometimes. | |
| You can use `ds = ds.flatten_indices()` to write your dataset in contiguous chunks of data and have optimal speed before switching to an iterable dataset. | |
| Example: | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example"> | |
| Basic usage: | |
| ```python | |
| >>> ids = ds.to_iterable_dataset() | |
| >>> for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example-2"> | |
| With lazy filtering and processing: | |
| ```python | |
| >>> ids = ds.to_iterable_dataset() | |
| >>> ids = ids.filter(filter_fn).map(process_fn) # will filter and process on-the-fly when you start iterating over the iterable dataset | |
| >>> for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example-3"> | |
| With sharding to enable efficient shuffling: | |
| ```python | |
| >>> ids = ds.to_iterable_dataset(num_shards=64) # the dataset is split into 64 shards to be iterated over | |
| >>> ids = ids.shuffle(buffer_size=10_000) # will shuffle the shards order and use a shuffle buffer for fast approximate shuffling when you start iterating | |
| >>> for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example-4"> | |
| With a PyTorch DataLoader: | |
| ```python | |
| >>> import torch | |
| >>> ids = ds.to_iterable_dataset(num_shards=64) | |
| >>> ids = ids.filter(filter_fn).map(process_fn) | |
| >>> dataloader = torch.utils.data.DataLoader(ids, num_workers=4) # will assign 64 / 4 = 16 shards to each worker to load, filter and process when you start iterating | |
| >>> for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example-5"> | |
| With a PyTorch DataLoader and shuffling: | |
| ```python | |
| >>> import torch | |
| >>> ids = ds.to_iterable_dataset(num_shards=64) | |
| >>> ids = ids.shuffle(buffer_size=10_000) # will shuffle the shards order and use a shuffle buffer when you start iterating | |
| >>> dataloader = torch.utils.data.DataLoader(ids, num_workers=4) # will assign 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating | |
| >>> for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| In a distributed setup like PyTorch DDP with a PyTorch DataLoader and shuffling | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example-6"> | |
| ```python | |
| >>> from datasets.distributed import split_dataset_by_node | |
| >>> ids = ds.to_iterable_dataset(num_shards=512) | |
| >>> ids = ids.shuffle(buffer_size=10_000, seed=42) # will shuffle the shards order and use a shuffle buffer when you start iterating | |
| >>> ids = split_dataset_by_node(ds, world_size=8, rank=0) # will keep only 512 / 8 = 64 shards from the shuffled lists of shards when you start iterating | |
| >>> dataloader = torch.utils.data.DataLoader(ids, num_workers=4) # will assign 64 / 4 = 16 shards from this node's list of shards to each worker when you start iterating | |
| >>> for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Dataset.to_iterable_dataset.example-7"> | |
| With shuffling and multiple epochs: | |
| ```python | |
| >>> ids = ds.to_iterable_dataset(num_shards=64) | |
| >>> ids = ids.shuffle(buffer_size=10_000, seed=42) # will shuffle the shards order and use a shuffle buffer when you start iterating | |
| >>> for epoch in range(n_epochs): | |
| ... ids.set_epoch(epoch) # will use effective_seed = seed + epoch to shuffle the shards and for the shuffle buffer when you start iterating | |
| ... for example in ids: | |
| ... pass | |
| ``` | |
| </ExampleCodeBlock> | |
| Feel free to also use `IterableDataset.set_epoch()` when using a PyTorch DataLoader or in distributed setups. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_faiss_index</name><anchor>datasets.Dataset.add_faiss_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L6127</source><parameters>[{"name": "column", "val": ": str"}, {"name": "index_name", "val": ": typing.Optional[str] = None"}, {"name": "device", "val": ": typing.Optional[int] = None"}, {"name": "string_factory", "val": ": typing.Optional[str] = None"}, {"name": "metric_type", "val": ": typing.Optional[int] = None"}, {"name": "custom_index", "val": ": typing.Optional[ForwardRef('faiss.Index')] = None"}, {"name": "batch_size", "val": ": int = 1000"}, {"name": "train_size", "val": ": typing.Optional[int] = None"}, {"name": "faiss_verbose", "val": ": bool = False"}, {"name": "dtype", "val": " = <class 'numpy.float32'>"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| The column of the vectors to add to the index. | |
| - **index_name** (`str`, *optional*) -- | |
| The `index_name`/identifier of the index. | |
| This is the `index_name` that is used to call [get_nearest_examples()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.get_nearest_examples) or [search()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.search). | |
| By default it corresponds to `column`. | |
| - **device** (`Union[int, List[int]]`, *optional*) -- | |
| If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. | |
| If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. | |
| - **string_factory** (`str`, *optional*) -- | |
| This is passed to the index factory of Faiss to create the index. | |
| Default index class is `IndexFlat`. | |
| - **metric_type** (`int`, *optional*) -- | |
| Type of metric. Ex: `faiss.METRIC_INNER_PRODUCT` or `faiss.METRIC_L2`. | |
| - **custom_index** (`faiss.Index`, *optional*) -- | |
| Custom Faiss index that you already have instantiated and configured for your needs. | |
| - **batch_size** (`int`) -- | |
| Size of the batch to use while adding vectors to the `FaissIndex`. Default value is `1000`. | |
| <Added version="2.4.0"/> | |
| - **train_size** (`int`, *optional*) -- | |
| If the index needs a training step, specifies how many vectors will be used to train the index. | |
| - **faiss_verbose** (`bool`, defaults to `False`) -- | |
| Enable the verbosity of the Faiss index. | |
| - **dtype** (`data-type`) -- | |
| The dtype of the numpy arrays that are indexed. | |
| Default is `np.float32`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Add a dense index using Faiss for fast retrieval. | |
| By default the index is done over the vectors of the specified column. | |
| You can specify `device` if you want to run it on GPU (`device` must be the GPU index). | |
| You can find more information about Faiss here: | |
| - For [string factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory) | |
| <ExampleCodeBlock anchor="datasets.Dataset.add_faiss_index.example"> | |
| Example: | |
| ```python | |
| >>> ds = datasets.load_dataset('crime_and_punish', split='train') | |
| >>> ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']})) | |
| >>> ds_with_embeddings.add_faiss_index(column='embeddings') | |
| >>> # query | |
| >>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10) | |
| >>> # save index | |
| >>> ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss') | |
| >>> ds = datasets.load_dataset('crime_and_punish', split='train') | |
| >>> # load index | |
| >>> ds.load_faiss_index('embeddings', 'my_index.faiss') | |
| >>> # query | |
| >>> scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_faiss_index_from_external_arrays</name><anchor>datasets.Dataset.add_faiss_index_from_external_arrays</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L6207</source><parameters>[{"name": "external_arrays", "val": ": <built-in function array>"}, {"name": "index_name", "val": ": str"}, {"name": "device", "val": ": typing.Optional[int] = None"}, {"name": "string_factory", "val": ": typing.Optional[str] = None"}, {"name": "metric_type", "val": ": typing.Optional[int] = None"}, {"name": "custom_index", "val": ": typing.Optional[ForwardRef('faiss.Index')] = None"}, {"name": "batch_size", "val": ": int = 1000"}, {"name": "train_size", "val": ": typing.Optional[int] = None"}, {"name": "faiss_verbose", "val": ": bool = False"}, {"name": "dtype", "val": " = <class 'numpy.float32'>"}]</parameters><paramsdesc>- **external_arrays** (`np.array`) -- | |
| If you want to use arrays from outside the lib for the index, you can set `external_arrays`. | |
| It will use `external_arrays` to create the Faiss index instead of the arrays in the given `column`. | |
| - **index_name** (`str`) -- | |
| The `index_name`/identifier of the index. | |
| This is the `index_name` that is used to call [get_nearest_examples()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.get_nearest_examples) or [search()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.search). | |
| - **device** (Optional `Union[int, List[int]]`, *optional*) -- | |
| If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. | |
| If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. | |
| - **string_factory** (`str`, *optional*) -- | |
| This is passed to the index factory of Faiss to create the index. | |
| Default index class is `IndexFlat`. | |
| - **metric_type** (`int`, *optional*) -- | |
| Type of metric. Ex: `faiss.faiss.METRIC_INNER_PRODUCT` or `faiss.METRIC_L2`. | |
| - **custom_index** (`faiss.Index`, *optional*) -- | |
| Custom Faiss index that you already have instantiated and configured for your needs. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to use while adding vectors to the FaissIndex. Default value is 1000. | |
| <Added version="2.4.0"/> | |
| - **train_size** (`int`, *optional*) -- | |
| If the index needs a training step, specifies how many vectors will be used to train the index. | |
| - **faiss_verbose** (`bool`, defaults to False) -- | |
| Enable the verbosity of the Faiss index. | |
| - **dtype** (`numpy.dtype`) -- | |
| The dtype of the numpy arrays that are indexed. Default is np.float32.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Add a dense index using Faiss for fast retrieval. | |
| The index is created using the vectors of `external_arrays`. | |
| You can specify `device` if you want to run it on GPU (`device` must be the GPU index). | |
| You can find more information about Faiss here: | |
| - For [string factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory) | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>save_faiss_index</name><anchor>datasets.Dataset.save_faiss_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L535</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "file", "val": ": typing.Union[str, pathlib.PurePath]"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **index_name** (`str`) -- The index_name/identifier of the index. This is the index_name that is used to call `.get_nearest` or `.search`. | |
| - **file** (`str`) -- The path to the serialized faiss index on disk or remote URI (e.g. `"s3://my-bucket/index.faiss"`). | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.11.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Save a FaissIndex on disk. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>load_faiss_index</name><anchor>datasets.Dataset.load_faiss_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L553</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "file", "val": ": typing.Union[str, pathlib.PurePath]"}, {"name": "device", "val": ": typing.Union[list[int], int, NoneType] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **index_name** (`str`) -- The index_name/identifier of the index. This is the index_name that is used to | |
| call `.get_nearest` or `.search`. | |
| - **file** (`str`) -- The path to the serialized faiss index on disk or remote URI (e.g. `"s3://my-bucket/index.faiss"`). | |
| - **device** (Optional `Union[int, List[int]]`) -- If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. | |
| If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.11.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Load a FaissIndex from disk. | |
| If you want to do additional configurations, you can have access to the faiss index object by doing | |
| `.get_index(index_name).faiss_index` to make it fit your needs. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_elasticsearch_index</name><anchor>datasets.Dataset.add_elasticsearch_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L6266</source><parameters>[{"name": "column", "val": ": str"}, {"name": "index_name", "val": ": typing.Optional[str] = None"}, {"name": "host", "val": ": typing.Optional[str] = None"}, {"name": "port", "val": ": typing.Optional[int] = None"}, {"name": "es_client", "val": ": typing.Optional[ForwardRef('elasticsearch.Elasticsearch')] = None"}, {"name": "es_index_name", "val": ": typing.Optional[str] = None"}, {"name": "es_index_config", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| The column of the documents to add to the index. | |
| - **index_name** (`str`, *optional*) -- | |
| The `index_name`/identifier of the index. | |
| This is the index name that is used to call [get_nearest_examples()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.get_nearest_examples) or [search()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.search). | |
| By default it corresponds to `column`. | |
| - **host** (`str`, *optional*, defaults to `localhost`) -- | |
| Host of where ElasticSearch is running. | |
| - **port** (`str`, *optional*, defaults to `9200`) -- | |
| Port of where ElasticSearch is running. | |
| - **es_client** (`elasticsearch.Elasticsearch`, *optional*) -- | |
| The elasticsearch client used to create the index if host and port are `None`. | |
| - **es_index_name** (`str`, *optional*) -- | |
| The elasticsearch index name used to create the index. | |
| - **es_index_config** (`dict`, *optional*) -- | |
| The configuration of the elasticsearch index. | |
| Default config is: | |
| ``` | |
| { | |
| "settings": { | |
| "number_of_shards": 1, | |
| "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, | |
| }, | |
| "mappings": { | |
| "properties": { | |
| "text": { | |
| "type": "text", | |
| "analyzer": "standard", | |
| "similarity": "BM25" | |
| }, | |
| } | |
| }, | |
| } | |
| ```</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Add a text index using ElasticSearch for fast retrieval. This is done in-place. | |
| <ExampleCodeBlock anchor="datasets.Dataset.add_elasticsearch_index.example"> | |
| Example: | |
| ```python | |
| >>> es_client = elasticsearch.Elasticsearch() | |
| >>> ds = datasets.load_dataset('crime_and_punish', split='train') | |
| >>> ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index") | |
| >>> scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>load_elasticsearch_index</name><anchor>datasets.Dataset.load_elasticsearch_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L637</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "es_index_name", "val": ": str"}, {"name": "host", "val": ": typing.Optional[str] = None"}, {"name": "port", "val": ": typing.Optional[int] = None"}, {"name": "es_client", "val": ": typing.Optional[ForwardRef('Elasticsearch')] = None"}, {"name": "es_index_config", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **index_name** (`str`) -- | |
| The `index_name`/identifier of the index. This is the index name that is used to call `get_nearest` or `search`. | |
| - **es_index_name** (`str`) -- | |
| The name of elasticsearch index to load. | |
| - **host** (`str`, *optional*, defaults to `localhost`) -- | |
| Host of where ElasticSearch is running. | |
| - **port** (`str`, *optional*, defaults to `9200`) -- | |
| Port of where ElasticSearch is running. | |
| - **es_client** (`elasticsearch.Elasticsearch`, *optional*) -- | |
| The elasticsearch client used to create the index if host and port are `None`. | |
| - **es_index_config** (`dict`, *optional*) -- | |
| The configuration of the elasticsearch index. | |
| Default config is: | |
| ``` | |
| { | |
| "settings": { | |
| "number_of_shards": 1, | |
| "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, | |
| }, | |
| "mappings": { | |
| "properties": { | |
| "text": { | |
| "type": "text", | |
| "analyzer": "standard", | |
| "similarity": "BM25" | |
| }, | |
| } | |
| }, | |
| } | |
| ```</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Load an existing text index using ElasticSearch for fast retrieval. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>list_indexes</name><anchor>datasets.Dataset.list_indexes</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L438</source><parameters>[]</parameters></docstring> | |
| List the `colindex_nameumns`/identifiers of all the attached indexes. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_index</name><anchor>datasets.Dataset.get_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L442</source><parameters>[{"name": "index_name", "val": ": str"}]</parameters><paramsdesc>- **index_name** (`str`) -- Index name.</paramsdesc><paramgroups>0</paramgroups><retdesc>`BaseIndex`</retdesc></docstring> | |
| List the `index_name`/identifiers of all the attached indexes. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>drop_index</name><anchor>datasets.Dataset.drop_index</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L684</source><parameters>[{"name": "index_name", "val": ": str"}]</parameters><paramsdesc>- **index_name** (`str`) -- | |
| The `index_name`/identifier of the index.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Drop the index with the specified column. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>search</name><anchor>datasets.Dataset.search</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L693</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "query", "val": ": typing.Union[str, <built-in function array>]"}, {"name": "k", "val": ": int = 10"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **index_name** (`str`) -- | |
| The name/identifier of the index. | |
| - **query** (`Union[str, np.ndarray]`) -- | |
| The query as a string if `index_name` is a text index or as a numpy array if `index_name` is a vector index. | |
| - **k** (`int`) -- | |
| The number of examples to retrieve.</paramsdesc><paramgroups>0</paramgroups><rettype>`(scores, indices)`</rettype><retdesc>A tuple of `(scores, indices)` where: | |
| - **scores** (`List[List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples | |
| - **indices** (`List[List[int]]`): the indices of the retrieved examples</retdesc></docstring> | |
| Find the nearest examples indices in the dataset to the query. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>search_batch</name><anchor>datasets.Dataset.search_batch</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L713</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "queries", "val": ": typing.Union[list[str], <built-in function array>]"}, {"name": "k", "val": ": int = 10"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **index_name** (`str`) -- | |
| The `index_name`/identifier of the index. | |
| - **queries** (`Union[List[str], np.ndarray]`) -- | |
| The queries as a list of strings if `index_name` is a text index or as a numpy array if `index_name` is a vector index. | |
| - **k** (`int`) -- | |
| The number of examples to retrieve per query.</paramsdesc><paramgroups>0</paramgroups><rettype>`(total_scores, total_indices)`</rettype><retdesc>A tuple of `(total_scores, total_indices)` where: | |
| - **total_scores** (`List[List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples per query | |
| - **total_indices** (`List[List[int]]`): the indices of the retrieved examples per query</retdesc></docstring> | |
| Find the nearest examples indices in the dataset to the query. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_nearest_examples</name><anchor>datasets.Dataset.get_nearest_examples</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L735</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "query", "val": ": typing.Union[str, <built-in function array>]"}, {"name": "k", "val": ": int = 10"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **index_name** (`str`) -- | |
| The index_name/identifier of the index. | |
| - **query** (`Union[str, np.ndarray]`) -- | |
| The query as a string if `index_name` is a text index or as a numpy array if `index_name` is a vector index. | |
| - **k** (`int`) -- | |
| The number of examples to retrieve.</paramsdesc><paramgroups>0</paramgroups><rettype>`(scores, examples)`</rettype><retdesc>A tuple of `(scores, examples)` where: | |
| - **scores** (`List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples | |
| - **examples** (`dict`): the retrieved examples</retdesc></docstring> | |
| Find the nearest examples in the dataset to the query. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_nearest_examples_batch</name><anchor>datasets.Dataset.get_nearest_examples_batch</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/search.py#L759</source><parameters>[{"name": "index_name", "val": ": str"}, {"name": "queries", "val": ": typing.Union[list[str], <built-in function array>]"}, {"name": "k", "val": ": int = 10"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **index_name** (`str`) -- | |
| The `index_name`/identifier of the index. | |
| - **queries** (`Union[List[str], np.ndarray]`) -- | |
| The queries as a list of strings if `index_name` is a text index or as a numpy array if `index_name` is a vector index. | |
| - **k** (`int`) -- | |
| The number of examples to retrieve per query.</paramsdesc><paramgroups>0</paramgroups><rettype>`(total_scores, total_examples)`</rettype><retdesc>A tuple of `(total_scores, total_examples)` where: | |
| - **total_scores** (`List[List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples per query | |
| - **total_examples** (`List[dict]`): the retrieved examples per query</retdesc></docstring> | |
| Find the nearest examples in the dataset to the query. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>info</name><anchor>datasets.Dataset.info</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L167</source><parameters>[]</parameters></docstring> | |
| [DatasetInfo](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetInfo) object containing all the metadata in the dataset. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>split</name><anchor>datasets.Dataset.split</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L172</source><parameters>[]</parameters></docstring> | |
| [NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit) object corresponding to a named dataset split. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>builder_name</name><anchor>datasets.Dataset.builder_name</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L177</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>citation</name><anchor>datasets.Dataset.citation</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L181</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>config_name</name><anchor>datasets.Dataset.config_name</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L185</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>dataset_size</name><anchor>datasets.Dataset.dataset_size</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L189</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>description</name><anchor>datasets.Dataset.description</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L193</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_checksums</name><anchor>datasets.Dataset.download_checksums</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L197</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_size</name><anchor>datasets.Dataset.download_size</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L201</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>features</name><anchor>datasets.Dataset.features</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L780</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>homepage</name><anchor>datasets.Dataset.homepage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L209</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>license</name><anchor>datasets.Dataset.license</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L213</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>size_in_bytes</name><anchor>datasets.Dataset.size_in_bytes</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L217</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>supervised_keys</name><anchor>datasets.Dataset.supervised_keys</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L221</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>version</name><anchor>datasets.Dataset.version</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L225</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_csv</name><anchor>datasets.Dataset.from_csv</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1070</source><parameters>[{"name": "path_or_paths", "val": ": typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]]"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`path-like` or list of `path-like`) -- | |
| Path(s) of the CSV file(s). | |
| - **split** ([NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit), *optional*) -- | |
| Split name to be assigned to the dataset. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default. | |
| <Added version="2.8.0"/> | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `pandas.read_csv`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Create Dataset from CSV file(s). | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_csv.example"> | |
| Example: | |
| ```py | |
| >>> ds = Dataset.from_csv('path/to/dataset.csv') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_json</name><anchor>datasets.Dataset.from_json</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1206</source><parameters>[{"name": "path_or_paths", "val": ": typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]]"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "field", "val": ": typing.Optional[str] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`path-like` or list of `path-like`) -- | |
| Path(s) of the JSON or JSON Lines file(s). | |
| - **split** ([NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit), *optional*) -- | |
| Split name to be assigned to the dataset. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **field** (`str`, *optional*) -- | |
| Field name of the JSON file where the dataset is contained in. | |
| - **num_proc** (`int`, *optional* defaults to `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default. | |
| <Added version="2.8.0"/> | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `JsonConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Create Dataset from JSON or JSON Lines file(s). | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_json.example"> | |
| Example: | |
| ```py | |
| >>> ds = Dataset.from_json('path/to/dataset.json') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_parquet</name><anchor>datasets.Dataset.from_parquet</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1263</source><parameters>[{"name": "path_or_paths", "val": ": typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]]"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "columns", "val": ": typing.Optional[list[str]] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`path-like` or list of `path-like`) -- | |
| Path(s) of the Parquet file(s). | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Split name to be assigned to the dataset. | |
| - **features** (`Features`, *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **columns** (`List[str]`, *optional*) -- | |
| If not `None`, only these columns will be read from the file. | |
| A column name may be a prefix of a nested field, e.g. 'a' will select | |
| 'a.b', 'a.c', and 'a.d.e'. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default. | |
| <Added version="2.8.0"/> | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `ParquetConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Create Dataset from Parquet file(s). | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_parquet.example"> | |
| Example: | |
| ```py | |
| >>> ds = Dataset.from_parquet('path/to/dataset.parquet') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_text</name><anchor>datasets.Dataset.from_text</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1322</source><parameters>[{"name": "path_or_paths", "val": ": typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]]"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`path-like` or list of `path-like`) -- | |
| Path(s) of the text file(s). | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Split name to be assigned to the dataset. | |
| - **features** (`Features`, *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default. | |
| <Added version="2.8.0"/> | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `TextConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Create Dataset from text file(s). | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_text.example"> | |
| Example: | |
| ```py | |
| >>> ds = Dataset.from_text('path/to/dataset.txt') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_sql</name><anchor>datasets.Dataset.from_sql</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L1437</source><parameters>[{"name": "sql", "val": ": typing.Union[str, ForwardRef('sqlalchemy.sql.Selectable')]"}, {"name": "con", "val": ": typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')]"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **sql** (`str` or `sqlalchemy.sql.Selectable`) -- | |
| SQL query to be executed or a table name. | |
| - **con** (`str` or `sqlite3.Connection` or `sqlalchemy.engine.Connection` or `sqlalchemy.engine.Connection`) -- | |
| A [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) used to instantiate a database connection or a SQLite3/SQLAlchemy connection object. | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `SqlConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset)</retdesc></docstring> | |
| Create Dataset from SQL query or database table. | |
| <ExampleCodeBlock anchor="datasets.Dataset.from_sql.example"> | |
| Example: | |
| ```py | |
| >>> # Fetch a database table | |
| >>> ds = Dataset.from_sql("test_data", "postgres:///db_name") | |
| >>> # Execute a SQL query on the table | |
| >>> ds = Dataset.from_sql("SELECT sentence FROM test_data", "postgres:///db_name") | |
| >>> # Use a Selectable object to specify the query | |
| >>> from sqlalchemy import select, text | |
| >>> stmt = select([text("sentence")]).select_from(text("test_data")) | |
| >>> ds = Dataset.from_sql(stmt, "postgres:///db_name") | |
| ``` | |
| </ExampleCodeBlock> | |
| > [!TIP] | |
| > The returned dataset can only be cached if `con` is specified as URI string. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>align_labels_with_mapping</name><anchor>datasets.Dataset.align_labels_with_mapping</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L6388</source><parameters>[{"name": "label2id", "val": ": dict"}, {"name": "label_column", "val": ": str"}]</parameters><paramsdesc>- **label2id** (`dict`) -- | |
| The label name to ID mapping to align the dataset with. | |
| - **label_column** (`str`) -- | |
| The column name of labels to align on.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Align the dataset's label ID and label name mapping to match an input `label2id` mapping. | |
| This is useful when you want to ensure that a model's predicted labels are aligned with the dataset. | |
| The alignment in done using the lowercase label names. | |
| <ExampleCodeBlock anchor="datasets.Dataset.align_labels_with_mapping.example"> | |
| Example: | |
| ```python | |
| >>> # dataset with mapping {'entailment': 0, 'neutral': 1, 'contradiction': 2} | |
| >>> ds = load_dataset("nyu-mll/glue", "mnli", split="train") | |
| >>> # mapping to align with | |
| >>> label2id = {'CONTRADICTION': 0, 'NEUTRAL': 1, 'ENTAILMENT': 2} | |
| >>> ds_aligned = ds.align_labels_with_mapping(label2id, "label") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.concatenate_datasets</name><anchor>datasets.concatenate_datasets</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/combine.py#L166</source><parameters>[{"name": "dsets", "val": ": list"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "axis", "val": ": int = 0"}]</parameters><paramsdesc>- **dsets** (`List[datasets.Dataset]`) -- | |
| List of Datasets to concatenate. | |
| - **info** (`DatasetInfo`, *optional*) -- | |
| Dataset information, like description, citation, etc. | |
| - **split** (`NamedSplit`, *optional*) -- | |
| Name of the dataset split. | |
| - **axis** (`{0, 1}`, defaults to `0`) -- | |
| Axis to concatenate over, where `0` means over rows (vertically) and `1` means over columns | |
| (horizontally). | |
| <Added version="1.6.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Converts a list of [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) with the same schema into a single [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset). | |
| <ExampleCodeBlock anchor="datasets.concatenate_datasets.example"> | |
| Example: | |
| ```py | |
| >>> ds3 = concatenate_datasets([ds1, ds2]) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.interleave_datasets</name><anchor>datasets.interleave_datasets</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/combine.py#L18</source><parameters>[{"name": "datasets", "val": ": list"}, {"name": "probabilities", "val": ": typing.Optional[list[float]] = None"}, {"name": "seed", "val": ": typing.Optional[int] = None"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "stopping_strategy", "val": ": typing.Literal['first_exhausted', 'all_exhausted', 'all_exhausted_without_replacement'] = 'first_exhausted'"}]</parameters><paramsdesc>- **datasets** (`List[Dataset]` or `List[IterableDataset]`) -- | |
| List of datasets to interleave. | |
| - **probabilities** (`List[float]`, *optional*, defaults to `None`) -- | |
| If specified, the new dataset is constructed by sampling | |
| examples from one source at a time according to these probabilities. | |
| - **seed** (`int`, *optional*, defaults to `None`) -- | |
| The random seed used to choose a source for each example. | |
| - **info** ([DatasetInfo](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetInfo), *optional*) -- | |
| Dataset information, like description, citation, etc. | |
| <Added version="2.4.0"/> | |
| - **split** ([NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit), *optional*) -- | |
| Name of the dataset split. | |
| <Added version="2.4.0"/> | |
| - **stopping_strategy** (`str`, defaults to `first_exhausted`) -- | |
| Three strategies are proposed right now, `first_exhausted`, `all_exhausted` and `all_exhausted_without_replacement`. | |
| By default, `first_exhausted` is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples. | |
| If the strategy is `all_exhausted`, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once. | |
| When strategy is `all_exhausted_without_replacement` we make sure that each sample in each dataset is sampled only once. | |
| Note that if the strategy is `all_exhausted`, the interleaved dataset size can get enormous: | |
| - with no probabilities, the resulting dataset will have `max_length_datasets*nb_dataset` samples. | |
| - with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) or [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset)</rettype><retdesc>Return type depends on the input `datasets` | |
| parameter. `Dataset` if the input is a list of `Dataset`, `IterableDataset` if the input is a list of | |
| `IterableDataset`.</retdesc></docstring> | |
| Interleave several datasets (sources) into a single dataset. | |
| The new dataset is constructed by alternating between the sources to get the examples. | |
| You can use this function on a list of [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) objects, or on a list of [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) objects. | |
| - If `probabilities` is `None` (default) the new dataset is constructed by cycling between each source to get the examples. | |
| - If `probabilities` is not `None`, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities. | |
| The resulting dataset ends when one of the source datasets runs out of examples except when `oversampling` is `True`, | |
| in which case, the resulting dataset ends when all datasets have ran out of examples at least one time. | |
| Note for iterable datasets: | |
| In a distributed setup or in PyTorch DataLoader workers, the stopping strategy is applied per process. | |
| Therefore the "first_exhausted" strategy on an sharded iterable dataset can generate less samples in total (up to 1 missing sample per subdataset per worker). | |
| Example: | |
| <ExampleCodeBlock anchor="datasets.interleave_datasets.example"> | |
| For regular datasets (map-style): | |
| ```python | |
| >>> from datasets import Dataset, interleave_datasets | |
| >>> d1 = Dataset.from_dict({"a": [0, 1, 2]}) | |
| >>> d2 = Dataset.from_dict({"a": [10, 11, 12]}) | |
| >>> d3 = Dataset.from_dict({"a": [20, 21, 22]}) | |
| >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted") | |
| >>> dataset["a"] | |
| [10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22] | |
| >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) | |
| >>> dataset["a"] | |
| [10, 0, 11, 1, 2] | |
| >>> dataset = interleave_datasets([d1, d2, d3]) | |
| >>> dataset["a"] | |
| [0, 10, 20, 1, 11, 21, 2, 12, 22] | |
| >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") | |
| >>> dataset["a"] | |
| [0, 10, 20, 1, 11, 21, 2, 12, 22] | |
| >>> d1 = Dataset.from_dict({"a": [0, 1, 2]}) | |
| >>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]}) | |
| >>> d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]}) | |
| >>> dataset = interleave_datasets([d1, d2, d3]) | |
| >>> dataset["a"] | |
| [0, 10, 20, 1, 11, 21, 2, 12, 22] | |
| >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") | |
| >>> dataset["a"] | |
| [0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 10, 24] | |
| >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) | |
| >>> dataset["a"] | |
| [10, 0, 11, 1, 2] | |
| >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted") | |
| >>> dataset["a"] | |
| [10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24] | |
| For datasets in streaming mode (iterable): | |
| >>> from datasets import interleave_datasets | |
| >>> d1 = load_dataset('allenai/c4', 'es', split='train', streaming=True) | |
| >>> d2 = load_dataset('allenai/c4', 'fr', split='train', streaming=True) | |
| >>> dataset = interleave_datasets([d1, d2]) | |
| >>> iterator = iter(dataset) | |
| >>> next(iterator) | |
| {'text': 'Comprar Zapatillas para niña en chancla con goma por...'} | |
| >>> next(iterator) | |
| {'text': 'Le sacre de philippe ier, 23 mai 1059 - Compte Rendu...' | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.distributed.split_dataset_by_node</name><anchor>datasets.distributed.split_dataset_by_node</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/distributed.py#L10</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "rank", "val": ": int"}, {"name": "world_size", "val": ": int"}]</parameters><paramsdesc>- **dataset** ([Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) or [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset)) -- | |
| The dataset to split by node. | |
| - **rank** (`int`) -- | |
| Rank of the current node. | |
| - **world_size** (`int`) -- | |
| Total number of nodes.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) or [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset)</rettype><retdesc>The dataset to be used on the node at rank `rank`.</retdesc></docstring> | |
| Split a dataset for the node at rank `rank` in a pool of nodes of size `world_size`. | |
| For map-style datasets: | |
| Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset. | |
| To maximize data loading throughput, chunks are made of contiguous data on disk if possible. | |
| For iterable datasets: | |
| If the dataset has a number of shards that is a factor of `world_size` (i.e. if `dataset.num_shards % world_size == 0`), | |
| then the shards are evenly assigned across the nodes, which is the most optimized. | |
| Otherwise, each node keeps 1 example out of `world_size`, skipping the other examples. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.enable_caching</name><anchor>datasets.enable_caching</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/fingerprint.py#L94</source><parameters>[]</parameters></docstring> | |
| When applying transforms on a dataset, the data are stored in cache files. | |
| The caching mechanism allows to reload an existing cache file if it's already been computed. | |
| Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated | |
| after each transform. | |
| If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. | |
| More precisely, if the caching is disabled: | |
| - cache files are always recreated | |
| - cache files are written to a temporary directory that is deleted when session closes | |
| - cache files are named using a random hash instead of the dataset fingerprint | |
| - use [save_to_disk()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.save_to_disk) to save a transformed dataset or it will be deleted when session closes | |
| - caching doesn't affect [load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset). If you want to regenerate a dataset from scratch you should use | |
| the `download_mode` parameter in [load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset). | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.disable_caching</name><anchor>datasets.disable_caching</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/fingerprint.py#L115</source><parameters>[]</parameters></docstring> | |
| When applying transforms on a dataset, the data are stored in cache files. | |
| The caching mechanism allows to reload an existing cache file if it's already been computed. | |
| Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated | |
| after each transform. | |
| If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. | |
| More precisely, if the caching is disabled: | |
| - cache files are always recreated | |
| - cache files are written to a temporary directory that is deleted when session closes | |
| - cache files are named using a random hash instead of the dataset fingerprint | |
| - use [save_to_disk()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.save_to_disk) to save a transformed dataset or it will be deleted when session closes | |
| - caching doesn't affect [load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset). If you want to regenerate a dataset from scratch you should use | |
| the `download_mode` parameter in [load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset). | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.is_caching_enabled</name><anchor>datasets.is_caching_enabled</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/fingerprint.py#L136</source><parameters>[]</parameters></docstring> | |
| When applying transforms on a dataset, the data are stored in cache files. | |
| The caching mechanism allows to reload an existing cache file if it's already been computed. | |
| Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated | |
| after each transform. | |
| If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. | |
| More precisely, if the caching is disabled: | |
| - cache files are always recreated | |
| - cache files are written to a temporary directory that is deleted when session closes | |
| - cache files are named using a random hash instead of the dataset fingerprint | |
| - use [save_to_disk()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.save_to_disk)] to save a transformed dataset or it will be deleted when session closes | |
| - caching doesn't affect [load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset). If you want to regenerate a dataset from scratch you should use | |
| the `download_mode` parameter in [load_dataset()](/docs/datasets/pr_7835/en/package_reference/loading_methods#datasets.load_dataset). | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Column</name><anchor>datasets.Column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L633</source><parameters>[{"name": "source", "val": ": typing.Union[ForwardRef('Dataset'), ForwardRef('Column')]"}, {"name": "column_name", "val": ": str"}]</parameters></docstring> | |
| An iterable for a specific column of a [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset). | |
| Example: | |
| <ExampleCodeBlock anchor="datasets.Column.example"> | |
| Iterate on the texts of the "text" column of a dataset: | |
| ```python | |
| for text in dataset["text"]: | |
| ... | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.Column.example-2"> | |
| It also works with nested columns: | |
| ```python | |
| for source in dataset["metadata"]["source"]: | |
| ... | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## DatasetDict[[datasets.DatasetDict]] | |
| Dictionary with split names as keys ('train', 'test' for example), and `Dataset` objects as values. | |
| It also has dataset transform methods like map or filter, to process all the splits at once. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.DatasetDict</name><anchor>datasets.DatasetDict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L57</source><parameters>""</parameters></docstring> | |
| A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.) | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>data</name><anchor>datasets.DatasetDict.data</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L98</source><parameters>[]</parameters></docstring> | |
| The Apache Arrow tables backing each split. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.data.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.data | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cache_files</name><anchor>datasets.DatasetDict.cache_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L113</source><parameters>[]</parameters></docstring> | |
| The cache files containing the Apache Arrow table backing each split. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.cache_files.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.cache_files | |
| {'test': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'}], | |
| 'train': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'}], | |
| 'validation': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>num_columns</name><anchor>datasets.DatasetDict.num_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L131</source><parameters>[]</parameters></docstring> | |
| Number of columns in each split of the dataset. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.num_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.num_columns | |
| {'test': 2, 'train': 2, 'validation': 2} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>num_rows</name><anchor>datasets.DatasetDict.num_rows</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L147</source><parameters>[]</parameters></docstring> | |
| Number of rows in each split of the dataset. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.num_rows.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.num_rows | |
| {'test': 1066, 'train': 8530, 'validation': 1066} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>column_names</name><anchor>datasets.DatasetDict.column_names</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L163</source><parameters>[]</parameters></docstring> | |
| Names of the columns in each split of the dataset. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.column_names.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.column_names | |
| {'test': ['text', 'label'], | |
| 'train': ['text', 'label'], | |
| 'validation': ['text', 'label']} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shape</name><anchor>datasets.DatasetDict.shape</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L181</source><parameters>[]</parameters></docstring> | |
| Shape of each split of the dataset (number of rows, number of columns). | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.shape.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.shape | |
| {'test': (1066, 2), 'train': (8530, 2), 'validation': (1066, 2)} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>unique</name><anchor>datasets.DatasetDict.unique</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L230</source><parameters>[{"name": "column", "val": ": str"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| column name (list all the column names with [column_names](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict.column_names))</paramsdesc><paramgroups>0</paramgroups><rettype>Dict[`str`, `list`]</rettype><retdesc>Dictionary of unique elements in the given column.</retdesc></docstring> | |
| Return a list of the unique elements in a column for each split. | |
| This is implemented in the low-level backend and as such, very fast. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.unique.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.unique("label") | |
| {'test': [1, 0], 'train': [1, 0], 'validation': [1, 0]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cleanup_cache_files</name><anchor>datasets.DatasetDict.cleanup_cache_files</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L254</source><parameters>[]</parameters><retdesc>`Dict` with the number of removed files for each split</retdesc></docstring> | |
| Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. | |
| Be careful when running this command that no other process is currently using other cache files. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.cleanup_cache_files.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.cleanup_cache_files() | |
| {'test': 0, 'train': 0, 'validation': 0} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>map</name><anchor>datasets.DatasetDict.map</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L818</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": ": bool = False"}, {"name": "with_rank", "val": ": bool = False"}, {"name": "with_split", "val": ": bool = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "drop_last_batch", "val": ": bool = False"}, {"name": "remove_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "cache_file_names", "val": ": typing.Optional[dict[str, typing.Optional[str]]] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "disable_nullable", "val": ": bool = False"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "desc", "val": ": typing.Optional[str] = None"}, {"name": "try_original_type", "val": ": typing.Optional[bool] = True"}]</parameters><paramsdesc>- **function** (`callable`) -- with one of the following signature: | |
| - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` | |
| - `function(example: Dict[str, Any], indices: int) -> Dict[str, Any]` if `batched=False` and `with_indices=True` | |
| - `function(batch: Dict[str, list]) -> Dict[str, list]` if `batched=True` and `with_indices=False` | |
| - `function(batch: Dict[str, list], indices: list[int]) -> Dict[str, list]` if `batched=True` and `with_indices=True` | |
| For advanced usage, the function can also return a `pyarrow.Table`. | |
| If the function is asynchronous, then `map` will run your function in parallel. | |
| Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. | |
| If no function is provided, default to identity function: `lambda x: x`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. | |
| - **with_rank** (`bool`, defaults to `False`) -- | |
| Provide process rank to `function`. Note that in this case the | |
| signature of `function` should be `def function(example[, idx], rank): ...`. | |
| - **with_split** (`bool`, defaults to `False`) -- | |
| Provide process split to `function`. Note that in this case the | |
| signature of `function` should be `def function(example[, idx], split): ...`. | |
| - **input_columns** (`[Union[str, list[str]]]`, *optional*, defaults to `None`) -- | |
| The columns to be passed into `function` as | |
| positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True`, | |
| `batch_size <= 0` or `batch_size == None` then provide the full dataset as a single batch to `function`. | |
| - **drop_last_batch** (`bool`, defaults to `False`) -- | |
| Whether a last batch smaller than the batch_size should be | |
| dropped instead of being processed by the function. | |
| - **remove_columns** (`[Union[str, list[str]]]`, *optional*, defaults to `None`) -- | |
| Remove a selection of columns while doing the mapping. | |
| Columns will be removed before updating the examples with the output of `function`, i.e. if `function` is adding | |
| columns with names in `remove_columns`, these columns will be kept. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the current computation from `function` | |
| can be identified, use it instead of recomputing. | |
| - **cache_file_names** (`[Dict[str, str]]`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| results of the computation instead of the automatically generated cache file name. | |
| You have to provide one `cache_file_name` per dataset in the dataset dictionary. | |
| - **writer_batch_size** (`int`, default `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **features** (`[datasets.Features]`, *optional*, defaults to `None`) -- | |
| Use a specific [Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features) to store the cache file | |
| instead of the automatically generated one. | |
| - **disable_nullable** (`bool`, defaults to `False`) -- | |
| Disallow null values in the table. | |
| - **fn_kwargs** (`Dict`, *optional*, defaults to `None`) -- | |
| Keyword arguments to be passed to `function` | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| The number of processes to use for multiprocessing. | |
| - If `None` or `0`, no multiprocessing is used and the operation runs in the main process. | |
| - If greater than `1`, one or multiple worker processes are used to process data in parallel. | |
| Note: The function passed to `map()` must be picklable for multiprocessing to work correctly | |
| (i.e., prefer functions defined at the top level of a module, not inside another function or class). | |
| - **desc** (`str`, *optional*, defaults to `None`) -- | |
| Meaningful description to be displayed alongside with the progress bar while mapping examples. | |
| - **try_original_type** (`Optional[bool]`, defaults to `True`) -- | |
| Try to keep the types of the original columns (e.g. int32 -> int32). | |
| Set to False if you want to always infer new types.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a function to all the examples in the table (individually or in batches) and update the table. | |
| If your function returns a column that already exists, then it overwrites it. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| You can specify whether the function should be batched or not with the `batched` parameter: | |
| - If batched is `False`, then the function takes 1 example in and should return 1 example. | |
| An example is a dictionary, e.g. `{"text": "Hello there !"}`. | |
| - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. | |
| A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`. | |
| - If batched is `True` and `batch_size` is `n > 1`, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. | |
| Note that the last batch may have less than `n` examples. | |
| A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. | |
| If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simultaneous calls. | |
| It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.map.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> def add_prefix(example): | |
| ... example["text"] = "Review: " + example["text"] | |
| ... return example | |
| >>> ds = ds.map(add_prefix) | |
| >>> ds["train"][0:3]["text"] | |
| ['Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', | |
| 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .', | |
| 'Review: effective but too-tepid biopic'] | |
| # process a batch of examples | |
| >>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True) | |
| # set number of processors | |
| >>> ds = ds.map(add_prefix, num_proc=4) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>filter</name><anchor>datasets.DatasetDict.filter</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L979</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": ": bool = False"}, {"name": "with_rank", "val": ": bool = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "cache_file_names", "val": ": typing.Optional[dict[str, typing.Optional[str]]] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "desc", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **function** (`Callable`) -- Callable with one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> bool` if `batched=False` and `with_indices=False` and `with_rank=False` | |
| - `function(example: Dict[str, Any], *extra_args) -> bool` if `batched=False` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) | |
| - `function(batch: Dict[str, list]) -> list[bool]` if `batched=True` and `with_indices=False` and `with_rank=False` | |
| - `function(batch: Dict[str, list], *extra_args) -> list[bool]` if `batched=True` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) | |
| If no function is provided, defaults to an always `True` function: `lambda x: True`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the | |
| signature of `function` should be `def function(example, idx[, rank]): ...`. | |
| - **with_rank** (`bool`, defaults to `False`) -- | |
| Provide process rank to `function`. Note that in this case the | |
| signature of `function` should be `def function(example[, idx], rank): ...`. | |
| - **input_columns** (`[Union[str, list[str]]]`, *optional*, defaults to `None`) -- | |
| The columns to be passed into `function` as | |
| positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True` | |
| `batch_size <= 0` or `batch_size == None` then provide the full dataset as a single batch to `function`. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the current computation from `function` | |
| can be identified, use it instead of recomputing. | |
| - **cache_file_names** (`[Dict[str, str]]`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| results of the computation instead of the automatically generated cache file name. | |
| You have to provide one `cache_file_name` per dataset in the dataset dictionary. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. | |
| - **fn_kwargs** (`Dict`, *optional*, defaults to `None`) -- | |
| Keyword arguments to be passed to `function` | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| The number of processes to use for multiprocessing. | |
| - If `None` or `0`, no multiprocessing is used and the operation runs in the main process. | |
| - If greater than `1`, one or multiple worker processes are used to process data in parallel. | |
| Note: The function passed to `map()` must be picklable for multiprocessing to work correctly | |
| (i.e., prefer functions defined at the top level of a module, not inside another function or class). | |
| - **desc** (`str`, *optional*, defaults to `None`) -- | |
| Meaningful description to be displayed alongside with the progress bar while filtering examples.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a filter function to all the elements in the table in batches | |
| and update the table so that the dataset only includes examples according to the filter function. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.filter.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.filter(lambda x: x["label"] == 1) | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 4265 | |
| }) | |
| validation: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 533 | |
| }) | |
| test: Dataset({ | |
| features: ['text', 'label'], | |
| num_rows: 533 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>sort</name><anchor>datasets.DatasetDict.sort</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1144</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, collections.abc.Sequence[str]]"}, {"name": "reverse", "val": ": typing.Union[bool, collections.abc.Sequence[bool]] = False"}, {"name": "null_placement", "val": ": str = 'at_end'"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "indices_cache_file_names", "val": ": typing.Optional[dict[str, typing.Optional[str]]] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}]</parameters><paramsdesc>- **column_names** (`Union[str, Sequence[str]]`) -- | |
| Column name(s) to sort by. | |
| - **reverse** (`Union[bool, Sequence[bool]]`, defaults to `False`) -- | |
| If `True`, sort by descending order rather than ascending. If a single bool is provided, | |
| the value is applied to the sorting of all column names. Otherwise a list of bools with the | |
| same length and order as column_names must be provided. | |
| - **null_placement** (`str`, defaults to `at_end`) -- | |
| Put `None` values at the beginning if `at_start` or `first` or at the end if `at_end` or `last` | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the sorted indices in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the sorted indices | |
| can be identified, use it instead of recomputing. | |
| - **indices_cache_file_names** (`[Dict[str, str]]`, *optional*, defaults to `None`) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| indices mapping instead of the automatically generated cache file name. | |
| You have to provide one `cache_file_name` per dataset in the dataset dictionary. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| Higher value gives smaller cache files, lower value consume less temporary memory.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new dataset sorted according to a single or multiple columns. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.sort.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset('cornell-movie-review-data/rotten_tomatoes') | |
| >>> ds['train']['label'][:10] | |
| [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
| >>> sorted_ds = ds.sort('label') | |
| >>> sorted_ds['train']['label'][:10] | |
| [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] | |
| >>> another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False]) | |
| >>> another_sorted_ds['train']['label'][:10] | |
| [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shuffle</name><anchor>datasets.DatasetDict.shuffle</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1211</source><parameters>[{"name": "seeds", "val": ": typing.Union[int, dict[str, typing.Optional[int]], NoneType] = None"}, {"name": "seed", "val": ": typing.Optional[int] = None"}, {"name": "generators", "val": ": typing.Optional[dict[str, numpy.random._generator.Generator]] = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "load_from_cache_file", "val": ": typing.Optional[bool] = None"}, {"name": "indices_cache_file_names", "val": ": typing.Optional[dict[str, typing.Optional[str]]] = None"}, {"name": "writer_batch_size", "val": ": typing.Optional[int] = 1000"}]</parameters><paramsdesc>- **seeds** (`Dict[str, int]` or `int`, *optional*) -- | |
| A seed to initialize the default BitGenerator if `generator=None`. | |
| If `None`, then fresh, unpredictable entropy will be pulled from the OS. | |
| If an `int` or `array_like[ints]` is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. | |
| You can provide one `seed` per dataset in the dataset dictionary. | |
| - **seed** (`int`, *optional*) -- | |
| A seed to initialize the default BitGenerator if `generator=None`. Alias for seeds (a `ValueError` is raised if both are provided). | |
| - **generators** (`Dict[str, *optional*, np.random.Generator]`) -- | |
| Numpy random Generator to use to compute the permutation of the dataset rows. | |
| If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). | |
| You have to provide one `generator` per dataset in the dataset dictionary. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Keep the dataset in memory instead of writing it to a cache file. | |
| - **load_from_cache_file** (`Optional[bool]`, defaults to `True` if caching is enabled) -- | |
| If a cache file storing the current computation from `function` | |
| can be identified, use it instead of recomputing. | |
| - **indices_cache_file_names** (`Dict[str, str]`, *optional*) -- | |
| Provide the name of a path for the cache file. It is used to store the | |
| indices mappings instead of the automatically generated cache file name. | |
| You have to provide one `cache_file_name` per dataset in the dataset dictionary. | |
| - **writer_batch_size** (`int`, defaults to `1000`) -- | |
| Number of rows per write operation for the cache file writer. | |
| This value is a good trade-off between memory usage during the processing, and processing speed. | |
| Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new Dataset where the rows are shuffled. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| Currently shuffling uses numpy random generators. | |
| You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy's default random generator (PCG64). | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.shuffle.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds["train"]["label"][:10] | |
| [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
| # set a seed | |
| >>> shuffled_ds = ds.shuffle(seed=42) | |
| >>> shuffled_ds["train"]["label"][:10] | |
| [0, 1, 0, 1, 0, 0, 0, 0, 0, 0] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>set_format</name><anchor>datasets.DatasetDict.set_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L575</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}, {"name": "**format_kwargs", "val": ""}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means `__getitem__` returns python objects (default). | |
| - **columns** (`list[str]`, *optional*) -- | |
| Columns to format in the output. | |
| `None` means `__getitem__` returns all columns (default). | |
| - **output_all_columns** (`bool`, defaults to False) -- | |
| Keep un-formatted columns as well in the output (as python objects), | |
| - ****format_kwargs** (additional keyword arguments) -- | |
| Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format (type and columns). | |
| The format is set for every dataset in the dataset dictionary. | |
| It is possible to call `map` after calling `set_format`. Since `map` may add new columns, then the list of formatted columns | |
| gets updated. In this case, if you apply `map` on a dataset to add a new column, then this column will be formatted: | |
| `new formatted columns = (all columns - previously unformatted columns)` | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.set_format.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) | |
| >>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) | |
| >>> ds["train"].format | |
| {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': 'numpy'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>reset_format</name><anchor>datasets.DatasetDict.reset_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L626</source><parameters>[]</parameters></docstring> | |
| Reset `__getitem__` return format to python objects and all columns. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| Same as `self.set_format()` | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.reset_format.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) | |
| >>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) | |
| >>> ds["train"].format | |
| {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': 'numpy'} | |
| >>> ds.reset_format() | |
| >>> ds["train"].format | |
| {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': None} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>formatted_as</name><anchor>datasets.DatasetDict.formatted_as</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L535</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}, {"name": "**format_kwargs", "val": ""}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means `__getitem__` returns python objects (default). | |
| - **columns** (`list[str]`, *optional*) -- | |
| Columns to format in the output. | |
| `None` means `__getitem__` returns all columns (default). | |
| - **output_all_columns** (`bool`, defaults to False) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| - ****format_kwargs** (additional keyword arguments) -- | |
| Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| To be used in a `with` statement. Set `__getitem__` return format (type and columns). | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>with_format</name><anchor>datasets.DatasetDict.with_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L687</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}, {"name": "**format_kwargs", "val": ""}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means `__getitem__` returns python objects (default). | |
| - **columns** (`list[str]`, *optional*) -- | |
| Columns to format in the output. | |
| `None` means `__getitem__` returns all columns (default). | |
| - **output_all_columns** (`bool`, defaults to `False`) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| - ****format_kwargs** (additional keyword arguments) -- | |
| Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format (type and columns). The data formatting is applied on-the-fly. | |
| The format `type` (for example "numpy") is used to format batches when using `__getitem__`. | |
| The format is set for every dataset in the dataset dictionary. | |
| It's also possible to use custom transforms for formatting using [with_transform()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.with_transform). | |
| Contrary to [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict.set_format), `with_format` returns a new [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) object with new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) objects. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.with_format.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) | |
| >>> ds["train"].format | |
| {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': None} | |
| >>> ds = ds.with_format("torch") | |
| >>> ds["train"].format | |
| {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], | |
| 'format_kwargs': {}, | |
| 'output_all_columns': False, | |
| 'type': 'torch'} | |
| >>> ds["train"][0] | |
| {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', | |
| 'label': tensor(1), | |
| 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, | |
| 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, | |
| 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0]), | |
| 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), | |
| 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>with_transform</name><anchor>datasets.DatasetDict.with_transform</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L764</source><parameters>[{"name": "transform", "val": ": typing.Optional[typing.Callable]"}, {"name": "columns", "val": ": typing.Optional[list] = None"}, {"name": "output_all_columns", "val": ": bool = False"}]</parameters><paramsdesc>- **transform** (`Callable`, *optional*) -- | |
| User-defined formatting transform, replaces the format defined by [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format). | |
| A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. | |
| This function is applied right before returning the objects in `__getitem__`. | |
| - **columns** (`list[str]`, *optional*) -- | |
| Columns to format in the output. | |
| If specified, then the input batch of the transform only contains those columns. | |
| - **output_all_columns** (`bool`, defaults to False) -- | |
| Keep un-formatted columns as well in the output (as python objects). | |
| If set to `True`, then the other un-formatted columns are kept with the output of the transform.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Set `__getitem__` return format using this transform. The transform is applied on-the-fly on batches when `__getitem__` is called. | |
| The transform is set for every dataset in the dataset dictionary | |
| As [set_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.set_format), this can be reset using [reset_format()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.reset_format). | |
| Contrary to `set_transform()`, `with_transform` returns a new [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) object with new [Dataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset) objects. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.with_transform.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> def encode(example): | |
| ... return tokenizer(example['text'], truncation=True, padding=True, return_tensors="pt") | |
| >>> ds = ds.with_transform(encode) | |
| >>> ds["train"][0] | |
| {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1, 1, 1, 1, 1, 1, 1, 1]), | |
| 'input_ids': tensor([ 101, 1103, 2067, 1110, 17348, 1106, 1129, 1103, 6880, 1432, | |
| 112, 188, 1207, 107, 14255, 1389, 107, 1105, 1115, 1119, | |
| 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, | |
| 170, 11791, 5253, 188, 1732, 7200, 10947, 12606, 2895, 117, | |
| 179, 7766, 118, 172, 15554, 1181, 3498, 6961, 3263, 1137, | |
| 188, 1566, 7912, 14516, 6997, 119, 102]), | |
| 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0])} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.DatasetDict.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L197</source><parameters>[{"name": "max_depth", "val": " = 16"}]</parameters></docstring> | |
| Flatten the Apache Arrow Table of each split (nested features are flatten). | |
| Each column with a struct type is flattened into one column per struct field. | |
| Other columns are left unchanged. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.flatten.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("rajpurkar/squad") | |
| >>> ds["train"].features | |
| {'id': Value('string'), | |
| 'title': Value('string'), | |
| 'context': Value('string'), | |
| 'question': Value('string'), | |
| 'answers.text': List(Value('string')), | |
| 'answers.answer_start': List(Value('int32'))} | |
| >>> ds.flatten() | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], | |
| num_rows: 87599 | |
| }) | |
| validation: Dataset({ | |
| features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], | |
| num_rows: 10570 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast</name><anchor>datasets.DatasetDict.cast</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L278</source><parameters>[{"name": "features", "val": ": Features"}]</parameters><paramsdesc>- **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features)) -- | |
| New features to cast the dataset to. | |
| The name and order of the fields in the features must match the current column names. | |
| The type of the data must also be convertible from one type to the other. | |
| For non-trivial conversion, e.g. `string` <-> `ClassLabel` you should use [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict.map) to update the dataset.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Cast the dataset to a new set of features. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.cast.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel, Value | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> new_features = ds["train"].features.copy() | |
| >>> new_features['label'] = ClassLabel(names=['bad', 'good']) | |
| >>> new_features['text'] = Value('large_string') | |
| >>> ds = ds.cast(new_features) | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('large_string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_column</name><anchor>datasets.DatasetDict.cast_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L310</source><parameters>[{"name": "column", "val": ": str"}, {"name": "feature", "val": ""}]</parameters><paramsdesc>- **column** (`str`) -- | |
| Column name. | |
| - **feature** (`Feature`) -- | |
| Target feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</retdesc></docstring> | |
| Cast column to feature for decoding. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.cast_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>remove_columns</name><anchor>datasets.DatasetDict.remove_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L339</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **column_names** (`Union[str, list[str]]`) -- | |
| Name of the column(s) to remove.</paramsdesc><paramgroups>0</paramgroups><rettype>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>A copy of the dataset object without the columns to remove.</retdesc></docstring> | |
| Remove one or several column(s) from each split in the dataset | |
| and the features associated to the column(s). | |
| The transformation is applied to all the splits of the dataset dictionary. | |
| You can also remove a column using [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict.map) with `remove_columns` but the present method | |
| doesn't copy the data of the remaining columns and is thus faster. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.remove_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds = ds.remove_columns("label") | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text'], | |
| num_rows: 8530 | |
| }) | |
| validation: Dataset({ | |
| features: ['text'], | |
| num_rows: 1066 | |
| }) | |
| test: Dataset({ | |
| features: ['text'], | |
| num_rows: 1066 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_column</name><anchor>datasets.DatasetDict.rename_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L381</source><parameters>[{"name": "original_column_name", "val": ": str"}, {"name": "new_column_name", "val": ": str"}]</parameters><paramsdesc>- **original_column_name** (`str`) -- | |
| Name of the column to rename. | |
| - **new_column_name** (`str`) -- | |
| New name for the column.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Rename a column in the dataset and move the features associated to the original column under the new column name. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| You can also rename a column using [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict.map) with `remove_columns` but the present method: | |
| - takes care of moving the original features under the new column name. | |
| - doesn't copy the data to a new dataset and is thus much faster. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.rename_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds = ds.rename_column("label", "label_new") | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text', 'label_new'], | |
| num_rows: 8530 | |
| }) | |
| validation: Dataset({ | |
| features: ['text', 'label_new'], | |
| num_rows: 1066 | |
| }) | |
| test: Dataset({ | |
| features: ['text', 'label_new'], | |
| num_rows: 1066 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_columns</name><anchor>datasets.DatasetDict.rename_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L429</source><parameters>[{"name": "column_mapping", "val": ": dict"}]</parameters><paramsdesc>- **column_mapping** (`Dict[str, str]`) -- | |
| A mapping of columns to rename to their new names.</paramsdesc><paramgroups>0</paramgroups><rettype>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>A copy of the dataset with renamed columns.</retdesc></docstring> | |
| Rename several columns in the dataset, and move the features associated to the original columns under | |
| the new column names. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.rename_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text_new', 'label_new'], | |
| num_rows: 8530 | |
| }) | |
| validation: Dataset({ | |
| features: ['text_new', 'label_new'], | |
| num_rows: 1066 | |
| }) | |
| test: Dataset({ | |
| features: ['text_new', 'label_new'], | |
| num_rows: 1066 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>select_columns</name><anchor>datasets.DatasetDict.select_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L467</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **column_names** (`Union[str, list[str]]`) -- | |
| Name of the column(s) to keep.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Select one or several column(s) from each split in the dataset and | |
| the features associated to the column(s). | |
| The transformation is applied to all the splits of the dataset | |
| dictionary. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.select_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") | |
| >>> ds.select_columns("text") | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['text'], | |
| num_rows: 8530 | |
| }) | |
| validation: Dataset({ | |
| features: ['text'], | |
| num_rows: 1066 | |
| }) | |
| test: Dataset({ | |
| features: ['text'], | |
| num_rows: 1066 | |
| }) | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class_encode_column</name><anchor>datasets.DatasetDict.class_encode_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L503</source><parameters>[{"name": "column", "val": ": str"}, {"name": "include_nulls", "val": ": bool = False"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| The name of the column to cast. | |
| - **include_nulls** (`bool`, defaults to `False`) -- | |
| Whether to include null values in the class labels. If `True`, the null values will be encoded as the `"None"` class label. | |
| <Added version="1.14.2"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Casts the given column as [ClassLabel](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.ClassLabel) and updates the tables. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.class_encode_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("boolq") | |
| >>> ds["train"].features | |
| {'answer': Value('bool'), | |
| 'passage': Value('string'), | |
| 'question': Value('string')} | |
| >>> ds = ds.class_encode_column("answer") | |
| >>> ds["train"].features | |
| {'answer': ClassLabel(num_classes=2, names=['False', 'True']), | |
| 'passage': Value('string'), | |
| 'question': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>push_to_hub</name><anchor>datasets.DatasetDict.push_to_hub</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1616</source><parameters>[{"name": "repo_id", "val": ""}, {"name": "config_name", "val": ": str = 'default'"}, {"name": "set_default", "val": ": typing.Optional[bool] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "commit_message", "val": ": typing.Optional[str] = None"}, {"name": "commit_description", "val": ": typing.Optional[str] = None"}, {"name": "private", "val": ": typing.Optional[bool] = None"}, {"name": "token", "val": ": typing.Optional[str] = None"}, {"name": "revision", "val": ": typing.Optional[str] = None"}, {"name": "create_pr", "val": ": typing.Optional[bool] = False"}, {"name": "max_shard_size", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "num_shards", "val": ": typing.Optional[dict[str, int]] = None"}, {"name": "embed_external_files", "val": ": bool = True"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **repo_id** (`str`) -- | |
| The ID of the repository to push to in the following format: `<user>/<dataset_name>` or | |
| `<org>/<dataset_name>`. Also accepts `<dataset_name>`, which will default to the namespace | |
| of the logged-in user. | |
| - **config_name** (`str`) -- | |
| Configuration name of a dataset. Defaults to "default". | |
| - **set_default** (`bool`, *optional*) -- | |
| Whether to set this configuration as the default one. Otherwise, the default configuration is the one | |
| named "default". | |
| - **data_dir** (`str`, *optional*) -- | |
| Directory name that will contain the uploaded data files. Defaults to the `config_name` if different | |
| from "default", else "data". | |
| <Added version="2.17.0"/> | |
| - **commit_message** (`str`, *optional*) -- | |
| Message to commit while pushing. Will default to `"Upload dataset"`. | |
| - **commit_description** (`str`, *optional*) -- | |
| Description of the commit that will be created. | |
| Additionally, description of the PR if a PR is created (`create_pr` is True). | |
| <Added version="2.16.0"/> | |
| - **private** (`bool`, *optional*) -- | |
| Whether to make the repo private. If `None` (default), the repo will be public unless the | |
| organization's default is private. This value is ignored if the repo already exists. | |
| - **token** (`str`, *optional*) -- | |
| An optional authentication token for the Hugging Face Hub. If no token is passed, will default | |
| to the token saved locally when logging in with `huggingface-cli login`. Will raise an error | |
| if no token is passed and the user is not logged-in. | |
| - **revision** (`str`, *optional*) -- | |
| Branch to push the uploaded files to. Defaults to the `"main"` branch. | |
| <Added version="2.15.0"/> | |
| - **create_pr** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to create a PR with the uploaded files or directly commit. | |
| <Added version="2.15.0"/> | |
| - **max_shard_size** (`int` or `str`, *optional*, defaults to `"500MB"`) -- | |
| The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit | |
| (like `"500MB"` or `"1GB"`). | |
| - **num_shards** (`Dict[str, int]`, *optional*) -- | |
| Number of shards to write. By default, the number of shards depends on `max_shard_size`. | |
| Use a dictionary to define a different num_shards for each split. | |
| <Added version="2.8.0"/> | |
| - **embed_external_files** (`bool`, defaults to `True`) -- | |
| Whether to embed file bytes in the shards. | |
| In particular, this will do the following before the push for the fields of type: | |
| - [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) and [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image) removes local path information and embed file content in the Parquet files. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when preparing and uploading the dataset. | |
| This is helpful if the dataset is made of many samples or media files to embed. | |
| Multiprocessing is disabled by default. | |
| <Added version="4.0.0"/></paramsdesc><paramgroups>0</paramgroups><retdesc>huggingface_hub.CommitInfo</retdesc></docstring> | |
| Pushes the [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) to the hub as a Parquet dataset. | |
| The [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) is pushed using HTTP requests and does not need to have neither git or git-lfs installed. | |
| Each dataset split will be pushed independently. The pushed dataset will keep the original split names. | |
| The resulting Parquet files are self-contained by default: if your dataset contains [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image) or [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) | |
| data, the Parquet files will store the bytes of your images or audio files. | |
| You can disable this by setting `embed_external_files` to False. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.push_to_hub.example"> | |
| Example: | |
| ```python | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>") | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", private=True) | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", max_shard_size="1GB") | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", num_shards={"train": 1024, "test": 8}) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.push_to_hub.example-2"> | |
| If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages): | |
| ```python | |
| >>> english_dataset.push_to_hub("<organization>/<dataset_id>", "en") | |
| >>> french_dataset.push_to_hub("<organization>/<dataset_id>", "fr") | |
| >>> # later | |
| >>> english_dataset = load_dataset("<organization>/<dataset_id>", "en") | |
| >>> french_dataset = load_dataset("<organization>/<dataset_id>", "fr") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>save_to_disk</name><anchor>datasets.DatasetDict.save_to_disk</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1294</source><parameters>[{"name": "dataset_dict_path", "val": ": typing.Union[str, bytes, os.PathLike]"}, {"name": "max_shard_size", "val": ": typing.Union[str, int, NoneType] = None"}, {"name": "num_shards", "val": ": typing.Optional[dict[str, int]] = None"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **dataset_dict_path** (`path-like`) -- | |
| Path (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`) | |
| of the dataset dict directory where the dataset dict will be saved to. | |
| - **max_shard_size** (`int` or `str`, *optional*, defaults to `"500MB"`) -- | |
| The maximum size of the dataset shards to be saved to the filesystem. If expressed as a string, needs to be digits followed by a unit | |
| (like `"50MB"`). | |
| - **num_shards** (`Dict[str, int]`, *optional*) -- | |
| Number of shards to write. By default the number of shards depends on `max_shard_size` and `num_proc`. | |
| You need to provide the number of shards for each dataset in the dataset dictionary. | |
| Use a dictionary to define a different num_shards for each split. | |
| <Added version="2.8.0"/> | |
| - **num_proc** (`int`, *optional*, default `None`) -- | |
| Number of processes when downloading and generating the dataset locally. | |
| Multiprocessing is disabled by default. | |
| <Added version="2.8.0"/> | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.8.0"/></paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Saves a dataset dict to a filesystem using `fsspec.spec.AbstractFileSystem`. | |
| For [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image), [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) and [Video](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Video) data: | |
| All the Image(), Audio() and Video() data are stored in the arrow files. | |
| If you want to store paths or urls, please use the Value("string") type. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.save_to_disk.example"> | |
| Example: | |
| ```python | |
| >>> dataset_dict.save_to_disk("path/to/dataset/directory") | |
| >>> dataset_dict.save_to_disk("path/to/dataset/directory", max_shard_size="1GB") | |
| >>> dataset_dict.save_to_disk("path/to/dataset/directory", num_shards={"train": 1024, "test": 8}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>load_from_disk</name><anchor>datasets.DatasetDict.load_from_disk</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1368</source><parameters>[{"name": "dataset_dict_path", "val": ": typing.Union[str, bytes, os.PathLike]"}, {"name": "keep_in_memory", "val": ": typing.Optional[bool] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **dataset_dict_path** (`path-like`) -- | |
| Path (e.g. `"dataset/train"`) or remote URI (e.g. `"s3//my-bucket/dataset/train"`) | |
| of the dataset dict directory where the dataset dict will be loaded from. | |
| - **keep_in_memory** (`bool`, defaults to `None`) -- | |
| Whether to copy the dataset in-memory. If `None`, the | |
| dataset will not be copied in-memory unless explicitly enabled by setting | |
| `datasets.config.IN_MEMORY_MAX_SIZE` to nonzero. See more details in the | |
| [improve performance](../cache#improve-performance) section. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.8.0"/></paramsdesc><paramgroups>0</paramgroups><retdesc>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</retdesc></docstring> | |
| Load a dataset that was previously saved using `save_to_disk` from a filesystem using `fsspec.spec.AbstractFileSystem`. | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.load_from_disk.example"> | |
| Example: | |
| ```py | |
| >>> ds = load_from_disk('path/to/dataset/directory') | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_csv</name><anchor>datasets.DatasetDict.from_csv</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1428</source><parameters>[{"name": "path_or_paths", "val": ": dict"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`dict` of path-like) -- | |
| Path(s) of the CSV file(s). | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (str, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `pandas.read_csv`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</retdesc></docstring> | |
| Create [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) from CSV file(s). | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.from_csv.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import DatasetDict | |
| >>> ds = DatasetDict.from_csv({'train': 'path/to/dataset.csv'}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_json</name><anchor>datasets.DatasetDict.from_json</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1471</source><parameters>[{"name": "path_or_paths", "val": ": dict"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`path-like` or list of `path-like`) -- | |
| Path(s) of the JSON Lines file(s). | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (str, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `JsonConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</retdesc></docstring> | |
| Create [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) from JSON Lines file(s). | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.from_json.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import DatasetDict | |
| >>> ds = DatasetDict.from_json({'train': 'path/to/dataset.json'}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_parquet</name><anchor>datasets.DatasetDict.from_parquet</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1514</source><parameters>[{"name": "path_or_paths", "val": ": dict"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "columns", "val": ": typing.Optional[list[str]] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`dict` of path-like) -- | |
| Path(s) of the CSV file(s). | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - **columns** (`list[str]`, *optional*) -- | |
| If not `None`, only these columns will be read from the file. | |
| A column name may be a prefix of a nested field, e.g. 'a' will select | |
| 'a.b', 'a.c', and 'a.d.e'. | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `ParquetConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</retdesc></docstring> | |
| Create [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) from Parquet file(s). | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.from_parquet.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import DatasetDict | |
| >>> ds = DatasetDict.from_parquet({'train': 'path/to/dataset/parquet'}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_text</name><anchor>datasets.DatasetDict.from_text</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1563</source><parameters>[{"name": "path_or_paths", "val": ": dict"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "cache_dir", "val": ": str = None"}, {"name": "keep_in_memory", "val": ": bool = False"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_paths** (`dict` of path-like) -- | |
| Path(s) of the text file(s). | |
| - **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features), *optional*) -- | |
| Dataset features. | |
| - **cache_dir** (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`) -- | |
| Directory to cache data. | |
| - **keep_in_memory** (`bool`, defaults to `False`) -- | |
| Whether to copy the data in-memory. | |
| - ****kwargs** (additional keyword arguments) -- | |
| Keyword arguments to be passed to `TextConfig`.</paramsdesc><paramgroups>0</paramgroups><retdesc>[DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict)</retdesc></docstring> | |
| Create [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) from text file(s). | |
| <ExampleCodeBlock anchor="datasets.DatasetDict.from_text.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import DatasetDict | |
| >>> ds = DatasetDict.from_text({'train': 'path/to/dataset.txt'}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| <a id='package_reference_features'></a> | |
| ## IterableDataset[[datasets.IterableDataset]] | |
| The base class [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) implements an iterable Dataset backed by python generators. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.IterableDataset</name><anchor>datasets.IterableDataset</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2179</source><parameters>[{"name": "ex_iterable", "val": ": _BaseExamplesIterable"}, {"name": "info", "val": ": typing.Optional[datasets.info.DatasetInfo] = None"}, {"name": "split", "val": ": typing.Optional[datasets.splits.NamedSplit] = None"}, {"name": "formatting", "val": ": typing.Optional[datasets.iterable_dataset.FormattingConfig] = None"}, {"name": "shuffling", "val": ": typing.Optional[datasets.iterable_dataset.ShufflingConfig] = None"}, {"name": "distributed", "val": ": typing.Optional[datasets.iterable_dataset.DistributedConfig] = None"}, {"name": "token_per_repo_id", "val": ": typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None"}]</parameters></docstring> | |
| A Dataset backed by an iterable. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_generator</name><anchor>datasets.IterableDataset.from_generator</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2580</source><parameters>[{"name": "generator", "val": ": typing.Callable"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "gen_kwargs", "val": ": typing.Optional[dict] = None"}, {"name": "split", "val": ": NamedSplit = NamedSplit('train')"}]</parameters><paramsdesc>- **generator** (`Callable`) -- | |
| A generator function that `yields` examples. | |
| - **features** (`Features`, *optional*) -- | |
| Dataset features. | |
| - **gen_kwargs(`dict`,** *optional*) -- | |
| Keyword arguments to be passed to the `generator` callable. | |
| You can define a sharded iterable dataset by passing the list of shards in `gen_kwargs`. | |
| This can be used to improve shuffling and when iterating over the dataset with multiple workers. | |
| - **split** ([NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit), defaults to `Split.TRAIN`) -- | |
| Split name to be assigned to the dataset. | |
| <Added version="2.21.0"/></paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype></docstring> | |
| Create an Iterable Dataset from a generator. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.from_generator.example"> | |
| Example: | |
| ```py | |
| >>> def gen(): | |
| ... yield {"text": "Good", "label": 0} | |
| ... yield {"text": "Bad", "label": 1} | |
| ... | |
| >>> ds = IterableDataset.from_generator(gen) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.from_generator.example-2"> | |
| ```py | |
| >>> def gen(shards): | |
| ... for shard in shards: | |
| ... with open(shard) as f: | |
| ... for line in f: | |
| ... yield {"line": line} | |
| ... | |
| >>> shards = [f"data{i}.txt" for i in range(32)] | |
| >>> ds = IterableDataset.from_generator(gen, gen_kwargs={"shards": shards}) | |
| >>> ds = ds.shuffle(seed=42, buffer_size=10_000) # shuffles the shards order + uses a shuffle buffer | |
| >>> from torch.utils.data import DataLoader | |
| >>> dataloader = DataLoader(ds.with_format("torch"), num_workers=4) # give each worker a subset of 32/4=8 shards | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>remove_columns</name><anchor>datasets.IterableDataset.remove_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3312</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **column_names** (`Union[str, List[str]]`) -- | |
| Name of the column(s) to remove.</paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype><retdesc>A copy of the dataset object without the columns to remove.</retdesc></docstring> | |
| Remove one or several column(s) in the dataset and the features associated to them. | |
| The removal is done on-the-fly on the examples when iterating over the dataset. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.remove_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> next(iter(ds)) | |
| {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1} | |
| >>> ds = ds.remove_columns("label") | |
| >>> next(iter(ds)) | |
| {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>select_columns</name><anchor>datasets.IterableDataset.select_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3347</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **column_names** (`Union[str, List[str]]`) -- | |
| Name of the column(s) to select.</paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype><retdesc>A copy of the dataset object with selected columns.</retdesc></docstring> | |
| Select one or several column(s) in the dataset and the features | |
| associated to them. The selection is done on-the-fly on the examples | |
| when iterating over the dataset. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.select_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> next(iter(ds)) | |
| {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1} | |
| >>> ds = ds.select_columns("text") | |
| >>> next(iter(ds)) | |
| {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_column</name><anchor>datasets.IterableDataset.cast_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3398</source><parameters>[{"name": "column", "val": ": str"}, {"name": "feature", "val": ": typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.List, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf, datasets.features.nifti.Nifti, datasets.features.dicom.Dicom]"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| Column name. | |
| - **feature** (`Feature`) -- | |
| Target feature.</paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype></docstring> | |
| Cast column to feature for decoding. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.cast_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, Audio | |
| >>> ds = load_dataset("PolyAI/minds14", name="en-US", split="train", streaming=True) | |
| >>> ds.features | |
| {'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None), | |
| 'english_transcription': Value('string'), | |
| 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']), | |
| 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN']), | |
| 'path': Value('string'), | |
| 'transcription': Value('string')} | |
| >>> ds = ds.cast_column("audio", Audio(sampling_rate=16000)) | |
| >>> ds.features | |
| {'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), | |
| 'english_transcription': Value('string'), | |
| 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']), | |
| 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN']), | |
| 'path': Value('string'), | |
| 'transcription': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast</name><anchor>datasets.IterableDataset.cast</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3445</source><parameters>[{"name": "features", "val": ": Features"}]</parameters><paramsdesc>- **features** ([Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features)) -- | |
| New features to cast the dataset to. | |
| The name of the fields in the features must match the current column names. | |
| The type of the data must also be convertible from one type to the other. | |
| For non-trivial conversion, e.g. `string` <-> `ClassLabel` you should use [map()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dataset.map) to update the Dataset.</paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype><retdesc>A copy of the dataset with casted features.</retdesc></docstring> | |
| Cast the dataset to a new set of features. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.cast.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel, Value | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> ds.features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> new_features = ds.features.copy() | |
| >>> new_features["label"] = ClassLabel(names=["bad", "good"]) | |
| >>> new_features["text"] = Value("large_string") | |
| >>> ds = ds.cast(new_features) | |
| >>> ds.features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('large_string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode</name><anchor>datasets.IterableDataset.decode</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3492</source><parameters>[{"name": "enable", "val": ": bool = True"}, {"name": "num_threads", "val": ": int = 0"}]</parameters><paramsdesc>- **enable** (`bool`, defaults to `True`) -- | |
| Enable or disable features decoding. | |
| - **num_threads** (`int`, defaults to `0`) -- | |
| Enable multithreading for features decoding.</paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype><retdesc>A copy of the dataset with casted features.</retdesc></docstring> | |
| Enable or disable the dataset features decoding for audio, image, video. | |
| When enabled (default), media types are decoded: | |
| * audio -> dict of "array" and "sampling_rate" and "path" | |
| * image -> PIL.Image | |
| * video -> torchvision.io.VideoReader | |
| You can enable multithreading using `num_threads`. This is especially useful to speed up remote | |
| data streaming. However it can be slower than `num_threads=0` for local data on fast disks. | |
| Disabling decoding is useful if you want to iterate on the paths or bytes of the media files | |
| without actually decoding their content. To disable decoding you can use `.decode(False)`, which | |
| is equivalent to calling `.cast()` or `.cast_column()` with all the Audio, Image and Video types | |
| set to `decode=False`. | |
| Examples: | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.decode.example"> | |
| Disable decoding: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("sshh12/planet-textures", split="train", streaming=True) | |
| >>> next(iter(ds)) | |
| {'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=2048x1024>, | |
| 'text': 'A distant celestial object with an icy crust, displaying a light blue shade, covered with round pits and rugged terrains.'} | |
| >>> ds = ds.decode(False) | |
| >>> ds.features | |
| {'image': Image(mode=None, decode=False, id=None), | |
| 'text': Value('string')} | |
| >>> next(iter(ds)) | |
| { | |
| 'image': { | |
| 'path': 'hf://datasets/sshh12/planet-textures@69dc4cef7a5c4b2cfe387727ec8ea73d4bff7302/train/textures/0000.png', | |
| 'bytes': None | |
| }, | |
| 'text': 'A distant celestial object with an icy crust, displaying a light blue shade, covered with round pits and rugged terrains.' | |
| } | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.decode.example-2"> | |
| Speed up streaming with multithreading: | |
| ```py | |
| >>> import os | |
| >>> from datasets import load_dataset | |
| >>> from tqdm import tqdm | |
| >>> ds = load_dataset("sshh12/planet-textures", split="train", streaming=True) | |
| >>> num_threads = min(32, (os.cpu_count() or 1) + 4) | |
| >>> ds = ds.decode(num_threads=num_threads) | |
| >>> for _ in tqdm(ds): # 20 times faster ! | |
| ... ... | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__iter__</name><anchor>datasets.IterableDataset.__iter__</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2517</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>iter</name><anchor>datasets.IterableDataset.iter</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2542</source><parameters>[{"name": "batch_size", "val": ": int"}, {"name": "drop_last_batch", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`) -- size of each batch to yield. | |
| - **drop_last_batch** (`bool`, default *False*) -- Whether a last batch smaller than the batch_size should be | |
| dropped</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Iterate through the batches of size *batch_size*. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>map</name><anchor>datasets.IterableDataset.map</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2752</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": ": bool = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "drop_last_batch", "val": ": bool = False"}, {"name": "remove_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "features", "val": ": typing.Optional[datasets.features.features.Features] = None"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **function** (`Callable`, *optional*, defaults to `None`) -- | |
| Function applied on-the-fly on the examples when you iterate on the dataset. | |
| It must have one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` | |
| - `function(example: Dict[str, Any], idx: int) -> Dict[str, Any]` if `batched=False` and `with_indices=True` | |
| - `function(batch: Dict[str, List]) -> Dict[str, List]` if `batched=True` and `with_indices=False` | |
| - `function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]` if `batched=True` and `with_indices=True` | |
| For advanced usage, the function can also return a `pyarrow.Table`. | |
| If the function is asynchronous, then `map` will run your function in parallel. | |
| Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. | |
| If no function is provided, default to identity function: `lambda x: x`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx[, rank]): ...`. | |
| - **input_columns** (`Optional[Union[str, List[str]]]`, defaults to `None`) -- | |
| The columns to be passed into `function` | |
| as positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True`. | |
| `batch_size <= 0` or `batch_size == None` then provide the full dataset as a single batch to `function`. | |
| - **drop_last_batch** (`bool`, defaults to `False`) -- | |
| Whether a last batch smaller than the batch_size should be | |
| dropped instead of being processed by the function. | |
| - **remove_columns** (`[List[str]]`, *optional*, defaults to `None`) -- | |
| Remove a selection of columns while doing the mapping. | |
| Columns will be removed before updating the examples with the output of `function`, i.e. if `function` is adding | |
| columns with names in `remove_columns`, these columns will be kept. | |
| - **features** (`[Features]`, *optional*, defaults to `None`) -- | |
| Feature types of the resulting dataset. | |
| - **fn_kwargs** (`Dict`, *optional*, default `None`) -- | |
| Keyword arguments to be passed to `function`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. | |
| If your function returns a column that already exists, then it overwrites it. | |
| The function is applied on-the-fly on the examples when iterating over the dataset. | |
| You can specify whether the function should be batched or not with the `batched` parameter: | |
| - If batched is `False`, then the function takes 1 example in and should return 1 example. | |
| An example is a dictionary, e.g. `{"text": "Hello there !"}`. | |
| - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. | |
| A batch is a dictionary, e.g. a batch of 1 example is {"text": ["Hello there !"]}. | |
| - If batched is `True` and `batch_size` is `n` > 1, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. | |
| Note that the last batch may have less than `n` examples. | |
| A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. | |
| If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. | |
| It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.map.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> def add_prefix(example): | |
| ... example["text"] = "Review: " + example["text"] | |
| ... return example | |
| >>> ds = ds.map(add_prefix) | |
| >>> list(ds.take(3)) | |
| [{'label': 1, | |
| 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'Review: effective but too-tepid biopic'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_column</name><anchor>datasets.IterableDataset.rename_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3257</source><parameters>[{"name": "original_column_name", "val": ": str"}, {"name": "new_column_name", "val": ": str"}]</parameters><paramsdesc>- **original_column_name** (`str`) -- | |
| Name of the column to rename. | |
| - **new_column_name** (`str`) -- | |
| New name for the column.</paramsdesc><paramgroups>0</paramgroups><rettype>`IterableDataset`</rettype><retdesc>A copy of the dataset with a renamed column.</retdesc></docstring> | |
| Rename a column in the dataset, and move the features associated to the original column under the new column | |
| name. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.rename_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> next(iter(ds)) | |
| {'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| >>> ds = ds.rename_column("text", "movie_review") | |
| >>> next(iter(ds)) | |
| {'label': 1, | |
| 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>filter</name><anchor>datasets.IterableDataset.filter</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2904</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": " = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **function** (`Callable`) -- | |
| Callable with one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> bool` if `with_indices=False, batched=False` | |
| - `function(example: Dict[str, Any], indices: int) -> bool` if `with_indices=True, batched=False` | |
| - `function(example: Dict[str, List]) -> List[bool]` if `with_indices=False, batched=True` | |
| - `function(example: Dict[str, List], indices: List[int]) -> List[bool]` if `with_indices=True, batched=True` | |
| If the function is asynchronous, then `filter` will run your function in parallel. | |
| If no function is provided, defaults to an always True function: `lambda x: True`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. | |
| - **input_columns** (`str` or `List[str]`, *optional*) -- | |
| The columns to be passed into `function` as | |
| positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, default `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True`. | |
| - **fn_kwargs** (`Dict`, *optional*, default `None`) -- | |
| Keyword arguments to be passed to `function`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. | |
| The filtering is done on-the-fly when iterating over the dataset. | |
| If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simulatenous calls (configurable). | |
| It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.filter.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> ds = ds.filter(lambda x: x["label"] == 0) | |
| >>> list(ds.take(3)) | |
| [{'label': 0, 'movie_review': 'simplistic , silly and tedious .'}, | |
| {'label': 0, | |
| 'movie_review': "it's so laddish and juvenile , only teenage boys could possibly find it funny ."}, | |
| {'label': 0, | |
| 'movie_review': 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shuffle</name><anchor>datasets.IterableDataset.shuffle</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2990</source><parameters>[{"name": "seed", "val": " = None"}, {"name": "generator", "val": ": typing.Optional[numpy.random._generator.Generator] = None"}, {"name": "buffer_size", "val": ": int = 1000"}]</parameters><paramsdesc>- **seed** (`int`, *optional*, defaults to `None`) -- | |
| Random seed that will be used to shuffle the dataset. | |
| It is used to sample from the shuffle buffer and also to shuffle the data shards. | |
| - **generator** (`numpy.random.Generator`, *optional*) -- | |
| Numpy random Generator to use to compute the permutation of the dataset rows. | |
| If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). | |
| - **buffer_size** (`int`, defaults to `1000`) -- | |
| Size of the buffer.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Randomly shuffles the elements of this dataset. | |
| This dataset fills a buffer with `buffer_size` elements, then randomly samples elements from this buffer, | |
| replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or | |
| equal to the full size of the dataset is required. | |
| For instance, if your dataset contains 10,000 elements but `buffer_size` is set to 1000, then `shuffle` will | |
| initially select a random element from only the first 1000 elements in the buffer. Once an element is | |
| selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, | |
| maintaining the 1000 element buffer. | |
| If the dataset is made of several shards, it also does shuffle the order of the shards. | |
| However if the order has been fixed by using [skip()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset.skip) or [take()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset.take) | |
| then the order of the shards is kept unchanged. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.shuffle.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> list(ds.take(3)) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}] | |
| >>> shuffled_ds = ds.shuffle(seed=42) | |
| >>> list(shuffled_ds.take(3)) | |
| [{'label': 1, | |
| 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."}, | |
| {'label': 1, | |
| 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'}, | |
| {'label': 1, | |
| 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>batch</name><anchor>datasets.IterableDataset.batch</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3616</source><parameters>[{"name": "batch_size", "val": ": int"}, {"name": "drop_last_batch", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`) -- The number of samples in each batch. | |
| - **drop_last_batch** (`bool`, defaults to `False`) -- Whether to drop the last incomplete batch.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Group samples from the dataset into batches. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.batch.example"> | |
| Example: | |
| ```py | |
| >>> ds = load_dataset("some_dataset", streaming=True) | |
| >>> batched_ds = ds.batch(batch_size=32) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>skip</name><anchor>datasets.IterableDataset.skip</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3064</source><parameters>[{"name": "n", "val": ": int"}]</parameters><paramsdesc>- **n** (`int`) -- | |
| Number of elements to skip.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) that skips the first `n` elements. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.skip.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> list(ds.take(3)) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}] | |
| >>> ds = ds.skip(1) | |
| >>> list(ds.take(3)) | |
| [{'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}, | |
| {'label': 1, | |
| 'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>take</name><anchor>datasets.IterableDataset.take</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3151</source><parameters>[{"name": "n", "val": ": int"}]</parameters><paramsdesc>- **n** (`int`) -- | |
| Number of elements to take.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) with only the first `n` elements. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.take.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) | |
| >>> small_ds = ds.take(2) | |
| >>> list(small_ds) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shard</name><anchor>datasets.IterableDataset.shard</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3188</source><parameters>[{"name": "num_shards", "val": ": int"}, {"name": "index", "val": ": int"}, {"name": "contiguous", "val": ": bool = True"}]</parameters><paramsdesc>- **num_shards** (`int`) -- | |
| How many shards to split the dataset into. | |
| - **index** (`int`) -- | |
| Which shard to select and return. | |
| - **contiguous** -- (`bool`, defaults to `True`): | |
| Whether to select contiguous blocks of indices for shards.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Return the `index`-nth shard from dataset split into `num_shards` pieces. | |
| This shards deterministically. `dataset.shard(n, i)` splits the dataset into contiguous chunks, | |
| so it can be easily concatenated back together after processing. If `dataset.num_shards % n == l`, then the | |
| first `l` datasets each have `(dataset.num_shards // n) + 1` shards, and the remaining datasets have `(dataset.num_shards // n)` shards. | |
| `datasets.concatenate_datasets([dset.shard(n, i) for i in range(n)])` returns a dataset with the same order as the original. | |
| In particular, `dataset.shard(dataset.num_shards, i)` returns a dataset with 1 shard. | |
| Note: n should be less or equal to the number of shards in the dataset `dataset.num_shards`. | |
| On the other hand, `dataset.shard(n, i, contiguous=False)` contains all the shards of the dataset whose index mod `n = i`. | |
| Be sure to shard before using any randomizing operator (such as `shuffle`). | |
| It is best if the shard operator is used early in the dataset pipeline. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.shard.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("amazon_polarity", split="train", streaming=True) | |
| >>> ds | |
| Dataset({ | |
| features: ['label', 'title', 'content'], | |
| num_shards: 4 | |
| }) | |
| >>> ds.shard(num_shards=2, index=0) | |
| Dataset({ | |
| features: ['label', 'title', 'content'], | |
| num_shards: 2 | |
| }) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>repeat</name><anchor>datasets.IterableDataset.repeat</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3108</source><parameters>[{"name": "num_times", "val": ": typing.Optional[int]"}]</parameters><paramsdesc>- **num_times** (`int`) or (`None`) -- | |
| Number of times to repeat the dataset. If `None`, the dataset will be repeated indefinitely.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a new [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset) that repeats the underlying dataset `num_times` times. | |
| N.B. The effect of calling shuffle after repeat depends significantly on buffer size. | |
| With buffer_size 1, duplicate data is never seen in the same iteration, even after shuffling: | |
| ds.repeat(n).shuffle(seed=42, buffer_size=1) is equivalent to ds.shuffle(seed=42, buffer_size=1).repeat(n), | |
| and only shuffles shard orders within each iteration. | |
| With buffer size >= (num samples in the dataset * num_times), we get full shuffling of the repeated data, i.e. we can observe duplicates in | |
| the same iteration. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.repeat.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> ds = ds.take(2).repeat(2) | |
| >>> list(ds) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}, | |
| {'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_csv</name><anchor>datasets.IterableDataset.to_csv</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3743</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, bytes, os.PathLike, typing.BinaryIO]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**to_csv_kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_buf** (`PathLike` or `FileOrBuffer`) -- | |
| Either a path to a file (e.g. `file.csv`), a remote URI (e.g. `hf://datasets/username/my_dataset_name/data.csv`), | |
| or a BinaryIO, where the dataset will be saved to in the specified format. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| - ****to_csv_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to pandas's [`pandas.DataFrame.to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). | |
| The parameter `index` defaults to `False` if not specified. | |
| If you would like to write the index, pass `index=True` and also set a name for the index column by | |
| passing `index_label`.</paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of characters or bytes written.</retdesc></docstring> | |
| Exports the dataset to csv. | |
| This iterates on the dataset and loads it completely in memory before writing it. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_csv.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_csv("path/to/dataset/directory") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_pandas</name><anchor>datasets.IterableDataset.to_pandas</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3677</source><parameters>[{"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "batched", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`, *optional*) -- | |
| The size (number of rows) of the batches if `batched` is `True`. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **batched** (`bool`) -- | |
| Set to `True` to return a generator that yields the dataset as batches | |
| of `batch_size` rows. Defaults to `False` (returns the whole datasets once).</paramsdesc><paramgroups>0</paramgroups><retdesc>`pandas.DataFrame` or `Iterator[pandas.DataFrame]`</retdesc></docstring> | |
| Returns the dataset as a `pandas.DataFrame`. Can also return a generator for large datasets. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_pandas.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_pandas() | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_dict</name><anchor>datasets.IterableDataset.to_dict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3639</source><parameters>[{"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "batched", "val": ": bool = False"}]</parameters><paramsdesc>- **batch_size** (`int`, *optional*) -- The size (number of rows) of the batches if `batched` is `True`. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` or `Iterator[dict]`</retdesc></docstring> | |
| Returns the dataset as a Python dict. Can also return a generator for large datasets. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_dict.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_dict() | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_json</name><anchor>datasets.IterableDataset.to_json</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3786</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, bytes, os.PathLike, typing.BinaryIO]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**to_json_kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_buf** (`PathLike` or `FileOrBuffer`) -- | |
| Either a path to a file (e.g. `file.json`), a remote URI (e.g. `hf://datasets/username/my_dataset_name/data.json`), | |
| or a BinaryIO, where the dataset will be saved to in the specified format. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| - ****to_json_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to pandas's [`pandas.DataFrame.to_json`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html). | |
| Default arguments are `lines=True` and `orient="records". | |
| The parameter `index` defaults to `False` if `orient` is `"split"` or `"table"`. | |
| If you would like to write the index, pass `index=True`.</paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of characters or bytes written.</retdesc></docstring> | |
| Export the dataset to JSON Lines or JSON. | |
| This iterates on the dataset and loads it completely in memory before writing it. | |
| The default output format is [JSON Lines](https://jsonlines.org/). | |
| To export to [JSON](https://www.json.org), pass `lines=False` argument and the desired `orient`. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_json.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_json("path/to/dataset/directory/filename.jsonl") | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_json.example-2"> | |
| ```py | |
| >>> num_shards = dataset.num_shards | |
| >>> for index in range(num_shards): | |
| ... shard = dataset.shard(index, num_shards) | |
| ... shard.to_json(f"path/of/my/dataset/data-{index:05d}.jsonl") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_parquet</name><anchor>datasets.IterableDataset.to_parquet</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3882</source><parameters>[{"name": "path_or_buf", "val": ": typing.Union[str, bytes, os.PathLike, typing.BinaryIO]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "storage_options", "val": ": typing.Optional[dict] = None"}, {"name": "**parquet_writer_kwargs", "val": ""}]</parameters><paramsdesc>- **path_or_buf** (`PathLike` or `FileOrBuffer`) -- | |
| Either a path to a file (e.g. `file.parquet`), a remote URI (e.g. `hf://datasets/username/my_dataset_name/data.parquet`), | |
| or a BinaryIO, where the dataset will be saved to in the specified format. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - **storage_options** (`dict`, *optional*) -- | |
| Key/value pairs to be passed on to the file-system backend, if any. | |
| <Added version="2.19.0"/> | |
| - ****parquet_writer_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to PyArrow's `pyarrow.parquet.ParquetWriter`.</paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of characters or bytes written.</retdesc></docstring> | |
| Exports the dataset to parquet | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_parquet.example"> | |
| Example: | |
| ```py | |
| >>> ds.to_parquet("path/to/dataset/directory") | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_parquet.example-2"> | |
| ```py | |
| >>> num_shards = dataset.num_shards | |
| >>> for index in range(num_shards): | |
| ... shard = dataset.shard(index, num_shards) | |
| ... shard.to_parquet(f"path/of/my/dataset/data-{index:05d}.parquet") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_sql</name><anchor>datasets.IterableDataset.to_sql</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L3840</source><parameters>[{"name": "name", "val": ": str"}, {"name": "con", "val": ": typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')]"}, {"name": "batch_size", "val": ": typing.Optional[int] = None"}, {"name": "**sql_writer_kwargs", "val": ""}]</parameters><paramsdesc>- **name** (`str`) -- | |
| Name of SQL table. | |
| - **con** (`str` or `sqlite3.Connection` or `sqlalchemy.engine.Connection` or `sqlalchemy.engine.Connection`) -- | |
| A [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) or a SQLite3/SQLAlchemy connection object used to write to a database. | |
| - **batch_size** (`int`, *optional*) -- | |
| Size of the batch to load in memory and write at once. | |
| Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`. | |
| - ****sql_writer_kwargs** (additional keyword arguments) -- | |
| Parameters to pass to pandas's [`pandas.DataFrame.to_sql`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html). | |
| The parameter `index` defaults to `False` if not specified. | |
| If you would like to write the index, pass `index=True` and also set a name for the index column by | |
| passing `index_label`.</paramsdesc><paramgroups>0</paramgroups><rettype>`int`</rettype><retdesc>The number of records written.</retdesc></docstring> | |
| Exports the dataset to a SQL database. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.to_sql.example"> | |
| Example: | |
| ```py | |
| >>> # con provided as a connection URI string | |
| >>> ds.to_sql("data", "sqlite:///my_own_db.sql") | |
| >>> # con provided as a sqlite3 connection object | |
| >>> import sqlite3 | |
| >>> con = sqlite3.connect("my_own_db.sql") | |
| >>> with con: | |
| ... ds.to_sql("data", con) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>push_to_hub</name><anchor>datasets.IterableDataset.push_to_hub</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L4087</source><parameters>[{"name": "repo_id", "val": ": str"}, {"name": "config_name", "val": ": str = 'default'"}, {"name": "set_default", "val": ": typing.Optional[bool] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "commit_message", "val": ": typing.Optional[str] = None"}, {"name": "commit_description", "val": ": typing.Optional[str] = None"}, {"name": "private", "val": ": typing.Optional[bool] = None"}, {"name": "token", "val": ": typing.Optional[str] = None"}, {"name": "revision", "val": ": typing.Optional[str] = None"}, {"name": "create_pr", "val": ": typing.Optional[bool] = False"}, {"name": "num_shards", "val": ": typing.Optional[int] = None"}, {"name": "embed_external_files", "val": ": bool = True"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **repo_id** (`str`) -- | |
| The ID of the repository to push to in the following format: `<user>/<dataset_name>` or | |
| `<org>/<dataset_name>`. Also accepts `<dataset_name>`, which will default to the namespace | |
| of the logged-in user. | |
| - **config_name** (`str`, defaults to "default") -- | |
| The configuration name (or subset) of a dataset. Defaults to "default". | |
| - **set_default** (`bool`, *optional*) -- | |
| Whether to set this configuration as the default one. Otherwise, the default configuration is the one | |
| named "default". | |
| - **split** (`str`, *optional*) -- | |
| The name of the split that will be given to that dataset. Defaults to `self.split`. | |
| - **data_dir** (`str`, *optional*) -- | |
| Directory name that will contain the uploaded data files. Defaults to the `config_name` if different | |
| from "default", else "data". | |
| - **commit_message** (`str`, *optional*) -- | |
| Message to commit while pushing. Will default to `"Upload dataset"`. | |
| - **commit_description** (`str`, *optional*) -- | |
| Description of the commit that will be created. | |
| Additionally, description of the PR if a PR is created (`create_pr` is True). | |
| - **private** (`bool`, *optional*) -- | |
| Whether to make the repo private. If `None` (default), the repo will be public unless the | |
| organization's default is private. This value is ignored if the repo already exists. | |
| - **token** (`str`, *optional*) -- | |
| An optional authentication token for the Hugging Face Hub. If no token is passed, will default | |
| to the token saved locally when logging in with `huggingface-cli login`. Will raise an error | |
| if no token is passed and the user is not logged-in. | |
| - **revision** (`str`, *optional*) -- | |
| Branch to push the uploaded files to. Defaults to the `"main"` branch. | |
| - **create_pr** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to create a PR with the uploaded files or directly commit. | |
| - **num_shards** (`int`, *optional*) -- | |
| Number of shards to write. Equals to this dataset's `.num_shards` by default. | |
| - **embed_external_files** (`bool`, defaults to `True`) -- | |
| Whether to embed file bytes in the shards. | |
| In particular, this will do the following before the push for the fields of type: | |
| - [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) and [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image): remove local path information and embed file content in the Parquet files. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when preparing and uploading the dataset. | |
| This is helpful if the dataset is made of many samples and transformations. | |
| Multiprocessing is disabled by default.</paramsdesc><paramgroups>0</paramgroups><retdesc>huggingface_hub.CommitInfo</retdesc></docstring> | |
| Pushes the dataset to the hub as a Parquet dataset. | |
| The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed. | |
| The resulting Parquet files are self-contained by default. If your dataset contains [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image), [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) or [Video](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Video) | |
| data, the Parquet files will store the bytes of your images or audio files. | |
| You can disable this by setting `embed_external_files` to `False`. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.push_to_hub.example"> | |
| Example: | |
| ```python | |
| >>> dataset.push_to_hub("<organization>/<dataset_id>") | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", private=True) | |
| >>> dataset.push_to_hub("<organization>/<dataset_id>", num_shards=1024) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.push_to_hub.example-2"> | |
| If your dataset has multiple splits (e.g. train/validation/test): | |
| ```python | |
| >>> train_dataset.push_to_hub("<organization>/<dataset_id>", split="train") | |
| >>> val_dataset.push_to_hub("<organization>/<dataset_id>", split="validation") | |
| >>> # later | |
| >>> dataset = load_dataset("<organization>/<dataset_id>") | |
| >>> train_dataset = dataset["train"] | |
| >>> val_dataset = dataset["validation"] | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.push_to_hub.example-3"> | |
| If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages): | |
| ```python | |
| >>> english_dataset.push_to_hub("<organization>/<dataset_id>", "en") | |
| >>> french_dataset.push_to_hub("<organization>/<dataset_id>", "fr") | |
| >>> # later | |
| >>> english_dataset = load_dataset("<organization>/<dataset_id>", "en") | |
| >>> french_dataset = load_dataset("<organization>/<dataset_id>", "fr") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>load_state_dict</name><anchor>datasets.IterableDataset.load_state_dict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2297</source><parameters>[{"name": "state_dict", "val": ": dict"}]</parameters></docstring> | |
| Load the state_dict of the dataset. | |
| The iteration will restart at the next example from when the state was saved. | |
| Resuming returns exactly where the checkpoint was saved except in two cases: | |
| 1. examples from shuffle buffers are lost when resuming and the buffers are refilled with new data | |
| 2. combinations of `.with_format(arrow)` and batched `.map()` may skip one batch. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.load_state_dict.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Dataset, concatenate_datasets | |
| >>> ds = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) | |
| >>> for idx, example in enumerate(ds): | |
| ... print(example) | |
| ... if idx == 2: | |
| ... state_dict = ds.state_dict() | |
| ... print("checkpoint") | |
| ... break | |
| >>> ds.load_state_dict(state_dict) | |
| >>> print(f"restart from checkpoint") | |
| >>> for example in ds: | |
| ... print(example) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.load_state_dict.example-2"> | |
| which returns: | |
| ``` | |
| {'a': 0} | |
| {'a': 1} | |
| {'a': 2} | |
| checkpoint | |
| restart from checkpoint | |
| {'a': 3} | |
| {'a': 4} | |
| {'a': 5} | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.load_state_dict.example-3"> | |
| ```py | |
| >>> from torchdata.stateful_dataloader import StatefulDataLoader | |
| >>> ds = load_dataset("deepmind/code_contests", streaming=True, split="train") | |
| >>> dataloader = StatefulDataLoader(ds, batch_size=32, num_workers=4) | |
| >>> # checkpoint | |
| >>> state_dict = dataloader.state_dict() # uses ds.state_dict() under the hood | |
| >>> # resume from checkpoint | |
| >>> dataloader.load_state_dict(state_dict) # uses ds.load_state_dict() under the hood | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>state_dict</name><anchor>datasets.IterableDataset.state_dict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2244</source><parameters>[]</parameters><rettype>`dict`</rettype></docstring> | |
| Get the current state_dict of the dataset. | |
| It corresponds to the state at the latest example it yielded. | |
| Resuming returns exactly where the checkpoint was saved except in two cases: | |
| 1. examples from shuffle buffers are lost when resuming and the buffers are refilled with new data | |
| 2. combinations of `.with_format(arrow)` and batched `.map()` may skip one batch. | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.state_dict.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Dataset, concatenate_datasets | |
| >>> ds = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) | |
| >>> for idx, example in enumerate(ds): | |
| ... print(example) | |
| ... if idx == 2: | |
| ... state_dict = ds.state_dict() | |
| ... print("checkpoint") | |
| ... break | |
| >>> ds.load_state_dict(state_dict) | |
| >>> print(f"restart from checkpoint") | |
| >>> for example in ds: | |
| ... print(example) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.state_dict.example-2"> | |
| which returns: | |
| ``` | |
| {'a': 0} | |
| {'a': 1} | |
| {'a': 2} | |
| checkpoint | |
| restart from checkpoint | |
| {'a': 3} | |
| {'a': 4} | |
| {'a': 5} | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDataset.state_dict.example-3"> | |
| ```py | |
| >>> from torchdata.stateful_dataloader import StatefulDataLoader | |
| >>> ds = load_dataset("deepmind/code_contests", streaming=True, split="train") | |
| >>> dataloader = StatefulDataLoader(ds, batch_size=32, num_workers=4) | |
| >>> # checkpoint | |
| >>> state_dict = dataloader.state_dict() # uses ds.state_dict() under the hood | |
| >>> # resume from checkpoint | |
| >>> dataloader.load_state_dict(state_dict) # uses ds.load_state_dict() under the hood | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>info</name><anchor>datasets.IterableDataset.info</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L167</source><parameters>[]</parameters></docstring> | |
| [DatasetInfo](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetInfo) object containing all the metadata in the dataset. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>split</name><anchor>datasets.IterableDataset.split</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L172</source><parameters>[]</parameters></docstring> | |
| [NamedSplit](/docs/datasets/pr_7835/en/package_reference/builder_classes#datasets.NamedSplit) object corresponding to a named dataset split. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>builder_name</name><anchor>datasets.IterableDataset.builder_name</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L177</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>citation</name><anchor>datasets.IterableDataset.citation</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L181</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>config_name</name><anchor>datasets.IterableDataset.config_name</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L185</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>dataset_size</name><anchor>datasets.IterableDataset.dataset_size</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L189</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>description</name><anchor>datasets.IterableDataset.description</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L193</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_checksums</name><anchor>datasets.IterableDataset.download_checksums</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L197</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>download_size</name><anchor>datasets.IterableDataset.download_size</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L201</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>features</name><anchor>datasets.IterableDataset.features</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L205</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>homepage</name><anchor>datasets.IterableDataset.homepage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L209</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>license</name><anchor>datasets.IterableDataset.license</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L213</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>size_in_bytes</name><anchor>datasets.IterableDataset.size_in_bytes</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L217</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>supervised_keys</name><anchor>datasets.IterableDataset.supervised_keys</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L221</source><parameters>[]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>version</name><anchor>datasets.IterableDataset.version</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/arrow_dataset.py#L225</source><parameters>[]</parameters></docstring> | |
| </div></div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.IterableColumn</name><anchor>datasets.IterableColumn</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/iterable_dataset.py#L2146</source><parameters>[{"name": "source", "val": ": typing.Union[ForwardRef('IterableDataset'), ForwardRef('IterableColumn')]"}, {"name": "column_name", "val": ": str"}]</parameters></docstring> | |
| An iterable for a specific column of an [IterableDataset](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset). | |
| Example: | |
| <ExampleCodeBlock anchor="datasets.IterableColumn.example"> | |
| Iterate on the texts of the "text" column of a dataset: | |
| ```python | |
| for text in dataset["text"]: | |
| ... | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableColumn.example-2"> | |
| It also works with nested columns: | |
| ```python | |
| for source in dataset["metadata"]["source"]: | |
| ... | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## IterableDatasetDict[[datasets.IterableDatasetDict]] | |
| Dictionary with split names as keys ('train', 'test' for example), and `IterableDataset` objects as values. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.IterableDatasetDict</name><anchor>datasets.IterableDatasetDict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L1986</source><parameters>""</parameters></docstring> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>map</name><anchor>datasets.IterableDatasetDict.map</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2087</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": ": bool = False"}, {"name": "with_split", "val": ": bool = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": int = 1000"}, {"name": "drop_last_batch", "val": ": bool = False"}, {"name": "remove_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **function** (`Callable`, *optional*, defaults to `None`) -- | |
| Function applied on-the-fly on the examples when you iterate on the dataset. | |
| It must have one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` | |
| - `function(example: Dict[str, Any], idx: int) -> Dict[str, Any]` if `batched=False` and `with_indices=True` | |
| - `function(batch: Dict[str, list]) -> Dict[str, list]` if `batched=True` and `with_indices=False` | |
| - `function(batch: Dict[str, list], indices: list[int]) -> Dict[str, list]` if `batched=True` and `with_indices=True` | |
| For advanced usage, the function can also return a `pyarrow.Table`. | |
| If the function is asynchronous, then `map` will run your function in parallel. | |
| Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. | |
| If no function is provided, default to identity function: `lambda x: x`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx[, rank]): ...`. | |
| - **input_columns** (`[Union[str, list[str]]]`, *optional*, defaults to `None`) -- | |
| The columns to be passed into `function` | |
| as positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function`. | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True`. | |
| - **drop_last_batch** (`bool`, defaults to `False`) -- | |
| Whether a last batch smaller than the `batch_size` should be | |
| dropped instead of being processed by the function. | |
| - **remove_columns** (`[list[str]]`, *optional*, defaults to `None`) -- | |
| Remove a selection of columns while doing the mapping. | |
| Columns will be removed before updating the examples with the output of `function`, i.e. if `function` is adding | |
| columns with names in `remove_columns`, these columns will be kept. | |
| - **fn_kwargs** (`Dict`, *optional*, defaults to `None`) -- | |
| Keyword arguments to be passed to `function`</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. | |
| If your function returns a column that already exists, then it overwrites it. | |
| The function is applied on-the-fly on the examples when iterating over the dataset. | |
| The transformation is applied to all the datasets of the dataset dictionary. | |
| You can specify whether the function should be batched or not with the `batched` parameter: | |
| - If batched is `False`, then the function takes 1 example in and should return 1 example. | |
| An example is a dictionary, e.g. `{"text": "Hello there !"}`. | |
| - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. | |
| A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`. | |
| - If batched is `True` and `batch_size` is `n` > 1, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. | |
| Note that the last batch may have less than `n` examples. | |
| A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. | |
| If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simultaneous calls. | |
| It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.map.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> def add_prefix(example): | |
| ... example["text"] = "Review: " + example["text"] | |
| ... return example | |
| >>> ds = ds.map(add_prefix) | |
| >>> next(iter(ds["train"])) | |
| {'label': 1, | |
| 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>filter</name><anchor>datasets.IterableDatasetDict.filter</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2187</source><parameters>[{"name": "function", "val": ": typing.Optional[typing.Callable] = None"}, {"name": "with_indices", "val": " = False"}, {"name": "input_columns", "val": ": typing.Union[str, list[str], NoneType] = None"}, {"name": "batched", "val": ": bool = False"}, {"name": "batch_size", "val": ": typing.Optional[int] = 1000"}, {"name": "fn_kwargs", "val": ": typing.Optional[dict] = None"}]</parameters><paramsdesc>- **function** (`Callable`) -- | |
| Callable with one of the following signatures: | |
| - `function(example: Dict[str, Any]) -> bool` if `with_indices=False, batched=False` | |
| - `function(example: Dict[str, Any], indices: int) -> bool` if `with_indices=True, batched=False` | |
| - `function(example: Dict[str, list]) -> list[bool]` if `with_indices=False, batched=True` | |
| - `function(example: Dict[str, list], indices: list[int]) -> list[bool]` if `with_indices=True, batched=True` | |
| If no function is provided, defaults to an always True function: `lambda x: True`. | |
| - **with_indices** (`bool`, defaults to `False`) -- | |
| Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. | |
| - **input_columns** (`str` or `list[str]`, *optional*) -- | |
| The columns to be passed into `function` as | |
| positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. | |
| - **batched** (`bool`, defaults to `False`) -- | |
| Provide batch of examples to `function` | |
| - **batch_size** (`int`, *optional*, defaults to `1000`) -- | |
| Number of examples per batch provided to `function` if `batched=True`. | |
| - **fn_kwargs** (`Dict`, *optional*, defaults to `None`) -- | |
| Keyword arguments to be passed to `function`</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. | |
| The filtering is done on-the-fly when iterating over the dataset. | |
| The filtering is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.filter.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds = ds.filter(lambda x: x["label"] == 0) | |
| >>> list(ds["train"].take(3)) | |
| [{'label': 0, 'text': 'Review: simplistic , silly and tedious .'}, | |
| {'label': 0, | |
| 'text': "Review: it's so laddish and juvenile , only teenage boys could possibly find it funny ."}, | |
| {'label': 0, | |
| 'text': 'Review: exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>shuffle</name><anchor>datasets.IterableDatasetDict.shuffle</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2250</source><parameters>[{"name": "seed", "val": " = None"}, {"name": "generator", "val": ": typing.Optional[numpy.random._generator.Generator] = None"}, {"name": "buffer_size", "val": ": int = 1000"}]</parameters><paramsdesc>- **seed** (`int`, *optional*, defaults to `None`) -- | |
| Random seed that will be used to shuffle the dataset. | |
| It is used to sample from the shuffle buffer and also to shuffle the data shards. | |
| - **generator** (`numpy.random.Generator`, *optional*) -- | |
| Numpy random Generator to use to compute the permutation of the dataset rows. | |
| If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). | |
| - **buffer_size** (`int`, defaults to `1000`) -- | |
| Size of the buffer.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Randomly shuffles the elements of this dataset. | |
| The shuffling is applied to all the datasets of the dataset dictionary. | |
| This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, | |
| replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or | |
| equal to the full size of the dataset is required. | |
| For instance, if your dataset contains 10,000 elements but `buffer_size` is set to 1000, then `shuffle` will | |
| initially select a random element from only the first 1000 elements in the buffer. Once an element is | |
| selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, | |
| maintaining the 1000 element buffer. | |
| If the dataset is made of several shards, it also does `shuffle` the order of the shards. | |
| However if the order has been fixed by using [skip()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset.skip) or [take()](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDataset.take) | |
| then the order of the shards is kept unchanged. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.shuffle.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> list(ds["train"].take(3)) | |
| [{'label': 1, | |
| 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, | |
| {'label': 1, | |
| 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, | |
| {'label': 1, 'text': 'effective but too-tepid biopic'}] | |
| >>> ds = ds.shuffle(seed=42) | |
| >>> list(ds["train"].take(3)) | |
| [{'label': 1, | |
| 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."}, | |
| {'label': 1, | |
| 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'}, | |
| {'label': 1, | |
| 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>with_format</name><anchor>datasets.IterableDatasetDict.with_format</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2041</source><parameters>[{"name": "type", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **type** (`str`, *optional*) -- | |
| Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. | |
| `None` means it returns python objects (default).</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Return a dataset with the specified format. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.with_format.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoTokenizer | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True) | |
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") | |
| >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) | |
| >>> ds = ds.with_format("torch") | |
| >>> next(iter(ds)) | |
| {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', | |
| 'label': tensor(1), | |
| 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, | |
| 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, | |
| 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0]), | |
| 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), | |
| 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, | |
| 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast</name><anchor>datasets.IterableDatasetDict.cast</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2458</source><parameters>[{"name": "features", "val": ": Features"}]</parameters><paramsdesc>- **features** (`Features`) -- | |
| New features to cast the dataset to. | |
| The name of the fields in the features must match the current column names. | |
| The type of the data must also be convertible from one type to the other. | |
| For non-trivial conversion, e.g. `string` <-> `ClassLabel` you should use `map` to update the Dataset.</paramsdesc><paramgroups>0</paramgroups><rettype>[IterableDatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDatasetDict)</rettype><retdesc>A copy of the dataset with casted features.</retdesc></docstring> | |
| Cast the dataset to a new set of features. | |
| The type casting is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.cast.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> new_features = ds["train"].features.copy() | |
| >>> new_features['label'] = ClassLabel(names=['bad', 'good']) | |
| >>> new_features['text'] = Value('large_string') | |
| >>> ds = ds.cast(new_features) | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('large_string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_column</name><anchor>datasets.IterableDatasetDict.cast_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2427</source><parameters>[{"name": "column", "val": ": str"}, {"name": "feature", "val": ": typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.List, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf, datasets.features.nifti.Nifti, datasets.features.dicom.Dicom]"}]</parameters><paramsdesc>- **column** (`str`) -- | |
| Column name. | |
| - **feature** (`Feature`) -- | |
| Target feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>[IterableDatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDatasetDict)</retdesc></docstring> | |
| Cast column to feature for decoding. | |
| The type casting is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.cast_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) | |
| >>> ds["train"].features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>remove_columns</name><anchor>datasets.IterableDatasetDict.remove_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2375</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **column_names** (`Union[str, list[str]]`) -- | |
| Name of the column(s) to remove.</paramsdesc><paramgroups>0</paramgroups><rettype>[IterableDatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDatasetDict)</rettype><retdesc>A copy of the dataset object without the columns to remove.</retdesc></docstring> | |
| Remove one or several column(s) in the dataset and the features associated to them. | |
| The removal is done on-the-fly on the examples when iterating over the dataset. | |
| The removal is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.remove_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds = ds.remove_columns("label") | |
| >>> next(iter(ds["train"])) | |
| {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_column</name><anchor>datasets.IterableDatasetDict.rename_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2311</source><parameters>[{"name": "original_column_name", "val": ": str"}, {"name": "new_column_name", "val": ": str"}]</parameters><paramsdesc>- **original_column_name** (`str`) -- | |
| Name of the column to rename. | |
| - **new_column_name** (`str`) -- | |
| New name for the column.</paramsdesc><paramgroups>0</paramgroups><rettype>[IterableDatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDatasetDict)</rettype><retdesc>A copy of the dataset with a renamed column.</retdesc></docstring> | |
| Rename a column in the dataset, and move the features associated to the original column under the new column | |
| name. | |
| The renaming is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.rename_column.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds = ds.rename_column("text", "movie_review") | |
| >>> next(iter(ds["train"])) | |
| {'label': 1, | |
| 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>rename_columns</name><anchor>datasets.IterableDatasetDict.rename_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2347</source><parameters>[{"name": "column_mapping", "val": ": dict"}]</parameters><paramsdesc>- **column_mapping** (`Dict[str, str]`) -- | |
| A mapping of columns to rename to their new names.</paramsdesc><paramgroups>0</paramgroups><rettype>[IterableDatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDatasetDict)</rettype><retdesc>A copy of the dataset with renamed columns</retdesc></docstring> | |
| Rename several columns in the dataset, and move the features associated to the original columns under | |
| the new column names. | |
| The renaming is applied to all the datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.rename_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds = ds.rename_columns({"text": "movie_review", "label": "rating"}) | |
| >>> next(iter(ds["train"])) | |
| {'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', | |
| 'rating': 1} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>select_columns</name><anchor>datasets.IterableDatasetDict.select_columns</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2401</source><parameters>[{"name": "column_names", "val": ": typing.Union[str, list[str]]"}]</parameters><paramsdesc>- **column_names** (`Union[str, list[str]]`) -- | |
| Name of the column(s) to keep.</paramsdesc><paramgroups>0</paramgroups><rettype>[IterableDatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.IterableDatasetDict)</rettype><retdesc>A copy of the dataset object with only selected columns.</retdesc></docstring> | |
| Select one or several column(s) in the dataset and the features | |
| associated to them. The selection is done on-the-fly on the examples | |
| when iterating over the dataset. The selection is applied to all the | |
| datasets of the dataset dictionary. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.select_columns.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) | |
| >>> ds = ds.select("text") | |
| >>> next(iter(ds["train"])) | |
| {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>push_to_hub</name><anchor>datasets.IterableDatasetDict.push_to_hub</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/dataset_dict.py#L2495</source><parameters>[{"name": "repo_id", "val": ""}, {"name": "config_name", "val": ": str = 'default'"}, {"name": "set_default", "val": ": typing.Optional[bool] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "commit_message", "val": ": typing.Optional[str] = None"}, {"name": "commit_description", "val": ": typing.Optional[str] = None"}, {"name": "private", "val": ": typing.Optional[bool] = None"}, {"name": "token", "val": ": typing.Optional[str] = None"}, {"name": "revision", "val": ": typing.Optional[str] = None"}, {"name": "create_pr", "val": ": typing.Optional[bool] = False"}, {"name": "num_shards", "val": ": typing.Optional[dict[str, int]] = None"}, {"name": "embed_external_files", "val": ": bool = True"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **repo_id** (`str`) -- | |
| The ID of the repository to push to in the following format: `<user>/<dataset_name>` or | |
| `<org>/<dataset_name>`. Also accepts `<dataset_name>`, which will default to the namespace | |
| of the logged-in user. | |
| - **config_name** (`str`) -- | |
| Configuration name of a dataset. Defaults to "default". | |
| - **set_default** (`bool`, *optional*) -- | |
| Whether to set this configuration as the default one. Otherwise, the default configuration is the one | |
| named "default". | |
| - **data_dir** (`str`, *optional*) -- | |
| Directory name that will contain the uploaded data files. Defaults to the `config_name` if different | |
| from "default", else "data". | |
| <Added version="2.17.0"/> | |
| - **commit_message** (`str`, *optional*) -- | |
| Message to commit while pushing. Will default to `"Upload dataset"`. | |
| - **commit_description** (`str`, *optional*) -- | |
| Description of the commit that will be created. | |
| Additionally, description of the PR if a PR is created (`create_pr` is True). | |
| <Added version="2.16.0"/> | |
| - **private** (`bool`, *optional*) -- | |
| Whether to make the repo private. If `None` (default), the repo will be public unless the | |
| organization's default is private. This value is ignored if the repo already exists. | |
| - **token** (`str`, *optional*) -- | |
| An optional authentication token for the Hugging Face Hub. If no token is passed, will default | |
| to the token saved locally when logging in with `huggingface-cli login`. Will raise an error | |
| if no token is passed and the user is not logged-in. | |
| - **revision** (`str`, *optional*) -- | |
| Branch to push the uploaded files to. Defaults to the `"main"` branch. | |
| - **create_pr** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to create a PR with the uploaded files or directly commit. | |
| - **num_shards** (`Dict[str, int]`, *optional*) -- | |
| Number of shards to write. Equals to this dataset's `.num_shards` by default. | |
| Use a dictionary to define a different num_shards for each split. | |
| - **embed_external_files** (`bool`, defaults to `True`) -- | |
| Whether to embed file bytes in the shards. | |
| In particular, this will do the following before the push for the fields of type: | |
| - [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) and [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image) removes local path information and embed file content in the Parquet files. | |
| - **num_proc** (`int`, *optional*, defaults to `None`) -- | |
| Number of processes when preparing and uploading the dataset. | |
| This is helpful if the dataset is made of many samples or media files to embed. | |
| Multiprocessing is disabled by default. | |
| <Added version="4.0.0"/></paramsdesc><paramgroups>0</paramgroups><retdesc>huggingface_hub.CommitInfo</retdesc></docstring> | |
| Pushes the [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) to the hub as a Parquet dataset. | |
| The [DatasetDict](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.DatasetDict) is pushed using HTTP requests and does not need to have neither git or git-lfs installed. | |
| Each dataset split will be pushed independently. The pushed dataset will keep the original split names. | |
| The resulting Parquet files are self-contained by default: if your dataset contains [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image) or [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) | |
| data, the Parquet files will store the bytes of your images or audio files. | |
| You can disable this by setting `embed_external_files` to False. | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.push_to_hub.example"> | |
| Example: | |
| ```python | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>") | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", private=True) | |
| >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", num_shards={"train": 1024, "test": 8}) | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="datasets.IterableDatasetDict.push_to_hub.example-2"> | |
| If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages): | |
| ```python | |
| >>> english_dataset.push_to_hub("<organization>/<dataset_id>", "en") | |
| >>> french_dataset.push_to_hub("<organization>/<dataset_id>", "fr") | |
| >>> # later | |
| >>> english_dataset = load_dataset("<organization>/<dataset_id>", "en") | |
| >>> french_dataset = load_dataset("<organization>/<dataset_id>", "fr") | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ## Features[[datasets.Features]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Features</name><anchor>datasets.Features</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1743</source><parameters>[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}]</parameters></docstring> | |
| A special dictionary that defines the internal structure of a dataset. | |
| Instantiated with a dictionary of type `dict[str, FieldType]`, where keys are the desired column names, | |
| and values are the type of that column. | |
| `FieldType` can be one of the following: | |
| - [Value](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Value) feature specifies a single data type value, e.g. `int64` or `string`. | |
| - [ClassLabel](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.ClassLabel) feature specifies a predefined set of classes which can have labels associated to them and | |
| will be stored as integers in the dataset. | |
| - Python `dict` specifies a composite feature containing a mapping of sub-fields to sub-features. | |
| It's possible to have nested fields of nested fields in an arbitrary manner. | |
| - [List](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.List) or [LargeList](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.LargeList) specifies a composite feature containing a sequence of | |
| sub-features, all of the same feature type. | |
| - [Array2D](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Array2D), [Array3D](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Array3D), [Array4D](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Array4D) or [Array5D](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Array5D) feature for multidimensional arrays. | |
| - [Audio](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Audio) feature to store the absolute path to an audio file or a dictionary with the relative path | |
| to an audio file ("path" key) and its bytes content ("bytes" key). | |
| This feature loads the audio lazily with a decoder. | |
| - [Image](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Image) feature to store the absolute path to an image file, an `np.ndarray` object, a `PIL.Image.Image` object | |
| or a dictionary with the relative path to an image file ("path" key) and its bytes content ("bytes" key). | |
| This feature extracts the image data. | |
| - [Video](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Video) feature to store the absolute path to a video file, a `torchcodec.decoders.VideoDecoder` object | |
| or a dictionary with the relative path to a video file ("path" key) and its bytes content ("bytes" key). | |
| This feature loads the video lazily with a decoder. | |
| - [Pdf](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Pdf) feature to store the absolute path to a PDF file, a `pdfplumber.pdf.PDF` object | |
| or a dictionary with the relative path to a PDF file ("path" key) and its bytes content ("bytes" key). | |
| This feature loads the PDF lazily with a PDF reader. | |
| - [Nifti](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Nifti) feature to store the absolute path to a NIfTI neuroimaging file, a `nibabel.Nifti1Image` object | |
| or a dictionary with the relative path to a NIfTI file ("path" key) and its bytes content ("bytes" key). | |
| This feature loads the NIfTI file lazily with nibabel. | |
| - [Dicom](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Dicom) feature to store the absolute path to a DICOM medical imaging file, a `pydicom.dataset.FileDataset` object | |
| or a dictionary with the relative path to a DICOM file ("path" key) and its bytes content ("bytes" key). | |
| This feature loads the DICOM file lazily with pydicom. | |
| - [Translation](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Translation) or [TranslationVariableLanguages](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.TranslationVariableLanguages) feature specific to Machine Translation. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>copy</name><anchor>datasets.Features.copy</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2172</source><parameters>[]</parameters><retdesc>[Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features)</retdesc></docstring> | |
| Make a deep copy of [Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features). | |
| <ExampleCodeBlock anchor="datasets.Features.copy.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> copy_of_features = ds.features.copy() | |
| >>> copy_of_features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_batch</name><anchor>datasets.Features.decode_batch</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2145</source><parameters>[{"name": "batch", "val": ": dict"}, {"name": "token_per_repo_id", "val": ": typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None"}]</parameters><paramsdesc>- **batch** (`dict[str, list[Any]]`) -- | |
| Dataset batch data. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode audio or image files from private repositories on the Hub, you can pass | |
| a dictionary repo_id (str) -> token (bool or str)</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict[str, list[Any]]`</retdesc></docstring> | |
| Decode batch with custom feature decoding. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_column</name><anchor>datasets.Features.decode_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2120</source><parameters>[{"name": "column", "val": ": list"}, {"name": "column_name", "val": ": str"}, {"name": "token_per_repo_id", "val": ": typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None"}]</parameters><paramsdesc>- **column** (`list[Any]`) -- | |
| Dataset column data. | |
| - **column_name** (`str`) -- | |
| Dataset column name.</paramsdesc><paramgroups>0</paramgroups><retdesc>`list[Any]`</retdesc></docstring> | |
| Decode column with custom feature decoding. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Features.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2097</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "token_per_repo_id", "val": ": typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None"}]</parameters><paramsdesc>- **example** (`dict[str, Any]`) -- | |
| Dataset row data. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode audio or image files from private repositories on the Hub, you can pass | |
| a dictionary `repo_id (str) -> token (bool or str)`.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict[str, Any]`</retdesc></docstring> | |
| Decode example with custom feature decoding. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_batch</name><anchor>datasets.Features.encode_batch</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2078</source><parameters>[{"name": "batch", "val": ""}]</parameters><paramsdesc>- **batch** (`dict[str, list[Any]]`) -- | |
| Data in a Dataset batch.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict[str, list[Any]]`</retdesc></docstring> | |
| Encode batch into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_column</name><anchor>datasets.Features.encode_column</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2062</source><parameters>[{"name": "column", "val": ""}, {"name": "column_name", "val": ": str"}]</parameters><paramsdesc>- **column** (`list[Any]`) -- | |
| Data in a Dataset column. | |
| - **column_name** (`str`) -- | |
| Dataset column name.</paramsdesc><paramgroups>0</paramgroups><retdesc>`list[Any]`</retdesc></docstring> | |
| Encode column into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Features.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2048</source><parameters>[{"name": "example", "val": ""}]</parameters><paramsdesc>- **example** (`dict[str, Any]`) -- | |
| Data in a Dataset row.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict[str, Any]`</retdesc></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Features.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2245</source><parameters>[{"name": "max_depth", "val": " = 16"}]</parameters><rettype>[Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features)</rettype><retdesc>The flattened features.</retdesc></docstring> | |
| Flatten the features. Every dictionary column is removed and is replaced by | |
| all the subfields it contains. The new fields are named by concatenating the | |
| name of the original column and the subfield name like this: `<original>.<subfield>`. | |
| If a column contains nested dictionaries, then all the lower-level subfields names are | |
| also concatenated to form new columns: `<original>.<subfield>.<subsubfield>`, etc. | |
| <ExampleCodeBlock anchor="datasets.Features.flatten.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("rajpurkar/squad", split="train") | |
| >>> ds.features.flatten() | |
| {'answers.answer_start': List(Value('int32'), id=None), | |
| 'answers.text': List(Value('string'), id=None), | |
| 'context': Value('string'), | |
| 'id': Value('string'), | |
| 'question': Value('string'), | |
| 'title': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_arrow_schema</name><anchor>datasets.Features.from_arrow_schema</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1831</source><parameters>[{"name": "pa_schema", "val": ": Schema"}]</parameters><paramsdesc>- **pa_schema** (`pyarrow.Schema`) -- | |
| Arrow Schema.</paramsdesc><paramgroups>0</paramgroups><retdesc>[Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features)</retdesc></docstring> | |
| Construct [Features](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Features) from Arrow Schema. | |
| It also checks the schema metadata for Hugging Face Datasets features. | |
| Non-nullable fields are not supported and set to nullable. | |
| Also, pa.dictionary is not supported and it uses its underlying type instead. | |
| Therefore datasets convert DictionaryArray objects to their actual values. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>from_dict</name><anchor>datasets.Features.from_dict</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1865</source><parameters>[{"name": "dic", "val": ""}]</parameters><paramsdesc>- **dic** (*dict[str, Any]*) -- | |
| Python dictionary.</paramsdesc><paramgroups>0</paramgroups><rettype>*Features*</rettype></docstring> | |
| Construct [*Features*] from dict. | |
| Regenerate the nested feature object from a deserialized dict. | |
| We use the *_type* key to infer the dataclass name of the feature *FieldType*. | |
| It allows for a convenient constructor syntax | |
| to define features from deserialized JSON dictionaries. This function is used in particular when deserializing | |
| a [*DatasetInfo*] that was dumped to a JSON object. This acts as an analogue to | |
| [*Features.from_arrow_schema*] and handles the recursive field-by-field instantiation, but doesn't require | |
| any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive | |
| dtypes that [*Value*] automatically performs. | |
| <ExampleCodeBlock anchor="datasets.Features.from_dict.example"> | |
| Example: | |
| ```python | |
| >>> Features.from_dict(&lcub;'_type': &lcub;'dtype': 'string', 'id': None, '_type': 'Value'}}) | |
| &lcub;'_type': Value('string')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>reorder_fields_as</name><anchor>datasets.Features.reorder_fields_as</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L2192</source><parameters>[{"name": "other", "val": ": Features"}]</parameters><paramsdesc>- **other** ([*Features*]) -- | |
| The other [*Features*] to align with.</paramsdesc><paramgroups>0</paramgroups><retdesc>[*Features*]</retdesc></docstring> | |
| Reorder Features fields to match the field order of other [*Features*]. | |
| The order of the fields is important since it matters for the underlying arrow data. | |
| Re-ordering the fields allows to make the underlying arrow data type match. | |
| <ExampleCodeBlock anchor="datasets.Features.reorder_fields_as.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Features, List, Value | |
| >>> # let's say we have two features with a different order of nested fields (for a and b for example) | |
| >>> f1 = Features(&lcub;"root": &lcub;"a": Value("string"), "b": Value("string")}}) | |
| >>> f2 = Features(&lcub;"root": &lcub;"b": Value("string"), "a": Value("string")}}) | |
| >>> assert f1.type != f2.type | |
| >>> # re-ordering keeps the base structure (here List is defined at the root level), but makes the fields order match | |
| >>> f1.reorder_fields_as(f2) | |
| &lcub;'root': List(&lcub;'b': Value('string'), 'a': Value('string')})} | |
| >>> assert f1.reorder_fields_as(f2).type == f2.type | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ### Scalar[[datasets.Value]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Value</name><anchor>datasets.Value</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L486</source><parameters>[{"name": "dtype", "val": ": str"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **dtype** (`str`) -- | |
| Name of the data type.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Scalar feature value of a particular data type. | |
| The possible dtypes of `Value` are as follows: | |
| - `null` | |
| - `bool` | |
| - `int8` | |
| - `int16` | |
| - `int32` | |
| - `int64` | |
| - `uint8` | |
| - `uint16` | |
| - `uint32` | |
| - `uint64` | |
| - `float16` | |
| - `float32` (alias float) | |
| - `float64` (alias double) | |
| - `time32[(s|ms)]` | |
| - `time64[(us|ns)]` | |
| - `timestamp[(s|ms|us|ns)]` | |
| - `timestamp[(s|ms|us|ns), tz=(tzstring)]` | |
| - `date32` | |
| - `date64` | |
| - `duration[(s|ms|us|ns)]` | |
| - `decimal128(precision, scale)` | |
| - `decimal256(precision, scale)` | |
| - `binary` | |
| - `large_binary` | |
| - `binary_view` | |
| - `string` | |
| - `large_string` | |
| - `string_view` | |
| <ExampleCodeBlock anchor="datasets.Value.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Features | |
| >>> features = Features({'stars': Value('int32')}) | |
| >>> features | |
| {'stars': Value('int32')} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.ClassLabel</name><anchor>datasets.ClassLabel</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L978</source><parameters>[{"name": "num_classes", "val": ": dataclasses.InitVar[typing.Optional[int]] = None"}, {"name": "names", "val": ": list = None"}, {"name": "names_file", "val": ": dataclasses.InitVar[typing.Optional[str]] = None"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **num_classes** (`int`, *optional*) -- | |
| Number of classes. All labels must be < `num_classes`. | |
| - **names** (`list` of `str`, *optional*) -- | |
| String names for the integer classes. | |
| The order in which the names are provided is kept. | |
| - **names_file** (`str`, *optional*) -- | |
| Path to a file with names for the integer classes, one per line.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Feature type for integer class labels. | |
| There are 3 ways to define a `ClassLabel`, which correspond to the 3 arguments: | |
| * `num_classes`: Create 0 to (num_classes-1) labels. | |
| * `names`: List of label strings. | |
| * `names_file`: File containing the list of labels. | |
| Under the hood the labels are stored as integers. | |
| You can use negative integers to represent unknown/missing labels. | |
| <ExampleCodeBlock anchor="datasets.ClassLabel.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Features, ClassLabel | |
| >>> features = Features({'label': ClassLabel(num_classes=3, names=['bad', 'ok', 'good'])}) | |
| >>> features | |
| {'label': ClassLabel(names=['bad', 'ok', 'good'])} | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.ClassLabel.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1143</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.IntegerArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.IntegerArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.Int64Array`</rettype><retdesc>Array in the `ClassLabel` arrow storage type.</retdesc></docstring> | |
| Cast an Arrow array to the `ClassLabel` arrow storage type. | |
| The Arrow types that can be converted to the `ClassLabel` pyarrow storage type are: | |
| - `pa.string()` | |
| - `pa.int()` | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>int2str</name><anchor>datasets.ClassLabel.int2str</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1097</source><parameters>[{"name": "values", "val": ": typing.Union[int, collections.abc.Iterable]"}]</parameters></docstring> | |
| Conversion `integer` => class name `string`. | |
| Regarding unknown/missing labels: passing negative integers raises `ValueError`. | |
| <ExampleCodeBlock anchor="datasets.ClassLabel.int2str.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> ds.features["label"].int2str(0) | |
| 'neg' | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>str2int</name><anchor>datasets.ClassLabel.str2int</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1052</source><parameters>[{"name": "values", "val": ": typing.Union[str, collections.abc.Iterable]"}]</parameters></docstring> | |
| Conversion class name `string` => `integer`. | |
| <ExampleCodeBlock anchor="datasets.ClassLabel.str2int.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") | |
| >>> ds.features["label"].str2int('neg') | |
| 0 | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ### Composite[[datasets.LargeList]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.LargeList</name><anchor>datasets.LargeList</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1237</source><parameters>[{"name": "feature", "val": ": typing.Any"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **feature** (`FeatureType`) -- | |
| Child feature data type of each item within the large list.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Feature type for large list data composed of child feature data type. | |
| It is backed by `pyarrow.LargeListType`, which is like `pyarrow.ListType` but with 64-bit rather than 32-bit offsets. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.List</name><anchor>datasets.List</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1209</source><parameters>[{"name": "feature", "val": ": typing.Any"}, {"name": "length", "val": ": int = -1"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **feature** (`FeatureType`) -- | |
| Child feature data type of each item within the large list. | |
| - **length** (optional `int`, default to -1) -- | |
| Length of the list if it is fixed. | |
| Defaults to -1 which means an arbitrary length.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Feature type for large list data composed of child feature data type. | |
| It is backed by `pyarrow.ListType`, which uses 32-bit offsets or a fixed length. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Sequence</name><anchor>datasets.Sequence</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L1175</source><parameters>[{"name": "feature", "val": " = None"}, {"name": "length", "val": " = -1"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **feature** (`FeatureType`) -- | |
| Child feature data type of each item within the large list. | |
| - **length** (optional `int`, default to -1) -- | |
| Length of the list if it is fixed. | |
| Defaults to -1 which means an arbitrary length.</paramsdesc><paramgroups>0</paramgroups><retdesc>[List](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.List) of the specified feature, except `dict` of sub-features | |
| which are converted to `dict` of lists of sub-features for compatibility with TFDS.</retdesc></docstring> | |
| A `Sequence` is a utility that automatically converts internal dictionary feature into a dictionary of | |
| lists. This behavior is implemented to have a compatibility layer with the TensorFlow Datasets library but may be | |
| un-wanted in some cases. If you don't want this behavior, you can use a [List](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.List) or a [LargeList](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.LargeList) | |
| instead of the [Sequence](/docs/datasets/pr_7835/en/package_reference/main_classes#datasets.Sequence). | |
| </div> | |
| ### Translation[[datasets.Translation]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Translation</name><anchor>datasets.Translation</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/translation.py#L12</source><parameters>[{"name": "languages", "val": ": list"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **languages** (`dict`) -- | |
| A dictionary for each example mapping string language codes to string translations.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| `Feature` for translations with fixed languages per example. | |
| Here for compatibility with tfds. | |
| <ExampleCodeBlock anchor="datasets.Translation.example"> | |
| Example: | |
| ```python | |
| >>> # At construction time: | |
| >>> datasets.features.Translation(languages=['en', 'fr', 'de']) | |
| >>> # During data generation: | |
| >>> yield { | |
| ... 'en': 'the cat', | |
| ... 'fr': 'le chat', | |
| ... 'de': 'die katze' | |
| ... } | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Translation.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/translation.py#L44</source><parameters>[]</parameters></docstring> | |
| Flatten the Translation feature into a dictionary. | |
| </div></div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.TranslationVariableLanguages</name><anchor>datasets.TranslationVariableLanguages</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/translation.py#L52</source><parameters>[{"name": "languages", "val": ": typing.Optional[list] = None"}, {"name": "num_languages", "val": ": typing.Optional[int] = None"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **languages** (`dict`) -- | |
| A dictionary for each example mapping string language codes to one or more string translations. | |
| The languages present may vary from example to example.</paramsdesc><paramgroups>0</paramgroups><rettype>- `language` or `translation` (variable-length 1D `tf.Tensor` of `tf.string`)</rettype><retdesc>Language codes sorted in ascending order or plain text translations, sorted to align with language codes.</retdesc></docstring> | |
| `Feature` for translations with variable languages per example. | |
| Here for compatibility with tfds. | |
| <ExampleCodeBlock anchor="datasets.TranslationVariableLanguages.example"> | |
| Example: | |
| ```python | |
| >>> # At construction time: | |
| >>> datasets.features.TranslationVariableLanguages(languages=['en', 'fr', 'de']) | |
| >>> # During data generation: | |
| >>> yield { | |
| ... 'en': 'the cat', | |
| ... 'fr': ['le chat', 'la chatte,'] | |
| ... 'de': 'die katze' | |
| ... } | |
| >>> # Tensor returned : | |
| >>> { | |
| ... 'language': ['en', 'de', 'fr', 'fr'], | |
| ... 'translation': ['the cat', 'die katze', 'la chatte', 'le chat'], | |
| ... } | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.TranslationVariableLanguages.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/translation.py#L122</source><parameters>[]</parameters></docstring> | |
| Flatten the TranslationVariableLanguages feature into a dictionary. | |
| </div></div> | |
| ### Arrays[[datasets.Array2D]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Array2D</name><anchor>datasets.Array2D</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L583</source><parameters>[{"name": "shape", "val": ": tuple"}, {"name": "dtype", "val": ": str"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **shape** (`tuple`) -- | |
| Size of each dimension. | |
| - **dtype** (`str`) -- | |
| Name of the data type.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a two-dimensional array. | |
| <ExampleCodeBlock anchor="datasets.Array2D.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Features | |
| >>> features = Features({'x': Array2D(shape=(1, 3), dtype='int32')}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Array3D</name><anchor>datasets.Array3D</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L608</source><parameters>[{"name": "shape", "val": ": tuple"}, {"name": "dtype", "val": ": str"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **shape** (`tuple`) -- | |
| Size of each dimension. | |
| - **dtype** (`str`) -- | |
| Name of the data type.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a three-dimensional array. | |
| <ExampleCodeBlock anchor="datasets.Array3D.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Features | |
| >>> features = Features({'x': Array3D(shape=(1, 2, 3), dtype='int32')}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Array4D</name><anchor>datasets.Array4D</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L633</source><parameters>[{"name": "shape", "val": ": tuple"}, {"name": "dtype", "val": ": str"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **shape** (`tuple`) -- | |
| Size of each dimension. | |
| - **dtype** (`str`) -- | |
| Name of the data type.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a four-dimensional array. | |
| <ExampleCodeBlock anchor="datasets.Array4D.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Features | |
| >>> features = Features({'x': Array4D(shape=(1, 2, 2, 3), dtype='int32')}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Array5D</name><anchor>datasets.Array5D</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/features.py#L658</source><parameters>[{"name": "shape", "val": ": tuple"}, {"name": "dtype", "val": ": str"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **shape** (`tuple`) -- | |
| Size of each dimension. | |
| - **dtype** (`str`) -- | |
| Name of the data type.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Create a five-dimensional array. | |
| <ExampleCodeBlock anchor="datasets.Array5D.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import Features | |
| >>> features = Features({'x': Array5D(shape=(1, 2, 2, 3, 3), dtype='int32')}) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ### Audio[[datasets.Audio]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Audio</name><anchor>datasets.Audio</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/audio.py#L24</source><parameters>[{"name": "sampling_rate", "val": ": typing.Optional[int] = None"}, {"name": "decode", "val": ": bool = True"}, {"name": "num_channels", "val": ": typing.Optional[int] = None"}, {"name": "stream_index", "val": ": typing.Optional[int] = None"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **sampling_rate** (`int`, *optional*) -- | |
| Target sampling rate. If `None`, the native sampling rate is used. | |
| - **num_channels** (`int`, *optional*) -- | |
| The desired number of channels of the samples. By default, the number of channels of the source is used. | |
| Audio decoding will return samples with shape (num_channels, num_samples) | |
| Currently `None` (number of channels of the source, default), `1` (mono) or `2` (stereo) channels are supported. | |
| The `num_channels` argument is passed to `torchcodec.decoders.AudioDecoder`. | |
| <Added version="4.4.0"/> | |
| - **decode** (`bool`, defaults to `True`) -- | |
| Whether to decode the audio data. If `False`, | |
| returns the underlying dictionary in the format `{"path": audio_path, "bytes": audio_bytes}`. | |
| - **stream_index** (`int`, *optional*) -- | |
| The streaming index to use from the file. If `None` defaults to the "best" index.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Audio `Feature` to extract audio data from an audio file. | |
| Input: The Audio feature accepts as input: | |
| - A `str`: Absolute path to the audio file (i.e. random access is allowed). | |
| - A `pathlib.Path`: path to the audio file (i.e. random access is allowed). | |
| - A `dict` with the keys: | |
| - `path`: String with relative path of the audio file to the archive file. | |
| - `bytes`: Bytes content of the audio file. | |
| This is useful for parquet or webdataset files which embed audio files. | |
| - A `dict` with the keys: | |
| - `array`: Array containing the audio sample | |
| - `sampling_rate`: Integer corresponding to the sampling rate of the audio sample. | |
| - A `torchcodec.decoders.AudioDecoder`: torchcodec audio decoder object. | |
| Output: The Audio features output data as `torchcodec.decoders.AudioDecoder` objects, with additional keys: | |
| - `array`: Array containing the audio sample | |
| - `sampling_rate`: Integer corresponding to the sampling rate of the audio sample. | |
| <ExampleCodeBlock anchor="datasets.Audio.example"> | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, Audio | |
| >>> ds = load_dataset("PolyAI/minds14", name="en-US", split="train") | |
| >>> ds = ds.cast_column("audio", Audio(sampling_rate=44100, num_channels=2)) | |
| >>> ds[0]["audio"] | |
| <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> | |
| >>> audio = ds[0]["audio"] | |
| >>> audio.get_samples_played_in_range(0, 10) | |
| AudioSamples: | |
| data (shape): torch.Size([2, 110592]) | |
| pts_seconds: 0.0 | |
| duration_seconds: 2.507755102040816 | |
| sample_rate: 44100 | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.Audio.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/audio.py#L234</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.StructArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Audio arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`</retdesc></docstring> | |
| Cast an Arrow array to the Audio arrow storage type. | |
| The Arrow types that can be converted to the Audio pyarrow storage type are: | |
| - `pa.string()` - it must contain the "path" data | |
| - `pa.binary()` - it must contain the audio bytes | |
| - `pa.struct({"bytes": pa.binary()})` | |
| - `pa.struct({"path": pa.string()})` | |
| - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Audio.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/audio.py#L164</source><parameters>[{"name": "value", "val": ": dict"}, {"name": "token_per_repo_id", "val": ": typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None"}]</parameters><paramsdesc>- **value** (`dict`) -- | |
| A dictionary with keys: | |
| - `path`: String with relative audio file path. | |
| - `bytes`: Bytes of the audio file. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode | |
| audio files from private repositories on the Hub, you can pass | |
| a dictionary repo_id (`str`) -> token (`bool` or `str`)</paramsdesc><paramgroups>0</paramgroups><retdesc>`torchcodec.decoders.AudioDecoder`</retdesc></docstring> | |
| Decode example audio file into audio data. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>embed_storage</name><anchor>datasets.Audio.embed_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/audio.py#L274</source><parameters>[{"name": "storage", "val": ": StructArray"}, {"name": "token_per_repo_id", "val": " = None"}]</parameters><paramsdesc>- **storage** (`pa.StructArray`) -- | |
| PyArrow array to embed.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Audio arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Embed audio files into the Arrow array. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Audio.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/audio.py#L96</source><parameters>[{"name": "value", "val": ": typing.Union[str, bytes, bytearray, dict, ForwardRef('AudioDecoder')]"}]</parameters><paramsdesc>- **value** (`str`, `bytes`,`bytearray`,`dict`, `AudioDecoder`) -- | |
| Data passed as input to Audio feature.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict`</rettype></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Audio.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/audio.py#L223</source><parameters>[]</parameters></docstring> | |
| If in the decodable state, raise an error, otherwise flatten the feature into a dictionary. | |
| </div></div> | |
| ### Image[[datasets.Image]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Image</name><anchor>datasets.Image</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/image.py#L47</source><parameters>[{"name": "mode", "val": ": typing.Optional[str] = None"}, {"name": "decode", "val": ": bool = True"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **mode** (`str`, *optional*) -- | |
| The mode to convert the image to. If `None`, the native mode of the image is used. | |
| - **decode** (`bool`, defaults to `True`) -- | |
| Whether to decode the image data. If `False`, | |
| returns the underlying dictionary in the format `{"path": image_path, "bytes": image_bytes}`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Image `Feature` to read image data from an image file. | |
| Input: The Image feature accepts as input: | |
| - A `str`: Absolute path to the image file (i.e. random access is allowed). | |
| - A `pathlib.Path`: path to the image file (i.e. random access is allowed). | |
| - A `dict` with the keys: | |
| - `path`: String with relative path of the image file to the archive file. | |
| - `bytes`: Bytes of the image file. | |
| This is useful for parquet or webdataset files which embed image files. | |
| - An `np.ndarray`: NumPy array representing an image. | |
| - A `PIL.Image.Image`: PIL image object. | |
| Output: The Image features output data as `PIL.Image.Image` objects. | |
| <ExampleCodeBlock anchor="datasets.Image.example"> | |
| Examples: | |
| ```py | |
| >>> from datasets import load_dataset, Image | |
| >>> ds = load_dataset("AI-Lab-Makerere/beans", split="train") | |
| >>> ds.features["image"] | |
| Image(decode=True, id=None) | |
| >>> ds[0]["image"] | |
| <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x15E52E7F0> | |
| >>> ds = ds.cast_column('image', Image(decode=False)) | |
| {'bytes': None, | |
| 'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/healthy/healthy_train.85.jpg'} | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.Image.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/image.py#L213</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.StructArray, pa.ListArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Image arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Cast an Arrow array to the Image arrow storage type. | |
| The Arrow types that can be converted to the Image pyarrow storage type are: | |
| - `pa.string()` - it must contain the "path" data | |
| - `pa.large_string()` - it must contain the "path" data (will be cast to string if possible) | |
| - `pa.binary()` - it must contain the image bytes | |
| - `pa.struct({"bytes": pa.binary()})` | |
| - `pa.struct({"path": pa.string()})` | |
| - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter | |
| - `pa.list(*)` - it must contain the image array data | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Image.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/image.py#L139</source><parameters>[{"name": "value", "val": ": dict"}, {"name": "token_per_repo_id", "val": " = None"}]</parameters><paramsdesc>- **value** (`str` or `dict`) -- | |
| A string with the absolute image file path, a dictionary with | |
| keys: | |
| - `path`: String with absolute or relative image file path. | |
| - `bytes`: The bytes of the image file. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode | |
| image files from private repositories on the Hub, you can pass | |
| a dictionary repo_id (`str`) -> token (`bool` or `str`).</paramsdesc><paramgroups>0</paramgroups><retdesc>`PIL.Image.Image`</retdesc></docstring> | |
| Decode example image file into image data. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>embed_storage</name><anchor>datasets.Image.embed_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/image.py#L269</source><parameters>[{"name": "storage", "val": ": StructArray"}, {"name": "token_per_repo_id", "val": " = None"}]</parameters><paramsdesc>- **storage** (`pa.StructArray`) -- | |
| PyArrow array to embed.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Image arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Embed image files into the Arrow array. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Image.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/image.py#L98</source><parameters>[{"name": "value", "val": ": typing.Union[str, bytes, bytearray, dict, numpy.ndarray, ForwardRef('PIL.Image.Image')]"}]</parameters><paramsdesc>- **value** (`str`, `np.ndarray`, `PIL.Image.Image` or `dict`) -- | |
| Data passed as input to Image feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` with "path" and "bytes" fields</retdesc></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Image.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/image.py#L200</source><parameters>[]</parameters></docstring> | |
| If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary. | |
| </div></div> | |
| ### Video[[datasets.Video]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Video</name><anchor>datasets.Video</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/video.py#L29</source><parameters>[{"name": "decode", "val": ": bool = True"}, {"name": "stream_index", "val": ": typing.Optional[int] = None"}, {"name": "dimension_order", "val": ": typing.Literal['NCHW', 'NHWC'] = 'NCHW'"}, {"name": "num_ffmpeg_threads", "val": ": int = 1"}, {"name": "device", "val": ": typing.Union[str, ForwardRef('torch.device'), NoneType] = 'cpu'"}, {"name": "seek_mode", "val": ": typing.Literal['exact', 'approximate'] = 'exact'"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **mode** (`str`, *optional*) -- | |
| The mode to convert the video to. If `None`, the native mode of the video is used. | |
| - **decode** (`bool`, defaults to `True`) -- | |
| Whether to decode the video data. If `False`, | |
| returns the underlying dictionary in the format `{"path": video_path, "bytes": video_bytes}`. | |
| - **stream_index** (`int`, *optional*) -- | |
| The streaming index to use from the file. If `None` defaults to the "best" index. | |
| - **dimension_order** (`str`, defaults to `NCHW`) -- | |
| The dimension order of the decoded frames. | |
| where N is the batch size, C is the number of channels, | |
| H is the height, and W is the width of the frames. | |
| - **num_ffmpeg_threads** (`int`, defaults to `1`) -- | |
| The number of threads to use for decoding the video. (Recommended to keep this at 1) | |
| - **device** (`str` or `torch.device`, defaults to `cpu`) -- | |
| The device to use for decoding the video. | |
| - **seek_mode** (`str`, defaults to `exact`) -- | |
| Determines if frame access will be “exact” or “approximate”. | |
| Exact guarantees that requesting frame i will always return frame i, but doing so requires an initial scan of the file. | |
| Approximate is faster as it avoids scanning the file, but less accurate as it uses the file's metadata to calculate where i probably is. | |
| read more [here](https://docs.pytorch.org/torchcodec/stable/generated_examples/approximate_mode.html#sphx-glr-generated-examples-approximate-mode-py)</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Video `Feature` to read video data from a video file. | |
| Input: The Video feature accepts as input: | |
| - A `str`: Absolute path to the video file (i.e. random access is allowed). | |
| - A `pathlib.Path`: path to the video file (i.e. random access is allowed). | |
| - A `dict` with the keys: | |
| - `path`: String with relative path of the video file in a dataset repository. | |
| - `bytes`: Bytes of the video file. | |
| This is useful for parquet or webdataset files which embed video files. | |
| - A `torchcodec.decoders.VideoDecoder`: torchcodec video decoder object. | |
| Output: The Video features output data as `torchcodec.decoders.VideoDecoder` objects. | |
| <ExampleCodeBlock anchor="datasets.Video.example"> | |
| Examples: | |
| ```py | |
| >>> from datasets import Dataset, Video | |
| >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) | |
| >>> ds.features["video"] | |
| Video(decode=True, id=None) | |
| >>> ds[0]["video"] | |
| <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61e080> | |
| >>> video = ds[0]["video"] | |
| >>> video.get_frames_in_range(0, 10) | |
| FrameBatch: | |
| data (shape): torch.Size([10, 3, 50, 66]) | |
| pts_seconds: tensor([0.4333, 0.4333, 0.4333, 0.4333, 0.4333, 0.4333, 0.4333, 0.4333, 0.4333, | |
| 0.4333], dtype=torch.float64) | |
| duration_seconds: tensor([0.0167, 0.0167, 0.0167, 0.0167, 0.0167, 0.0167, 0.0167, 0.0167, 0.0167, | |
| 0.0167], dtype=torch.float64) | |
| >>> ds.cast_column('video', Video(decode=False))[0]["video] | |
| {'bytes': None, | |
| 'path': 'path/to/Screen Recording.mov'} | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.Video.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/video.py#L241</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.StructArray, pa.ListArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Video arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Cast an Arrow array to the Video arrow storage type. | |
| The Arrow types that can be converted to the Video pyarrow storage type are: | |
| - `pa.string()` - it must contain the "path" data | |
| - `pa.binary()` - it must contain the video bytes | |
| - `pa.struct({"bytes": pa.binary()})` | |
| - `pa.struct({"path": pa.string()})` | |
| - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter | |
| - `pa.list(*)` - it must contain the video array data | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Video.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/video.py#L155</source><parameters>[{"name": "value", "val": ": typing.Union[str, datasets.features.video.Example]"}, {"name": "token_per_repo_id", "val": ": typing.Optional[dict[str, typing.Union[bool, str]]] = None"}]</parameters><paramsdesc>- **value** (`str` or `dict`) -- | |
| A string with the absolute video file path, a dictionary with | |
| keys: | |
| - `path`: String with absolute or relative video file path. | |
| - `bytes`: The bytes of the video file. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode | |
| video files from private repositories on the Hub, you can pass | |
| a dictionary repo_id (`str`) -> token (`bool` or `str`).</paramsdesc><paramgroups>0</paramgroups><retdesc>`torchcodec.decoders.VideoDecoder`</retdesc></docstring> | |
| Decode example video file into video data. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Video.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/video.py#L107</source><parameters>[{"name": "value", "val": ": typing.Union[str, bytes, bytearray, datasets.features.video.Example, numpy.ndarray, ForwardRef('VideoDecoder')]"}]</parameters><paramsdesc>- **value** (`str`, `np.ndarray`, `bytes`, `bytearray`, `VideoDecoder` or `dict`) -- | |
| Data passed as input to Video feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` with "path" and "bytes" fields</retdesc></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Video.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/video.py#L228</source><parameters>[]</parameters></docstring> | |
| If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary. | |
| </div></div> | |
| ### Pdf[[datasets.Pdf]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Pdf</name><anchor>datasets.Pdf</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/pdf.py#L31</source><parameters>[{"name": "decode", "val": ": bool = True"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **mode** (`str`, *optional*) -- | |
| The mode to convert the pdf to. If `None`, the native mode of the pdf is used. | |
| - **decode** (`bool`, defaults to `True`) -- | |
| Whether to decode the pdf data. If `False`, | |
| returns the underlying dictionary in the format `{"path": pdf_path, "bytes": pdf_bytes}`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| **Experimental.** | |
| Pdf `Feature` to read pdf documents from a pdf file. | |
| Input: The Pdf feature accepts as input: | |
| - A `str`: Absolute path to the pdf file (i.e. random access is allowed). | |
| - A `pathlib.Path`: path to the pdf file (i.e. random access is allowed). | |
| - A `dict` with the keys: | |
| - `path`: String with relative path of the pdf file in a dataset repository. | |
| - `bytes`: Bytes of the pdf file. | |
| This is useful for archived files with sequential access. | |
| - A `pdfplumber.pdf.PDF`: pdfplumber pdf object. | |
| <ExampleCodeBlock anchor="datasets.Pdf.example"> | |
| Examples: | |
| ```py | |
| >>> from datasets import Dataset, Pdf | |
| >>> ds = Dataset.from_dict({"pdf": ["path/to/pdf/file.pdf"]}).cast_column("pdf", Pdf()) | |
| >>> ds.features["pdf"] | |
| Pdf(decode=True, id=None) | |
| >>> ds[0]["pdf"] | |
| <pdfplumber.pdf.PDF object at 0x7f8a1c2d8f40> | |
| >>> ds = ds.cast_column("pdf", Pdf(decode=False)) | |
| >>> ds[0]["pdf"] | |
| {'bytes': None, | |
| 'path': 'path/to/pdf/file.pdf'} | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.Pdf.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/pdf.py#L186</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.StructArray, pa.ListArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Pdf arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Cast an Arrow array to the Pdf arrow storage type. | |
| The Arrow types that can be converted to the Pdf pyarrow storage type are: | |
| - `pa.string()` - it must contain the "path" data | |
| - `pa.binary()` - it must contain the image bytes | |
| - `pa.struct({"bytes": pa.binary()})` | |
| - `pa.struct({"path": pa.string()})` | |
| - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter | |
| - `pa.list(*)` - it must contain the pdf array data | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Pdf.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/pdf.py#L115</source><parameters>[{"name": "value", "val": ": dict"}, {"name": "token_per_repo_id", "val": " = None"}]</parameters><paramsdesc>- **value** (`str` or `dict`) -- | |
| A string with the absolute pdf file path, a dictionary with | |
| keys: | |
| - `path`: String with absolute or relative pdf file path. | |
| - `bytes`: The bytes of the pdf file. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode pdf files from private repositories on | |
| the Hub, you can pass a dictionary | |
| repo_id (`str`) -> token (`bool` or `str`).</paramsdesc><paramgroups>0</paramgroups><retdesc>`pdfplumber.pdf.PDF`</retdesc></docstring> | |
| Decode example pdf file into pdf data. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>embed_storage</name><anchor>datasets.Pdf.embed_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/pdf.py#L223</source><parameters>[{"name": "storage", "val": ": StructArray"}, {"name": "token_per_repo_id", "val": " = None"}]</parameters><paramsdesc>- **storage** (`pa.StructArray`) -- | |
| PyArrow array to embed.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the PDF arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Embed PDF files into the Arrow array. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Pdf.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/pdf.py#L80</source><parameters>[{"name": "value", "val": ": typing.Union[str, bytes, bytearray, dict, ForwardRef('pdfplumber.pdf.PDF')]"}]</parameters><paramsdesc>- **value** (`str`, `bytes`, `pdfplumber.pdf.PDF` or `dict`) -- | |
| Data passed as input to Pdf feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` with "path" and "bytes" fields</retdesc></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Pdf.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/pdf.py#L173</source><parameters>[]</parameters></docstring> | |
| If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary. | |
| </div></div> | |
| ### Nifti[[datasets.Nifti]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Nifti</name><anchor>datasets.Nifti</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/nifti.py#L23</source><parameters>[{"name": "decode", "val": ": bool = True"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **decode** (`bool`, defaults to `True`) -- | |
| Whether to decode the NIfTI data. If `False` a string with the bytes is returned. `decode=False` is not supported when decoding examples.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| **Experimental.** | |
| Nifti `Feature` to read NIfTI neuroimaging files. | |
| Input: The Nifti feature accepts as input: | |
| - A `str`: Absolute path to the NIfTI file (i.e. random access is allowed). | |
| - A `pathlib.Path`: path to the NIfTI file (i.e. random access is allowed). | |
| - A `dict` with the keys: | |
| - `path`: String with relative path of the NIfTI file in a dataset repository. | |
| - `bytes`: Bytes of the NIfTI file. | |
| This is useful for archived files with sequential access. | |
| - A `nibabel` image object (e.g., `nibabel.nifti1.Nifti1Image`). | |
| <ExampleCodeBlock anchor="datasets.Nifti.example"> | |
| Examples: | |
| ```py | |
| >>> from datasets import Dataset, Nifti | |
| >>> ds = Dataset.from_dict({"nifti": ["path/to/file.nii.gz"]}).cast_column("nifti", Nifti()) | |
| >>> ds.features["nifti"] | |
| Nifti(decode=True, id=None) | |
| >>> ds[0]["nifti"] | |
| <nibabel.nifti1.Nifti1Image object at 0x7f8a1c2d8f40> | |
| >>> ds = ds.cast_column("nifti", Nifti(decode=False)) | |
| >>> ds[0]["nifti"] | |
| {'bytes': None, | |
| 'path': 'path/to/file.nii.gz'} | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.Nifti.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/nifti.py#L188</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.BinaryArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.StructArray, pa.BinaryArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Nifti arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Cast an Arrow array to the Nifti arrow storage type. | |
| The Arrow types that can be converted to the Nifti pyarrow storage type are: | |
| - `pa.string()` - it must contain the "path" data | |
| - `pa.binary()` - it must contain the NIfTI bytes | |
| - `pa.struct({"bytes": pa.binary()})` | |
| - `pa.struct({"path": pa.string()})` | |
| - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Nifti.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/nifti.py#L109</source><parameters>[{"name": "value", "val": ": dict"}, {"name": "token_per_repo_id", "val": " = None"}]</parameters><paramsdesc>- **value** (`str` or `dict`) -- | |
| A string with the absolute NIfTI file path, a dictionary with | |
| keys: | |
| - `path`: String with absolute or relative NIfTI file path. | |
| - `bytes`: The bytes of the NIfTI file. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode NIfTI files from private repositories on | |
| the Hub, you can pass a dictionary | |
| repo_id (`str`) -> token (`bool` or `str`).</paramsdesc><paramgroups>0</paramgroups><retdesc>`nibabel.Nifti1Image` objects</retdesc></docstring> | |
| Decode example NIfTI file into nibabel image object. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Nifti.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/nifti.py#L69</source><parameters>[{"name": "value", "val": ": typing.Union[str, bytes, bytearray, dict, ForwardRef('nib.Nifti1Image')]"}]</parameters><paramsdesc>- **value** (`str`, `bytes`, `nibabel.Nifti1Image` or `dict`) -- | |
| Data passed as input to Nifti feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` with "path" and "bytes" fields</retdesc></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Nifti.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/nifti.py#L175</source><parameters>[]</parameters></docstring> | |
| If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary. | |
| </div></div> | |
| ### Dicom[[datasets.Dicom]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.Dicom</name><anchor>datasets.Dicom</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/dicom.py#L28</source><parameters>[{"name": "decode", "val": ": bool = True"}, {"name": "force", "val": ": bool = False"}, {"name": "id", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **decode** (`bool`, defaults to `True`) -- | |
| Whether to decode the DICOM data. If `False`, | |
| returns the underlying dictionary in the format `{"path": dicom_path, "bytes": dicom_bytes}`. | |
| - **force** (`bool`, defaults to `False`) -- | |
| Force reading files missing DICOM File Meta Information header or 'DICM' prefix. | |
| Passed to `pydicom.dcmread(force=...)`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| **Experimental.** | |
| Dicom `Feature` to read DICOM medical imaging files. | |
| Input: The Dicom feature accepts as input: | |
| - A `str`: Absolute path to the DICOM file (i.e. random access is allowed). | |
| - A `pathlib.Path`: path to the DICOM file (i.e. random access is allowed). | |
| - A `dict` with the keys: | |
| - `path`: String with relative path of the DICOM file in a dataset repository. | |
| - `bytes`: Bytes of the DICOM file. | |
| This is useful for archived files with sequential access. | |
| - A `pydicom.FileDataset`: pydicom dataset object. | |
| <ExampleCodeBlock anchor="datasets.Dicom.example"> | |
| Examples: | |
| ```py | |
| >>> from datasets import Dataset, Dicom | |
| >>> ds = Dataset.from_dict({"dicom": ["path/to/file.dcm"]}).cast_column("dicom", Dicom()) | |
| >>> ds.features["dicom"] | |
| Dicom(decode=True, force=False, id=None) | |
| >>> ds[0]["dicom"] | |
| <pydicom.dataset.FileDataset object at 0x7f8a1c2d8f40> | |
| >>> ds = ds.cast_column("dicom", Dicom(decode=False)) | |
| >>> ds[0]["dicom"] | |
| {'bytes': None, | |
| 'path': 'path/to/file.dcm'} | |
| ``` | |
| </ExampleCodeBlock> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>cast_storage</name><anchor>datasets.Dicom.cast_storage</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/dicom.py#L188</source><parameters>[{"name": "storage", "val": ": typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.BinaryArray]"}]</parameters><paramsdesc>- **storage** (`Union[pa.StringArray, pa.StructArray, pa.BinaryArray]`) -- | |
| PyArrow array to cast.</paramsdesc><paramgroups>0</paramgroups><rettype>`pa.StructArray`</rettype><retdesc>Array in the Dicom arrow storage type, that is | |
| `pa.struct({"bytes": pa.binary(), "path": pa.string()})`.</retdesc></docstring> | |
| Cast an Arrow array to the Dicom arrow storage type. | |
| The Arrow types that can be converted to the Dicom pyarrow storage type are: | |
| - `pa.string()` - it must contain the "path" data | |
| - `pa.binary()` - it must contain the DICOM bytes | |
| - `pa.struct({"bytes": pa.binary()})` | |
| - `pa.struct({"path": pa.string()})` | |
| - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>decode_example</name><anchor>datasets.Dicom.decode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/dicom.py#L116</source><parameters>[{"name": "value", "val": ": DicomDict"}, {"name": "token_per_repo_id", "val": ": typing.Optional[typing.Dict[str, typing.Union[str, bool]]] = None"}]</parameters><paramsdesc>- **value** (`dict`) -- | |
| A dictionary with keys: | |
| - `path`: String with absolute or relative DICOM file path. | |
| - `bytes`: The bytes of the DICOM file. | |
| - **token_per_repo_id** (`dict`, *optional*) -- | |
| To access and decode DICOM files from private repositories on | |
| the Hub, you can pass a dictionary | |
| repo_id (`str`) -> token (`bool` or `str`).</paramsdesc><paramgroups>0</paramgroups><retdesc>`pydicom.FileDataset` objects</retdesc></docstring> | |
| Decode example DICOM file into pydicom FileDataset object. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_example</name><anchor>datasets.Dicom.encode_example</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/dicom.py#L79</source><parameters>[{"name": "value", "val": ": typing.Union[str, bytes, bytearray, dict, ForwardRef('pydicom.FileDataset')]"}]</parameters><paramsdesc>- **value** (`str`, `bytes`, `pydicom.FileDataset` or `dict`) -- | |
| Data passed as input to Dicom feature.</paramsdesc><paramgroups>0</paramgroups><retdesc>`dict` with "path" and "bytes" fields</retdesc></docstring> | |
| Encode example into a format for Arrow. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>flatten</name><anchor>datasets.Dicom.flatten</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/features/dicom.py#L175</source><parameters>[]</parameters></docstring> | |
| If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary. | |
| </div></div> | |
| ## Filesystems[[datasets.filesystems.is_remote_filesystem]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>datasets.filesystems.is_remote_filesystem</name><anchor>datasets.filesystems.is_remote_filesystem</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/filesystems/__init__.py#L28</source><parameters>[{"name": "fs", "val": ": AbstractFileSystem"}]</parameters><paramsdesc>- **fs** (`fsspec.spec.AbstractFileSystem`) -- | |
| An abstract super-class for pythonic file-systems, e.g. `fsspec.filesystem('file')` or `s3fs.S3FileSystem`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Checks if `fs` is a remote filesystem. | |
| </div> | |
| ## Fingerprint[[datasets.fingerprint.Hasher]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class datasets.fingerprint.Hasher</name><anchor>datasets.fingerprint.Hasher</anchor><source>https://github.com/huggingface/datasets/blob/r_7835/src/datasets/fingerprint.py#L170</source><parameters>[]</parameters></docstring> | |
| Hasher that accepts python objects as inputs. | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/datasets/blob/main/docs/source/package_reference/main_classes.mdx" /> |
Xet Storage Details
- Size:
- 421 kB
- Xet hash:
- af78dc2cde878440a883908af1c89d3f6cf7322a331be940d2be7c1745bae260
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.