Buckets:
| # Cache management | |
| When you download a dataset from Hugging Face, the data are stored locally on your computer. | |
| Files from Hugging Face are stored as usual in the `huggingface_hub` cache, which is at `~/.cache/huggingface/hub` by default. | |
| See the [Hub cache documentation](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) for more details and how to change its location. | |
| The Hub cache allows ๐ค Datasets to avoid re-downloading dataset files from Hugging Face every time you use them. | |
| ๐ค Datasets also has its own cache to store datasets converted in Arrow format (the format used by [Dataset](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset) objects). | |
| This guide focuses on the ๐ค Datasets cache and will show you how to: | |
| - Change the cache directory. | |
| - Control how a dataset is loaded from the cache. | |
| - Clean up cache files in the directory. | |
| - Enable or disable caching. | |
| ## Cache directory | |
| The default ๐ค Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory: | |
| ``` | |
| $ export HF_HOME="/path/to/another/directory/datasets" | |
| ``` | |
| Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory: | |
| ``` | |
| $ export HF_DATASETS_CACHE="/path/to/datasets_cache" | |
| ``` | |
| โ ๏ธ This only applies to files written by the `datasets` library (e.g., Arrow files and indices). | |
| It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are located in `~/.cache/huggingface/hub` by default and controlled separately via the `HF_HUB_CACHE` variable: | |
| ``` | |
| $ export HF_HUB_CACHE="/path/to/hub_cache" | |
| ``` | |
| ๐ก If you'd like to relocate all Hugging Face caches โ including datasets and hub downloads โ use the `HF_HOME` variable instead: | |
| ``` | |
| $ export HF_HOME="/path/to/cache_root" | |
| ``` | |
| This results in: | |
| - datasets cache โ `/path/to/cache_root/datasets` | |
| - hub cache โ `/path/to/cache_root/hub` | |
| These distinctions are especially useful when working in shared environments or networked file systems (e.g., NFS). | |
| See [issue #7480](https://github.com/huggingface/datasets/issues/7480) for discussion on how users encountered unexpected cache locations when `HF_HUB_CACHE` was not set alongside `HF_DATASETS_CACHE`. | |
| When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets") | |
| ``` | |
| ## Download mode | |
| After you download a dataset, control how it is loaded by [load_dataset()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset) with the `download_mode` parameter. By default, ๐ค Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload') | |
| ``` | |
| Refer to [DownloadMode](/docs/datasets/pr_8021/en/package_reference/builder_classes#datasets.DownloadMode) for a full list of download modes. | |
| ## Cache files | |
| Clean up the Arrow cache files in the directory with [Dataset.cleanup_cache_files()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.cleanup_cache_files): | |
| ```py | |
| # Returns the number of removed cache files | |
| >>> dataset.cleanup_cache_files() | |
| 2 | |
| ``` | |
| ## Enable or disable caching | |
| If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [Dataset.map()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.map): | |
| ```py | |
| >>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False) | |
| ``` | |
| In the example above, ๐ค Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state. | |
| Disable caching on a global scale with [disable_caching()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.disable_caching): | |
| ```py | |
| >>> from datasets import disable_caching | |
| >>> disable_caching() | |
| ``` | |
| When you disable caching, ๐ค Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied. | |
| > [!TIP] | |
| > If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [load_dataset()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset) instead. | |
| ## Improve performance | |
| Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory: | |
| 1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory. | |
| 2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence. | |
Xet Storage Details
- Size:
- 5.25 kB
- Xet hash:
- c83451e942e8e610a19167fa25024aeebcafd3d38e47a49563fdb15e97db2a44
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.