Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_8021 /en /filesystems.md

rtrm

about 1 month ago

preview code

download

raw

2.97 kB

	# Cloud storage

	## Hugging Face Datasets

	The Hugging Face Dataset Hub is home to a growing collection of datasets that span a variety of domains and tasks.

	It's more than a cloud storage: the Dataset Hub is a platform that provides data versioning thanks to git, as well as a Dataset Viewer to explore the data, making it a great place to store AI-ready datasets.

	This guide shows how to import data from other cloud storage using the filesystems implementations from `fsspec`.

	## Import data from a cloud storage

	Most cloud storage providers have a `fsspec` FileSystem implementation, which is useful to import data from any cloud provider with the same code.
	This is especially useful to publish datasets on Hugging Face.

	Take a look at the following table for some example of supported cloud storage providers:

	\| Storage provider \| Filesystem implementation \|
	\|----------------------\|---------------------------------------------------------------\|
	\| Amazon S3 \| [s3fs](https://s3fs.readthedocs.io/en/latest/) \|
	\| Google Cloud Storage \| [gcsfs](https://gcsfs.readthedocs.io/en/latest/) \|
	\| Azure Blob/DataLake \| [adlfs](https://github.com/fsspec/adlfs) \|
	\| Oracle Cloud Storage \| [ocifs](https://ocifs.readthedocs.io/en/latest/) \|

	This guide will show you how to import data files from any cloud storage and save a dataset on Hugging Face.

	Let's say we want to publish a dataset on Hugging Face from Parquet files from a cloud storage.

	First, instantiate your cloud storage filesystem and list the files you'd like to import:

	```python
	>>> import fsspec
	>>> fs = fsspec.filesystem("...") # s3 / gcs / abfs / adl / oci / ...
	>>> data_dir = "path/to/my/data/"
	>>> pattern = "*.parquet"
	>>> data_files = fs.glob(data_dir + pattern)
	["path/to/my/data/0001.parquet", "path/to/my/data/0001.parquet", ...]
	```

	Then you can create a dataset on Hugging Face and import the data files, using for example:

	```python
	>>> from huggingface_hub import create_repo, upload_file
	>>> from tqdm.auto import tqdm
	>>> destination_dataset = "username/my-dataset"
	>>> create_repo(destination_dataset, repo_type="dataset")
	>>> for data_file in tqdm(fs.glob(data_dir + pattern)):
	... with fs.open(data_file) as fileobj:
	... path_in_repo = data_file[len(data_dir):]
	... upload_file(
	... path_or_fileobj=fileobj,
	... path_in_repo=path_in_repo,
	... repo_id=destination_dataset,
	... repo_type="dataset",
	... )
	```

	Check out the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) documentation on files uploads [here](https://huggingface.co/docs/huggingface_hub/en/guides/upload) if you're looking for more upload options.

	Finally you can now load the dataset using 🤗 Datasets:

	```python
	>>> from datasets import load_dataset
	>>> ds = load_dataset("username/my-dataset")
	```

Xet Storage Details

Size:: 2.97 kB
Xet hash:: 28a3204bc9ed280a65fd828b94fd5049fe0cf52ed354d3ef8f66e384d58ec17f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.