Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_8021 /en /share.md

rtrm

about 1 month ago

preview code

download

raw

9.04 kB

	# Share a dataset using the CLI

	At Hugging Face, we are on a mission to democratize good Machine Learning and we believe in the value of open source. That's why we designed 🤗 Datasets so that anyone can share a dataset with the greater ML community. There are currently thousands of datasets in over 100 languages in the Hugging Face Hub, and the Hugging Face team always welcomes new contributions!

	Dataset repositories offer features such as:

	- Free dataset hosting
	- Dataset versioning
	- Commit history and diffs
	- Metadata for discoverability
	- Dataset cards for documentation, licensing, limitations, etc.
	- [Dataset Viewer](https://huggingface.co/docs/hub/datasets-viewer)

	This guide will show you how to share a dataset folder or repository that can be easily accessed by anyone.

	## Add a dataset

	You can share your dataset with the community with a dataset repository on the Hugging Face Hub.
	It can also be a private dataset if you want to control who has access to it.

	In a dataset repository, you can host all your data files and [configure your dataset](./repository_structure#define-your-splits-in-yaml) to define which file goes to which split.
	The following formats are supported: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite, WebDataset.
	Many kinds of compressed file types are also supported: GZ, BZ2, LZ4, LZMA or ZSTD.
	For example, your dataset can be made of `.json.gz` files.

	When loading a dataset from the Hub, all the files in the supported formats are loaded, following the [repository structure](./repository_structure).

	For more information on how to load a dataset from the Hub, take a look at the [load a dataset from the Hub](./load_hub) tutorial.

	### Create the repository

	Sharing a community dataset will require you to create an account on [hf.co](https://huggingface.co/join) if you don't have one yet.
	You can directly create a [new dataset repository](https://huggingface.co/login?next=%2Fnew-dataset) from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.

	1. Make sure you are in the virtual environment where you installed Datasets, and run the following command:

	```
	huggingface-cli login
	```

	2. Login using your Hugging Face Hub credentials, and create a new dataset repository:

	```
	huggingface-cli repo create my-cool-dataset --type dataset
	```

	Add the `-organization` flag to create a repository under a specific organization:

	```
	huggingface-cli repo create my-cool-dataset --type dataset --organization your-org-name
	```

	## Prepare your files

	Check your directory to ensure the only files you're uploading are:

	- The data files of the dataset

	- The dataset card `README.md`

	## huggingface-cli upload

	Use the `huggingface-cli upload` command to upload files to the Hub directly. Internally, it uses the same `upload_file` and `upload_folder` helpers described in the [Upload guide](https://huggingface.co/docs/huggingface_hub/guides/upload). In the examples below, we will walk through the most common use cases. For a full list of available options, you can run:

	```bash
	>>> huggingface-cli upload --help
	```

	For more general information about `huggingface-cli` you can check the [CLI guide](https://huggingface.co/docs/huggingface_hub/guides/cli).

	### Upload an entire folder

	The default usage for this command is:

	```bash
	# Usage: huggingface-cli upload [dataset_repo_id] [local_path] [path_in_repo] --repo-type dataset
	```

	To upload the current directory at the root of the repo, use:

	```bash
	>>> huggingface-cli upload my-cool-dataset . . --repo-type dataset
	https://huggingface.co/datasets/Wauplin/my-cool-dataset/tree/main/
	```

	> [!TIP]
	> If the repo doesn't exist yet, it will be created automatically.

	You can also upload a specific folder:

	```bash
	>>> huggingface-cli upload my-cool-dataset ./data . --repo-type dataset
	https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main/
	```

	Finally, you can upload a folder to a specific destination on the repo:

	```bash
	>>> huggingface-cli upload my-cool-dataset ./path/to/curated/data /data/train --repo-type dataset
	https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main/data/train
	```

	### Upload a single file

	You can also upload a single file by setting `local_path` to point to a file on your machine. If that's the case, `path_in_repo` is optional and will default to the name of your local file:

	```bash
	>>> huggingface-cli upload Wauplin/my-cool-dataset ./files/train.csv --repo-type dataset
	https://huggingface.co/datasetsWauplin/my-cool-dataset/blob/main/train.csv
	```

	If you want to upload a single file to a specific directory, set `path_in_repo` accordingly:

	```bash
	>>> huggingface-cli upload Wauplin/my-cool-dataset ./files/train.csv /data/train.csv --repo-type dataset
	https://huggingface.co/datasetsWauplin/my-cool-dataset/blob/main/data/train.csv
	```

	### Upload multiple files

	To upload multiple files from a folder at once without uploading the entire folder, use the `--include` and `--exclude` patterns. It can also be combined with the `--delete` option to delete files on the repo while uploading new ones. In the example below, we sync the local Space by deleting remote files and uploading all CSV files:

	```bash
	# Sync local Space with Hub (upload new CSV files, delete removed files)
	>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type dataset --include="/data/.csv" --delete="" --commit-message="Sync local dataset with Hub"
	...
	```

	### Upload to an organization

	To upload content to a repo owned by an organization instead of a personal repo, you must explicitly specify it in the `repo_id`:

	```bash
	>>> huggingface-cli upload MyCoolOrganization/my-cool-dataset . . --repo-type dataset
	https://huggingface.co/datasetsMyCoolOrganization/my-cool-dataset/tree/main/
	```

	### Upload to a specific revision

	By default, files are uploaded to the `main` branch. If you want to upload files to another branch or reference, use the `--revision` option:

	```bash
	# Upload files to a PR
	huggingface-cli upload bigcode/the-stack . . --repo-type dataset --revision refs/pr/104
	...
	```

	Note: if `revision` does not exist and `--create-pr` is not set, a branch will be created automatically from the `main` branch.

	### Upload and create a PR

	If you don't have the permission to push to a repo, you must open a PR and let the authors know about the changes you want to make. This can be done by setting the `--create-pr` option:

	```bash
	# Create a PR and upload the files to it
	>>> huggingface-cli upload bigcode/the-stack --repo-type dataset --revision refs/pr/104 --create-pr . .
	https://huggingface.co/datasets/bigcode/the-stack/blob/refs%2Fpr%2F104/
	```

	### Upload at regular intervals

	In some cases, you might want to push regular updates to a repo. For example, this is useful if your dataset is growing over time and you want to upload the data folder every 10 minutes. You can do this using the `--every` option:

	```bash
	# Upload new logs every 10 minutes
	huggingface-cli upload my-cool-dynamic-dataset data/ --every=10
	```

	### Specify a commit message

	Use the `--commit-message` and `--commit-description` to set a custom message and description for your commit instead of the default one

	```bash
	>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --repo-type dataset --commit-message="Version 2" --commit-description="Train size: 4321. Check Dataset Viewer for more details."
	...
	https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main
	```

	### Specify a token

	To upload files, you must use a token. By default, the token saved locally (using `huggingface-cli login`) will be used. If you want to authenticate explicitly, use the `--token` option:

	```bash
	>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --repo-type dataset --token=hf_****
	...
	https://huggingface.co/datasetsWauplin/my-cool-data/tree/main
	```

	### Quiet mode

	By default, the `huggingface-cli upload` command will be verbose. It will print details such as warning messages, information about the uploaded files, and progress bars. If you want to silence all of this, use the `--quiet` option. Only the last line (i.e. the URL to the uploaded files) is printed. This can prove useful if you want to pass the output to another command in a script.

	```bash
	>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --repo-type dataset --quiet
	https://huggingface.co/datasets/Wauplin/my-cool-dataset/tree/main
	```

	## Enjoy !

	Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳

	```
	dataset = load_dataset("Wauplin/my-cool-dataset")
	```

	If your dataset is supported, it should also have a [Dataset Viewer](https://huggingface.co/docs/hub/datasets-viewer) for everyone to explore the dataset content.

	Finally, don't forget to enrich the dataset card to document your dataset and make it discoverable! Check out the [Create a dataset card](dataset_card) guide to learn more.

Xet Storage Details

Size:: 9.04 kB
Xet hash:: c27b861fbf74fb0d8405947435e2e80510dc2468405b8bf674971f644ec3173f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.