Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / hub /pr_2437 /en /datasets-ingesting.md

HuggingFaceDocBuilder

about 1 month ago

preview code

download

raw

8.59 kB

	# Ingesting Datasets

	Data generally lives in databases or cloud storage in forms that are not suited for AI workflows.
	Ingesting data to the [Hub](https://huggingface.co/datasets) is a good way to publish them as AI-ready datasets, enabling easy and efficient data loading, processing and model training and evaluation.

	## Using `huggingface_hub`

	The simplest way to ingest data is to simply upload the data files with `huggingface_hub`.

	The `huggingface_hub` Python library provides a rich feature set that allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more.

	This is relevant if your data is static/frozen and if you can easily obtain a local dump of the data in a format supported by the Hub (e.g., Parquet or JSON Lines) with a usable structure (e.g., well-defined fields for training and evaluation).

	## Using `dlt`

	[dlt](http://github.com/dlt-hub/dlt) is an open-source Python library for data movement (ETL), and is useful for developers (and their agents) building data pipelines.
	It can ingest data from diverse source types:

	* Cloud storage or files
	* REST APIs
	* SQL databases
	* Python generators

	Examples of source types:

	* `filesystem` (includes s3, gs, az, abff, etc.)
	* `sql_database`, `mongodb`, `google_sheets`
	* `notion`, `hubspot`, `rest_api`

	Find your source type from the [list of sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources) and create your `dlt` project:

	```
	dlt init filesystem
	```

	You can then create a configuration file `.dlt/secrets.toml` in the root of your dlt project to define the Hub as a filesystem destination for your datasets, based on the `hf://` protocol:

	```toml
	[destination.filesystem]
	bucket_url = "hf://datasets/"

	[destination.filesystem.credentials]
	hf_token = "hf_..." # Your Hugging Face Access Token
	```

	The namespace should be your user name or the name of your organization/team where you want to ingest your dataset.

	Then each dlt dataset creates or updates a Hugging Face dataset repository. The repository name is /, where is the same one you used in the bucket_url (your organization or team), and is the pipeline's dataset_name.

	Here is an example pipeline:

	```python
	import dlt

	@dlt.resource
	def my_data():
	# One of the functions auto-generated by `dlt init` that you can customize,
	# or you can define your own python generator function.

	# Here is an example from the `chess` source type:
	for player in ['magnuscarlsen', 'rpragchess']:
	response = requests.get(f'https://api.chess.com/pub/player/{player}')
	response.raise_for_status()
	yield response.json()

	# Requires bucket_url = "hf://datasets/" in .dlt/secrets.toml
	pipeline = dlt.pipeline(
	pipeline_name="my_pipeline",
	destination="filesystem",
	dataset_name="dataset_name",
	)

	pipeline.run(my_data())
	```

	Customize the `dlt` resource to load the data you want and parse the fields you want to publish in your dataset, e.g. the text you need for training and evaluation.

	## Using other libraries

	Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](./datasets-pandas), [Polars](./datasets-polars), [Dask](./datasets-dask), [DuckDB](./datasets-duckdb), [Spark](./datasets-spark), or [Daft](./datasets-daft) can ingest data from various places to the Hub.
	See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information.

	## Ingest raw data

	If you are ingesting raw data that need further curation before being published as AI-ready datasets or if you need an S3-like experience, consider ingesting them to [Hugging Face Storage Buckets](./storage-buckets).

	## Scheduled ingestion

	There are some limitations when updating the same file on the Hub thousands of times.
	For instance, you might want to ingest generations of a running LLM inference server, live agents traces, or logs of a running model training.
	In such cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly.
	The main reason is that you don’t want to version every update of your data because it’ll make the git repository unusable.

	Three options are available:

	* Use a Storage Bucket instead of a Dataset repository: [Storage Buckets](/docs/hub/storage-buckets) offer an S3-like experience that allows updating files very frequently, since they are not based on git. Storage Buckets are especially useful for data that are not ready to be published as a dataset, e.g. data that are still evolving or that need more curation.
	* Use a CommitScheduler: The `CommitScheduler` in `huggingface_hub` offers near real-time ingestion to keep the git history of a Dataset repository manageable. It can be configured to do git commits at intervals defined in minutes.
	* Use Hugging Face Jobs to schedule ingestion scripts: Hugging Face Jobs provides a way to run and schedule python scripts on Hugging Face infrastructure. Schedule ingestion scripts to run at intervals defined using the Cron syntax.

	### High frequency using Storage Buckets

	Contrary to Dataset repositories that are based on git, you can update files on Storage Buckets at very high rate, offering quasi real-time ingestion.

	Use `batch_bucket_files()` in `huggingface_hub` to update files in a bucket:

	```python
	from huggingface_hub import batch_bucket_files

	def update_bucket(local_files):
	destinations = [os.path.basename(local_file) for local_file in local_file]
	batch_bucket_files(bucket_id="username/bucket_name", add=[(local_file, dst) for local_file, dst in zip(local_files, destinations)])
	```

	Alternatively, you can append to files in a Bucket and `flush()` on every new item:

	```python
	from huggingface_hub import hffs

	with hffs.open("buckets/username/bucket_name/texts.jsonl", "a") as f:
	for text in live_texts_stream:
	f.write(json.dumps({"text": text}) + "\n")
	f.flush()
	```

	The `HfFileSystem` is based on `fsspec` which has a default blocksize of 5MiB, which means flushing actually uploads the data once a full chunk of 5MiB of new data was appended.
	If you want to upload more often, lower `blocksize` in `hffs.open()` (e.g. `hffs.open(..., blocksize=100 * 2 ** 10)` for 100 kiB) or use `f.flush(force=True)`.

	Hugging Face storage is based on Xet which enables efficient I/O when appending to files: uploads are deduplicated and only new data are uploaded.
	Find more information on doing dynamic data ingestion in buckets in the [buckets documentation on uploads](/docs/hub/storage-buckets#uploading-files) and in the [dataset editing documentation](./datasets-editing#only-upload-the-new-data).

	### Near real-time using a `CommitScheduler`

	The idea is to run a background job that regularly pushes a local folder to the Hub. You want to save data to the Hub (potentially millions of entries), but you don’t need to save in real-time each user’s input. Instead, you can save the data locally in a JSON file and upload it every 10 minutes. For example:

	```python
	import json
	from huggingface_hub import CommitScheduler

	folder_path = "path/to/files/to/ingest"
	every = 10 # ingest every 10min

	with CommitScheduler(repo_id="username/dataset_name", repo_type="dataset", folder_path=folder_path, every=every) as scheduler:
	# Write to the folder to ingest every 10min
	# For example:
	with open(folder_path + "/texts.jsonl", "a") as f:
	f.write(json.dumps({"text": text}) + "\n")
	...
	```

	Check out how to ingest dynamic data without having to reupload everything every time in the documentation on [dataset editing](./datasets-editing#only-upload-the-new-data).

	Find more information on scheduled uploads in the [huggingface_hub documentation](/docs/huggingface_hub/guides/upload#scheduled-uploads).

	### Cron-based using Hugging Face Jobs

	Schedule python scripts to ingest data according to a schedule

	For example to run a script `ingest.py` every 5 minutes:

	```bash
	hf jobs scheduled uv run "/5 * * *" ingest.py
	```

	Declare the script dependencies [in the header of the script](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies) or use `--with`.
	For example to run a `dlt` pipeline every day at midnight:

	```bash
	hf jobs scheduled uv run --with "dlt[hf]" "0 0 * * *" pipeline.py
	```

	You can check the logs of every run using `hf jobs logs` or directly in the Jobs page on your account on Hugging Face.

	Find more information about Hugging Face Jobs in the [Jobs documentation](/docs/hub/jobs-overview).

Xet Storage Details

Size:: 8.59 kB
Xet hash:: 6929419337bb816a9068c8b751acbd4819b98e4a8732554c4da0714264ba6221

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.