Buckets:
| # Ingesting Datasets | |
| Data generally lives in databases or cloud storage in forms that are not suited for AI workflows. | |
| Ingesting data to the [Hub](https://huggingface.co/datasets) is a good way to publish them as AI-ready datasets, enabling easy and efficient data loading, processing and model training and evaluation. | |
| ## Using `huggingface_hub` | |
| The simplest way to ingest data is to simply upload the data files with `huggingface_hub`. | |
| The `huggingface_hub` Python library provides a rich feature set that allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. | |
| This is relevant if your data is static/frozen and if you can easily obtain a local dump of the data in a format supported by the Hub (e.g., Parquet or JSON Lines) with a usable structure (e.g., well-defined fields for training and evaluation). | |
| ## Using `dlt` | |
| [dlt](http://github.com/dlt-hub/dlt) is an open-source Python library for data movement (ETL), and is useful for developers (and their agents) building data pipelines. | |
| It can ingest data from diverse source types: | |
| * Cloud storage or files | |
| * REST APIs | |
| * SQL databases | |
| * Python generators | |
| Examples of source types: | |
| * `filesystem` (includes s3, gs, az, abff, etc.) | |
| * `sql_database`, `mongodb`, `google_sheets` | |
| * `notion`, `hubspot`, `rest_api` | |
| Find your source type from the [list of sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources) and create your `dlt` project: | |
| ``` | |
| dlt init filesystem | |
| ``` | |
| You can then create a configuration file `.dlt/secrets.toml` in the root of your dlt project to define the Hub as a filesystem destination for your datasets, based on the `hf://` protocol: | |
| ```toml | |
| [destination.filesystem] | |
| bucket_url = "hf://datasets/" | |
| [destination.filesystem.credentials] | |
| hf_token = "hf_..." # Your Hugging Face Access Token | |
| ``` | |
| The namespace should be your user name or the name of your organization/team where you want to ingest your dataset. | |
| Then each dlt dataset creates or updates a Hugging Face dataset repository. The repository name is /, where is the same one you used in the bucket_url (your organization or team), and is the pipeline's dataset_name. | |
| Here is an example pipeline: | |
| ```python | |
| import dlt | |
| @dlt.resource | |
| def my_data(): | |
| # One of the functions auto-generated by `dlt init` that you can customize, | |
| # or you can define your own python generator function. | |
| # Here is an example from the `chess` source type: | |
| for player in ['magnuscarlsen', 'rpragchess']: | |
| response = requests.get(f'https://api.chess.com/pub/player/{player}') | |
| response.raise_for_status() | |
| yield response.json() | |
| # Requires bucket_url = "hf://datasets/" in .dlt/secrets.toml | |
| pipeline = dlt.pipeline( | |
| pipeline_name="my_pipeline", | |
| destination="filesystem", | |
| dataset_name="dataset_name", | |
| ) | |
| pipeline.run(my_data()) | |
| ``` | |
| Customize the `dlt` resource to load the data you want and parse the fields you want to publish in your dataset, e.g. the text you need for training and evaluation. | |
| ## Using other libraries | |
| Some libraries like [🤗 Datasets](/docs/datasets/index), [Pandas](./datasets-pandas), [Polars](./datasets-polars), [Dask](./datasets-dask), [DuckDB](./datasets-duckdb), [Spark](./datasets-spark), or [Daft](./datasets-daft) can ingest data from various places to the Hub. | |
| See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. | |
| ## Ingest raw data | |
| If you are ingesting raw data that need further curation before being published as AI-ready datasets or if you need an S3-like experience, consider ingesting them to [Hugging Face Storage Buckets](./storage-buckets). | |
| ## Scheduled ingestion | |
| There are some limitations when updating the same file on the Hub thousands of times. | |
| For instance, you might want to ingest generations of a running LLM inference server, live agents traces, or logs of a running model training. | |
| In such cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly. | |
| The main reason is that you don’t want to version every update of your data because it’ll make the git repository unusable. | |
| Three options are available: | |
| * **Use a Storage Bucket instead of a Dataset repository:** [Storage Buckets](/docs/hub/storage-buckets) offer an S3-like experience that allows updating files very frequently, since they are not based on git. Storage Buckets are especially useful for data that are not ready to be published as a dataset, e.g. data that are still evolving or that need more curation. | |
| * **Use a CommitScheduler**: The `CommitScheduler` in `huggingface_hub` offers near real-time ingestion to keep the git history of a Dataset repository manageable. It can be configured to do git commits at intervals defined in minutes. | |
| * **Use Hugging Face Jobs to schedule ingestion scripts**: Hugging Face Jobs provides a way to run and schedule python scripts on Hugging Face infrastructure. Schedule ingestion scripts to run at intervals defined using the Cron syntax. | |
| ### High frequency using Storage Buckets | |
| Contrary to Dataset repositories that are based on git, you can update files on Storage Buckets at very high rate, offering quasi real-time ingestion. | |
| Use `batch_bucket_files()` in `huggingface_hub` to update files in a bucket: | |
| ```python | |
| from huggingface_hub import batch_bucket_files | |
| def update_bucket(local_files): | |
| destinations = [os.path.basename(local_file) for local_file in local_file] | |
| batch_bucket_files(bucket_id="username/bucket_name", add=[(local_file, dst) for local_file, dst in zip(local_files, destinations)]) | |
| ``` | |
| Alternatively, you can append to files in a Bucket and `flush()` on every new item: | |
| ```python | |
| from huggingface_hub import hffs | |
| with hffs.open("buckets/username/bucket_name/texts.jsonl", "a") as f: | |
| for text in live_texts_stream: | |
| f.write(json.dumps({"text": text}) + "\n") | |
| f.flush() | |
| ``` | |
| The `HfFileSystem` is based on `fsspec` which has a default blocksize of 5MiB, which means flushing actually uploads the data once a full chunk of 5MiB of new data was appended. | |
| If you want to upload more often, lower `blocksize` in `hffs.open()` (e.g. `hffs.open(..., blocksize=100 * 2 ** 10)` for 100 kiB) or use `f.flush(force=True)`. | |
| Hugging Face storage is based on Xet which enables efficient I/O when appending to files: uploads are deduplicated and only new data are uploaded. | |
| Find more information on doing dynamic data ingestion in buckets in the [buckets documentation on uploads](/docs/hub/storage-buckets#uploading-files) and in the [dataset editing documentation](./datasets-editing#only-upload-the-new-data). | |
| ### Near real-time using a `CommitScheduler` | |
| The idea is to run a background job that regularly pushes a local folder to the Hub. You want to save data to the Hub (potentially millions of entries), but you don’t need to save in real-time each user’s input. Instead, you can save the data locally in a JSON file and upload it every 10 minutes. For example: | |
| ```python | |
| import json | |
| from huggingface_hub import CommitScheduler | |
| folder_path = "path/to/files/to/ingest" | |
| every = 10 # ingest every 10min | |
| with CommitScheduler(repo_id="username/dataset_name", repo_type="dataset", folder_path=folder_path, every=every) as scheduler: | |
| # Write to the folder to ingest every 10min | |
| # For example: | |
| with open(folder_path + "/texts.jsonl", "a") as f: | |
| f.write(json.dumps({"text": text}) + "\n") | |
| ... | |
| ``` | |
| Check out how to ingest dynamic data without having to reupload everything every time in the documentation on [dataset editing](./datasets-editing#only-upload-the-new-data). | |
| Find more information on scheduled uploads in the [huggingface_hub documentation](/docs/huggingface_hub/guides/upload#scheduled-uploads). | |
| ### Cron-based using Hugging Face Jobs | |
| Schedule python scripts to ingest data according to a schedule | |
| For example to run a script `ingest.py` every 5 minutes: | |
| ```bash | |
| hf jobs scheduled uv run "*/5 * * * *" ingest.py | |
| ``` | |
| Declare the script dependencies [in the header of the script](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies) or use `--with`. | |
| For example to run a `dlt` pipeline every day at midnight: | |
| ```bash | |
| hf jobs scheduled uv run --with "dlt[hf]" "0 0 * * *" pipeline.py | |
| ``` | |
| You can check the logs of every run using `hf jobs logs` or directly in the Jobs page on your account on Hugging Face. | |
| Find more information about Hugging Face Jobs in the [Jobs documentation](/docs/hub/jobs-overview). | |
Xet Storage Details
- Size:
- 8.59 kB
- Xet hash:
- 6929419337bb816a9068c8b751acbd4819b98e4a8732554c4da0714264ba6221
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.