Using Storage Buckets as a Working Layer for Data Pipelines
The old approach: download everything, merge, re-upload
The previous pipeline ran on GitHub Actions every few hours:
- Download the entire existing dataset from the Hub (multiple GB of parquet)
- Fetch a batch of new/updated cards from the Hub API
- Merge new + existing, deduplicate
- Re-upload everything
Every run moved gigabytes of unchanged data. It was slow, fragile, and required disk space hacks in CI.
The new approach: Buckets as active storage
Storage Buckets are mutable, non-versioned object storage on the Hub. Unlike dataset repos (which are git-backed and version every change), buckets are designed for data that's actively being written to — no version history accumulating, no git overhead, just files you can write, overwrite, and sync freely.
Under the hood, buckets are powered by Xet, which breaks content into chunks and deduplicates across files. This is particularly useful for incremental pipelines: when successive fetch runs write overlapping data, only the genuinely new chunks need to be transferred, making uploads and syncs faster. Xet also means that sync_bucket() transfers are efficient — it compares chunks, not whole files, so syncing a bucket where most data hasn't changed is fast.
This makes buckets a natural fit as a working layer in a data pipeline:
[Fetch Jobs] → [Storage Bucket] → [Compile Job] → [Dataset Repo]
(frequent) (active storage) (daily) (published data)
Fetch jobs run frequently, collecting data and appending it to the bucket. They never need to read what's already there — they just write. If a run fails, previous data is untouched.
The bucket accumulates results over time. It's the single source of truth for raw collected data. Xet deduplication keeps storage efficient even as data accumulates across hundreds of runs.
A compile job runs daily, reads everything from the bucket, deduplicates, and publishes a clean dataset to a versioned repo. Consumers see a stable parquet dataset that updates once a day.
The key property: each stage only writes forward. Fetch jobs never read the bucket. The compile job never modifies the bucket. The published dataset can be regenerated from the bucket at any time.
What this looks like in practice
Fetching: append JSONL batches to the bucket
Each fetch run writes its results as a single JSONL file:
from huggingface_hub import batch_bucket_files
jsonl_data = "\n".join(record.model_dump_json() for record in results)
batch_bucket_files(
"my-org/my-pipeline-bucket",
add=[(jsonl_data.encode(), f"data/{timestamp}.jsonl")],
)
No need to read existing data. No merge step. Just append.
Compiling: sync, process, publish
The daily compile syncs the bucket locally, processes with Polars, and pushes to the Hub:
from huggingface_hub import sync_bucket
import polars as pl
from datasets import Dataset
# Bulk sync from bucket
sync_bucket(
source="hf://buckets/my-org/my-pipeline-bucket/data",
dest=local_dir,
)
# Polars LazyFrame: filter + dedup in one pass
lf = pl.concat([pl.scan_ndjson(f) for f in jsonl_files])
lf = lf.sort("last_modified", descending=True)
lf = lf.unique(subset=["id"], keep="first")
df = lf.collect()
Dataset.from_polars(df).push_to_hub("my-org/my-dataset")
For this pipeline, this compiles hundreds of thousands of cards in about a minute.
Scheduling with HF Jobs
HF Jobs lets you run compute on HF infrastructure — from CPUs to A100s — with pay-per-second pricing. Scheduled Jobs add cron-style scheduling, so you can set up recurring pipelines without any CI/CD infrastructure.
Both steps run as scheduled UV scripts — dependencies are declared inline in the script header, so there's no Docker image to build or requirements.txt to maintain:
# Fetch every 2 hours
hf jobs scheduled uv run "8 */2 * * *" \
-s HF_TOKEN --timeout 30m \
https://huggingface.co/datasets/my-org/my-pipeline/resolve/main/fetch.py
# Compile daily
hf jobs scheduled uv run "3 3 * * *" \
-s HF_TOKEN --flavor cpu-upgrade --timeout 1h \
https://huggingface.co/datasets/my-org/my-pipeline/resolve/main/compile.py
A few things that make this workflow nice:
- Scripts live in an HF repo and are referenced by URL. Updating the pipeline is just a
git push— the next scheduled run picks up the new code automatically. - Secrets are passed securely with
-s.HF_TOKENis a special shorthand that passes your logged-in token. - Hardware is configurable per job. The fetch runs on
cpu-basic(I/O bound), the compile oncpu-upgrade(needs more RAM for the full dataset). - Logs and status are available via
hf jobs logsandhf jobs scheduled ps, or on the web UI. - You can manage schedules with
hf jobs scheduled suspend,resume, anddelete— no config files to edit.
Design choices
JSONL batches, not one file per record. I initially stored one JSON file per record in the bucket — natural deduplication via filename. But with hundreds of thousands of records, syncing that many small files became the bottleneck. Switching to one JSONL file per batch (hundreds of files instead of hundreds of thousands) made the compile step fast. Deduplication moved to compile time, which Polars handles in seconds.
Bucket as recoverable source of truth. Each layer is independently recoverable. If the compile step breaks, the bucket still has all the data. If the bucket is lost, you re-fetch from the source API. The published dataset is a derived artifact that can be regenerated at any time.
When this pattern works well
This "bucket as working layer" approach fits pipelines where:
- Data is collected incrementally — you fetch new data on a schedule and accumulate it over time
- The published dataset is a processed view — deduplication, filtering, or transformation happens before publishing
- Fault tolerance matters — a failed fetch shouldn't corrupt existing data, and the published dataset should be rebuildable
For more on data ingestion patterns, see the Ingesting Datasets docs which covers Buckets, Jobs, CommitScheduler, dlt, and other approaches.