Using Storage Buckets as a Working Layer for Data Pipelines

Published March 26, 2026

I recently migrated the librarian-bots card collection pipeline to use Storage Buckets as intermediate storage and HF Jobs for scheduling. The pipeline collects README cards and metadata from datasets and models across the Hub. Here's the design pattern and why it works well.

The old approach: download everything, merge, re-upload

The previous pipeline ran on GitHub Actions every few hours:

Download the entire existing dataset from the Hub (multiple GB of parquet)
Fetch a batch of new/updated cards from the Hub API
Merge new + existing, deduplicate
Re-upload everything

Every run moved gigabytes of unchanged data. It was slow, fragile, and required disk space hacks in CI.

The new approach: Buckets as active storage

Storage Buckets are mutable, non-versioned object storage on the Hub. Unlike dataset repos (which are git-backed and version every change), buckets are designed for data that's actively being written to — no version history accumulating, no git overhead, just files you can write, overwrite, and sync freely.

Under the hood, buckets are powered by Xet, which breaks content into chunks and deduplicates across files. This is particularly useful for incremental pipelines: when successive fetch runs write overlapping data, only the genuinely new chunks need to be transferred, making uploads and syncs faster. Xet also means that sync_bucket() transfers are efficient — it compares chunks, not whole files, so syncing a bucket where most data hasn't changed is fast.

This makes buckets a natural fit as a working layer in a data pipeline:

[Fetch Jobs]  →  [Storage Bucket]  →  [Compile Job]  →  [Dataset Repo]
  (frequent)       (active storage)      (daily)          (published data)

Fetch jobs run frequently, collecting data and appending it to the bucket. They never need to read what's already there — they just write. If a run fails, previous data is untouched.

The bucket accumulates results over time. It's the single source of truth for raw collected data. Xet deduplication keeps storage efficient even as data accumulates across hundreds of runs.

A compile job runs daily, reads everything from the bucket, deduplicates, and publishes a clean dataset to a versioned repo. Consumers see a stable parquet dataset that updates once a day.

The key property: each stage only writes forward. Fetch jobs never read the bucket. The compile job never modifies the bucket. The published dataset can be regenerated from the bucket at any time.

What this looks like in practice

Fetching: append JSONL batches to the bucket

Each fetch run writes its results as a single JSONL file:

from huggingface_hub import batch_bucket_files

jsonl_data = "\n".join(record.model_dump_json() for record in results)
batch_bucket_files(
    "my-org/my-pipeline-bucket",
    add=[(jsonl_data.encode(), f"data/{timestamp}.jsonl")],
)

No need to read existing data. No merge step. Just append.

Compiling: sync, process, publish

The daily compile syncs the bucket locally, processes with Polars, and pushes to the Hub:

from huggingface_hub import sync_bucket
import polars as pl
from datasets import Dataset

# Bulk sync from bucket
sync_bucket(
    source="hf://buckets/my-org/my-pipeline-bucket/data",
    dest=local_dir,
)

# Polars LazyFrame: filter + dedup in one pass
lf = pl.concat([pl.scan_ndjson(f) for f in jsonl_files])
lf = lf.sort("last_modified", descending=True)
lf = lf.unique(subset=["id"], keep="first")
df = lf.collect()

Dataset.from_polars(df).push_to_hub("my-org/my-dataset")

For this pipeline, this compiles hundreds of thousands of cards in about a minute.

Scheduling with HF Jobs

HF Jobs lets you run compute on HF infrastructure — from CPUs to A100s — with pay-per-second pricing. Scheduled Jobs add cron-style scheduling, so you can set up recurring pipelines without any CI/CD infrastructure.

Both steps run as scheduled UV scripts — dependencies are declared inline in the script header, so there's no Docker image to build or requirements.txt to maintain:

# Fetch every 2 hours
hf jobs scheduled uv run "8 */2 * * *" \
  -s HF_TOKEN --timeout 30m \
  https://huggingface.co/datasets/my-org/my-pipeline/resolve/main/fetch.py

# Compile daily
hf jobs scheduled uv run "3 3 * * *" \
  -s HF_TOKEN --flavor cpu-upgrade --timeout 1h \
  https://huggingface.co/datasets/my-org/my-pipeline/resolve/main/compile.py

A few things that make this workflow nice:

Scripts live in an HF repo and are referenced by URL. Updating the pipeline is just a git push — the next scheduled run picks up the new code automatically.
Secrets are passed securely with -s. HF_TOKEN is a special shorthand that passes your logged-in token.
Hardware is configurable per job. The fetch runs on cpu-basic (I/O bound), the compile on cpu-upgrade (needs more RAM for the full dataset).
Logs and status are available via hf jobs logs and hf jobs scheduled ps, or on the web UI.
You can manage schedules with hf jobs scheduled suspend, resume, and delete — no config files to edit.

Design choices

JSONL batches, not one file per record. I initially stored one JSON file per record in the bucket — natural deduplication via filename. But with hundreds of thousands of records, syncing that many small files became the bottleneck. Switching to one JSONL file per batch (hundreds of files instead of hundreds of thousands) made the compile step fast. Deduplication moved to compile time, which Polars handles in seconds.

Bucket as recoverable source of truth. Each layer is independently recoverable. If the compile step breaks, the bucket still has all the data. If the bucket is lost, you re-fetch from the source API. The published dataset is a derived artifact that can be regenerated at any time.

When this pattern works well

This "bucket as working layer" approach fits pipelines where:

Data is collected incrementally — you fetch new data on a schedule and accumulate it over time
The published dataset is a processed view — deduplication, filtering, or transformation happens before publishing
Fault tolerance matters — a failed fetch shouldn't corrupt existing data, and the published dataset should be rebuildable

For more on data ingestion patterns, see the Ingesting Datasets docs which covers Buckets, Jobs, CommitScheduler, dlt, and other approaches.

Datasets mentioned in this article 3

FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages

July 8, 2025

FineWeb2-C: Help Build Better Language Models in Your Language

December 23, 2024

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Using Storage Buckets as a Working Layer for Data Pipelines

The old approach: download everything, merge, re-upload

The new approach: Buckets as active storage

What this looks like in practice

Fetching: append JSONL batches to the bucket

Compiling: sync, process, publish

Scheduling with HF Jobs

Design choices

When this pattern works well

Links

Datasets mentioned in this article 3

FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages

FineWeb2-C: Help Build Better Language Models in Your Language

Community

Datasets mentioned in this article 3