Buckets:

the-hf-stack
/

dagster-hf-datasets-examples

Files

xet

the-hf-stack/dagster-hf-datasets-examples / basic_hub_ingestion /README.md

AINovice2005

21 days ago

preview code

download

raw

1.92 kB

Basic Hub Ingestion

Load a public Hugging Face dataset as a Dagster asset with automatic metadata extraction.

What this example shows

Using @hf_dataset_asset to declare a Hub-backed Dagster asset
Returning MaterializeResult with dataset metadata (rows, columns, fingerprint)
Registering HuggingFaceResource and HFParquetIOManager in Definitions
Materialization of two independent splits (train, test) as separate assets

Dataset

stanfordnlp/imdb — 50 K movie reviews with binary sentiment labels. Small, text-only, no auth required.

Split	Rows
train	25,000
test	25,000

Key API

@hf_dataset_asset(
    path="stanfordnlp/imdb",
    split="train",
    io_manager_key="hf_parquet_io_manager",
)
def imdb_train(context: AssetExecutionContext, dataset: Dataset) -> MaterializeResult:
    ...

Note: The decorated function body is not called by Dagster to load the dataset. HuggingFaceResource performs the load and injects dataset as a parameter. The function body is where you inspect, log, and return the result.

Storage layout

After materialization, HFParquetIOManager writes:

.dagster_hf_storage/
├── imdb_train/          # Arrow format via save_to_disk()
└── imdb_test/

How to run

cd dagster_hf_datasets_examples

dagster dev -m basic_hub_ingestion.definitions

Then open http://localhost:3000, navigate to the Asset Catalog, and materialize imdb_train and imdb_test.

Metadata visible in the Dagster UI

Key	Description
`rows`	Row count for the materialized split
`columns`	List of column names
`source_dataset`	Hub dataset identifier
`split`	Which split was loaded
`fingerprint`	Reproducibility hash from the datasets library

Xet Storage Details

Size:: 1.92 kB
Xet hash:: 1ac6b3a27557b2074c9bad8f1c5ad175f202475b2645726710553b2df21d95b3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.