AINovice2005's picture
|
download
raw
1.92 kB

Basic Hub Ingestion

Load a public Hugging Face dataset as a Dagster asset with automatic metadata extraction.

What this example shows

  • Using @hf_dataset_asset to declare a Hub-backed Dagster asset
  • Returning MaterializeResult with dataset metadata (rows, columns, fingerprint)
  • Registering HuggingFaceResource and HFParquetIOManager in Definitions
  • Materialization of two independent splits (train, test) as separate assets

Dataset

stanfordnlp/imdb — 50 K movie reviews with binary sentiment labels. Small, text-only, no auth required.

Split Rows
train 25,000
test 25,000

Key API

@hf_dataset_asset(
    path="stanfordnlp/imdb",
    split="train",
    io_manager_key="hf_parquet_io_manager",
)
def imdb_train(context: AssetExecutionContext, dataset: Dataset) -> MaterializeResult:
    ...

Note: The decorated function body is not called by Dagster to load the dataset. HuggingFaceResource performs the load and injects dataset as a parameter. The function body is where you inspect, log, and return the result.

Storage layout

After materialization, HFParquetIOManager writes:

.dagster_hf_storage/
├── imdb_train/          # Arrow format via save_to_disk()
└── imdb_test/

How to run

cd dagster_hf_datasets_examples

dagster dev -m basic_hub_ingestion.definitions

Then open http://localhost:3000, navigate to the Asset Catalog, and materialize imdb_train and imdb_test.

Metadata visible in the Dagster UI

Key Description
rows Row count for the materialized split
columns List of column names
source_dataset Hub dataset identifier
split Which split was loaded
fingerprint Reproducibility hash from the datasets library

Xet Storage Details

Size:
1.92 kB
·
Xet hash:
1ac6b3a27557b2074c9bad8f1c5ad175f202475b2645726710553b2df21d95b3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.