Buckets:

the-hf-stack
/

dagster-hf-datasets-examples

the-hf-stack/dagster-hf-datasets-examples / basic_hub_ingestion

210 kB

70 files

Updated 16 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
README.md	1.92 kB xet	16 days ago	1ac6b3a2
__init__.py	74 Bytes xet	16 days ago	f7f49e7c
assets.py	1.74 kB xet	16 days ago	d06862d7
definitions.py	590 Bytes xet	16 days ago	7b68b78b

README.md

Basic Hub Ingestion

Load a public Hugging Face dataset as a Dagster asset with automatic metadata extraction.

What this example shows

Using @hf_dataset_asset to declare a Hub-backed Dagster asset
Returning MaterializeResult with dataset metadata (rows, columns, fingerprint)
Registering HuggingFaceResource and HFParquetIOManager in Definitions
Materialization of two independent splits (train, test) as separate assets

Dataset

stanfordnlp/imdb — 50 K movie reviews with binary sentiment labels. Small, text-only, no auth required.

Split	Rows
train	25,000
test	25,000

Key API

@hf_dataset_asset(
    path="stanfordnlp/imdb",
    split="train",
    io_manager_key="hf_parquet_io_manager",
)
def imdb_train(context: AssetExecutionContext, dataset: Dataset) -> MaterializeResult:
    ...

Note: The decorated function body is not called by Dagster to load the dataset. HuggingFaceResource performs the load and injects dataset as a parameter. The function body is where you inspect, log, and return the result.

Storage layout

After materialization, HFParquetIOManager writes:

.dagster_hf_storage/
├── imdb_train/          # Arrow format via save_to_disk()
└── imdb_test/

How to run

cd dagster_hf_datasets_examples

dagster dev -m basic_hub_ingestion.definitions

Then open http://localhost:3000, navigate to the Asset Catalog, and materialize imdb_train and imdb_test.

Metadata visible in the Dagster UI

Key	Description
`rows`	Row count for the materialized split
`columns`	List of column names
`source_dataset`	Hub dataset identifier
`split`	Which split was loaded
`fingerprint`	Reproducibility hash from the datasets library

Total size: 210 kB

Files: 70

Last updated: Jun 14

Pre-warmed CDN: US EU US EU