Buckets:
Basic Hub Ingestion
Load a public Hugging Face dataset as a Dagster asset with automatic metadata extraction.
What this example shows
- Using
@hf_dataset_assetto declare a Hub-backed Dagster asset - Returning
MaterializeResultwith dataset metadata (rows, columns, fingerprint) - Registering
HuggingFaceResourceandHFParquetIOManagerinDefinitions - Materialization of two independent splits (
train,test) as separate assets
Dataset
stanfordnlp/imdb — 50 K movie reviews with binary sentiment labels. Small, text-only, no auth required.
| Split | Rows |
|---|---|
| train | 25,000 |
| test | 25,000 |
Key API
@hf_dataset_asset(
path="stanfordnlp/imdb",
split="train",
io_manager_key="hf_parquet_io_manager",
)
def imdb_train(context: AssetExecutionContext, dataset: Dataset) -> MaterializeResult:
...
Note: The decorated function body is not called by Dagster to load the dataset.
HuggingFaceResourceperforms the load and injectsdatasetas a parameter. The function body is where you inspect, log, and return the result.
Storage layout
After materialization, HFParquetIOManager writes:
.dagster_hf_storage/
├── imdb_train/ # Arrow format via save_to_disk()
└── imdb_test/
How to run
cd dagster_hf_datasets_examples
dagster dev -m basic_hub_ingestion.definitions
Then open http://localhost:3000, navigate to the Asset Catalog,
and materialize imdb_train and imdb_test.
Metadata visible in the Dagster UI
| Key | Description |
|---|---|
rows |
Row count for the materialized split |
columns |
List of column names |
source_dataset |
Hub dataset identifier |
split |
Which split was loaded |
fingerprint |
Reproducibility hash from the datasets library |
Xet Storage Details
- Size:
- 1.92 kB
- Xet hash:
- 1ac6b3a27557b2074c9bad8f1c5ad175f202475b2645726710553b2df21d95b3
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.