Buckets:

the-hf-stack
/

dagster-hf-datasets-examples

Files

xet

the-hf-stack/dagster-hf-datasets-examples / basic_hub_ingestion /README.md

AINovice2005

21 days ago

preview code

download

raw

1.92 kB

	# Basic Hub Ingestion

	Load a public Hugging Face dataset as a Dagster asset with automatic metadata extraction.

	## What this example shows

	- Using `@hf_dataset_asset` to declare a Hub-backed Dagster asset
	- Returning `MaterializeResult` with dataset metadata (rows, columns, fingerprint)
	- Registering `HuggingFaceResource` and `HFParquetIOManager` in `Definitions`
	- Materialization of two independent splits (`train`, `test`) as separate assets

	## Dataset

	[`stanfordnlp/imdb`](https://huggingface.co/datasets/stanfordnlp/imdb) — 50 K movie reviews with binary sentiment labels. Small, text-only, no auth required.

	\| Split \| Rows \|
	\|-------\|------\|
	\| train \| 25,000 \|
	\| test \| 25,000 \|

	## Key API

	```python
	@hf_dataset_asset(
	path="stanfordnlp/imdb",
	split="train",
	io_manager_key="hf_parquet_io_manager",
	)
	def imdb_train(context: AssetExecutionContext, dataset: Dataset) -> MaterializeResult:
	...
	```

	> Note: The decorated function body is not called by Dagster to load the dataset.
	> `HuggingFaceResource` performs the load and injects `dataset` as a parameter.
	> The function body is where you inspect, log, and return the result.

	## Storage layout

	After materialization, `HFParquetIOManager` writes:

	```
	.dagster_hf_storage/
	├── imdb_train/ # Arrow format via save_to_disk()
	└── imdb_test/
	```

	## How to run

	```bash
	cd dagster_hf_datasets_examples

	dagster dev -m basic_hub_ingestion.definitions
	```

	Then open [http://localhost:3000](http://localhost:3000), navigate to the Asset Catalog,
	and materialize `imdb_train` and `imdb_test`.

	## Metadata visible in the Dagster UI

	\| Key \| Description \|
	\|-----\|-------------\|
	\| `rows` \| Row count for the materialized split \|
	\| `columns` \| List of column names \|
	\| `source_dataset` \| Hub dataset identifier \|
	\| `split` \| Which split was loaded \|
	\| `fingerprint` \| Reproducibility hash from the datasets library \|

Xet Storage Details

Size:: 1.92 kB
Xet hash:: 1ac6b3a27557b2074c9bad8f1c5ad175f202475b2645726710553b2df21d95b3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.