Buckets:
210 kB
70 files
Updated 16 days ago
Ctrl+K
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| README.md | 1.92 kB xet | 1ac6b3a2 | |
| __init__.py | 74 Bytes xet | f7f49e7c | |
| assets.py | 1.74 kB xet | d06862d7 | |
| definitions.py | 590 Bytes xet | 7b68b78b |
Basic Hub Ingestion
Load a public Hugging Face dataset as a Dagster asset with automatic metadata extraction.
What this example shows
- Using
@hf_dataset_assetto declare a Hub-backed Dagster asset - Returning
MaterializeResultwith dataset metadata (rows, columns, fingerprint) - Registering
HuggingFaceResourceandHFParquetIOManagerinDefinitions - Materialization of two independent splits (
train,test) as separate assets
Dataset
stanfordnlp/imdb — 50 K movie reviews with binary sentiment labels. Small, text-only, no auth required.
| Split | Rows |
|---|---|
| train | 25,000 |
| test | 25,000 |
Key API
@hf_dataset_asset(
path="stanfordnlp/imdb",
split="train",
io_manager_key="hf_parquet_io_manager",
)
def imdb_train(context: AssetExecutionContext, dataset: Dataset) -> MaterializeResult:
...
Note: The decorated function body is not called by Dagster to load the dataset.
HuggingFaceResourceperforms the load and injectsdatasetas a parameter. The function body is where you inspect, log, and return the result.
Storage layout
After materialization, HFParquetIOManager writes:
.dagster_hf_storage/
├── imdb_train/ # Arrow format via save_to_disk()
└── imdb_test/
How to run
cd dagster_hf_datasets_examples
dagster dev -m basic_hub_ingestion.definitions
Then open http://localhost:3000, navigate to the Asset Catalog,
and materialize imdb_train and imdb_test.
Metadata visible in the Dagster UI
| Key | Description |
|---|---|
rows |
Row count for the materialized split |
columns |
List of column names |
source_dataset |
Hub dataset identifier |
split |
Which split was loaded |
fingerprint |
Reproducibility hash from the datasets library |
- Total size
- 210 kB
- Files
- 70
- Last updated
- Jun 14
- Pre-warmed CDN
- US EU US EU