Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| audio_dataset_curation | 4 items | ||
| basic_hub_ingestion | 4 items | ||
| code_instruction_pipeline | 4 items | ||
| dataset_card_publishing | 4 items | ||
| distributed_token_sharding | 4 items | ||
| dynamic_bucket_partitioning | 4 items | ||
| event_driven_benchmark_refresh | 4 items | ||
| golden_pipeline | 4 items | ||
| image_dataset_curation | 4 items | ||
| multi_asset_split_routing | 4 items | ||
| multi_modal_data_profiling | 4 items | ||
| preference_alignment_data | 4 items | ||
| sanitization_observability | 4 items | ||
| streaming_oom_datasets | 4 items | ||
| synthethic_multimodal_data | 4 items | ||
| vision_dataset | 4 items | ||
| vision_language_scale_pipeline | 4 items | ||
| README.md | 9.35 kB xet | e5feaa0b | |
| pyproject.toml | 721 Bytes xet | 936b1574 |
dagster-hf-datasets Examples
A curated collection of 17 end-to-end examples demonstrating how to build reproducible, observable, and production-ready ML data pipelines with dagster-hf-datasets
Each example is self-contained, runnable with dagster dev -m <example_folder>.definitions.py, and covers a distinct concept or workflow tier — from basic Hub ingestion to flagship end-to-end pipelines.
Prerequisites
pip install dagster dagster-hf-datasets
Some examples require additional dependencies (noted in their README). Set your Hugging Face token for gated datasets:
export HF_TOKEN=hf_...
All examples share the same Definitions boilerplate. The HuggingFaceResource handles Hub loading; HFParquetIOManager handles persistence:
from dagster import Definitions
from dagster_hf_datasets import HuggingFaceResource, HFParquetIOManager
defs = Definitions(
assets=[...],
resources={
"huggingface": HuggingFaceResource(cache_dir=".hf_cache"),
"hf_parquet_io_manager": HFParquetIOManager(base_dir=".dagster_hf_storage"),
},
)
Running an Example
cd dagster_hf_datasets_examples
dagster dev -m <example_folder>.definitions.py
Every definitions.py is a standalone entrypoint.
Repository Structure
dagster-hf-datasets-examples/
├── README.md ← you are here
├── pyproject.toml
│
├── basic_hub_ingestion/
│ ├── assets.py ← @hf_dataset_asset / @hf_multi_asset definitions
│ ├── definitions.py ← Definitions() entrypoint for dagster dev
│ ├── README.md ← goal, dataset, API notes, expected output
│ └── __init__.py
│
└── <example_name>/ ← same layout for all 17 examples
├── assets.py
├── definitions.py
├── README.md
└── __init__.py
Example Catalog
🏗️ Foundation
Core patterns every dagster-hf-datasets user encounters first.
| Folder | Description | Dataset | Key API |
|---|---|---|---|
basic_hub_ingestion |
Load a Hub dataset and materialize it as a Dagster asset with lineage metadata | stanfordnlp/imdb |
@hf_dataset_asset |
multi_asset_split_routing |
Materialize train/validation/test splits as independently tracked assets | nyu-mll/glue (sst2) |
@hf_multi_asset |
sanitization_observability |
Filter, deduplicate, and expose quality metrics via asset checks | HuggingFaceFW/fineweb-edu |
@asset_check, Dataset.filter() |
dataset_card_publishing |
Generate and publish a Hub README from pipeline metadata | rajpurkar/squad |
HFDatasetPublisher |
🔄 Data Management & Automation
Patterns for scalability, automation, and memory-safe processing.
| Folder | Description | Dataset | Key API |
|---|---|---|---|
event_driven_benchmark_refresh |
Re-materialize evaluation datasets automatically when an upstream Hub revision changes | openai/openai_humaneval |
Dagster sensors |
streaming_oom_datasets |
Process datasets larger than available memory using IterableDataset |
allenai/dolma |
IterableDataset, chunked Parquet writes |
dynamic_bucket_partitioning |
Partition datasets by language, domain, or task with Dagster's partition system | Helsinki-NLP/opus_books |
StaticPartitionsDefinition, DynamicPartitionsDefinition |
distributed_token_sharding |
Pre-tokenize large datasets into training-ready shards | HuggingFaceFW/fineweb |
Batched Dataset.map(), .npy shard caching |
🖼️ Multimodal Workflows
Image, audio, and vision-language pipelines.
| Folder | Description | Dataset | Key API |
|---|---|---|---|
multi_modal_data_profiling |
Profile an image-text dataset: resolution stats, caption analysis, thumbnail gallery, health score | nlphuji/flickr30k |
PIL, Dataset.from_list() |
image_dataset_curation |
Build a production-quality vision dataset: corrupt detection, dedup, aspect-ratio validation, multi-format export | zh-plus/tiny-imagenet |
@multi_asset, WebDataset/Arrow/Parquet packaging |
audio_dataset_curation |
Construct a clean speech corpus: duration filtering, SNR gating, language validation | mozilla-foundation/common_voice_11_0 |
soundfile, librosa |
vision_dataset |
Package image datasets into training-optimized formats with manifests | zh-plus/tiny-imagenet |
WebDataset .tar, Arrow, Parquet |
synthethic_multimodal_data |
Caption pre-generated synthetic images, score caption↔prompt alignment, filter by quality | poloclub/diffusiondb |
BLIP-base (CPU), MiniLM similarity |
vision_language_scale_pipeline |
Build CLIP/LLaVA-style image-caption datasets at scale | google-research-datasets/conceptual_captions |
Image-caption pairing, schema validation |
🤖 LLM & Alignment Workflows
Text, instruction, and preference-alignment pipelines.
| Folder | Description | Dataset | Key API |
|---|---|---|---|
code_instruction_pipeline |
Prepare code and instruction-following datasets for model training | bigcode/the-stack |
Deduplication, language filtering |
preference_alignment_data |
Build RLHF/DPO-ready datasets from raw preference records | Anthropic/hh-rlhf |
Prompt/chosen/rejected schema, pair validation |
golden_pipeline |
Complete end-to-end pipeline: ingest → clean → validate → split → tokenize → publish | allenai/c4 |
Full asset graph, HFDatasetPublisher |
Learning Path
Beginner — start here if you're new to dagster-hf-datasets
basic_hub_ingestion— understand@hf_dataset_assetandMaterializeResultmulti_asset_split_routing— see how@hf_multi_assetauto-resolves splitssanitization_observability— chain@hf_dataset_assetwith downstream@assetnodesdataset_card_publishing— publish lineage-aware Hub documentation
Intermediate — build operational pipelines
event_driven_benchmark_refresh— add sensor-based automationstreaming_oom_datasets— handle datasets that exceed memorydynamic_bucket_partitioning— partition by language/domain/taskdistributed_token_sharding— generate training-ready tokenized shards
Advanced — multimodal and specialized workflows
multi_modal_data_profilingimage_dataset_curationaudio_dataset_curationvision_datasetsynthethic_multimodal_datavision_language_scale_pipelinecode_instruction_pipelinepreference_alignment_datagolden_pipeline— run this last to see everything together
Common Gotchas
The decorated function body is not responsible for loading the dataset.
HuggingFaceResource performs the load; the function receives the injected dataset parameter. Your function body inspects, transforms, and returns MaterializeResult.
HFParquetIOManager does not persist IterableDataset.
Streaming datasets must be written to Parquet manually via pyarrow. See streaming_oom_datasets for the pattern.
HFDatasetPublisher is in a private module.
Import from dagster_hf_datasets._export._publisher. Pin your dagster-hf-datasets version if you use this.
Splits from @hf_multi_asset are named {asset_name}_{split}.
A glue_sst2 multi-asset produces glue_sst2_train, glue_sst2_validation, glue_sst2_test as downstream dependency names.
Split before tokenizing.
Always train-test split before tokenization to avoid data leakage. The golden_pipeline demonstrates the correct order.
Resources
- Total size
- 210 kB
- Files
- 70
- Last updated
- Jun 14
- Pre-warmed CDN
- US EU US EU