210 kB
70 files
Updated 13 days ago
Name
Size
audio_dataset_curation
basic_hub_ingestion
code_instruction_pipeline
dataset_card_publishing
distributed_token_sharding
dynamic_bucket_partitioning
event_driven_benchmark_refresh
golden_pipeline
image_dataset_curation
multi_asset_split_routing
multi_modal_data_profiling
preference_alignment_data
sanitization_observability
streaming_oom_datasets
synthethic_multimodal_data
vision_dataset
vision_language_scale_pipeline
README.md9.35 kB
xet
pyproject.toml721 Bytes
xet
README.md

dagster-hf-datasets Examples

A curated collection of 17 end-to-end examples demonstrating how to build reproducible, observable, and production-ready ML data pipelines with dagster-hf-datasets

Each example is self-contained, runnable with dagster dev -m <example_folder>.definitions.py, and covers a distinct concept or workflow tier — from basic Hub ingestion to flagship end-to-end pipelines.


Prerequisites

pip install dagster dagster-hf-datasets

Some examples require additional dependencies (noted in their README). Set your Hugging Face token for gated datasets:

export HF_TOKEN=hf_...

All examples share the same Definitions boilerplate. The HuggingFaceResource handles Hub loading; HFParquetIOManager handles persistence:

from dagster import Definitions
from dagster_hf_datasets import HuggingFaceResource, HFParquetIOManager

defs = Definitions(
    assets=[...],
    resources={
        "huggingface": HuggingFaceResource(cache_dir=".hf_cache"),
        "hf_parquet_io_manager": HFParquetIOManager(base_dir=".dagster_hf_storage"),
    },
)

Running an Example

cd dagster_hf_datasets_examples

dagster dev -m  <example_folder>.definitions.py

Every definitions.py is a standalone entrypoint.


Repository Structure

dagster-hf-datasets-examples/
├── README.md                      ← you are here
├── pyproject.toml
│
├── basic_hub_ingestion/
│   ├── assets.py                  ← @hf_dataset_asset / @hf_multi_asset definitions
│   ├── definitions.py             ← Definitions() entrypoint for dagster dev
│   ├── README.md                  ← goal, dataset, API notes, expected output
│   └── __init__.py
│
└── <example_name>/                ← same layout for all 17 examples
    ├── assets.py
    ├── definitions.py
    ├── README.md
    └── __init__.py

Example Catalog

🏗️ Foundation

Core patterns every dagster-hf-datasets user encounters first.

Folder Description Dataset Key API
basic_hub_ingestion Load a Hub dataset and materialize it as a Dagster asset with lineage metadata stanfordnlp/imdb @hf_dataset_asset
multi_asset_split_routing Materialize train/validation/test splits as independently tracked assets nyu-mll/glue (sst2) @hf_multi_asset
sanitization_observability Filter, deduplicate, and expose quality metrics via asset checks HuggingFaceFW/fineweb-edu @asset_check, Dataset.filter()
dataset_card_publishing Generate and publish a Hub README from pipeline metadata rajpurkar/squad HFDatasetPublisher

🔄 Data Management & Automation

Patterns for scalability, automation, and memory-safe processing.

Folder Description Dataset Key API
event_driven_benchmark_refresh Re-materialize evaluation datasets automatically when an upstream Hub revision changes openai/openai_humaneval Dagster sensors
streaming_oom_datasets Process datasets larger than available memory using IterableDataset allenai/dolma IterableDataset, chunked Parquet writes
dynamic_bucket_partitioning Partition datasets by language, domain, or task with Dagster's partition system Helsinki-NLP/opus_books StaticPartitionsDefinition, DynamicPartitionsDefinition
distributed_token_sharding Pre-tokenize large datasets into training-ready shards HuggingFaceFW/fineweb Batched Dataset.map(), .npy shard caching

🖼️ Multimodal Workflows

Image, audio, and vision-language pipelines.

Folder Description Dataset Key API
multi_modal_data_profiling Profile an image-text dataset: resolution stats, caption analysis, thumbnail gallery, health score nlphuji/flickr30k PIL, Dataset.from_list()
image_dataset_curation Build a production-quality vision dataset: corrupt detection, dedup, aspect-ratio validation, multi-format export zh-plus/tiny-imagenet @multi_asset, WebDataset/Arrow/Parquet packaging
audio_dataset_curation Construct a clean speech corpus: duration filtering, SNR gating, language validation mozilla-foundation/common_voice_11_0 soundfile, librosa
vision_dataset Package image datasets into training-optimized formats with manifests zh-plus/tiny-imagenet WebDataset .tar, Arrow, Parquet
synthethic_multimodal_data Caption pre-generated synthetic images, score caption↔prompt alignment, filter by quality poloclub/diffusiondb BLIP-base (CPU), MiniLM similarity
vision_language_scale_pipeline Build CLIP/LLaVA-style image-caption datasets at scale google-research-datasets/conceptual_captions Image-caption pairing, schema validation

🤖 LLM & Alignment Workflows

Text, instruction, and preference-alignment pipelines.

Folder Description Dataset Key API
code_instruction_pipeline Prepare code and instruction-following datasets for model training bigcode/the-stack Deduplication, language filtering
preference_alignment_data Build RLHF/DPO-ready datasets from raw preference records Anthropic/hh-rlhf Prompt/chosen/rejected schema, pair validation
golden_pipeline Complete end-to-end pipeline: ingest → clean → validate → split → tokenize → publish allenai/c4 Full asset graph, HFDatasetPublisher

Learning Path

Beginner — start here if you're new to dagster-hf-datasets

  1. basic_hub_ingestion — understand @hf_dataset_asset and MaterializeResult
  2. multi_asset_split_routing — see how @hf_multi_asset auto-resolves splits
  3. sanitization_observability — chain @hf_dataset_asset with downstream @asset nodes
  4. dataset_card_publishing — publish lineage-aware Hub documentation

Intermediate — build operational pipelines

  1. event_driven_benchmark_refresh — add sensor-based automation
  2. streaming_oom_datasets — handle datasets that exceed memory
  3. dynamic_bucket_partitioning — partition by language/domain/task
  4. distributed_token_sharding — generate training-ready tokenized shards

Advanced — multimodal and specialized workflows

  1. multi_modal_data_profiling
  2. image_dataset_curation
  3. audio_dataset_curation
  4. vision_dataset
  5. synthethic_multimodal_data
  6. vision_language_scale_pipeline
  7. code_instruction_pipeline
  8. preference_alignment_data
  9. golden_pipeline — run this last to see everything together

Common Gotchas

The decorated function body is not responsible for loading the dataset. HuggingFaceResource performs the load; the function receives the injected dataset parameter. Your function body inspects, transforms, and returns MaterializeResult.

HFParquetIOManager does not persist IterableDataset. Streaming datasets must be written to Parquet manually via pyarrow. See streaming_oom_datasets for the pattern.

HFDatasetPublisher is in a private module. Import from dagster_hf_datasets._export._publisher. Pin your dagster-hf-datasets version if you use this.

Splits from @hf_multi_asset are named {asset_name}_{split}. A glue_sst2 multi-asset produces glue_sst2_train, glue_sst2_validation, glue_sst2_test as downstream dependency names.

Split before tokenizing. Always train-test split before tokenization to avoid data leakage. The golden_pipeline demonstrates the correct order.


Resources

Total size
210 kB
Files
70
Last updated
Jun 14
Pre-warmed CDN
US EU US EU

Contributors