AINovice2005's picture
|
download
raw
9.35 kB
# dagster-hf-datasets Examples
A curated collection of 17 end-to-end examples demonstrating how to build reproducible, observable, and production-ready ML data pipelines with `dagster-hf-datasets`
Each example is self-contained, runnable with `dagster dev -m <example_folder>.definitions.py`, and covers a distinct concept or workflow tier — from basic Hub ingestion to flagship end-to-end pipelines.
---
## Prerequisites
```bash
pip install dagster dagster-hf-datasets
```
Some examples require additional dependencies (noted in their README). Set your Hugging Face token for gated datasets:
```bash
export HF_TOKEN=hf_...
```
All examples share the same `Definitions` boilerplate. The `HuggingFaceResource` handles Hub loading; `HFParquetIOManager` handles persistence:
```python
from dagster import Definitions
from dagster_hf_datasets import HuggingFaceResource, HFParquetIOManager
defs = Definitions(
assets=[...],
resources={
"huggingface": HuggingFaceResource(cache_dir=".hf_cache"),
"hf_parquet_io_manager": HFParquetIOManager(base_dir=".dagster_hf_storage"),
},
)
```
---
## Running an Example
```bash
cd dagster_hf_datasets_examples
dagster dev -m <example_folder>.definitions.py
```
Every `definitions.py` is a standalone entrypoint.
---
## Repository Structure
```
dagster-hf-datasets-examples/
├── README.md ← you are here
├── pyproject.toml
├── basic_hub_ingestion/
│ ├── assets.py ← @hf_dataset_asset / @hf_multi_asset definitions
│ ├── definitions.py ← Definitions() entrypoint for dagster dev
│ ├── README.md ← goal, dataset, API notes, expected output
│ └── __init__.py
└── <example_name>/ ← same layout for all 17 examples
├── assets.py
├── definitions.py
├── README.md
└── __init__.py
```
---
## Example Catalog
### 🏗️ Foundation
Core patterns every `dagster-hf-datasets` user encounters first.
| Folder | Description | Dataset | Key API |
|--------|-------------|---------|---------|
| [`basic_hub_ingestion`](./basic_hub_ingestion/) | Load a Hub dataset and materialize it as a Dagster asset with lineage metadata | `stanfordnlp/imdb` | `@hf_dataset_asset` |
| [`multi_asset_split_routing`](./multi_asset_split_routing/) | Materialize train/validation/test splits as independently tracked assets | `nyu-mll/glue` (sst2) | `@hf_multi_asset` |
| [`sanitization_observability`](./sanitization_observability/) | Filter, deduplicate, and expose quality metrics via asset checks | `HuggingFaceFW/fineweb-edu` | `@asset_check`, `Dataset.filter()` |
| [`dataset_card_publishing`](./dataset_card_publishing/) | Generate and publish a Hub README from pipeline metadata | `rajpurkar/squad` | `HFDatasetPublisher` |
---
### 🔄 Data Management & Automation
Patterns for scalability, automation, and memory-safe processing.
| Folder | Description | Dataset | Key API |
|--------|-------------|---------|---------|
| [`event_driven_benchmark_refresh`](./event_driven_benchmark_refresh/) | Re-materialize evaluation datasets automatically when an upstream Hub revision changes | `openai/openai_humaneval` | Dagster sensors |
| [`streaming_oom_datasets`](./streaming_oom_datasets/) | Process datasets larger than available memory using `IterableDataset` | `allenai/dolma` | `IterableDataset`, chunked Parquet writes |
| [`dynamic_bucket_partitioning`](./dynamic_bucket_partitioning/) | Partition datasets by language, domain, or task with Dagster's partition system | `Helsinki-NLP/opus_books` | `StaticPartitionsDefinition`, `DynamicPartitionsDefinition` |
| [`distributed_token_sharding`](./distributed_token_sharding/) | Pre-tokenize large datasets into training-ready shards | `HuggingFaceFW/fineweb` | Batched `Dataset.map()`, `.npy` shard caching |
---
### 🖼️ Multimodal Workflows
Image, audio, and vision-language pipelines.
| Folder | Description | Dataset | Key API |
|--------|-------------|---------|---------|
| [`multi_modal_data_profiling`](./multi_modal_data_profiling/) | Profile an image-text dataset: resolution stats, caption analysis, thumbnail gallery, health score | `nlphuji/flickr30k` | PIL, `Dataset.from_list()` |
| [`image_dataset_curation`](./image_dataset_curation/) | Build a production-quality vision dataset: corrupt detection, dedup, aspect-ratio validation, multi-format export | `zh-plus/tiny-imagenet` | `@multi_asset`, WebDataset/Arrow/Parquet packaging |
| [`audio_dataset_curation`](./audio_dataset_curation/) | Construct a clean speech corpus: duration filtering, SNR gating, language validation | `mozilla-foundation/common_voice_11_0` | `soundfile`, `librosa` |
| [`vision_dataset`](./vision_dataset/) | Package image datasets into training-optimized formats with manifests | `zh-plus/tiny-imagenet` | WebDataset `.tar`, Arrow, Parquet |
| [`synthethic_multimodal_data`](./synthethic_multimodal_data/) | Caption pre-generated synthetic images, score caption↔prompt alignment, filter by quality | `poloclub/diffusiondb` | BLIP-base (CPU), MiniLM similarity |
| [`vision_language_scale_pipeline`](./vision_language_scale_pipeline/) | Build CLIP/LLaVA-style image-caption datasets at scale | `google-research-datasets/conceptual_captions` | Image-caption pairing, schema validation |
---
### 🤖 LLM & Alignment Workflows
Text, instruction, and preference-alignment pipelines.
| Folder | Description | Dataset | Key API |
|--------|-------------|---------|---------|
| [`code_instruction_pipeline`](./code_instruction_pipeline/) | Prepare code and instruction-following datasets for model training | `bigcode/the-stack` | Deduplication, language filtering |
| [`preference_alignment_data`](./preference_alignment_data/) | Build RLHF/DPO-ready datasets from raw preference records | `Anthropic/hh-rlhf` | Prompt/chosen/rejected schema, pair validation |
| [`golden_pipeline`](./golden_pipeline/) | Complete end-to-end pipeline: ingest → clean → validate → split → tokenize → publish | `allenai/c4` | Full asset graph, `HFDatasetPublisher` |
---
## Learning Path
### Beginner — start here if you're new to dagster-hf-datasets
1. [`basic_hub_ingestion`](./basic_hub_ingestion/) — understand `@hf_dataset_asset` and `MaterializeResult`
2. [`multi_asset_split_routing`](./multi_asset_split_routing/) — see how `@hf_multi_asset` auto-resolves splits
3. [`sanitization_observability`](./sanitization_observability/) — chain `@hf_dataset_asset` with downstream `@asset` nodes
4. [`dataset_card_publishing`](./dataset_card_publishing/) — publish lineage-aware Hub documentation
### Intermediate — build operational pipelines
5. [`event_driven_benchmark_refresh`](./event_driven_benchmark_refresh/) — add sensor-based automation
6. [`streaming_oom_datasets`](./streaming_oom_datasets/) — handle datasets that exceed memory
7. [`dynamic_bucket_partitioning`](./dynamic_bucket_partitioning/) — partition by language/domain/task
8. [`distributed_token_sharding`](./distributed_token_sharding/) — generate training-ready tokenized shards
### Advanced — multimodal and specialized workflows
9. [`multi_modal_data_profiling`](./multi_modal_data_profiling/)
10. [`image_dataset_curation`](./image_dataset_curation/)
11. [`audio_dataset_curation`](./audio_dataset_curation/)
12. [`vision_dataset`](./vision_dataset/)
13. [`synthethic_multimodal_data`](./synthethic_multimodal_data/)
14. [`vision_language_scale_pipeline`](./vision_language_scale_pipeline/)
15. [`code_instruction_pipeline`](./code_instruction_pipeline/)
16. [`preference_alignment_data`](./preference_alignment_data/)
17. [`golden_pipeline`](./golden_pipeline/) — run this last to see everything together
---
## Common Gotchas
**The decorated function body is not responsible for loading the dataset.**
`HuggingFaceResource` performs the load; the function receives the injected `dataset` parameter. Your function body inspects, transforms, and returns `MaterializeResult`.
**`HFParquetIOManager` does not persist `IterableDataset`.**
Streaming datasets must be written to Parquet manually via `pyarrow`. See `streaming_oom_datasets` for the pattern.
**`HFDatasetPublisher` is in a private module.**
Import from `dagster_hf_datasets._export._publisher`. Pin your `dagster-hf-datasets` version if you use this.
**Splits from `@hf_multi_asset` are named `{asset_name}_{split}`.**
A `glue_sst2` multi-asset produces `glue_sst2_train`, `glue_sst2_validation`, `glue_sst2_test` as downstream dependency names.
**Split before tokenizing.**
Always train-test split before tokenization to avoid data leakage. The `golden_pipeline` demonstrates the correct order.
---
## Resources
- [dagster-hf-datasets on GitHub](https://github.com/dagster-io/community-integrations/tree/main/libraries/dagster-hf-datasets)
- [API Reference](https://github.com/dagster-io/community-integrations/blob/main/libraries/dagster-hf-datasets/docs/API.md)
- [Usage Guide](https://github.com/dagster-io/community-integrations/blob/main/libraries/dagster-hf-datasets/docs/Usage.md)
- [Dagster Docs: HF Datasets Integration](https://docs.dagster.io/integrations/libraries/hf-datasets)
- [Hugging Face Datasets Docs](https://huggingface.co/docs/datasets)

Xet Storage Details

Size:
9.35 kB
·
Xet hash:
e5feaa0bd53c4f7ad2b28fd43e3fc236367ce8b3da140783db952de80cb39c07

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.