Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
AINovice2005 
posted an update May 29
Post
77
Excited to share the release of dagster-hf-datasets: A Dagster-native integration that brings Hugging Face Datasets into Dagster's asset-oriented orchestration model

The integration enables:

• 🤗 Dataset and DatasetDict assets
• 🐙 Dagster asset lineage and observability
• 📦 Parquet-backed materialization via HFParquetIOManager
• 🚀 Publishing curated datasets back to the Hugging Face Hub
• 📝 Automatic dataset card generation from pipeline metadata

As the Hub continues to grow beyond 1M+ datasets, orchestration, reproducibility, and observability are becoming increasingly important parts of the dataset lifecycle. I'm also working on a longer article covering the architecture and data pipelines enabled by the integration.

More Soon!

https://github.com/dagster-io/community-integrations/tree/main/libraries/dagster-hf-datasets

https://github.com/dagster-io/community-integrations/tree/main/libraries/dagster-hf-datasets/docs

In this post