Buckets:

the-hf-stack
/

dagster-hf-datasets-examples

the-hf-stack/dagster-hf-datasets-examples / distributed_token_sharding

210 kB

70 files

Updated 17 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
README.md	2.68 kB xet	17 days ago	be07d3fd
__init__.py	94 Bytes xet	17 days ago	bb13d76d
assets.py	1.31 kB xet	17 days ago	c2d65c91
definitions.py	593 Bytes xet	17 days ago	7205581f

README.md

Distributed Token Sharding

Tokenize a large web dataset as a Dagster asset and persist the result through the Hugging Face Parquet IO manager.

What this example shows

Loading a large Hub dataset with @hf_dataset_asset
Applying a Hugging Face tokenizer in a downstream @asset
Using batched Dataset.map() for tokenization throughput
Persisting both raw and tokenized datasets with HFParquetIOManager
Recording tokenizer and row-count metadata in the Dagster UI

Dataset

HuggingFaceFW/fineweb (sample-100BT config) - a large-scale cleaned web corpus used for language-model pretraining experiments. This example keeps the asset graph intentionally small so the focus stays on the ingestion -> tokenization handoff.

Asset	Description
`fineweb_dataset`	Loads the FineWeb sample from the Hub
`tokenized_fineweb`	Tokenizes the `text` column with `bert-base-uncased`

Asset graph

fineweb_dataset
      |
      v
tokenized_fineweb

Key API

@asset(
    group_name="tokenization_shard_caching",
    io_manager_key="hf_parquet_io_manager",
)
def tokenized_fineweb(
    context: AssetExecutionContext,
    fineweb_dataset: Dataset,
) -> MaterializeResult:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    tokenized = fineweb_dataset.map(
        lambda batch: tokenizer(batch["text"], truncation=True),
        batched=True,
        batch_size=1000,
    )

    return MaterializeResult(value=tokenized, metadata={"rows": len(tokenized)})

Dataset.map(..., batched=True) processes multiple records per tokenizer call, which is the standard pattern for keeping tokenization overhead manageable.

Metadata visible in the Dagster UI

Asset	Key	Description
`fineweb_dataset`	`rows`	Number of raw rows loaded from the Hub
`tokenized_fineweb`	`rows`	Number of rows after tokenization
`tokenized_fineweb`	`tokenizer`	Tokenizer used for the transformation

Storage layout

.dagster_hf_storage/
├── fineweb_dataset/
└── tokenized_fineweb/

Both assets are written by HFParquetIOManager, so downstream assets can receive the materialized Dataset object directly.

How to run

cd dagster_hf_datasets_examples

dagster dev -m distributed_token_sharding.definitions

Materialize fineweb_dataset first, then tokenized_fineweb.

Note: FineWeb configs can be large. For local testing, reduce the dataset inside fineweb_dataset() before tokenization if you do not want to process the full split.

Total size: 210 kB

Files: 70

Last updated: Jun 14

Pre-warmed CDN: US EU US EU