Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
dignity045
/
grandline
like
0
dataset-preprocessing
llm-pretraining
tokenization
deduplication
data-pipeline
ml-intern
License:
apache-2.0
Model card
Files
Files and versions
xet
Community
main
grandline
/
scripts
17 kB
Ctrl+K
Ctrl+K
1 contributor
History:
2 commits
dignity045
Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards)
ab68c56
verified
1 day ago
inspect_dataset.py
5.93 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
2 days ago
merge_manifests.py
2.73 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
2 days ago
run_dataset.py
8.33 kB
Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards)
1 day ago