Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up

dignity045
/
grandline

dataset-preprocessing
llm-pretraining
tokenization
deduplication
data-pipeline
ml-intern
Model card Files Files and versions
xet
Community
grandline / scripts
17 kB
Ctrl+K
Ctrl+K
  • 1 contributor
History: 2 commits
dignity045's picture
dignity045
Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards)
ab68c56 verified 1 day ago
  • inspect_dataset.py
    5.93 kB
    Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining 2 days ago
  • merge_manifests.py
    2.73 kB
    Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining 2 days ago
  • run_dataset.py
    8.33 kB
    Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards) 1 day ago