Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up

dignity045
/
grandline

dataset-preprocessing
llm-pretraining
tokenization
deduplication
data-pipeline
ml-intern
Model card Files Files and versions
xet
Community
grandline
153 kB
Ctrl+K
Ctrl+K
  • 1 contributor
History: 9 commits
dignity045's picture
dignity045
Update ML Intern artifact metadata
f39f181 verified 1 day ago
  • configs
    Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining 1 day ago
  • scripts
    Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards) 1 day ago
  • src
    Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards) 1 day ago
  • state
    Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining 1 day ago
  • tests
    Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining 1 day ago
  • .gitattributes
    1.52 kB
    initial commit 1 day ago
  • README.md
    9.85 kB
    Update ML Intern artifact metadata 1 day ago
  • pyproject.toml
    1.16 kB
    Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining 1 day ago