Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
dignity045
/
grandline
like
0
dataset-preprocessing
llm-pretraining
tokenization
deduplication
data-pipeline
ml-intern
License:
apache-2.0
Model card
Files
Files and versions
xet
Community
main
grandline
153 kB
Ctrl+K
Ctrl+K
1 contributor
History:
9 commits
dignity045
Update ML Intern artifact metadata
f39f181
verified
1 day ago
configs
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
scripts
Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards)
1 day ago
src
Add selective HF parquet shard download support (--hf-files, --hf-subdir, --max-shards, --list-shards)
1 day ago
state
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
tests
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
.gitattributes
Safe
1.52 kB
initial commit
1 day ago
README.md
9.85 kB
Update ML Intern artifact metadata
1 day ago
pyproject.toml
1.16 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago