Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
dignity045
/
grandline
like
0
dataset-preprocessing
llm-pretraining
tokenization
deduplication
data-pipeline
ml-intern
License:
apache-2.0
Model card
Files
Files and versions
xet
Community
main
grandline
/
tests
21.5 kB
Ctrl+K
Ctrl+K
1 contributor
History:
1 commit
dignity045
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
ed59144
verified
1 day ago
test_dedup.py
4.52 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
test_determinism.py
3.65 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
test_filters.py
4.86 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
test_pack.py
4.64 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago
test_tokenize.py
3.87 kB
Initial GrandLine implementation: deterministic shard-first dataset preprocessing for LLM pretraining
1 day ago