Running on CPU Upgrade 174 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens 📝 174 Explore synthetic data experiments in a bookshelf view
Running on CPU Upgrade 174 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens 📝 174 Explore synthetic data experiments in a bookshelf view
Running on CPU Upgrade 174 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens 📝 174 Explore synthetic data experiments in a bookshelf view
HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT-shuffled Viewer • Updated 12 days ago • 56.1M • 776
HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT Viewer • Updated 12 days ago • 56.1M • 51.7k
HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT-shuffled Viewer • Updated 12 days ago • 62.1M • 829 • 3
HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT Viewer • Updated 12 days ago • 62.1M • 37.1k • 1
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated 12 days ago • 12