PT datasets
Viewer • Updated • 10.4B • 510k • 510Note CC: https://commoncrawl.org/ en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per language)
HuggingFaceFW/fineweb
Viewer • Updated • 52.5B • 203k • 2.63kNote From CC 2013-20 ~ 2025-26 (continue updating) 18.5T tokens, English only Blog: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 345k • 929Note From CC 2013-20 ~ 2025-26 (continue updating) 1.3T tokens Paper: https://arxiv.org/abs/2406.17557
HuggingFaceFW/fineweb-2
Viewer • Updated • 4.48B • 107k • 735Note From CC 2013-20 ~ 2024-18 20TB disk (3T words), 1000+ languages Blog: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks
togethercomputer/RedPajama-Data-1T
Viewer • Updated • 1.73M • 2.24k • 1.13kNote From CC, C4, GitHub, Wikipedia, Gutenberg and Books3, ArXiv, Stackexchange 1.2T tokens
togethercomputer/RedPajama-Data-V2
Updated • 2.29k • 391Note From CC 2014-15 ~ 2023-14 30T tokens, English, French, Spanish, German, and Italian Blog: https://www.together.ai/blog/redpajama-data-v2
mlfoundations/dclm-baseline-1.0
Preview • Updated • 149k • 252Note From CC 2013-20 ~ 2022-49 2.6T tokens (baseline), 240T tokens (pool) https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/index.html https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html https://data.commoncrawl.org/contrib/datacomp/DCLM-baseline/index.html
nvidia/Nemotron-CC-v2
Viewer • Updated • 8.79B • 49.9k • 99Note From CC 2013-20 ~ 2025-13 v1 (2013-20 ~ 2024-30, 3.3T tokens actual & 1.9T synthetic): https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html