Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
itsnotsplat 's Collections
Post-training
Pretraining

Pretraining

updated about 24 hours ago

This is general pretraining data for training a model from scratch. Around 5.37 trillion tokens

Upvote
1

  • ronantakizawa/github-top-code

    Viewer • Updated 11 days ago • 1.12M • 1.54k • 116

  • HuggingFaceFW/fineweb-edu

    Viewer • Updated Jul 11, 2025 • 3.5B • 221k • 971

  • openbmb/UltraData-Math

    Viewer • Updated 13 days ago • 181M • 83.4k • 257

  • nick007x/github-code-2025

    Viewer • Updated Oct 15, 2025 • 147M • 8.64k • 114

  • angie-chen55/python-github-code

    Viewer • Updated May 31, 2022 • 7.23M • 2.83k • 37

  • jblitzar/github-python

    Viewer • Updated Jul 30, 2025 • 60.3M • 936

  • tiiuae/falcon-refinedweb

    Viewer • Updated Jun 20, 2023 • 968M • 14.1k • 893

  • nick007x/arxiv-papers

    Viewer • Updated Oct 14, 2025 • 2.55M • 6.65k • 179

  • hoskinson-center/proof-pile

    Viewer • Updated Aug 19, 2023 • 363k • 1.36k • 63
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs