Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing
    • Website
      • Tasks
      • HuggingChat
      • Collections
      • Languages
      • Organizations
    • Community
      • Blog
      • Posts
      • Daily Papers
      • Learn
      • Discord
      • Forum
      • GitHub
    • Solutions
      • Team & Enterprise
      • Hugging Face PRO
      • Enterprise Support
      • Inference Providers
      • Inference Endpoints
      • Storage Buckets

  • Log In
  • Sign Up
itsnotsplat 's Collections
Ai/real image classifier
Post-training
Pretraining

Pretraining

updated Mar 29

This is general pretraining data for training a model from scratch. Around ~2.1 trillion tokens.

Upvote
1

  • ronantakizawa/github-top-code

    Viewer • Updated Feb 23 • 1.12M • 538 • 122

  • HuggingFaceFW/fineweb-edu

    Viewer • Updated Jul 11, 2025 • 3.5B • 615k • 1.08k

  • openbmb/UltraData-Math

    Viewer • Updated Apr 15 • 181M • 64.2k • 306

  • nick007x/github-code-2025

    Viewer • Updated Apr 1 • 148M • 1.35k • 117

  • angie-chen55/python-github-code

    Viewer • Updated May 31, 2022 • 7.23M • 5.34k • 37

  • tiiuae/falcon-refinedweb

    Viewer • Updated Jun 20, 2023 • 968M • 21.2k • 913

  • nick007x/arxiv-papers

    Viewer • Updated Apr 1 • 2.55M • 858k • 185

  • hoskinson-center/proof-pile

    Viewer • Updated Aug 19, 2023 • 363k • 2.29k • 67

  • HuggingFaceTB/finemath

    Viewer • Updated Feb 6, 2025 • 48.3M • 40.2k • 360
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs