Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
Vegaandagev 's Collections
Pretraining Datasets
Text-Image
RL Methods

Pretraining Datasets

updated 14 days ago
Upvote
-

  • togethercomputer/RedPajama-Data-1T

    Viewer • Updated Jun 17, 2024 • 1.73M • 2.15k • 1.14k

  • EleutherAI/the_pile_deduplicated

    Viewer • Updated Dec 2, 2022 • 134M • 27.8k • 111

  • karpathy/climbmix-400b-shuffle

    Viewer • Updated Mar 3 • 553M • 195k • 32

  • allenai/dolma3_mix-6T

    Preview • Updated Jan 15 • 109k • 24

  • Skywork/SkyPile-150B

    Viewer • Updated Dec 7, 2023 • 1.76M • 33.4k • 404

  • tiiuae/falcon-refinedweb

    Viewer • Updated Jun 20, 2023 • 968M • 26k • 902

  • allenai/c4

    Viewer • Updated Jan 9, 2024 • 10.4B • 625k • 540

  • HuggingFaceFW/fineweb

    Viewer • Updated Jul 11, 2025 • 52.5B • 329k • 2.73k

  • PleIAs/common_corpus

    Viewer • Updated Feb 19 • 69.9k • 140k • 388

  • bigcode/the-stack-v2

    Viewer • Updated Apr 23, 2024 • 5.45B • 10k • 526

  • ontocord/MixtureVitae-v1

    Viewer • Updated Jan 17 • 59.9M • 1.99k • 15

  • wikimedia/wikipedia

    Viewer • Updated Jan 9, 2024 • 61.6M • 96.2k • 1.17k

  • manu/project_gutenberg

    Viewer • Updated Sep 7, 2023 • 75.6k • 3.47k • 71

  • KiteFishAI/arxiv-tex-corpus-full

    Viewer • Updated Feb 21 • 821k • 687 • 10
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs