Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
itsnotsplat
's Collections
Post-training
Pretraining
Pretraining
updated
about 24 hours ago
This is general pretraining data for training a model from scratch. Around 5.37 trillion tokens
Upvote
1
ronantakizawa/github-top-code
Viewer
•
Updated
11 days ago
•
1.12M
•
1.54k
•
116
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
Jul 11, 2025
•
3.5B
•
221k
•
971
openbmb/UltraData-Math
Viewer
•
Updated
13 days ago
•
181M
•
83.4k
•
257
nick007x/github-code-2025
Viewer
•
Updated
Oct 15, 2025
•
147M
•
8.64k
•
114
angie-chen55/python-github-code
Viewer
•
Updated
May 31, 2022
•
7.23M
•
2.83k
•
37
jblitzar/github-python
Viewer
•
Updated
Jul 30, 2025
•
60.3M
•
936
tiiuae/falcon-refinedweb
Viewer
•
Updated
Jun 20, 2023
•
968M
•
14.1k
•
893
nick007x/arxiv-papers
Viewer
•
Updated
Oct 14, 2025
•
2.55M
•
6.65k
•
179
hoskinson-center/proof-pile
Viewer
•
Updated
Aug 19, 2023
•
363k
•
1.36k
•
63
Upvote
1
Share collection
View history
Collection guide
Browse collections