Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
Vegaandagev
's Collections
Pretraining Datasets
Text-Image
RL Methods
Pretraining Datasets
updated
14 days ago
Upvote
-
togethercomputer/RedPajama-Data-1T
Viewer
•
Updated
Jun 17, 2024
•
1.73M
•
2.15k
•
1.14k
EleutherAI/the_pile_deduplicated
Viewer
•
Updated
Dec 2, 2022
•
134M
•
27.8k
•
111
karpathy/climbmix-400b-shuffle
Viewer
•
Updated
Mar 3
•
553M
•
195k
•
32
allenai/dolma3_mix-6T
Preview
•
Updated
Jan 15
•
109k
•
24
Skywork/SkyPile-150B
Viewer
•
Updated
Dec 7, 2023
•
1.76M
•
33.4k
•
404
tiiuae/falcon-refinedweb
Viewer
•
Updated
Jun 20, 2023
•
968M
•
26k
•
902
allenai/c4
Viewer
•
Updated
Jan 9, 2024
•
10.4B
•
625k
•
540
HuggingFaceFW/fineweb
Viewer
•
Updated
Jul 11, 2025
•
52.5B
•
329k
•
2.73k
PleIAs/common_corpus
Viewer
•
Updated
Feb 19
•
69.9k
•
140k
•
388
bigcode/the-stack-v2
Viewer
•
Updated
Apr 23, 2024
•
5.45B
•
10k
•
526
ontocord/MixtureVitae-v1
Viewer
•
Updated
Jan 17
•
59.9M
•
1.99k
•
15
wikimedia/wikipedia
Viewer
•
Updated
Jan 9, 2024
•
61.6M
•
96.2k
•
1.17k
manu/project_gutenberg
Viewer
•
Updated
Sep 7, 2023
•
75.6k
•
3.47k
•
71
KiteFishAI/arxiv-tex-corpus-full
Viewer
•
Updated
Feb 21
•
821k
•
687
•
10
Upvote
-
Share collection
View history
Collection guide
Browse collections