Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
tjadamlee
's Collections
Speech Data
text-pretrain-data
text-intruction-data
visual-data
text-pretrain-data
updated
Feb 21, 2024
some pretrain dataset for LLM
Upvote
4
Sort: Collection
allenai/MADLAD-400
Updated
Sep 9, 2024
•
45.6k
•
170
CASIA-LM/ChineseWebText
Viewer
•
Updated
Nov 13, 2023
•
1k
•
1.36k
•
44
allenai/dolma
Updated
Apr 17, 2024
•
4.42k
•
1.05k
allenai/peS2o
Updated
Oct 13, 2024
•
11k
•
196
Skywork/SkyPile-150B
Viewer
•
Updated
Dec 7, 2023
•
1.76M
•
10k
•
407
wenge-research/yayi2_pretrain_data
Viewer
•
Updated
Dec 29, 2023
•
1.68M
•
478
•
59
togethercomputer/RedPajama-Data-V2
Updated
Nov 21, 2024
•
7.82k
•
403
tiiuae/falcon-refinedweb
Viewer
•
Updated
Jun 20, 2023
•
968M
•
12.1k
•
929
togethercomputer/RedPajama-Data-1T
Viewer
•
Updated
Jun 17, 2024
•
1.73M
•
1.97k
•
1.17k
Tele-AI/TeleChat-PTD
Updated
Mar 20, 2024
•
662
•
176
open-web-math/open-web-math
Viewer
•
Updated
Oct 17, 2023
•
6.32M
•
33.6k
•
349
GAIR/MathPile
Preview
•
Updated
Apr 3, 2025
•
313
•
195
EleutherAI/proof-pile-2
Updated
Oct 25, 2023
•
16.5k
•
225
vietgpt/open-web-math
Viewer
•
Updated
Nov 21, 2023
•
6.32M
•
242
meta-math/MetaMathQA
Viewer
•
Updated
Dec 21, 2023
•
395k
•
49.1k
•
463
bigcode/the-stack-dedup
Viewer
•
Updated
Aug 17, 2023
•
237M
•
17k
•
398
bigcode/the-stack
Viewer
•
Updated
Apr 13, 2023
•
546M
•
21.9k
•
1.03k
nampdn-ai/tiny-strange-textbooks
Viewer
•
Updated
Feb 2, 2024
•
1M
•
102
•
92
uonlp/CulturaX
Viewer
•
Updated
Dec 16, 2024
•
7.18B
•
18.5k
•
643
Locutusque/UltraTextbooks
Viewer
•
Updated
Feb 2, 2024
•
5.52M
•
2.23k
•
200
HuggingFaceTB/cosmopedia
Viewer
•
Updated
Aug 12, 2024
•
31.1M
•
19.3k
•
721
Upvote
4
Sort: Collection
Share collection
View history
Collection guide
Browse collections