Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
ashish-soni08
's Collections
Function Calling
Reasoning Datasets
Reasoning_Models
How_LLMS_Think _and_Reason_Papers
Microsoft Models
Embedding_Models
Leaderboards
MoE_Models
Meta AI
Pre-Training-Data-for-LLMs
Privacy_Masking_for_LLMs
Pre-Training-Data-for-LLMs
updated
Aug 16, 2024
Open-Source Datasets that have been employed for pre-training Large Language Models
Upvote
-
tiiuae/falcon-refinedweb
Viewer
•
Updated
Jun 20, 2023
•
968M
•
20.9k
•
914
togethercomputer/RedPajama-Data-1T
Viewer
•
Updated
Jun 17, 2024
•
1.73M
•
2.57k
•
1.16k
mikex86/stackoverflow-posts
Viewer
•
Updated
Aug 1, 2023
•
58.3M
•
5.55k
•
61
Upvote
-
Share collection
View history
Collection guide
Browse collections