Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
rain2sun
's Collections
Benchmark
NLP
RL-Datasets
Distilled
Math-Code-Reason
Code-IFT-Datasets
Open-LLM
High-Quality-Datasets
Pretrain-Datasets
IFT-Datasets
Pretrain-Datasets
updated
Jun 9
预训练使用的超大规模开源数据集
Upvote
-
togethercomputer/RedPajama-Data-V2
Updated
Nov 21, 2024
•
6.75k
•
388
bigcode/the-stack-v2
Viewer
•
Updated
Apr 23, 2024
•
5.45B
•
8.05k
•
436
mlfoundations/dclm-baseline-1.0
Preview
•
Updated
Jul 22, 2024
•
465k
•
250
LLM360/TxT360
Updated
May 26
•
38.2k
•
247
opencsg/chinese-fineweb-edu-v2
Viewer
•
Updated
14 days ago
•
188M
•
16k
•
71
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
Jul 11
•
3.5B
•
298k
•
884
bigcode/the-stack-dedup
Viewer
•
Updated
Aug 17, 2023
•
237M
•
12.8k
•
375
HuggingFaceFW/fineweb-2
Viewer
•
Updated
Oct 27
•
4.48B
•
59.1k
•
707
microsoft/RedStone
Updated
Dec 5, 2024
•
137
•
35
CASIA-LM/ChineseWebText2.0
Viewer
•
Updated
Dec 2, 2024
•
2k
•
2.49k
•
27
openbmb/Ultra-FineWeb
Viewer
•
Updated
16 days ago
•
1.29B
•
64.8k
•
280
Upvote
-
Share collection
View history
Collection guide
Browse collections