Speech Data Selected Opensource speech data MLCommons/peoples_speech Viewer • Updated Nov 20, 2024 • 8.05M • 18.4k • 271 speechcolab/gigaspeech Viewer • Updated Feb 7 • 11.9M • 28.9k • 165 keithito/lj_speech Updated Aug 14, 2024 • 1.13k • 62 legacy-datasets/common_voice Updated Aug 22, 2024 • 1.54k • 145
text-intruction-data text-instructino-data Open-Orca/OpenOrca Viewer • Updated Feb 19, 2025 • 2.94M • 18.4k • 1.55k Open-Orca/SlimOrca-Dedup Viewer • Updated May 19, 2025 • 363k • 1.52k • 93 Open-Orca/SlimOrca Viewer • Updated Oct 12, 2023 • 518k • 3.25k • 299 argilla/distilabel-intel-orca-dpo-pairs Viewer • Updated Aug 7, 2025 • 12.9k • 9.85k • 183
text-pretrain-data some pretrain dataset for LLM allenai/MADLAD-400 Updated Sep 9, 2024 • 41.8k • 170 CASIA-LM/ChineseWebText Viewer • Updated Nov 13, 2023 • 1k • 1.78k • 44 allenai/dolma Updated Apr 17, 2024 • 4.33k • 1.05k allenai/peS2o Updated Oct 13, 2024 • 10.8k • 196
Speech Data Selected Opensource speech data MLCommons/peoples_speech Viewer • Updated Nov 20, 2024 • 8.05M • 18.4k • 271 speechcolab/gigaspeech Viewer • Updated Feb 7 • 11.9M • 28.9k • 165 keithito/lj_speech Updated Aug 14, 2024 • 1.13k • 62 legacy-datasets/common_voice Updated Aug 22, 2024 • 1.54k • 145
text-pretrain-data some pretrain dataset for LLM allenai/MADLAD-400 Updated Sep 9, 2024 • 41.8k • 170 CASIA-LM/ChineseWebText Viewer • Updated Nov 13, 2023 • 1k • 1.78k • 44 allenai/dolma Updated Apr 17, 2024 • 4.33k • 1.05k allenai/peS2o Updated Oct 13, 2024 • 10.8k • 196
text-intruction-data text-instructino-data Open-Orca/OpenOrca Viewer • Updated Feb 19, 2025 • 2.94M • 18.4k • 1.55k Open-Orca/SlimOrca-Dedup Viewer • Updated May 19, 2025 • 363k • 1.52k • 93 Open-Orca/SlimOrca Viewer • Updated Oct 12, 2023 • 518k • 3.25k • 299 argilla/distilabel-intel-orca-dpo-pairs Viewer • Updated Aug 7, 2025 • 12.9k • 9.85k • 183