Nemotron-Pre-Training-Datasets Collection Large scale pre-training datasets used in the Nemotron family of models. • 12 items • Updated about 23 hours ago • 133
microsoft/Phi-4-multimodal-instruct Automatic Speech Recognition • 6B • Updated Dec 10, 2025 • 300k • 1.58k
instruction-pretrain/medicine-instruction-augmented-corpora Preview • Updated about 1 month ago • 144 • 13
datajuicer/the-pile-pubmed-central-refined-by-data-juicer Viewer • Updated Oct 23, 2023 • 100 • 10 • 2