STEM & Non-STEM Q&A Datasets for LLM Training Collection Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation. • 5 items • Updated 9 days ago • 1
Academic Textbook Corpora for LLM Training Collection Sample of a 2.2B+ word textbook corpus across 32K+ books, 5K+ subjects, and 14 languages for LLM training and multilingual knowledge modeling. • 13 items • Updated 9 days ago • 1
Podcast Speech & Conversational Audio Datasets Collection Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content. • 12 items • Updated 9 days ago • 1
Dual Channel Global Customer-Agent Interaction Datasets Collection Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training. • 22 items • Updated 9 days ago • 1
Healthcare AI Datasets for Clinical & LLM Training Collection Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training. • 12 items • Updated 9 days ago • 1
Computer Vision & Multimodal Datasets Collection Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI. • 7 items • Updated 9 days ago • 1