view article Article RegMix: Data Mixture as Regression for Language Model Pre-training SivilTaram • Jul 11, 2024 • 16
🧠Reasoning datasets Collection Datasets with reasoning traces for math and code released by the community • 24 items • Updated May 19, 2025 • 190
TTS Datasets Collection My selection of multilingual TTS Datasets. • 280 items • Updated 17 days ago • 2
dataset collection Collection contain datatset for train indic parler tts • 30 items • Updated Mar 20, 2025 • 1
IndicTTS Datasets Collection Datasets derived from the Indic TTS Database, a special corpus of Indian languages developed by the Speech Technology Consortium at IIT Madras. • 13 items • Updated Mar 6, 2025 • 15
view article Article 🪆 Introduction to Matryoshka Embedding Models +1 tomaarsen, Xenova, osanseviero • Feb 23, 2024 • 211