MERaLiON
/

MERaLiON-SpeechEncoder-2

@@ -29,7 +29,7 @@ language:
 We introduce **MERaLiON-SpeechEncoder-2**, our next-generation multilingual speech encoder that was pre-trained from scratch on a greatly expanded corpus of **1.4 million hours** of unlabeled audio, with a **strong focus on Southeast Asian (SEA) languages and accents**. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks — including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition. **This model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.**
-Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a full breakdown of the language coverage of our pre-training data.
 <p align="center">
   <img src="data2.svg" width="620"/>

 We introduce **MERaLiON-SpeechEncoder-2**, our next-generation multilingual speech encoder that was pre-trained from scratch on a greatly expanded corpus of **1.4 million hours** of unlabeled audio, with a **strong focus on Southeast Asian (SEA) languages and accents**. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks — including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition. **This model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.**
+Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a regional breakdown of the language coverage of our pre-training data.
 <p align="center">
   <img src="data2.svg" width="620"/>