MERaLiON
/

MERaLiON-SpeechEncoder-2

@@ -39,16 +39,16 @@ Unlike many existing models optimized for high-resource, Western languages, MERa
 ## Model Highlights
 #### Small model size
-With only 630M parameters (≈2.5 GB in memory), the model is easily deployable on most commercial GPUs, eliminating the need for distributed or large-scale compute setups.
 #### Natively multilingual
-Building on [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
 #### Competitive performance on downstream speech tasks
-The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a [high-performance ASR system shown below](#automatic-speech-recognition-asr).
 #### Innovative pre-training techniques
-MERaLiON-SpeechEncoder-2 was trained from scratch with a novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
 ## Model Summary
@@ -77,12 +77,12 @@ MERaLiON-SpeechEncoder-2 is competitive to state-of-the-art, improving slightly
 ### Automatic Speech Recognition (ASR)
 <p align="center">
-  <img src="overall_wer.svg" width="700"/>
-  <img src="audiobench_wer.svg" width="700"/>
-  <img src="fleurs_wer.svg" width="700"/>
 </p>
-Leveraging on the multilingual capabilities of MERaLiON-SpeechEncoder-2, we further finetuned the model for on supervised speech data to produce a lightweight MERaLiON-SpeechEncoder-2-ASR-CTC, which is competitive to models many times its size in transcribing the target languages, while offering much faster inference speeds. It outperforms the popular Whisper large v3 across most languages in [Audiobench](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) and maintains close performance in FLEURS. Our internal benchmarking, shown in the 'Overall ASR Performance', also contains several private datasets in addition to Audiobench and FLEURS.
 ## Direct Use

 ## Model Highlights
 #### Small model size
+With only **630M parameters (≈2.5 GB in memory)**, the model is easily deployable on most commercial GPUs, eliminating the need for distributed or large-scale compute setups.
 #### Natively multilingual
+Building on [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include **English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages**. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
 #### Competitive performance on downstream speech tasks
+The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities demonstrated through its integration into a [high-performance ASR system](#automatic-speech-recognition-asr).
 #### Innovative pre-training techniques
+MERaLiON-SpeechEncoder-2 was trained from scratch with a **novel extension of the BEST-RQ** self-supervised objective, by using more informative latent targets. We also adopted the **Muon optimizer**, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
 ## Model Summary
 ### Automatic Speech Recognition (ASR)
 <p align="center">
+  <img src="overall_wer.svg" width="720"/>
+  <img src="audiobench_wer.svg" width="720"/>
+  <img src="fleurs_wer.svg" width="720"/>
 </p>
+Leveraging on the multilingual capabilities of MERaLiON-SpeechEncoder-2, we further finetuned the model for ASR on supervised speech data to produce a lightweight MERaLiON-SpeechEncoder-2-ASR-CTC, which is competitive to models many times its size in transcribing the target languages, while offering much faster inference speeds. It outperforms the popular Whisper large v3 across most languages in [Audiobench](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) and maintains close performance on FLEURS. Our comprehensive internal benchmarking, shown in the 'Overall ASR Performance', also contains several private datasets in addition to Audiobench and FLEURS.
 ## Direct Use