--- license: other license_name: meralion-public-license-v3 license_link: https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf extra_gated_fields: First Name: text Last Name: text Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI developer/Researcher - Other I consent to being contacted by the MERaLiON team for feedback or follow-up regarding my experience using the model: checkbox extra_gated_description: >- By downloading this model, you acknowledge that you have read and agree to be bound by the Terms and Conditions set out in this document [MERaLiON Public License v3](https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf). The information you provide will be collected, stored, processed, and shared in accordance with the [A*STAR Privacy Policy](https://www.a-star.edu.sg/privacy-statement). extra_gated_button_content: Submit datasets: - MERaLiON/Multitask-National-Speech-Corpus-v1 language: - en - zh - ms - ta - id - th - vi metrics: - wer - bleu base_model: - openai/whisper-large-v3 - google/gemma-2-9b-it library_name: transformers tags: - meralion - meralion-3 ---
💻 Web Demo | ⚙️ vLLM coming soon
## Introduction We are pleased to announce the release of our flagship speech-text large language model, [**MERaLiON-3-10B**](https://huggingface.co/MERaLiON/MERaLiON-3-10B). MERaLiON-3-10B demonstrates competitive performance across benchmark evaluations in Age Recognition, Gender Recognition, Spoken Question Answering (SQA), and Contextual Paralinguistic Question Answering (CPQA) in the Southeast Asian context. These results are comparable to those achieved by other state-of-the-art AudioLLMs, including Gemini 3 Flash and Qwen3 Omni Instruct. MERaLiON-3-10B maintains its competitive performance in other tasks such as Multilingual Automatic Speech Recognition (ASR), Speech Translation (ST), Audio Scene Understanding and general speech comprehension vis-à-vis MERaLiON-2-10B. We constructed a benchmark containing speech and prompts in Malay, Indonesian, English, Chinese, Tamil, Thai and Vietnamese to better represent the Southeast Asian context. The following table presents task-specific evaluation scores, assessed using the LLM-as-a-Judge framework across multiple datasets. Higher scores indicate better performance. We will open-source these benchmarks separately as part of a paper. See the [Evaluation](#performance) section for detailed benchmarks. | Benchmark | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio | | :--- | :---: | :---: | :---: | :---: | :---: | | Age (commonvoice-en, ta, th, vi, zh) | 75.41 | 61.77 | 70.38 | **77.00** | 68.90 | | Gender (Multi-dataset) | **96.67** | 54.19 | 95.34 | 81.72 | 40.25 | | Spoken Q&A (SQA) | **61.50** | 56.76 | 58.74 | 59.75 | 57.48 | | Contextual paralinguistic Q&A (CPQA) | **57.33** | 48.31 | 54.21 | 54.07 | 54.54 | ## Model Description: MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork, with models tailored for **Singapore’s multilingual and multicultural landscape**, as well as the wider **Southeast Asian region**. MERaLiON-3-10B is finetuned on **150,000 hours of speech and audio data** across **6 diverse tasks**: Automatic Speech Recognition (ASR), SQA, Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and CPQA. - **Developed by:** I2R, A\*STAR, Singapore - **Model type:** Multimodal LLM - **Language(s):** Primarily English (Global and Singapore), Chinese, with support for audio of regional languages including Malay, Tamil, Indonesian, Thai, and Vietnamese. - **Audio:** **Mono** channel audio, **16000** hz, up to **300** seconds. - **License:** [MERaLiON Public License](https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf) - **Demo:** [MERaLiON-AudioLLM Web Demo](https://meralion.org/demo/) ## Performance: We benchmarked MERaLiON-3-10B against Qwen3 Omni, Gemini 3 Flash, GPT 4o Audio, and MERaLiON-2-10B, and it performed the best on 44 out of 59 benchmarks for tasks related to age recognition, gender recognition, SQA, and CPQA. MERaLiON-3-10B maintains competitive performance vis-à-vis MERaLiON-2-10B on the Audiobench benchmarks. **Age recognition** Age recognition tasks categorise speakers as teens (10-19), adults (20-59), or seniors (60-100). The prompts are either in English, or in the same language as the audio. LLM-as-a-judge is used to evaluate the correctness of each response. | Dataset | Lang | Var | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Commonvoice | en | eng | 64.30 | 63.10 | 64.20 | **68.00** | 65.00 | | | | sea | 64.30 | 63.10 | 64.20 | **68.00** | 65.00 | | | ta | eng | 78.00 | 64.65 | 73.50 | **79.00** | 71.00 | | | | sea | 58.00 | 47.90 | 48.40 | **78.00** | 62.00 | | | th | eng | **81.68** | 57.81 | 78.06 | 77.00 | 78.00 | | | | sea | 76.39 | 42.19 | 64.13 | **84.00** | 53.00 | | | vi | eng | **91.96** | 73.23 | 84.39 | 81.00 | 86.00 | | | | sea | **91.48** | 64.35 | 77.67 | 87.00 | 81.00 | | | zh | eng | 74.30 | 72.40 | **75.60** | 75.00 | 83.00 | | | | sea | **73.70** | 69.00 | 73.60 | 73.00 | 45.00 | | **Average** | | | 75.41 | 61.77 | 70.38 | **77.00** | 68.90 | **Gender recognition** The gender recognition benchmark consists of speech samples in Indonesian, Tamil, Thai, Vietnamese, Chinese, Malay, English, and Khmer. The text prompts are either in English, or in the same language as the audio. LLM-as-a-judge is used to evaluate the correctness of each response. | Dataset | Lang | Var | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | commonvoice | id | eng | **97.10** | 45.20 | 96.80 | 86.00 | 46.00 | | | | sea | **97.20** | 57.30 | 96.10 | 90.00 | 53.93 | | | ta | eng | **97.40** | 53.00 | 96.80 | 65.00 | 33.00 | | | | sea | **97.10** | 40.40 | 81.90 | 71.00 | 35.00 | | | th | eng | **97.86** | 50.07 | 96.92 | 87.00 | 50.00 | | | | sea | **97.99** | 23.96 | 95.18 | 82.00 | 40.00 | | | vi | eng | **99.22** | 24.05 | 98.82 | 87.00 | 26.00 | | | | sea | **99.22** | 14.64 | 96.86 | 88.00 | 35.00 | | | zh | eng | **98.20** | 53.70 | **98.20** | 89.00 | 49.00 | | | | sea | **98.30** | 35.50 | 98.10 | 82.00 | 21.00 | | emota | ta | eng | **100.00** | 67.31 | 99.89 | 83.00 | 25.00 | | | | sea | **100.00** | 48.93 | 97.65 | 86.00 | 33.00 | | fleurs | en | eng | **100.00** | 58.27 | **100.00** | 73.00 | 78.00 | | | | sea | **100.00** | 58.27 | **100.00** | 73.00 | 78.00 | | | km | eng | **100.00** | 56.60 | **100.00** | 94.00 | 62.00 | | | | sea | 99.48 | 43.40 | **100.00** | 99.00 | 15.00 | | indowavesentiment | id | eng | **100.00** | 71.67 | **100.00** | 84.00 | 60.00 | | | | sea | **100.00** | 60.67 | **100.00** | 88.00 | 14.00 | | m3ed | zh | eng | 93.30 | 84.30 | **94.30** | 73.00 | 23.00 | | | | sea | 93.70 | 70.70 | **94.40** | 72.00 | 12.00 | | openslr | ta | eng | **100.00** | 55.30 | 99.00 | 75.00 | 47.00 | | | | sea | **100.00** | 37.80 | 87.90 | 81.00 | 36.00 | | sg streets | en | eng | **100.00** | 89.63 | **100.00** | 87.00 | 32.00 | | | | sea | **100.00** | 89.63 | **100.00** | 87.00 | 32.00 | | asr-smaldusc | ms | eng | **99.40** | 52.40 | 98.60 | 97.00 | 76.00 | | | | sea | **99.40** | 44.00 | 98.80 | 99.00 | 24.00 | | thai elderly speech | th | eng | 99.09 | 68.15 | **99.29** | 77.00 | 46.00 | | | | sea | **98.99** | 26.92 | 97.39 | 76.00 | 51.00 | | thai ser | th | eng | **91.41** | 63.46 | 90.47 | 85.00 | 44.00 | | | | sea | **91.41** | 61.78 | 89.74 | 76.00 | 34.00 | | vietnam-celeb | vi | eng | **73.90** | 65.80 | 73.80 | 62.00 | 41.00 | | | | sea | 73.80 | 61.40 | **74.00** | 61.00 | 36.00 | | **Average** | | | **96.67** | 54.19 | 95.34 | 81.72 | 40.25 | **Spoken question and answer (SQA)** The benchmark consists of speech in English, Malay, Tamil, and Chinese, with text prompts in English containing questions related to the speech. As studies have found that LLM judges tend to favor longer, verbose answers even if they are not as clear, high-quality, or accurate as shorter alternatives, we have adjusted the judge's prompt to address verbosity bias. | Dataset | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio | | :--- | :---: | :---: | :---: | :---: | :---: | | ytb_sqa_batch1 | **67.65** | 65.89 | 66.66 | 63.25 | 60.43 | | ytb_sqa_batch3_ms | **58.00** | 50.40 | 56.25 | 57.75 | 55.80 | | ytb_sqa_batch3_ta | 58.55 | 53.60 | 52.25 | **59.45** | 56.25 | | ytb_sqa_batch3_zh_en | **61.80** | 57.15 | 59.80 | 58.55 | 57.45 | | **Average** | **61.50** | 56.76 | 58.74 | 59.75 | 57.48 | **Contextual paralinguistic question and answer (CPQA)** The audio includes both speech and non-speech elements, and when no speech is present, LLMs are expected to reason solely based on acoustic or musical elements. The speech samples were in languages of Chinese, Malay, Tamil, English, a mix of any of the languages (codeswitch), or could include dialects such as Hokkien. To test for robustness in instruction following, the text prompts were designed to be diverse, and were written in any of the following languages: English, Malay, Tamil, Indonesian, Vietnamese, Chinese, or Thai. LLMs are expected to reply in the same language as the text prompt. Similar to SQA, we have adjusted the judge's prompt to address verbosity bias. | Dataset | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio | | :--- | :---: | :---: | :---: | :---: | :---: | | yx_youtube_zh | **59.40** | 50.18 | 57.27 | 54.67 | 54.79 | | yx_youtube_codeswitch | **61.80** | 47.36 | 55.56 | 59.40 | 60.32 | | yx_youtube_dialect | **59.20** | 47.72 | 56.36 | 55.36 | 54.92 | | yx_youtube_ms | **60.40** | 46.16 | 53.88 | 57.00 | 56.36 | | yx_youtube_ta | **58.40** | 38.88 | 49.60 | 56.60 | 54.64 | | yx_youtube_en | **58.64** | 51.60 | 56.76 | 53.52 | 52.88 | | ytb_short_eval_cpqa_human1 | **54.64** | 47.57 | 53.95 | 47.42 | 49.97 | | ytb_short_eval_cpqa_llm1 | **59.42** | 56.25 | 56.07 | 54.94 | 52.44 | | ytb_long_eval_cpqa_llm1 | **60.46** | 57.48 | 57.44 | 54.94 | 56.32 | | ytb_long_eval_cpqa_human1 | **60.94** | 51.33 | 59.21 | 56.34 | 55.00 | | Emotional-YTB-MY_zh_30_test_CPQA_v1 | 51.81 | 46.81 | 51.22 | 51.07 | **53.41** | | Emotional-YTB-MY_ms_30_test_CPQA_v1 | 50.40 | 44.82 | 48.79 | 49.12 | **53.01** | | Emotional-YTB-MY_ta_test_CPQA_v1 | 49.77 | 41.88 | 48.62 | 52.56 | **54.96** | | **Average** | **57.33** | 48.31 | 54.21 | 54.07 | 54.54 | **Automatic Speech Recognition (ASR), instruction following and audio understanding** MERaLiON-3-10B continues to demonstrate competitive performance in ASR, instruction following and audio understanding as compared to MERaLiON-2-10B, with improvements on most metrics on Audiobench. Please visit [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) for dataset-level evaluation results. | Benchmark | MERaLiON-3-10B | MERaLiON-2-10B | MERaLiON-2-10B-ASR | MERaLiON-2-3B | | :--- | :---: | :---: | :---: | :---: | | ASR (lower better) | **0.125** | 0.1485 | 0.1332 | 0.1697 | | Speech Instruction | **76.90** | 70.20 | 13.40 | 19.10 | | Audio Scene Question Answering | **56.98** | 51.14 | 49.51 | 46.14 | | Spoken QA (Singlish) | **67.25** | 66.55 | 61.85 | 59.70 | | Audio Captioning | **38.31** | 35.60 | 34.47 | 33.24 | | Spoken Dialogue Summarisation | **56.45** | 53.10 | 55.80 | 48.55 | | Spoken QA (English) | **83.42** | 79.74 | 73.98 | 68.72 | | Music Understanding | **76.07** | 63.94 | 60.66 | 55.60 | | Accent Recognition | 57.47 | 41.82 | 47.79 | **60.05** | | Speech Translation | **28.83** | 27.39 | 28.54 | 22.13 | ## How to Use > [!WARNING] > **Out of Scope use**: This model is not intended for use in tool calling, math, and coding tasks. MERaLiON-3 requires `transformers` version `4.56.2` ``` pip install transformers==4.50.1 pip install librosa ``` To run in GPU, MERaLiON-3 requires `flash-attn`. ``` pip install flash-attn --no-build-isolation ``` > [!TIP] > Should you face any difficulties installing the above packages, you can try installing within this Docker container instead: > `pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel`, whose cuda and torch environments have been tested working. ### Audio Input - For ASR tasks, the maximum audio length is suggested to be 30 seconds at 16,000 Hz. - For general speech & audio understanding tasks, the maximum audio length which we tested for was up to 300 seconds at 16,000 Hz sampling rate. ### Text Prompt MERaLiON-3 is trained with this prompt template: ``` Instruction: