🔥 MERaLiON-3 🔥

💻 Web Demo | ⚙️ vLLM coming soon

Introduction

We are pleased to announce the release of our flagship speech-text large language model, MERaLiON-3-10B-preview. MERaLiON-3-10B-preview demonstrates competitive performance across benchmark evaluations in Age Recognition, Gender Recognition, Spoken Question Answering (SQA), and Contextual Paralinguistic Question Answering (CPQA) in the Southeast Asian context as compared to the latest AudioLLMs, including Gemini 3 Flash and Qwen3 Omni Instruct. The benchmark contains speech and prompts in Malay, Indonesian, English, Chinese, Tamil, Thai and Vietnamese to better represent the Southeast Asian context. The following table presents task-specific evaluation scores, assessed using the LLM-as-a-Judge framework across multiple datasets. Higher scores indicate better performance. We will open-source the benchmark separately as part of a paper. See the Evaluation section for detailed benchmarking.

Benchmark	MERaLiON-3-10B-preview	MERaLiON-2-10B	Qwen3 Omni	Gemini 3 Flash	GPT 4o Audio
Age (commonvoice-en, ta, th, vi, zh)	76.84	61.77	70.38	77.00	68.90
Gender (Multi-dataset)	92.70	54.19	95.34	81.72	40.25
Spoken Q&A (SQA)	59.61	56.76	58.74	59.75	57.48
Contextual paralinguistic Q&A (CPQA)	57.02	48.31	54.21	54.07	54.54

MERaLiON-3-10B-preview also maintains its competitive performance in other tasks such as Multilingual Automatic Speech Recognition (ASR), Speech Translation (ST), Audio Scene Understanding and general speech comprehension vis-à-vis MERaLiON-2-10B.

Model Description:

MERaLiON stands for Multimodal Empathetic Reasoning and Learning in One Network, with models tailored for Singapore’s multilingual and multicultural landscape, as well as the wider Southeast Asian region.

MERaLiON-3-10B-preview is finetuned on 150,000 hours of speech and audio data across 6 diverse tasks: Automatic Speech Recognition (ASR), SQA, Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and CPQA.

Developed by: I²R, A*STAR, Singapore
Model type: Multimodal LLM
Language(s): Primarily English (Global and Singapore), Chinese, with support for audio of regional languages including Malay, Tamil, Indonesian, Thai, and Vietnamese.
Audio: Mono channel audio, 16000 hz, up to 300 seconds.
License: MERaLiON Public License
Demo: MERaLiON-AudioLLM Web Demo

Performance:

We benchmarked MERaLiON-3-10B-preview against Qwen3 Omni, Gemini 3 Flash, GPT 4o Audio, and MERaLiON-2-10B, and it performed the best on 31 out of 59 benchmarks for tasks related to age recognition, gender recognition, SQA, and CPQA. MERaLiON-3-10B-preview maintains competitive performance vis-à-vis MERaLiON-2-10B on the Audiobench benchmarks.

Age recognition

Age recognition tasks categorise speakers as teens (10-19), adults (20-59), or seniors (60-100). The prompts are either in English, or in a Southeast Asian language. LLM-as-a-judge is used to evaluate the correctness of each response.

Dataset	Lang	Var	MERaLiON-3-10B-preview	MERaLiON-2-10B	Qwen3 Omni	Gemini 3 Flash	GPT 4o Audio
Commonvoice	en	eng	64.86	63.10	64.20	68.00	65.00
		sea	64.86	63.10	64.20	68.00	65.00
	ta	eng	79.00	64.65	73.50	79.00	71.00
		sea	59.90	47.90	48.40	78.00	62.00
	th	eng	83.72	57.81	78.06	77.00	78.00
		sea	81.16	42.19	64.13	84.00	53.00
	vi	eng	92.32	73.23	84.39	81.00	86.00
		sea	90.40	64.35	77.67	87.00	81.00
	zh	eng	77.45	72.40	75.60	75.00	83.00
		sea	74.70	69.00	73.60	73.00	45.00
Average			76.84	61.77	70.38	77.00	68.90

Gender recognition

The gender recognition benchmark consists of speech samples in Indonesian, Tamil, Thai, Vietnamese, Chinese, Malay, English, and Khmer. The text prompts are either in English, or in a Southeast Asian language. LLM-as-a-judge is used to evaluate the correctness of each response.

Dataset	Lang	Var	MERaLiON-3-10B-preview	MERaLiON-2-10B	Qwen3 Omni	Gemini 3 Flash	GPT 4o Audio
commonvoice	id	eng	97.10	45.20	96.80	86.00	46.00
		sea	96.90	57.30	96.10	90.00	53.93
	ta	eng	97.10	53.00	96.80	65.00	33.00
		sea	51.00	40.40	81.90	71.00	35.00
	th	eng	97.72	50.07	96.92	87.00	50.00
		sea	96.92	23.96	95.18	82.00	40.00
	vi	eng	98.69	24.05	98.82	87.00	26.00
		sea	98.56	14.64	96.86	88.00	35.00
	zh	eng	98.10	53.70	98.20	89.00	49.00
		sea	97.80	35.50	98.10	82.00	21.00
emota	ta	eng	100.00	67.31	99.89	83.00	25.00
		sea	63.68	48.93	97.65	86.00	33.00
fleurs	en	eng	99.69	58.27	100.00	73.00	78.00
		sea	99.69	58.27	100.00	73.00	78.00
	km	eng	100.00	56.60	100.00	94.00	62.00
		sea	97.39	43.40	100.00	99.00	15.00
indowavesentiment	id	eng	100.00	71.67	100.00	84.00	60.00
		sea	100.00	60.67	100.00	88.00	14.00
m3ed	zh	eng	92.90	84.30	94.30	73.00	23.00
		sea	91.80	70.70	94.40	72.00	12.00
openslr	ta	eng	100.00	55.30	99.00	75.00	47.00
		sea	67.50	37.80	87.90	81.00	36.00
sg streets	en	eng	99.59	89.63	100.00	87.00	32.00
		sea	99.59	89.63	100.00	87.00	32.00
asr-smaldusc	ms	eng	99.30	52.40	98.60	97.00	76.00
		sea	99.60	44.00	98.80	99.00	24.00
thai elderly speech	th	eng	99.40	68.15	99.29	77.00	46.00
		sea	99.29	26.92	97.39	76.00	51.00
thai ser	th	eng	91.20	63.46	90.47	85.00	44.00
		sea	88.27	61.78	89.74	76.00	34.00
vietnam-celeb	vi	eng	73.70	65.80	73.80	62.00	41.00
		sea	73.80	61.40	74.00	61.00	36.00
Average			92.70	54.19	95.34	81.72	40.25

Spoken question and answer (SQA)

The benchmark consists of speech in English, Malay, Tamil, and Chinese, with text prompts in English containing questions related to the speech. As studies have found that LLM judges tend to favor longer, verbose answers even if they are not as clear, high-quality, or accurate as shorter alternatives, we have adjusted the judge's prompt to address verbosity bias.

Dataset	MERaLiON-3-10B-preview	MERaLiON-2-10B	Qwen3 Omni	Gemini 3 Flash	GPT 4o Audio
ytb_sqa_batch1	65.60	65.89	66.66	63.25	60.43
ytb_sqa_batch3_ms	54.35	50.40	56.25	57.75	55.80
ytb_sqa_batch3_ta	57.34	53.60	52.25	59.45	56.25
ytb_sqa_batch3_zh_en	61.15	57.15	59.80	58.55	57.45
Average	59.61	56.76	58.74	59.75	57.48

Contextual paralinguistic question and answer (CPQA)

The audio includes both speech and non-speech elements, and when no speech is present, LLMs are expected to reason solely based on acoustic or musical elements. The speech samples were in languages of Chinese, Malay, Tamil, English, a mix of any of the languages (codeswitch), or could include dialects such as Hokkien. To test for robustness in instruction following, the text prompts were designed to be diverse, and were written in any of the following languages: English, Malay, Tamil, Indonesian, Vietnamese, Chinese, or Thai. LLMs are expected to reply in the same language as the text prompt. Similar to SQA, we have adjusted the judge's prompt to address verbosity bias.

Dataset	MERaLiON-3-10B-preview	MERaLiON-2-10B	Qwen3 Omni	Gemini 3 Flash	GPT 4o Audio
yx_youtube_zh	58.88	50.18	57.27	54.67	54.79
yx_youtube_codeswitch	63.04	47.36	55.56	59.40	60.32
yx_youtube_dialect	61.12	47.72	56.36	55.36	54.92
yx_youtube_ms	62.00	46.16	53.88	57.00	56.36
yx_youtube_ta	58.12	38.88	49.60	56.60	54.64
yx_youtube_en	58.64	51.60	56.76	53.52	52.88
ytb_short_eval_cpqa_human1	51.63	47.57	53.95	47.42	49.97
ytb_short_eval_cpqa_llm1	57.18	56.25	56.07	54.94	52.44
ytb_long_eval_cpqa_llm1	59.05	57.48	57.44	54.94	56.32
ytb_long_eval_cpqa_human1	59.22	51.33	59.21	56.34	55.00
Emotional-YTB-MY_zh_30_test_CPQA_v1	51.24	46.81	51.22	51.07	53.41
Emotional-YTB-MY_ms_30_test_CPQA_v1	50.63	44.82	48.79	49.12	53.01
Emotional-YTB-MY_ta_test_CPQA_v1	50.52	41.88	48.62	52.56	54.96
Average	57.02	48.31	54.21	54.07	54.54

Automatic Speech Recognition (ASR), instruction following and audio understanding

MERaLiON-3-10B-preview continues to demonstrate competitive performance in ASR, instruction following and audio understanding as compared to MERaLiON-2-10B, with improvements on many metrics on Audiobench. Please visit AudioBench benchmark for dataset-level evaluation results.

Benchmark	MERaLiON-3-10B-preview	MERaLiON-2-10B	MERaLiON-2-10B-ASR	MERaLiON-2-3B
ASR (lower better)	0.1325	0.1485	0.1332	0.1697
Speech Instruction	75.60	70.20	13.40	19.10
Audio Scene Question Answering	58.36	51.14	49.51	46.14
Spoken QA (Singlish)	66.38	66.55	61.85	59.70
Audio Captioning	36.86	35.60	34.47	33.24
Spoken Dialogue Summarisation	53.75	53.10	55.80	48.55
Spoken QA (English)	82.04	79.74	73.98	68.72
Music Understanding	70.43	63.94	60.66	55.60
Accent Recognition	41.39	41.82	47.79	60.05
Speech Translation	27.76	27.39	28.54	22.13

How to Use

Out of Scope use: This model is not intended for use in tool calling, math, and coding tasks.

MERaLiON-3 requires transformers version 4.50.1

pip install transformers==4.50.1
pip install librosa

To run in GPU, MERaLiON-3 requires flash-attn.

pip install flash-attn --no-build-isolation

Should you face any difficulties installing the above packages, you can try installing within this Docker container instead: pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel, whose cuda and torch environments have been tested working.

Audio Input

For ASR tasks, the maximum audio length is suggested to be 30 seconds at 16,000 Hz.
For general speech & audio understanding tasks, the maximum audio length which we tested for was up to 300 seconds at 16,000 Hz sampling rate.

Text Prompt

MERaLiON-3 is trained with this prompt template:

Instruction: <TextHere> \nFollow the text instruction based on the following audio: <SpeechHere>

It is generally recommended to follow this template, i.e., replace <TextHere> with your text instruction while leaving the <SpeechHere> untouched. We list a few useful example prompts here:

Standard prompts for better accuracy

prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"

transcription_prompt = prompt_template.format(query="Please transcribe this speech.")
translation_prompt = prompt_template.format(query="Please translate the speech into Malay")
summarization_prompt = prompt_template.format(query="Please summarize this speech")
audio_captioning_prompt_1 = prompt_template.format(query="Please describe the audio")
audio_captioning_prompt_2 = prompt_template.format(query="Please create a caption for the audio")
audio_scene_understanding_prompt = prompt_template.format(query="Are there people crying in the audio?")
speech_as_instruction_prompt = prompt_template.format(query="Please respond to the audio") # given a speech instruction is provided in the audio clip.
emotion_recognition_prompt_1 = prompt_template.format(query="What is the emotion of the speaker")
emotion_recognition_prompt_2 = prompt_template.format(query="Describe the paralinguistic features of the audio")
gender_recognition_prompt = prompt_template.format(query="What is the gender of the speaker")

More flexible prompts for enriched responses

prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"

prompt_1 = prompt_template.format(query="describe the paralinguistics feature and return in json format.")
prompt_2 = prompt_template.format(query="Please summarize the content of the speech and analyse the paralinguistics features of this audio. Return in json format.")
prompt_3 = prompt_template.format(query="Please translate this speech to Singapore's 4 official languages.")

AI agent prompts (beyond the default prompt template)

prompt_1 = \
"""
You are MERaLiON-AudioLLM, an empathic AI assistant developed by A*STAR. MERaLiON stands for Multimodal Empathetic Reasoning and Learning in One Network.
You are a friendly and empathetic conversational partner, and is proficient in understanding human emotions, accents, and genders from paralinguistic features.
Maintain a tone that is warm, non-judgmental, and supportive while replying to user. 

User's voice:  <SpeechHere>
"""

Huggingface Inference with CPU

import librosa
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

repo_id = "MERaLiON/MERaLiON-3-10B-preview"

processor = AutoProcessor.from_pretrained(
    repo_id, 
    trust_remote_code=True,
    )
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    repo_id,
    use_safetensors=True,
    trust_remote_code=True,
)

prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
transcribe_prompt = "Please transcribe this speech."
translate_prompt = "Can you please translate this speech into written Chinese?"

# batch inference of 2 samples
conversation = [
    [{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}],
    [{"role": "user", "content": prompt_template.format(query=translate_prompt)}],
]

chat_prompt = processor.tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=False,
    add_generation_prompt=True
)

# Use audio at 16000hz.
audio_array, sample_rate = librosa.load("/path/to/your/audio/file", sr=16000)
audio_array = [audio_array]*2
inputs = processor(text=chat_prompt, audios=audio_array)

# adjust the `max_new_tokens` based on your use case.
# Please note the inclusion of `no_repeat_ngram_size=6`.
outputs = model.generate(**inputs, max_new_tokens=256, no_repeat_ngram_size=6)
response = processor.batch_decode(outputs, skip_special_tokens=True)

Huggingface GPU Inference

import torch
import librosa
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

repo_id = "MERaLiON/MERaLiON-3-10B-preview"
device = "cuda"

processor = AutoProcessor.from_pretrained(
    repo_id, 
    trust_remote_code=True,
    )
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    repo_id,
    use_safetensors=True,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
).to(device)

prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
transcribe_prompt = "Please transcribe this speech."
translate_prompt = "Can you please translate this speech into written Chinese?"

# batch inference of 2 samples
conversation = [
    [{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}],
    [{"role": "user", "content": prompt_template.format(query=translate_prompt)}],
]

chat_prompt = processor.tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=False,
    add_generation_prompt=True
)

# Use audio at 16000hz.
audio_array, sample_rate = librosa.load("/path/to/your/audio/file", sr=16000)
audio_array = [audio_array]*2
inputs = processor(text=chat_prompt, audios=audio_array)

inputs = inputs.to(device, dtype=torch.bfloat16)

# adjust the `max_new_tokens` based on your use case.
# Please note the inclusion of `no_repeat_ngram_size=6`.
outputs = model.generate(**inputs, max_new_tokens=256, no_repeat_ngram_size=6)
response = processor.batch_decode(outputs, skip_special_tokens=True)

⚠️ Disclaimer

The current MERaLiON-3 has not been specifically aligned for safety and may generate content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.

Compute and Infrastructure

MERaLiON-3 was trained on the ASPIRE 2A+ Supercomputer Cluster, provided by National Supercomputing Centre (NSCC), Singapore. ASPIRE 2A+ cluster provides multiple H100 nodes, with each compute node equipped with 8 Nvidia H100 GPUs, 2 TB of RAM, and 30 TB of locally attached NVMe storage. These nodes are interconnected via a rail-optimised, full fat-tree topology, utilising 400 Gb/s NDR InfiniBand cables. Additionally, the cluster incorporates a 2.5 PB SSD-based Lustre file system, linked to the H100 nodes through high-speed InfiniBand connections.

With a global batch size of 768, we trained the current release of MERaLiON-3 for around 250k steps, which took around 2.5 days to complete using 16 nodes, 128 H100 GPUs.

📚 Citation

If you find our work useful, please cite our papers:

MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
AudioBench: A Universal Benchmark for Audio Large Language Models
Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation
Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data
MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages
Incorporating contextual paralinguistic understanding in large speech-language models
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

@misc{he2024meralionaudiollmtechnicalreport,
      title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models}, 
      author={{MERaLiON Team}},
      year={2024},
      eprint={2412.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.09818}, 
}

@article{wang2024audiobench,
    title={AudioBench: A Universal Benchmark for Audio Large Language Models},
    author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
    journal={NAACL},
    year={2025}
    }

@inproceedings{wang2025benchmarking,
  title={Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data},
  author={Wang, Qiongqiong and Sailor, Hardik Bhupendra and Liu, Tianchi and Zhang, Wenyu and Huzaifah, Muhammad and Lertcheva, Nattadaporn and Sun, Shuo and Chen, Nancy F and Wu, Jinyang and Aw, AiTi},
  booktitle={Findings of EMNLP 2025},
  year={2025}
}

@inproceedings{cpqa_interspeech,
  title={Contextual Paralinguistic Data Creation for  Multi-Modal Speech-LLM: Data Condensation and Spoken {QA} Generation},
  author={Wang, Qiongqiong and Sailor, Hardik B and Liu, Tianchi and Aw, Ai Ti},
  booktitle={Proc. Interspeech},
  year={2025},
}

@inproceedings{cpqa_asru,
  title={Incorporating Contextual Paralinguistic
Understanding in Large Speech-Language Models},
  author={
      Wang, Qiongqiong and Sailor, Hardik B and Wong, Jeremy H. M. and Liu, Tianchi and Sun, Shuo and Zhang, Wenyu and Huzaifah, Muhammad and  Chen, Nancy and Aw, Ai Ti},
  booktitle={Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2025},
}

@article{wang2025advancing,
    title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
    author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
    journal={arXiv preprint arXiv:2501.01034},
    year={2025}
    }

@article{zhang2024mowe,
    title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
    author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
    journal={ICASSP},
    year={2025}
    }

@misc{huang2025meraliontextllmcrosslingualunderstandinglarge,
      title={MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish}, 
      author={Xin Huang and Tarun Kumar Vangani and Minh Duc Pham and Xunlong Zou and Bin Wang and Zhengyuan Liu and Ai Ti Aw},
      year={2025},
      eprint={2501.08335},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.08335}, 
}

Downloads last month: 63,095

Safetensors

Model size

10B params

Tensor type

BF16

Model tree for MERaLiON/MERaLiON-3-10B-preview

Base model

google/gemma-2-9b

Finetuned

google/gemma-2-9b-it

Finetuned

(489)

this model

Finetunes

9 models

Dataset used to train MERaLiON/MERaLiON-3-10B-preview

Collection including MERaLiON/MERaLiON-3-10B-preview

MERaLiON-3

Collection

MERaLiON 3 release • 2 items • Updated 9 days ago • 3

Papers for MERaLiON/MERaLiON-3-10B-preview

MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

Paper • 2511.04914 • Published Nov 7, 2025

Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Paper • 2508.07273 • Published Aug 10, 2025

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Paper • 2501.01034 • Published Jan 2, 2025

MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

Paper • 2501.08335 • Published Dec 21, 2024

MERaLiON-AudioLLM: Technical Report

Paper • 2412.09818 • Published Dec 13, 2024