MERaLiON-3-ASR

MERaLiON-3-ASR is the speech-recognition line in the MERaLiON-3 generation of Speech-Text Large Language Models, developed by I2R, A*STAR, Singapore. It is purpose-built for Singapore- and Southeast-Asia-centric ASR, with broad coverage across regional languages, dialects, and natural conversational code-switching.

Coverage

Languages: English (Global and Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese
Chinese dialects: Cantonese, Hokkien
Code-switching: natural conversational English ↔ {Mandarin, Malay, Tamil, Vietnamese, Cantonese, Hokkien} mixtures, including Singlish

MERaLiON-3-ASR-API

MERaLiON-3-ASR-API is the hosted production endpoint, tuned for real-time and enterprise transcription workloads.

Fast streaming ASR, ~600 ms first-token latency for interactive use cases.
Long-form decoding for continuous audio up to multiple hours.
Word- and segment-level timestamps with speaker diarization for offline transcription.
Reinforced regional-dialect recognition for Cantonese and Hokkien.
State-of-the-art Southeast Asian code-switching performance.

On the locked evaluation suite, MERaLiON-3-ASR-API achieves the lowest mean Word Error Rate among all systems tested on English (Singapore) (12.52, vs. 25.97 / 27.34 for Gemini 3.5 Flash / GPT-4o), Cantonese (10.42, best of all systems), Hokkien (36.43, vs. 46.50 next best), and Code-switching (20.87, vs. 22.65 / 31.14). On Healthcare it is statistically tied with Gemini 3.5 Flash (20.47 vs. 20.04, within 0.5 pp).

MERaLiON-3-3B-ASR

MERaLiON-3-3B-ASR is the open-weights release published in this repository. At roughly one-third the size of its predecessor MERaLiON-2-10B-ASR, it matches or improves over the larger model on every evaluation section, with the largest gains concentrated where they matter most for the Southeast Asian setting:

Section	MERaLiON-2-10B-ASR	MERaLiON-3-3B-ASR	Δ (pp)
Thai	35.07	7.55	−27.5
Hokkien	59.32	46.50	−12.8
Cantonese	18.55	11.27	−7.3
Code-switching	26.24	22.65	−3.6
Tamil	28.29	25.83	−2.5

The smaller footprint makes the open model practical for self-hosting on a single 80 GB GPU.

Performance

The figure reports mean Word Error Rate (lower is better) across twelve language and domain sections, covering 66 evaluation datasets in total. Datasets in the Healthcare and Code-switching sections also appear in their primary language section, so they contribute to both. The full per-dataset breakdown is at SEA-SpeechBench.

The table below reports per-dataset Word/Character Error Rate (lower is better) for Cantonese and Hokkien — the two regional Chinese dialects where MERaLiON-3 shows its largest relative advantage. The best result on each row is highlighted.

	MERaLiON-3-ASR-API	MERaLiON-3-3B-ASR	MERaLiON-2-10B-ASR	Qwen3-ASR	gpt-4o	gpt-audio-1.5	gemini-3.5-flash
Cantonese
cv21_cantonese_test	11.65	13.88	18.29	25.08	8.74	21.96	20.34
fleurs_cantonese_test	8.75	8.93	10.53	11.30	6.31	7.55	9.63
mdcc_cantonese_test	4.95	5.11	6.95	8.67	7.56	18.51	11.27
wsyue_long_test	9.37	9.76	18.17	11.66	25.16	29.59	14.66
wsyue_short_test	11.19	12.56	24.55	17.77	18.23	33.99	31.01
ytb_asr_cantonese_short_v3	16.63	17.38	32.83	16.92	35.48	38.66	19.87
Hokkien
sarahwei_minnan_test	18.93	22.58	49.47	66.17	44.93	77.26	46.38
taiwan_tongues_hokkien_test	51.64	58.64	63.90	99.05	66.41	109.41	73.40
ytb_asr_hokkien_happycanalready_s4	38.72	58.30	64.58	128.21	78.89	68.55	63.37

Qualitative examples

The four clips below illustrate cases where MERaLiON-3-ASR-API produces the most accurate transcript, MERaLiON-3-3B-ASR a near-equivalent transcript with one or two character-level differences, and a leading general-purpose model produces a substantially less faithful transcript.

Cantonese (with code-switching to English)

Reference	嚟我公司，In-house PR Director，有冇兴趣？哇，你问得咁直接，我唔怕直接答啊。
MERaLiON-3-ASR-API	嚟我公司，in-house PR director，有冇兴趣？你问得咁直接，我又唔怕直接答啊。
MERaLiON-3-3B-ASR	嚟我公司 in house p r director 有冇兴趣你问得咁直接我用唔怕直接答啊。
gemini-3.5-flash	你間公司 In-house PR director 係咪呀？你問得咁直接，我好難直接答呀。

Hokkien (with code-switching to English)

Reference	我……我……我新来的 recruit 啦！按怎无戴 helmet？我……我袂使戴！我一戴，人就认袂出是我了！哈哈哈！我是……我是 recruit 梁婆婆！哈哈哈！人夯 full pack，按怎汝无 full pack？有啊！我这个就是 full pack 啦！哈哈哈！
MERaLiON-3-ASR-API	我啊，我是来 recruit 啊！安怎无带 helmet？我未使带，我一个带人欲啉袂出是我了！哈哈哈！我是、我是 recruit，梁某无！哈哈哈！人提 full pack，安怎汝无 full pack？我有啊！我即个就是 full pack 啊！哈哈哈！
MERaLiON-3-3B-ASR	我是哪的 recruit 啊按怎没有 helmet 我是 recruit 咩有 full pack 按怎你没有 full pack 我有啊我今天就是 full pack。
gemini-3.5-flash	Oh my god, 我是那一個 record。安仔，無 feedback 係點呀？Oh my god, 乜嘢 feedback 呀？我係，我係 record 嗰個。人哋有 feedback，你點解無 feedback 呀？Oh my god, 我呢個就係 feedback。

Code-switching (Tamil + English)

Reference	நாம்ம என்ன snacks வீட்ல சைரோ எல்லா snacks கூட குடுக்கலாம்
MERaLiON-3-ASR-API	நம்ம என்ன snacks வீட்ல சேரு எல்லா snacks கூட கொடுக்கலாம்.
MERaLiON-3-3B-ASR	நம்ம என்ன snacks வீட்ல சேர்ற எல்லா snacks கூட குடுக்கலாம்.
gemini-3.5-flash	நம்ம என்ன ஸ்நாக்ஸ் வீட்டுல செய்யற எல்லா ஸ்நாக்ஸ் கூட கொடுக்கலாம்

Code-switching (English + Mandarin)

Reference	there's like two quarters 嘛 then 他会教
MERaLiON-3-ASR-API	there's like two quarters (mah) then 他会教。
MERaLiON-3-3B-ASR	there's like two quarters (mah) then 他会叫。
gemini-3.5-flash	that's like two quarters then how would you

A note on the code-switching benchmark

The code-switching section aggregates 13 datasets of natural Southeast Asian conversational speech, totalling 5,430 samples / 35.4 hours, in which English is mixed organically with another SEA language. Coverage by direction:

Code-switch direction	Samples	Hours
EN ↔ Mandarin (incl. SEAME)	506	5.3
IMDA Part 4 (Singlish, EN ↔ Mandarin)	1,000	7.3
EN ↔ Tamil	2,184	9.0
EN ↔ Cantonese	904	6.2
EN ↔ Hokkien	536	3.7
EN ↔ Malay	200	2.6
EN ↔ Vietnamese	100	1.3

The benchmark is intended to capture real-world SEA usage patterns rather than synthetic alternation between scripts. Of the constituent corpora, only IMDA Part 4 (NSC) is publicly available; the remaining datasets are proprietary or curated in-house. Per-sample normalisation and scoring scripts for the whole evaluation are available at SEA-SpeechBench.

Model Description

Property	Value
Audio format	Mono, 16,000 Hz
Parameters	3 B
Precision	BF16
Supported languages	English (Global + Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese, Cantonese, Hokkien
Supported backend	vLLM

The model is trained as an ASR-optimised fine-tune over a multilingual mixture of curated Southeast Asian speech data, with particular emphasis on Singapore English, regional dialects (Cantonese, Hokkien), and natural English-X code-switching.

How to Use

The recommended way to run MERaLiON-3-3B-ASR is through the meralion-3-asr Python package, which wraps a vLLM backend tuned for this model. Install from PyPI:

pip install meralion-3-asr

The package pre-wires the transcription prompt, decoding configuration, no-repeat-ngram guard, and 30 s audio chunking — on both the offline path and the served path. Callers only provide audio.

1. Offline batch (in-process vLLM)

from meralion_3_asr import Meralion3ASR

model = Meralion3ASR.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", backend="vllm")

text = model.transcribe("audio.wav")                          # str
texts = model.transcribe_batch(["a.wav", "b.wav", "c.wav"])   # List[str]

Inputs may be local file paths, https:// URLs, base64 data URLs, or (numpy_array, sample_rate) tuples. Audio is automatically resampled to mono 16 kHz; long files are chunked transparently.

2. Serving via the bundled sidecar

meralion-3-asr serve starts a FastAPI sidecar fronting an internal vllm serve process. The sidecar exposes a single OpenAI-compatible endpoint, /v1/audio/transcriptions, applies 30 s non-overlapping chunking server-side, and forwards each chunk to the internal vLLM. Clients send audio only — no prompt, no decoding flags.

meralion-3-asr serve --model MERaLiON/MERaLiON-3-3B-ASR --port 8000

OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("audio.wav", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="MERaLiON/MERaLiON-3-3B-ASR",
        file=f,
    )
print(resp.text)

curl:

curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
    -F "model=MERaLiON/MERaLiON-3-3B-ASR" \
    -F "file=@audio.wav" \
  | jq -r .text

3. Transformers backend

vLLM (above) is the recommended backend. A pure transformers backend is also available — it loads the model in-process with AutoModelForSpeechSeq2Seq, which is convenient for debugging or environments without vLLM:

from meralion_3_asr import Meralion3ASR

model = Meralion3ASR.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", backend="transformers")

text = model.transcribe("audio.wav")                          # str
texts = model.transcribe_batch(["a.wav", "b.wav", "c.wav"])   # List[str]

The same prompt, decoding configuration, and 30 s chunking are applied on both backends.

Hardware & Infrastructure

MERaLiON-3 was trained on the ASPIRE 2A+ supercomputer cluster at the National Supercomputing Centre (NSCC), Singapore, hosted by A*STAR I2R:

GPUs: 128 Nvidia H100 GPUs (16 nodes × 8 H100)
Memory: 2 TB RAM per node
Storage: 30 TB NVMe per node + 2.5 PB SSD-based Lustre filesystem
Interconnect: 400 Gb/s NDR InfiniBand (full fat-tree topology)

Related Resources

License

MERaLiON-3-Public-Licence

Disclaimer

⚠️ MERaLiON-3-3B-ASR has not been specifically aligned for safety and may transcribe content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.

Citation

If you use MERaLiON-3-3B-ASR in your work, please cite the foundational MERaLiON references:

@misc{he2024meralionaudiollmtechnicalreport,
      title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
      author={{MERaLiON Team}},
      year={2024},
      eprint={2412.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.09818},
}

@article{wang2024audiobench,
    title={AudioBench: A Universal Benchmark for Audio Large Language Models},
    author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
    journal={NAACL},
    year={2025}
}

@article{wang2025advancing,
    title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
    author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
    journal={arXiv preprint arXiv:2501.01034},
    year={2025}
}

@article{zhang2024mowe,
    title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
    author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
    journal={ICASSP},
    year={2025}
}