MERaLiON-3-ASR

MERaLiON-3-ASR is the speech-recognition line in the MERaLiON-3 generation of Speech-Text Large Language Models, developed by I2R, A*STAR, Singapore. It is purpose-built for Singapore- and Southeast-Asia-centric ASR, with broad coverage across regional languages, dialects, and natural conversational code-switching.

Coverage

  • Languages: English (Global and Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese
  • Chinese dialects: Cantonese, Hokkien
  • Code-switching: natural conversational English ↔ {Mandarin, Malay, Tamil, Vietnamese, Cantonese, Hokkien} mixtures, including Singlish

MERaLiON-3-ASR-API

MERaLiON-3-ASR-API is the hosted production endpoint, tuned for real-time and enterprise transcription workloads.

  • Fast streaming ASR, ~600 ms first-token latency for interactive use cases.
  • Long-form decoding for continuous audio up to multiple hours.
  • Word- and segment-level timestamps with speaker diarization for offline transcription.
  • Reinforced regional-dialect recognition for Cantonese and Hokkien.
  • State-of-the-art Southeast Asian code-switching performance.

On the locked evaluation suite, MERaLiON-3-ASR-API achieves the lowest mean Word Error Rate among all systems tested on English (Singapore) (12.52, vs. 25.97 / 27.34 for Gemini 3.5 Flash / GPT-4o), Cantonese (10.42, best of all systems), Hokkien (36.43, vs. 46.50 next best), and Code-switching (20.87, vs. 22.65 / 31.14). On Healthcare it is statistically tied with Gemini 3.5 Flash (20.47 vs. 20.04, within 0.5 pp).

MERaLiON-3-3B-ASR

MERaLiON-3-3B-ASR is the open-weights release published in this repository. At roughly one-third the size of its predecessor MERaLiON-2-10B-ASR, it matches or improves over the larger model on every evaluation section, with the largest gains concentrated where they matter most for the Southeast Asian setting:

Section MERaLiON-2-10B-ASR MERaLiON-3-3B-ASR Δ (pp)
Thai 35.07 7.55 −27.5
Hokkien 59.32 46.50 −12.8
Cantonese 18.55 11.27 −7.3
Code-switching 26.24 22.65 −3.6
Tamil 28.29 25.83 −2.5

The smaller footprint makes the open model practical for self-hosting on a single 80 GB GPU.

Performance

ASR WER comparison across 12 language and domain sections

The figure reports mean Word Error Rate (lower is better) across twelve language and domain sections, covering 66 evaluation datasets in total. Datasets in the Healthcare and Code-switching sections also appear in their primary language section, so they contribute to both. The full per-dataset breakdown is at SEA-SpeechBench.

The table below reports per-dataset Word/Character Error Rate (lower is better) for Cantonese and Hokkien — the two regional Chinese dialects where MERaLiON-3 shows its largest relative advantage. The best result on each row is highlighted.

MERaLiON-3-ASR-APIMERaLiON-3-3B-ASRMERaLiON-2-10B-ASRQwen3-ASRgpt-4ogpt-audio-1.5gemini-3.5-flash
Cantonese
cv21_cantonese_test 11.65 13.88 18.29 25.08 8.74 21.96 20.34
fleurs_cantonese_test 8.75 8.93 10.53 11.30 6.31 7.55 9.63
mdcc_cantonese_test 4.95 5.11 6.95 8.67 7.56 18.51 11.27
wsyue_long_test 9.37 9.76 18.17 11.66 25.16 29.59 14.66
wsyue_short_test 11.19 12.56 24.55 17.77 18.23 33.99 31.01
ytb_asr_cantonese_short_v3 16.63 17.38 32.83 16.92 35.48 38.66 19.87
Hokkien
sarahwei_minnan_test 18.93 22.58 49.47 66.17 44.93 77.26 46.38
taiwan_tongues_hokkien_test 51.64 58.64 63.90 99.05 66.41 109.41 73.40
ytb_asr_hokkien_happycanalready_s4 38.72 58.30 64.58 128.21 78.89 68.55 63.37

Qualitative examples

The four clips below illustrate cases where MERaLiON-3-ASR-API produces the most accurate transcript, MERaLiON-3-3B-ASR a near-equivalent transcript with one or two character-level differences, and a leading general-purpose model produces a substantially less faithful transcript.

Cantonese (with code-switching to English)

Reference 嚟我公司,In-house PR Director,有冇兴趣? 哇,你问得咁直接,我唔怕直接答啊。
MERaLiON-3-ASR-API 嚟我公司,in-house PR director,有冇兴趣?你问得咁直接,我又唔怕直接答啊。
MERaLiON-3-3B-ASR 嚟我公司 in house p r director 有冇兴趣你问得咁直接我用唔怕直接答啊。
gemini-3.5-flash 你間公司 In-house PR director 係咪呀?你問得咁直接,我好難直接答呀。

Hokkien (with code-switching to English)

Reference 我……我……我新来的 recruit 啦!按怎无戴 helmet? 我……我袂使戴!我一戴,人就认袂出是我了!哈哈哈!我是……我是 recruit 梁婆婆!哈哈哈!人夯 full pack,按怎汝无 full pack? 有啊!我这个就是 full pack 啦!哈哈哈!
MERaLiON-3-ASR-API 我啊,我是来 recruit 啊!安怎无带 helmet?我未使带,我一个带人欲啉袂出是我了!哈哈哈!我是、我是 recruit,梁某无!哈哈哈!人提 full pack,安怎汝无 full pack?我有啊!我即个就是 full pack 啊!哈哈哈!
MERaLiON-3-3B-ASR 我是哪的 recruit 啊按怎没有 helmet 我是 recruit 咩有 full pack 按怎你没有 full pack 我有啊我今天就是 full pack。
gemini-3.5-flash Oh my god, 我是那一個 record。安仔,無 feedback 係點呀?Oh my god, 乜嘢 feedback 呀?我係,我係 record 嗰個。人哋有 feedback,你點解無 feedback 呀?Oh my god, 我呢個就係 feedback。

Code-switching (Tamil + English)

Reference நாம்ம என்ன snacks வீட்ல சைரோ எல்லா snacks கூட குடுக்கலாம்
MERaLiON-3-ASR-API நம்ம என்ன snacks வீட்ல சேரு எல்லா snacks கூட கொடுக்கலாம்.
MERaLiON-3-3B-ASR நம்ம என்ன snacks வீட்ல சேர்ற எல்லா snacks கூட குடுக்கலாம்.
gemini-3.5-flash நம்ம என்ன ஸ்நாக்ஸ் வீட்டுல செய்யற எல்லா ஸ்நாக்ஸ் கூட கொடுக்கலாம்

Code-switching (English + Mandarin)

Reference there's like two quarters 嘛 then 他 会 教
MERaLiON-3-ASR-API there's like two quarters (mah) then 他会教。
MERaLiON-3-3B-ASR there's like two quarters (mah) then 他会叫。
gemini-3.5-flash that's like two quarters then how would you

A note on the code-switching benchmark

The code-switching section aggregates 13 datasets of natural Southeast Asian conversational speech, totalling 5,430 samples / 35.4 hours, in which English is mixed organically with another SEA language. Coverage by direction:

Code-switch direction Samples Hours
EN ↔ Mandarin (incl. SEAME) 506 5.3
IMDA Part 4 (Singlish, EN ↔ Mandarin) 1,000 7.3
EN ↔ Tamil 2,184 9.0
EN ↔ Cantonese 904 6.2
EN ↔ Hokkien 536 3.7
EN ↔ Malay 200 2.6
EN ↔ Vietnamese 100 1.3

The benchmark is intended to capture real-world SEA usage patterns rather than synthetic alternation between scripts. Of the constituent corpora, only IMDA Part 4 (NSC) is publicly available; the remaining datasets are proprietary or curated in-house. Per-sample normalisation and scoring scripts for the whole evaluation are available at SEA-SpeechBench.

Model Description

Property Value
Audio format Mono, 16,000 Hz
Parameters 3 B
Precision BF16
Supported languages English (Global + Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese, Cantonese, Hokkien
Supported backend vLLM

The model is trained as an ASR-optimised fine-tune over a multilingual mixture of curated Southeast Asian speech data, with particular emphasis on Singapore English, regional dialects (Cantonese, Hokkien), and natural English-X code-switching.

How to Use

The recommended way to run MERaLiON-3-3B-ASR is through the meralion-3-asr Python package, which wraps a vLLM backend tuned for this model. Install from PyPI:

pip install meralion-3-asr

The package pre-wires the transcription prompt, decoding configuration, no-repeat-ngram guard, and 30 s audio chunking — on both the offline path and the served path. Callers only provide audio.

1. Offline batch (in-process vLLM)

from meralion_3_asr import Meralion3ASR

model = Meralion3ASR.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", backend="vllm")

text = model.transcribe("audio.wav")                          # str
texts = model.transcribe_batch(["a.wav", "b.wav", "c.wav"])   # List[str]

Inputs may be local file paths, https:// URLs, base64 data URLs, or (numpy_array, sample_rate) tuples. Audio is automatically resampled to mono 16 kHz; long files are chunked transparently.

2. Serving via the bundled sidecar

meralion-3-asr serve starts a FastAPI sidecar fronting an internal vllm serve process. The sidecar exposes a single OpenAI-compatible endpoint, /v1/audio/transcriptions, applies 30 s non-overlapping chunking server-side, and forwards each chunk to the internal vLLM. Clients send audio only — no prompt, no decoding flags.

meralion-3-asr serve --model MERaLiON/MERaLiON-3-3B-ASR --port 8000

OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("audio.wav", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="MERaLiON/MERaLiON-3-3B-ASR",
        file=f,
    )
print(resp.text)

curl:

curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
    -F "model=MERaLiON/MERaLiON-3-3B-ASR" \
    -F "file=@audio.wav" \
  | jq -r .text

A transformers backend is also available for debugging and small-scale use, but is currently considered experimental — the vLLM backend is the only fully supported and benchmarked path.

Hardware & Infrastructure

MERaLiON-3 was trained on the ASPIRE 2A+ supercomputer cluster at the National Supercomputing Centre (NSCC), Singapore, hosted by A*STAR I2R:

  • GPUs: 128 Nvidia H100 GPUs (16 nodes × 8 H100)
  • Memory: 2 TB RAM per node
  • Storage: 30 TB NVMe per node + 2.5 PB SSD-based Lustre filesystem
  • Interconnect: 400 Gb/s NDR InfiniBand (full fat-tree topology)

Related Resources

License

License: TBD.

Disclaimer

⚠️ MERaLiON-3-3B-ASR has not been specifically aligned for safety and may transcribe content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.

Citation

If you use MERaLiON-3-3B-ASR in your work, please cite the foundational MERaLiON references:

@misc{he2024meralionaudiollmtechnicalreport,
      title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
      author={{MERaLiON Team}},
      year={2024},
      eprint={2412.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.09818},
}

@article{wang2024audiobench,
    title={AudioBench: A Universal Benchmark for Audio Large Language Models},
    author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
    journal={NAACL},
    year={2025}
}

@article{wang2025advancing,
    title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
    author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
    journal={arXiv preprint arXiv:2501.01034},
    year={2025}
}

@article{zhang2024mowe,
    title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
    author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
    journal={ICASSP},
    year={2025}
}
Downloads last month
174
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MERaLiON/MERaLiON-3-3B-ASR

Papers for MERaLiON/MERaLiON-3-3B-ASR