Instructions to use MERaLiON/MERaLiON-3-3B-ASR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MERaLiON/MERaLiON-3-3B-ASR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="MERaLiON/MERaLiON-3-3B-ASR", trust_remote_code=True)# Load model directly from transformers import AutoModelForSpeechSeq2Seq model = AutoModelForSpeechSeq2Seq.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
MERaLiON-3-ASR
MERaLiON-3-ASR is the speech-recognition line in the MERaLiON-3 generation of Speech-Text Large Language Models, developed by I2R, A*STAR, Singapore. It is purpose-built for Singapore- and Southeast-Asia-centric ASR, with broad coverage across regional languages, dialects, and natural conversational code-switching.
Coverage
- Languages: English (Global and Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese
- Chinese dialects: Cantonese, Hokkien
- Code-switching: natural conversational English ↔ {Mandarin, Malay, Tamil, Vietnamese, Cantonese, Hokkien} mixtures, including Singlish
MERaLiON-3-ASR-API
MERaLiON-3-ASR-API is the hosted production endpoint, tuned for real-time and enterprise transcription workloads.
- Fast streaming ASR, ~600 ms first-token latency for interactive use cases.
- Long-form decoding for continuous audio up to multiple hours.
- Word- and segment-level timestamps with speaker diarization for offline transcription.
- Reinforced regional-dialect recognition for Cantonese and Hokkien.
- State-of-the-art Southeast Asian code-switching performance.
On the locked evaluation suite, MERaLiON-3-ASR-API achieves the lowest mean Word Error Rate among all systems tested on English (Singapore) (12.52, vs. 25.97 / 27.34 for Gemini 3.5 Flash / GPT-4o), Cantonese (10.42, best of all systems), Hokkien (36.43, vs. 46.50 next best), and Code-switching (20.87, vs. 22.65 / 31.14). On Healthcare it is statistically tied with Gemini 3.5 Flash (20.47 vs. 20.04, within 0.5 pp).
MERaLiON-3-3B-ASR
MERaLiON-3-3B-ASR is the open-weights release published in this repository. At roughly one-third the size of its predecessor MERaLiON-2-10B-ASR, it matches or improves over the larger model on every evaluation section, with the largest gains concentrated where they matter most for the Southeast Asian setting:
| Section | MERaLiON-2-10B-ASR | MERaLiON-3-3B-ASR | Δ (pp) |
|---|---|---|---|
| Thai | 35.07 | 7.55 | −27.5 |
| Hokkien | 59.32 | 46.50 | −12.8 |
| Cantonese | 18.55 | 11.27 | −7.3 |
| Code-switching | 26.24 | 22.65 | −3.6 |
| Tamil | 28.29 | 25.83 | −2.5 |
The smaller footprint makes the open model practical for self-hosting on a single 80 GB GPU.
Performance
The figure reports mean Word Error Rate (lower is better) across twelve language and domain sections, covering 66 evaluation datasets in total. Datasets in the Healthcare and Code-switching sections also appear in their primary language section, so they contribute to both. The full per-dataset breakdown is at SEA-SpeechBench.
The table below reports per-dataset Word/Character Error Rate (lower is better) for Cantonese and Hokkien — the two regional Chinese dialects where MERaLiON-3 shows its largest relative advantage. The best result on each row is highlighted.
| MERaLiON-3-ASR-API | MERaLiON-3-3B-ASR | MERaLiON-2-10B-ASR | Qwen3-ASR | gpt-4o | gpt-audio-1.5 | gemini-3.5-flash | |
|---|---|---|---|---|---|---|---|
| Cantonese | |||||||
| cv21_cantonese_test | 11.65 | 13.88 | 18.29 | 25.08 | 8.74 | 21.96 | 20.34 |
| fleurs_cantonese_test | 8.75 | 8.93 | 10.53 | 11.30 | 6.31 | 7.55 | 9.63 |
| mdcc_cantonese_test | 4.95 | 5.11 | 6.95 | 8.67 | 7.56 | 18.51 | 11.27 |
| wsyue_long_test | 9.37 | 9.76 | 18.17 | 11.66 | 25.16 | 29.59 | 14.66 |
| wsyue_short_test | 11.19 | 12.56 | 24.55 | 17.77 | 18.23 | 33.99 | 31.01 |
| ytb_asr_cantonese_short_v3 | 16.63 | 17.38 | 32.83 | 16.92 | 35.48 | 38.66 | 19.87 |
| Hokkien | |||||||
| sarahwei_minnan_test | 18.93 | 22.58 | 49.47 | 66.17 | 44.93 | 77.26 | 46.38 |
| taiwan_tongues_hokkien_test | 51.64 | 58.64 | 63.90 | 99.05 | 66.41 | 109.41 | 73.40 |
| ytb_asr_hokkien_happycanalready_s4 | 38.72 | 58.30 | 64.58 | 128.21 | 78.89 | 68.55 | 63.37 |
Qualitative examples
The four clips below illustrate cases where MERaLiON-3-ASR-API produces the most accurate transcript, MERaLiON-3-3B-ASR a near-equivalent transcript with one or two character-level differences, and a leading general-purpose model produces a substantially less faithful transcript.
Cantonese (with code-switching to English)
| Reference | 嚟我公司,In-house PR Director,有冇兴趣? 哇,你问得咁直接,我唔怕直接答啊。 |
| MERaLiON-3-ASR-API | 嚟我公司,in-house PR director,有冇兴趣?你问得咁直接,我又唔怕直接答啊。 |
| MERaLiON-3-3B-ASR | 嚟我公司 in house p r director 有冇兴趣你问得咁直接我用唔怕直接答啊。 |
| gemini-3.5-flash | 你間公司 In-house PR director 係咪呀?你問得咁直接,我好難直接答呀。 |
Hokkien (with code-switching to English)
| Reference | 我……我……我新来的 recruit 啦!按怎无戴 helmet? 我……我袂使戴!我一戴,人就认袂出是我了!哈哈哈!我是……我是 recruit 梁婆婆!哈哈哈!人夯 full pack,按怎汝无 full pack? 有啊!我这个就是 full pack 啦!哈哈哈! |
| MERaLiON-3-ASR-API | 我啊,我是来 recruit 啊!安怎无带 helmet?我未使带,我一个带人欲啉袂出是我了!哈哈哈!我是、我是 recruit,梁某无!哈哈哈!人提 full pack,安怎汝无 full pack?我有啊!我即个就是 full pack 啊!哈哈哈! |
| MERaLiON-3-3B-ASR | 我是哪的 recruit 啊按怎没有 helmet 我是 recruit 咩有 full pack 按怎你没有 full pack 我有啊我今天就是 full pack。 |
| gemini-3.5-flash | Oh my god, 我是那一個 record。安仔,無 feedback 係點呀?Oh my god, 乜嘢 feedback 呀?我係,我係 record 嗰個。人哋有 feedback,你點解無 feedback 呀?Oh my god, 我呢個就係 feedback。 |
Code-switching (Tamil + English)
| Reference | நாம்ம என்ன snacks வீட்ல சைரோ எல்லா snacks கூட குடுக்கலாம் |
| MERaLiON-3-ASR-API | நம்ம என்ன snacks வீட்ல சேரு எல்லா snacks கூட கொடுக்கலாம். |
| MERaLiON-3-3B-ASR | நம்ம என்ன snacks வீட்ல சேர்ற எல்லா snacks கூட குடுக்கலாம். |
| gemini-3.5-flash | நம்ம என்ன ஸ்நாக்ஸ் வீட்டுல செய்யற எல்லா ஸ்நாக்ஸ் கூட கொடுக்கலாம் |
Code-switching (English + Mandarin)
| Reference | there's like two quarters 嘛 then 他 会 教 |
| MERaLiON-3-ASR-API | there's like two quarters (mah) then 他会教。 |
| MERaLiON-3-3B-ASR | there's like two quarters (mah) then 他会叫。 |
| gemini-3.5-flash | that's like two quarters then how would you |
A note on the code-switching benchmark
The code-switching section aggregates 13 datasets of natural Southeast Asian conversational speech, totalling 5,430 samples / 35.4 hours, in which English is mixed organically with another SEA language. Coverage by direction:
| Code-switch direction | Samples | Hours |
|---|---|---|
| EN ↔ Mandarin (incl. SEAME) | 506 | 5.3 |
| IMDA Part 4 (Singlish, EN ↔ Mandarin) | 1,000 | 7.3 |
| EN ↔ Tamil | 2,184 | 9.0 |
| EN ↔ Cantonese | 904 | 6.2 |
| EN ↔ Hokkien | 536 | 3.7 |
| EN ↔ Malay | 200 | 2.6 |
| EN ↔ Vietnamese | 100 | 1.3 |
The benchmark is intended to capture real-world SEA usage patterns rather than synthetic alternation between scripts. Of the constituent corpora, only IMDA Part 4 (NSC) is publicly available; the remaining datasets are proprietary or curated in-house. Per-sample normalisation and scoring scripts for the whole evaluation are available at SEA-SpeechBench.
Model Description
| Property | Value |
|---|---|
| Audio format | Mono, 16,000 Hz |
| Parameters | 3 B |
| Precision | BF16 |
| Supported languages | English (Global + Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese, Cantonese, Hokkien |
| Supported backend | vLLM |
The model is trained as an ASR-optimised fine-tune over a multilingual mixture of curated Southeast Asian speech data, with particular emphasis on Singapore English, regional dialects (Cantonese, Hokkien), and natural English-X code-switching.
How to Use
The recommended way to run MERaLiON-3-3B-ASR is through the meralion-3-asr Python package, which wraps a vLLM backend tuned for this model. Install from PyPI:
pip install meralion-3-asr
The package pre-wires the transcription prompt, decoding configuration, no-repeat-ngram guard, and 30 s audio chunking — on both the offline path and the served path. Callers only provide audio.
1. Offline batch (in-process vLLM)
from meralion_3_asr import Meralion3ASR
model = Meralion3ASR.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", backend="vllm")
text = model.transcribe("audio.wav") # str
texts = model.transcribe_batch(["a.wav", "b.wav", "c.wav"]) # List[str]
Inputs may be local file paths, https:// URLs, base64 data URLs, or (numpy_array, sample_rate) tuples. Audio is automatically resampled to mono 16 kHz; long files are chunked transparently.
2. Serving via the bundled sidecar
meralion-3-asr serve starts a FastAPI sidecar fronting an internal vllm serve process. The sidecar exposes a single OpenAI-compatible endpoint, /v1/audio/transcriptions, applies 30 s non-overlapping chunking server-side, and forwards each chunk to the internal vLLM. Clients send audio only — no prompt, no decoding flags.
meralion-3-asr serve --model MERaLiON/MERaLiON-3-3B-ASR --port 8000
OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
with open("audio.wav", "rb") as f:
resp = client.audio.transcriptions.create(
model="MERaLiON/MERaLiON-3-3B-ASR",
file=f,
)
print(resp.text)
curl:
curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
-F "model=MERaLiON/MERaLiON-3-3B-ASR" \
-F "file=@audio.wav" \
| jq -r .text
A transformers backend is also available for debugging and small-scale use, but is currently considered experimental — the vLLM backend is the only fully supported and benchmarked path.
Hardware & Infrastructure
MERaLiON-3 was trained on the ASPIRE 2A+ supercomputer cluster at the National Supercomputing Centre (NSCC), Singapore, hosted by A*STAR I2R:
- GPUs: 128 Nvidia H100 GPUs (16 nodes × 8 H100)
- Memory: 2 TB RAM per node
- Storage: 30 TB NVMe per node + 2.5 PB SSD-based Lustre filesystem
- Interconnect: 400 Gb/s NDR InfiniBand (full fat-tree topology)
Related Resources
License
License: TBD.
Disclaimer
⚠️ MERaLiON-3-3B-ASR has not been specifically aligned for safety and may transcribe content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.
Citation
If you use MERaLiON-3-3B-ASR in your work, please cite the foundational MERaLiON references:
@misc{he2024meralionaudiollmtechnicalreport,
title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
author={{MERaLiON Team}},
year={2024},
eprint={2412.09818},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.09818},
}
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}
@article{wang2025advancing,
title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
journal={arXiv preprint arXiv:2501.01034},
year={2025}
}
@article{zhang2024mowe,
title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
journal={ICASSP},
year={2025}
}
- Downloads last month
- 174
