Instructions to use YingxuHe/MERaLiON-3-3B-ASR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YingxuHe/MERaLiON-3-3B-ASR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="YingxuHe/MERaLiON-3-3B-ASR", trust_remote_code=True)# Load model directly from transformers import AutoModelForSpeechSeq2Seq model = AutoModelForSpeechSeq2Seq.from_pretrained("YingxuHe/MERaLiON-3-3B-ASR", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
MERaLiON-3-ASR Model Card
Overview
MERaLiON-3-ASR is the speech-recognition family in the MERaLiON-3 generation of Speech-Text Large Language Models, developed by I2R, A*STAR, Singapore. The family is purpose-built for Singapore- and Southeast-Asia-centric ASR, with broad coverage across regional languages, regional dialects, and natural conversational code-switching:
- Languages: English (Global and Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese
- Chinese dialects: Cantonese, Hokkien
- Code-switching: natural conversational English โ {Mandarin, Malay, Tamil, Vietnamese, Cantonese, Hokkien} mixtures, including Singlish
The family ships in two variants:
- MERaLiON-3-ASR-API โ a production-oriented closed-source service.
- MERaLiON-3-3B-ASR โ the openly-released 3-billion-parameter checkpoint published in this repository.
MERaLiON-3-ASR-API
The closed MERaLiON-3-ASR-API is the production endpoint tuned for real-time and enterprise transcription workloads:
- Fast streaming ASR with ~600 ms first-token latency for interactive use cases.
- Long-form decoding for continuous audio up to multiple hours.
- Word- and segment-level timestamp prediction with speaker diarization for offline transcription.
- Reinforced regional-dialect recognition for Cantonese and Hokkien.
- State-of-the-art Southeast Asian code-switching performance.
On the locked evaluation suite, MERaLiON-3-ASR-API achieves the lowest mean WER among all systems tested on English (Singapore) (12.52 vs. 25.97 / 27.34 for Gemini 3.5 Flash / GPT-4o), Cantonese (10.42, best of all systems), Hokkien (36.43 vs. 46.50 next-best), and Code-switching (20.87 vs. 22.65 / 31.14). On Healthcare it is statistically tied with Gemini 3.5 Flash (20.47 vs. 20.04, within 0.5 pp).
MERaLiON-3-3B-ASR
MERaLiON-3-3B-ASR is the open-weights release in the family. While roughly one-third the size of its predecessor MERaLiON-2-10B-ASR, it matches or improves over the larger model on every section evaluated, with the largest gains concentrated where it matters most for the Southeast Asian setting:
| Section | MERaLiON-2-10B-ASR | MERaLiON-3-3B-ASR | ฮ (pp) |
|---|---|---|---|
| Thai | 35.07 | 7.55 | โ27.5 |
| Hokkien | 59.32 | 46.50 | โ12.8 |
| Cantonese | 18.55 | 11.27 | โ7.3 |
| Code-switching | 26.24 | 22.65 | โ3.6 |
| Tamil | 28.29 | 25.83 | โ2.5 |
Strong dialect and code-switch numbers are inherited from the same training mixture used by the production variant, while the smaller footprint makes the open model practical for self-hosting on a single 80 GB GPU.
A note on the code-switching benchmark
The code-switching section aggregates 13 datasets of natural Southeast Asian conversational speech, totalling 5,430 samples / 35.4 hours, in which English is mixed organically with another SEA language. Coverage by direction:
| Code-switch direction | Samples | Hours |
|---|---|---|
| EN โ Mandarin (incl. SEAME) | 506 | 5.3 |
| IMDA Part 4 (Singlish, EN โ Mandarin) | 1,000 | 7.3 |
| EN โ Tamil | 2,184 | 9.0 |
| EN โ Cantonese | 904 | 6.2 |
| EN โ Hokkien | 536 | 3.7 |
| EN โ Malay | 200 | 2.6 |
| EN โ Vietnamese | 100 | 1.3 |
This is intended to capture real-world SEA usage patterns rather than synthetic alternation between scripts. Of the constituent corpora, only IMDA Part 4 (NSC) is publicly available; the remaining datasets are proprietary or curated in-house. Per-sample normalization and scoring scripts for the whole evaluation are available at SEA-SpeechBench.
Performance highlights
The figure reports mean Word Error Rate (lower is better) across twelve language and domain sections, covering 66 evaluation datasets in total. Datasets in the Healthcare and Code-switching sections also appear in their primary language section, so they contribute to both. The full per-dataset breakdown is at SEA-SpeechBench.
Model Description
| Property | Value |
|---|---|
| Audio Format | Mono, 16,000 Hz |
| Parameters | 3B |
| Precision | BF16 |
| Supported Languages | English (Global + Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese, Cantonese, Hokkien |
| Supported Backend | vLLM (recommended); transformers is experimental |
The model is trained as an ASR-optimized fine-tune over a multilingual mixture of curated Southeast Asian speech data, with particular emphasis on Singapore English, regional dialects (Cantonese, Hokkien), and natural English-X code-switching.
How to Use
The recommended way to run MERaLiON-3-3B-ASR is through the meralion-3-asr Python package, which wraps a vLLM backend tuned for this model. The package is not yet published to PyPI; install directly from source:
pip install "meralion-3-asr[vllm] @ git+https://github.com/YingxuH/MERaLiON-3-ASR.git"
The package pre-wires the fixed transcription prompt, locked sampling defaults (temperature=0, top_p=1.0, top_k=50, repetition_penalty=1.0, max_new_tokens=512), the no-repeat-ngram guard, and the 30 s audio chunker โ on both the offline path and the served path. Callers only provide audio.
1. Offline batch (in-process vLLM)
For local scripts that load the model once and transcribe many files in the same process:
from meralion_3_asr import Meralion3ASR
model = Meralion3ASR.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", backend="vllm")
# Single file
text = model.transcribe("audio.wav") # str
# Batch
texts = model.transcribe_batch(["a.wav", "b.wav", "c.wav"]) # List[str]
Inputs may be local file paths, https:// URLs, base64 data URLs, or (numpy_array, sample_rate) tuples. Audio is automatically resampled to mono 16 kHz and long files are chunked transparently.
2. Serving via the bundled sidecar
meralion-3-asr serve starts a small FastAPI sidecar fronting an internal vllm serve process. The sidecar exposes a single OpenAI-compatible endpoint, /v1/audio/transcriptions, applies 30 s non-overlapping chunking server-side, and forwards each chunk to the internal vLLM with the locked sampling defaults pinned via --override-generation-config. Clients send audio only โ no prompt, no decoding flags.
meralion-3-asr serve --model MERaLiON/MERaLiON-3-3B-ASR --port 8000
2a. OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
with open("audio.wav", "rb") as f:
resp = client.audio.transcriptions.create(
model="MERaLiON/MERaLiON-3-3B-ASR",
file=f,
)
print(resp.text)
2b. curl
curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
-F "model=MERaLiON/MERaLiON-3-3B-ASR" \
-F "file=@audio.wav" \
| jq -r .text
Backend notes
- The vLLM backend is the supported and benchmarked path. All published numbers in this card were produced with this backend through
meralion-3-asr. - A
transformers-based backend is included for debugging and small-scale use. It has only been validated attransformers==4.50.1; newer versions can silently regress on edge cases (notably Gemma-2 softcap behavior under SDPA attention). - The HF repo's
generation_config.jsonis tuned for thetransformersbackend (repetition_penalty=1.05,no_repeat_ngram_size=6). The vLLM-tuned values (temperature=0,top_p=1.0,top_k=50,repetition_penalty=1.0,max_new_tokens=512) ship inside themeralion-3-asrpackage and are injected into the served vLLM via--override-generation-config, so neither backend's defaults pollute the other.
Hardware & Infrastructure
MERaLiON-3 was trained on the ASPIRE 2A+ Supercomputer Cluster at the National Supercomputing Centre (NSCC), Singapore, hosted by A*STAR I2R:
- GPUs: 128 Nvidia H100 GPUs (16 nodes with 8 H100s each)
- Memory: 2 TB RAM per node
- Storage: 30 TB NVMe per node + 2.5 PB SSD-based Lustre filesystem
- Interconnect: 400 Gb/s NDR InfiniBand (full fat-tree topology)
Related Resources
License
License: TBD.
Disclaimer
โ ๏ธ The current MERaLiON-3-3B-ASR has not been specifically aligned for safety and may transcribe content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.
๐ Citation
If you use MERaLiON-3-3B-ASR in your work, please cite the foundational MERaLiON references:
@misc{he2024meralionaudiollmtechnicalreport,
title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
author={{MERaLiON Team}},
year={2024},
eprint={2412.09818},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.09818},
}
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}
@article{wang2025advancing,
title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
journal={arXiv preprint arXiv:2501.01034},
year={2025}
}
@article{zhang2024mowe,
title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
journal={ICASSP},
year={2025}
}
- Downloads last month
- 122
