MERaLiON-3-ASR Model Card

Overview

MERaLiON-3-ASR is the speech-recognition family in the MERaLiON-3 generation of Speech-Text Large Language Models, developed by I2R, A*STAR, Singapore. The family is purpose-built for Singapore- and Southeast-Asia-centric ASR, with broad coverage across regional languages, regional dialects, and natural conversational code-switching:

  • Languages: English (Global and Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese
  • Chinese dialects: Cantonese, Hokkien
  • Code-switching: natural conversational English โ†” {Mandarin, Malay, Tamil, Vietnamese, Cantonese, Hokkien} mixtures, including Singlish

The family ships in two variants:

  • MERaLiON-3-ASR-API โ€” a production-oriented closed-source service.
  • MERaLiON-3-3B-ASR โ€” the openly-released 3-billion-parameter checkpoint published in this repository.

MERaLiON-3-ASR-API

The closed MERaLiON-3-ASR-API is the production endpoint tuned for real-time and enterprise transcription workloads:

  • Fast streaming ASR with ~600 ms first-token latency for interactive use cases.
  • Long-form decoding for continuous audio up to multiple hours.
  • Word- and segment-level timestamp prediction with speaker diarization for offline transcription.
  • Reinforced regional-dialect recognition for Cantonese and Hokkien.
  • State-of-the-art Southeast Asian code-switching performance.

On the locked evaluation suite, MERaLiON-3-ASR-API achieves the lowest mean WER among all systems tested on English (Singapore) (12.52 vs. 25.97 / 27.34 for Gemini 3.5 Flash / GPT-4o), Cantonese (10.42, best of all systems), Hokkien (36.43 vs. 46.50 next-best), and Code-switching (20.87 vs. 22.65 / 31.14). On Healthcare it is statistically tied with Gemini 3.5 Flash (20.47 vs. 20.04, within 0.5 pp).

MERaLiON-3-3B-ASR

MERaLiON-3-3B-ASR is the open-weights release in the family. While roughly one-third the size of its predecessor MERaLiON-2-10B-ASR, it matches or improves over the larger model on every section evaluated, with the largest gains concentrated where it matters most for the Southeast Asian setting:

Section MERaLiON-2-10B-ASR MERaLiON-3-3B-ASR ฮ” (pp)
Thai 35.07 7.55 โˆ’27.5
Hokkien 59.32 46.50 โˆ’12.8
Cantonese 18.55 11.27 โˆ’7.3
Code-switching 26.24 22.65 โˆ’3.6
Tamil 28.29 25.83 โˆ’2.5

Strong dialect and code-switch numbers are inherited from the same training mixture used by the production variant, while the smaller footprint makes the open model practical for self-hosting on a single 80 GB GPU.

A note on the code-switching benchmark

The code-switching section aggregates 13 datasets of natural Southeast Asian conversational speech, totalling 5,430 samples / 35.4 hours, in which English is mixed organically with another SEA language. Coverage by direction:

Code-switch direction Samples Hours
EN โ†” Mandarin (incl. SEAME) 506 5.3
IMDA Part 4 (Singlish, EN โ†” Mandarin) 1,000 7.3
EN โ†” Tamil 2,184 9.0
EN โ†” Cantonese 904 6.2
EN โ†” Hokkien 536 3.7
EN โ†” Malay 200 2.6
EN โ†” Vietnamese 100 1.3

This is intended to capture real-world SEA usage patterns rather than synthetic alternation between scripts. Of the constituent corpora, only IMDA Part 4 (NSC) is publicly available; the remaining datasets are proprietary or curated in-house. Per-sample normalization and scoring scripts for the whole evaluation are available at SEA-SpeechBench.

Performance highlights

ASR WER comparison across 12 language and domain sections

The figure reports mean Word Error Rate (lower is better) across twelve language and domain sections, covering 66 evaluation datasets in total. Datasets in the Healthcare and Code-switching sections also appear in their primary language section, so they contribute to both. The full per-dataset breakdown is at SEA-SpeechBench.

Model Description

Property Value
Audio Format Mono, 16,000 Hz
Parameters 3B
Precision BF16
Supported Languages English (Global + Singapore), Mandarin, Malay, Tamil, Indonesian, Thai, Vietnamese, Cantonese, Hokkien
Supported Backend vLLM (recommended); transformers is experimental

The model is trained as an ASR-optimized fine-tune over a multilingual mixture of curated Southeast Asian speech data, with particular emphasis on Singapore English, regional dialects (Cantonese, Hokkien), and natural English-X code-switching.

How to Use

The recommended way to run MERaLiON-3-3B-ASR is through the meralion-3-asr Python package, which wraps a vLLM backend tuned for this model. The package is not yet published to PyPI; install directly from source:

pip install "meralion-3-asr[vllm] @ git+https://github.com/YingxuH/MERaLiON-3-ASR.git"

The package pre-wires the fixed transcription prompt, locked sampling defaults (temperature=0, top_p=1.0, top_k=50, repetition_penalty=1.0, max_new_tokens=512), the no-repeat-ngram guard, and the 30 s audio chunker โ€” on both the offline path and the served path. Callers only provide audio.

1. Offline batch (in-process vLLM)

For local scripts that load the model once and transcribe many files in the same process:

from meralion_3_asr import Meralion3ASR

model = Meralion3ASR.from_pretrained("MERaLiON/MERaLiON-3-3B-ASR", backend="vllm")

# Single file
text = model.transcribe("audio.wav")           # str

# Batch
texts = model.transcribe_batch(["a.wav", "b.wav", "c.wav"])  # List[str]

Inputs may be local file paths, https:// URLs, base64 data URLs, or (numpy_array, sample_rate) tuples. Audio is automatically resampled to mono 16 kHz and long files are chunked transparently.

2. Serving via the bundled sidecar

meralion-3-asr serve starts a small FastAPI sidecar fronting an internal vllm serve process. The sidecar exposes a single OpenAI-compatible endpoint, /v1/audio/transcriptions, applies 30 s non-overlapping chunking server-side, and forwards each chunk to the internal vLLM with the locked sampling defaults pinned via --override-generation-config. Clients send audio only โ€” no prompt, no decoding flags.

meralion-3-asr serve --model MERaLiON/MERaLiON-3-3B-ASR --port 8000

2a. OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("audio.wav", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="MERaLiON/MERaLiON-3-3B-ASR",
        file=f,
    )
print(resp.text)

2b. curl

curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
    -F "model=MERaLiON/MERaLiON-3-3B-ASR" \
    -F "file=@audio.wav" \
  | jq -r .text

Backend notes

  • The vLLM backend is the supported and benchmarked path. All published numbers in this card were produced with this backend through meralion-3-asr.
  • A transformers-based backend is included for debugging and small-scale use. It has only been validated at transformers==4.50.1; newer versions can silently regress on edge cases (notably Gemma-2 softcap behavior under SDPA attention).
  • The HF repo's generation_config.json is tuned for the transformers backend (repetition_penalty=1.05, no_repeat_ngram_size=6). The vLLM-tuned values (temperature=0, top_p=1.0, top_k=50, repetition_penalty=1.0, max_new_tokens=512) ship inside the meralion-3-asr package and are injected into the served vLLM via --override-generation-config, so neither backend's defaults pollute the other.

Hardware & Infrastructure

MERaLiON-3 was trained on the ASPIRE 2A+ Supercomputer Cluster at the National Supercomputing Centre (NSCC), Singapore, hosted by A*STAR I2R:

  • GPUs: 128 Nvidia H100 GPUs (16 nodes with 8 H100s each)
  • Memory: 2 TB RAM per node
  • Storage: 30 TB NVMe per node + 2.5 PB SSD-based Lustre filesystem
  • Interconnect: 400 Gb/s NDR InfiniBand (full fat-tree topology)

Related Resources

License

License: TBD.

Disclaimer

โš ๏ธ The current MERaLiON-3-3B-ASR has not been specifically aligned for safety and may transcribe content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.

๐Ÿ“š Citation

If you use MERaLiON-3-3B-ASR in your work, please cite the foundational MERaLiON references:

@misc{he2024meralionaudiollmtechnicalreport,
      title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
      author={{MERaLiON Team}},
      year={2024},
      eprint={2412.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.09818},
}

@article{wang2024audiobench,
    title={AudioBench: A Universal Benchmark for Audio Large Language Models},
    author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
    journal={NAACL},
    year={2025}
}

@article{wang2025advancing,
    title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
    author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
    journal={arXiv preprint arXiv:2501.01034},
    year={2025}
}

@article{zhang2024mowe,
    title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
    author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
    journal={ICASSP},
    year={2025}
}
Downloads last month
122
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Papers for YingxuHe/MERaLiON-3-3B-ASR