Polyglot-Lion-0.6B: Compact multilingual ASR for Singapore — English, Mandarin, Tamil & Malay

Average error rate comparison across models

About

Polyglot-Lion-0.6B is a compact multilingual automatic speech recognition (ASR) model tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Developed by Quy-Anh Dang and Chris Ngo at Knovel Engineering, the model is presented in the report "Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR".

The model is fine-tuned from Qwen3-ASR-0.6B exclusively on publicly available speech corpora. It achieves an average error rate of 16.52 — halving the base model's error (50.68) and outperforming Whisper-large-v3-turbo (33.04) — and 20× faster inference.

Parameters: 0.6B
Languages: English, Mandarin, Tamil, Malay
Training cost: $81 on a single NVIDIA RTX PRO 6000 (48 h)
Inference speed: ~0.10 s/sample on RTX PRO 4500

Methodology

Polyglot-Lion employs a two-stage balanced upsampling strategy to handle severe class imbalance across languages and datasets:

Stage 1 (Intra-language balancing): Within each language, smaller datasets are replicated and subsampled to match the largest dataset in that language.
Stage 2 (Inter-language balancing): Across languages, per-language corpora are balanced so every language contributes equally to the final training set (~969 hours total).

The model deliberately omits language-tag conditioning during training, allowing it to learn to identify languages implicitly from the audio signal, which is critical for deployment-ready multilingual ASR in diverse linguistic environments.

Results

Model	Params	English (LS)	English (NSC)	Mandarin (CV)	Mandarin (AISH1)	Mandarin (AISH3)	Mandarin (Fleurs)	Tamil (CV)	Tamil (SLR65)	Tamil (SLR127)	Tamil (Fleurs)	Malay (Meso.)	Malay (Fleurs)	Avg
Whisper-large-v3-turbo	0.8B	3.04	32.02	17.91	9.64	16.81	10.63	74.50	58.13	69.56	66.90	28.47	8.88	33.04
SeaLLMs-Audio-7B	7B	94.74	9.53	8.68	9.65	9.76	37.09	126.70	127.24	138.65	105.31	71.34	26.25	63.75
Qwen2.5-Omni-3B	3B	29.21	34.79	46.36	28.25	44.55	54.74	318.36	465.58	448.82	311.67	211.90	74.69	172.37
Qwen2.5-Omni-7B	7B	13.80	22.96	14.49	7.33	22.58	16.68	252.06	239.15	303.96	326.43	158.06	43.92	118.45
Qwen3-ASR-0.6B	0.6B	2.74	7.64	10.06	2.08	2.59	9.75	121.10	127.00	129.12	130.09	47.29	18.71	50.68
Qwen3-ASR-1.7B	1.7B	2.31	6.22	7.50	1.52	2.08	9.33	139.96	134.63	144.49	147.23	39.00	10.87	53.76
MERaLiON-2-10B-ASR	10B	2.54	4.62	8.83	3.09	4.07	11.99	31.78	19.29	22.42	28.68	25.90	8.55	14.32

Polyglot-Lion-0.6B	0.6B	2.67	6.09	6.16	1.93	2.32	9.19	42.16	23.07	28.14	37.68	24.33	14.45	16.52
Polyglot-Lion-1.7B	1.7B	2.10	5.28	4.91	1.45	1.86	8.00	39.19	19.75	26.83	37.28	21.51	9.98	14.85

WER (%) for English, Tamil, and Malay; CER (%) for Mandarin. Lower is better. Bold = best overall.

Quick Start

Polyglot-Lion uses the qwen-asr package for inference.

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create environment and install
uv venv --python 3.12 && source .venv/bin/activate
uv pip install qwen-asr hf_transfer

Transformers

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "knoveleng/polyglot-lion-0.6b",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_new_tokens=256,
)

results = model.transcribe(audio="path/to/audio.wav", language=None)
print(results[0].language, results[0].text)

vLLM (faster)

from qwen_asr import Qwen3ASRModel

if __name__ == "__main__":
    model = Qwen3ASRModel.LLM(
        model="knoveleng/polyglot-lion-0.6b",
        gpu_memory_utilization=0.7,
        max_new_tokens=4096,
    )
    results = model.transcribe(audio=["audio1.wav", "audio2.wav"], language=None)
    for r in results:
        print(r.language, r.text)

For batch inference, timestamps, streaming, and server deployment, see the Qwen3-ASR documentation.

Citation

@misc{dang2026polyglotlion,
    title={Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR}, 
    author={Quy-Anh Dang and Chris Ngo},
    year={2026},
    eprint={2603.16184},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.16184}, 
}