TEA-ASR-1-mini · Taiwan Everyday Audio 🍵

TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin. It turns real speech into natural Traditional Chinese with authentic Taiwan vocabulary, and it stays robust through the everyday Mandarin–English code-switching common in Taiwan. Adapted from the state-of-the-art Qwen3-ASR foundation and merged into a single self-contained checkpoint, TEA-ASR loads and runs exactly like stock Qwen3-ASR — no converters, no post-processing — while matching or surpassing both a dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate.

TEA-ASR-1-mini is the 780M compact model (best accuracy-per-parameter). A companion TEA-ASR-1 shares the identical recipe — see JacobLinCool/TEA-ASR-1.

Key features

🎯 Built for Taiwan Mandarin — Traditional script and Taiwan-style word choice, produced by the model itself.
🔀 Code-switch robust — handles natural zh-en mixing instead of translating Mandarin into English.
🧩 Drop-in Qwen3-ASR compatible — same loading and inference API as the base model; nothing else to install or call.
🪶 Lightweight adaptation — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public audio, then merged for deployment.

Quick start

pip install qwen-asr

from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1-mini")
result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
print(result.text)   # -> Traditional Chinese with Taiwan lexicon

Set language="Chinese" for Taiwan speech (recommended). You can also pass a context= string of hotwords (names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.

Benchmark results

Mixed Error Rate (MER%, lower is better), all numbers from a single self-measured run under one protocol (see Evaluation). Columns: the two TEA-ASR models, the original (unadapted) Qwen3-ASR bases, and two references — Breeze-ASR-25 (a Taiwan-specialist ASR) and Whisper-large-v3. Bold = this model.

Benchmark	TEA-ASR-1	TEA-ASR-1-mini	Qwen3-ASR-1.7B	Qwen3-ASR-0.6B	Breeze-ASR-25	Whisper-large-v3
CommonVoice 19 (zh-TW)	3.64	5.14	3.90	5.79	8.03	10.17
ASCEND (zh-en)	10.59	12.49	10.57	12.54	17.53	19.61
CSZS (zh-en)	10.98	13.21	11.03	16.03	12.18	23.24
NTUML2021	6.80	7.37	10.12	11.03	7.50	9.68

How to read this. TEA-ASR-1-mini is the efficient model on this page. Across the suite, TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark, ahead of the Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; TEA-ASR-1-mini delivers most of that quality at well under half the parameters (780M vs 2B). Against the unadapted Qwen3-ASR base, the gain in this content-folded recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or better — and, importantly, the metric folds away script differences (see Evaluation), so it does not reflect the decisive practical change: TEA-ASR emits Traditional script and Taiwan vocabulary natively, whereas the base produces Simplified script.

Speed & memory

Measured on NVIDIA RTX 5090 (32 GB) (bf16, batch 1, 50 utterances, greedy decode). xRT = audio seconds processed per wall-clock second (higher is faster); RTF = wall-clock / audio (lower is faster); peak VRAM is the maximum allocated during inference.

Model	Params	xRT ↑	RTF ↓	Peak VRAM (GB) ↓
TEA-ASR-1	2B	11.0	0.091	4.16
TEA-ASR-1-mini	780M	8.1	0.124	1.65
Breeze-ASR-25	1.54B	5.5	0.182	4.41
Whisper-large-v3	1.54B	4.7	0.214	4.41

Figures

Accuracy across the four public benchmarks (content-fold MER%, lower is better):

Speed and memory (single GPU, bf16, batch 1):

Ablation — tokenizer × finetune. Content MER isolates the finetune gain (the script fold hides tokenizer effects); raw MER isolates the tokenizer-first localization that makes the output Traditional + Taiwan-lexicon:

Evaluation

Metric — Mixed Error Rate (MER). Character Error Rate for Chinese and Word Error Rate for the English tokens, computed jointly per utterance and micro-averaged.
Content fold (applied uniformly to every dataset and every system). Before scoring, both the reference and the hypothesis are normalized to a common form — converted to Simplified Chinese with OpenCC (t2s), lowercased, and stripped of punctuation. This isolates recognition from script style, so a Simplified-output model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual output is Traditional; the fold is only for scoring.)
Decoding. TEA-ASR and Qwen3-ASR are decoded with language=Chinese; Whisper-large-v3 and Breeze-ASR-25 use their own automatic language detection. All systems are scored with the same code on the same public splits; we do not import numbers reported elsewhere.

Dataset	What it tests	Eval split (n)
CommonVoice 19 (zh-TW)	Read Taiwan-Mandarin speech	full test (5013)
ASCEND	Spontaneous Mandarin–English code-switch conversation	full test (1315)
CSZS (zh-en)	Zero-resource code-switch benchmark	full test (3176)
NTUML2021	Mandarin lecture speech (university ML course)	test[:2000]

No train/test leakage. Fine-tuning used only the training pools, disjoint from every evaluation split: the NTUML2021 train split, the ASCEND train split, and a CommonVoice slice drawn from validated_without_test (CommonVoice's official non-test pool, disjoint from its test split). Evaluation therefore runs on the full, untouched CommonVoice / ASCEND / NTUML2021 test splits; CSZS is a separate dataset not used in training at all. Every number above is leak-free.

How it was built

Base Qwen/Qwen3-ASR-0.6B (frozen AuT audio encoder + Qwen3 decoder).
Adaptation: a rank-16 decoder-only LoRA trained on a few hours of public audio (CommonVoice zh-TW, ASCEND, NTUML2021), with general + code-switch replay to preserve the base model's broad and bilingual ability. The audio encoder is left frozen.
Localization: Traditional-script + Taiwan-lexicon output is rendered through the model's own tokenizer (the surface mapping is baked once at build time); there is no post-processing at inference — the Traditional output comes straight from the model's own tokenizer decode.
Packaging: the adapter is merged into the base and the localized tokenizer is shipped with it, so the release is a single drop-in checkpoint that loads like stock Qwen3-ASR.
Decoding tip: pass language="Chinese" for Taiwan speech; this also prevents translation-style outputs on dense code-switch.

Limitations

Dense synthetic code-switch (CSZS): the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1.
Scope: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the qwen-asr package, exactly like the base.

Citation

@misc{teaasr2026,
  title  = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin},
  author = {TEA-ASR contributors},
  year   = {2026},
  note   = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR}
}

Built on Qwen3-ASR (Apache-2.0). The TEA-ASR adaptation and this checkpoint are released under the MIT License; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its attribution/NOTICE terms.