TEA-ASR-1-mini · Taiwan Everyday Audio 🍵

TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin. It turns real speech into natural Traditional Chinese with authentic Taiwan vocabulary, and it stays robust through the everyday Mandarin–English code-switching common in Taiwan. Adapted from the state-of-the-art Qwen3-ASR foundation and merged into a single self-contained checkpoint, TEA-ASR loads and runs exactly like stock Qwen3-ASR — no converters, no post-processing — while matching or surpassing both a dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate.

TEA-ASR-1-mini is the 780M compact model (best accuracy-per-parameter). A companion TEA-ASR-1 shares the identical recipe — see JacobLinCool/TEA-ASR-1.

Key features

  • 🎯 Built for Taiwan Mandarin — Traditional script and Taiwan-style word choice, produced by the model itself.
  • 🔀 Code-switch robust — handles natural zh-en mixing instead of translating Mandarin into English.
  • 🧩 Drop-in Qwen3-ASR compatible — same loading and inference API as the base model; nothing else to install or call.
  • 🪶 Lightweight adaptation — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public audio, then merged for deployment.

Quick start

pip install qwen-asr
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1-mini")
result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
print(result.text)   # -> Traditional Chinese with Taiwan lexicon

Set language="Chinese" for Taiwan speech (recommended). You can also pass a context= string of hotwords (names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.

Benchmark results

Mixed Error Rate (MER%, lower is better), all numbers from a single self-measured run under one protocol (see Evaluation). Columns: the two TEA-ASR models, the original (unadapted) Qwen3-ASR bases, and two references — Breeze-ASR-25 (a Taiwan-specialist ASR) and Whisper-large-v3. Bold = this model.

Benchmark TEA-ASR-1 TEA-ASR-1-mini Qwen3-ASR-1.7B Qwen3-ASR-0.6B Breeze-ASR-25 Whisper-large-v3
CommonVoice 19 (zh-TW) 3.64 5.14 3.90 5.79 8.03 10.17
ASCEND (zh-en) 10.59 12.49 10.57 12.54 17.53 19.61
CSZS (zh-en) 10.98 13.21 11.03 16.03 12.18 23.24
NTUML2021 6.80 7.37 10.12 11.03 7.50 9.68

How to read this. TEA-ASR-1-mini is the efficient model on this page. Across the suite, TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark, ahead of the Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; TEA-ASR-1-mini delivers most of that quality at well under half the parameters (780M vs 2B). Against the unadapted Qwen3-ASR base, the gain in this content-folded recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or better — and, importantly, the metric folds away script differences (see Evaluation), so it does not reflect the decisive practical change: TEA-ASR emits Traditional script and Taiwan vocabulary natively, whereas the base produces Simplified script.

Speed & memory

Measured on NVIDIA RTX 5090 (32 GB) (bf16, batch 1, 50 utterances, greedy decode). xRT = audio seconds processed per wall-clock second (higher is faster); RTF = wall-clock / audio (lower is faster); peak VRAM is the maximum allocated during inference.

Model Params xRT ↑ RTF ↓ Peak VRAM (GB) ↓
TEA-ASR-1 2B 11.0 0.091 4.16
TEA-ASR-1-mini 780M 8.1 0.124 1.65
Breeze-ASR-25 1.54B 5.5 0.182 4.41
Whisper-large-v3 1.54B 4.7 0.214 4.41

Figures

Accuracy across the four public benchmarks (content-fold MER%, lower is better):

Accuracy across benchmarks

Speed and memory (single GPU, bf16, batch 1):

Speed and memory

Ablation — tokenizer × finetune. Content MER isolates the finetune gain (the script fold hides tokenizer effects); raw MER isolates the tokenizer-first localization that makes the output Traditional + Taiwan-lexicon:

Ablation

Evaluation

  • Metric — Mixed Error Rate (MER). Character Error Rate for Chinese and Word Error Rate for the English tokens, computed jointly per utterance and micro-averaged.
  • Content fold (applied uniformly to every dataset and every system). Before scoring, both the reference and the hypothesis are normalized to a common form — converted to Simplified Chinese with OpenCC (t2s), lowercased, and stripped of punctuation. This isolates recognition from script style, so a Simplified-output model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual output is Traditional; the fold is only for scoring.)
  • Decoding. TEA-ASR and Qwen3-ASR are decoded with language=Chinese; Whisper-large-v3 and Breeze-ASR-25 use their own automatic language detection. All systems are scored with the same code on the same public splits; we do not import numbers reported elsewhere.
Dataset What it tests Eval split (n)
CommonVoice 19 (zh-TW) Read Taiwan-Mandarin speech full test (5013)
ASCEND Spontaneous Mandarin–English code-switch conversation full test (1315)
CSZS (zh-en) Zero-resource code-switch benchmark full test (3176)
NTUML2021 Mandarin lecture speech (university ML course) test[:2000]
  • No train/test leakage. Fine-tuning used only the training pools, disjoint from every evaluation split: the NTUML2021 train split, the ASCEND train split, and a CommonVoice slice drawn from validated_without_test (CommonVoice's official non-test pool, disjoint from its test split). Evaluation therefore runs on the full, untouched CommonVoice / ASCEND / NTUML2021 test splits; CSZS is a separate dataset not used in training at all. Every number above is leak-free.

How it was built

  • Base Qwen/Qwen3-ASR-0.6B (frozen AuT audio encoder + Qwen3 decoder).
  • Adaptation: a rank-16 decoder-only LoRA trained on a few hours of public audio (CommonVoice zh-TW, ASCEND, NTUML2021), with general + code-switch replay to preserve the base model's broad and bilingual ability. The audio encoder is left frozen.
  • Localization: Traditional-script + Taiwan-lexicon output is rendered through the model's own tokenizer (the surface mapping is baked once at build time); there is no post-processing at inference — the Traditional output comes straight from the model's own tokenizer decode.
  • Packaging: the adapter is merged into the base and the localized tokenizer is shipped with it, so the release is a single drop-in checkpoint that loads like stock Qwen3-ASR.
  • Decoding tip: pass language="Chinese" for Taiwan speech; this also prevents translation-style outputs on dense code-switch.

Limitations

  • Dense synthetic code-switch (CSZS): the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1.
  • Scope: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the qwen-asr package, exactly like the base.

Citation

@misc{teaasr2026,
  title  = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin},
  author = {TEA-ASR contributors},
  year   = {2026},
  note   = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR}
}

Built on Qwen3-ASR (Apache-2.0). The TEA-ASR adaptation and this checkpoint are released under the MIT License; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its attribution/NOTICE terms.

Downloads last month
-
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JacobLinCool/TEA-ASR-1-mini

Finetuned
(37)
this model

Space using JacobLinCool/TEA-ASR-1-mini 1

Collection including JacobLinCool/TEA-ASR-1-mini