--- base_model: Qwen/Qwen3-ASR-1.7B license: mit library_name: transformers pipeline_tag: automatic-speech-recognition language: - zh - en tags: - automatic-speech-recognition - taiwan-mandarin - traditional-chinese - code-switching - qwen3-asr - speech --- # TEA-ASR-1 · Taiwan Everyday Audio 🍵 **TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin.** It turns real speech into natural **Traditional Chinese** with authentic **Taiwan vocabulary**, and it stays robust through the everyday **Mandarin–English code-switching** common in Taiwan. Adapted from the state-of-the-art **Qwen3-ASR** foundation and merged into a single self-contained checkpoint, TEA-ASR **loads and runs exactly like stock Qwen3-ASR** — no converters, no post-processing — while matching or surpassing both a dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate. `TEA-ASR-1` is the **2B flagship (best accuracy)**. A companion **TEA-ASR-1-mini** shares the identical recipe — see [`JacobLinCool/TEA-ASR-1-mini`](https://huggingface.co/JacobLinCool/TEA-ASR-1-mini). ## Key features - 🎯 **Built for Taiwan Mandarin** — Traditional script **and** Taiwan-style word choice, produced by the model itself. - 🔀 **Code-switch robust** — handles natural zh-en mixing instead of translating Mandarin into English. - 🧩 **Drop-in Qwen3-ASR compatible** — same loading and inference API as the base model; nothing else to install or call. - 🪶 **Lightweight adaptation** — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public audio, then merged for deployment. ## Quick start ```bash pip install qwen-asr ``` ```python from qwen_asr import Qwen3ASRModel model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1") result = model.transcribe(audio="utterance.wav", language="Chinese")[0] print(result.text) # -> Traditional Chinese with Taiwan lexicon ``` Set `language="Chinese"` for Taiwan speech (recommended). You can also pass a `context=` string of hotwords (names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR. ## Benchmark results Mixed Error Rate (MER%, **lower is better**), all numbers from a **single self-measured run under one protocol** (see [Evaluation](#evaluation)). Columns: the two TEA-ASR models, the original (unadapted) **Qwen3-ASR** bases, and two references — **Breeze-ASR-25** (a Taiwan-specialist ASR) and **Whisper-large-v3**. **Bold = this model.** | Benchmark | TEA-ASR-1 | TEA-ASR-1-mini | Qwen3-ASR-1.7B | Qwen3-ASR-0.6B | Breeze-ASR-25 | Whisper-large-v3 | |---|---|---|---|---|---|---| | CommonVoice 19 (zh-TW) | **3.64** | 5.14 | 3.90 | 5.79 | 8.03 | 10.17 | | ASCEND (zh-en) | **10.59** | 12.49 | 10.57 | 12.54 | 17.53 | 19.61 | | CSZS (zh-en) | **10.98** | 13.21 | 11.03 | 16.03 | 12.18 | 23.24 | | NTUML2021 | **6.80** | 7.37 | 10.12 | 11.03 | 7.50 | 9.68 | **How to read this.** **TEA-ASR-1** is the flagship model on this page. Across the suite, **TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark**, ahead of the Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; **TEA-ASR-1-mini** delivers most of that quality at well under half the parameters (780M vs 2B). Against the unadapted **Qwen3-ASR** base, the gain in this content-folded recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or better — and, importantly, the metric **folds away script differences** (see Evaluation), so it does *not* reflect the decisive practical change: TEA-ASR emits **Traditional script and Taiwan vocabulary natively**, whereas the base produces Simplified script. ## Speed & memory Measured on **NVIDIA RTX 5090 (32 GB)** (bf16, batch 1, 50 utterances, greedy decode). **xRT = audio seconds processed per wall-clock second** (higher is faster); **RTF = wall-clock / audio** (lower is faster); **peak VRAM** is the maximum allocated during inference. | Model | Params | xRT ↑ | RTF ↓ | Peak VRAM (GB) ↓ | |---|---|---|---|---| | TEA-ASR-1 | 2B | 11.0 | 0.091 | 4.16 | | TEA-ASR-1-mini | 780M | 8.1 | 0.124 | 1.65 | | Breeze-ASR-25 | 1.54B | 5.5 | 0.182 | 4.41 | | Whisper-large-v3 | 1.54B | 4.7 | 0.214 | 4.41 | ## Figures **Accuracy across the four public benchmarks** (content-fold MER%, lower is better): ![Accuracy across benchmarks](bench_mer.png) **Speed and memory** (single GPU, bf16, batch 1): ![Speed and memory](bench_speed_vram.png) **Ablation — tokenizer × finetune.** Content MER isolates the *finetune* gain (the script fold hides tokenizer effects); raw MER isolates the *tokenizer-first* localization that makes the output Traditional + Taiwan-lexicon: ![Ablation](ablation_2x2.png) ## Evaluation - **Metric — Mixed Error Rate (MER).** Character Error Rate for Chinese and Word Error Rate for the English tokens, computed jointly per utterance and micro-averaged. - **Content fold (applied uniformly to every dataset and every system).** Before scoring, both the reference and the hypothesis are normalized to a common form — **converted to Simplified Chinese with OpenCC (`t2s`)**, lowercased, and stripped of punctuation. This isolates *recognition* from *script style*, so a Simplified-output model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual output is Traditional; the fold is only for scoring.) - **Decoding.** TEA-ASR and Qwen3-ASR are decoded with `language=Chinese`; Whisper-large-v3 and Breeze-ASR-25 use their own automatic language detection. All systems are scored with the **same code on the same public splits**; we do not import numbers reported elsewhere. | Dataset | What it tests | Eval split (n) | |---|---|---| | **CommonVoice 19 (zh-TW)** | Read Taiwan-Mandarin speech | full test (5013) | | **ASCEND** | Spontaneous Mandarin–English code-switch conversation | full test (1315) | | **CSZS (zh-en)** | Zero-resource code-switch benchmark | full test (3176) | | **NTUML2021** | Mandarin lecture speech (university ML course) | test[:2000] | - **No train/test leakage.** Fine-tuning used **only** the *training* pools, disjoint from every evaluation split: the NTUML2021 *train* split, the ASCEND *train* split, and a CommonVoice slice drawn from `validated_without_test` (CommonVoice's official non-test pool, disjoint from its *test* split). Evaluation therefore runs on the **full, untouched** CommonVoice / ASCEND / NTUML2021 *test* splits; CSZS is a separate dataset not used in training at all. Every number above is leak-free. ## How it was built - **Base** `Qwen/Qwen3-ASR-1.7B` (frozen AuT audio encoder + Qwen3 decoder). - **Adaptation**: a rank-16 **decoder-only LoRA** trained on **a few hours of public audio** (CommonVoice zh-TW, ASCEND, NTUML2021), with general + code-switch **replay** to preserve the base model's broad and bilingual ability. The audio encoder is left frozen. - **Localization**: Traditional-script + Taiwan-lexicon output is rendered through the model's **own tokenizer** (the surface mapping is baked once at build time); there is **no post-processing at inference** — the Traditional output comes straight from the model's own tokenizer decode. - **Packaging**: the adapter is **merged** into the base and the localized tokenizer is shipped with it, so the release is a single drop-in checkpoint that loads like stock Qwen3-ASR. - **Decoding tip**: pass `language="Chinese"` for Taiwan speech; this also prevents translation-style outputs on dense code-switch. ## Limitations - **Dense synthetic code-switch (CSZS)**: the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1. - **Scope**: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the `qwen-asr` package, exactly like the base. ## Citation ```bibtex @misc{teaasr2026, title = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin}, author = {TEA-ASR contributors}, year = {2026}, note = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR} } ``` Built on [Qwen3-ASR](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (Apache-2.0). The TEA-ASR adaptation and this checkpoint are released under the **MIT License**; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its attribution/NOTICE terms.