Automatic Speech Recognition
Transformers
Safetensors
Chinese
English
qwen3_asr
taiwan-mandarin
traditional-chinese
code-switching
qwen3-asr
speech
Instructions to use JacobLinCool/TEA-ASR-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JacobLinCool/TEA-ASR-1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="JacobLinCool/TEA-ASR-1")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("JacobLinCool/TEA-ASR-1") model = AutoModelForMultimodalLM.from_pretrained("JacobLinCool/TEA-ASR-1") - Notebooks
- Google Colab
- Kaggle
| base_model: Qwen/Qwen3-ASR-1.7B | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: automatic-speech-recognition | |
| language: | |
| - zh | |
| - en | |
| tags: | |
| - automatic-speech-recognition | |
| - taiwan-mandarin | |
| - traditional-chinese | |
| - code-switching | |
| - qwen3-asr | |
| - speech | |
| # TEA-ASR-1 · Taiwan Everyday Audio 🍵 | |
| **TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin.** It turns real speech | |
| into natural **Traditional Chinese** with authentic **Taiwan vocabulary**, and it | |
| stays robust through the everyday **Mandarin–English code-switching** common in Taiwan. Adapted from the | |
| state-of-the-art **Qwen3-ASR** foundation and merged into a single self-contained checkpoint, TEA-ASR **loads and | |
| runs exactly like stock Qwen3-ASR** — no converters, no post-processing — while matching or surpassing both a | |
| dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate. | |
| `TEA-ASR-1` is the **2B flagship (best accuracy)**. | |
| A companion **TEA-ASR-1-mini** shares the identical recipe — see [`JacobLinCool/TEA-ASR-1-mini`](https://huggingface.co/JacobLinCool/TEA-ASR-1-mini). | |
| ## Key features | |
| - 🎯 **Built for Taiwan Mandarin** — Traditional script **and** Taiwan-style word choice, produced by the model | |
| itself. | |
| - 🔀 **Code-switch robust** — handles natural zh-en mixing instead of translating Mandarin into English. | |
| - 🧩 **Drop-in Qwen3-ASR compatible** — same loading and inference API as the base model; nothing else to install | |
| or call. | |
| - 🪶 **Lightweight adaptation** — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public | |
| audio, then merged for deployment. | |
| ## Quick start | |
| ```bash | |
| pip install qwen-asr | |
| ``` | |
| ```python | |
| from qwen_asr import Qwen3ASRModel | |
| model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1") | |
| result = model.transcribe(audio="utterance.wav", language="Chinese")[0] | |
| print(result.text) # -> Traditional Chinese with Taiwan lexicon | |
| ``` | |
| Set `language="Chinese"` for Taiwan speech (recommended). You can also pass a `context=` string of hotwords | |
| (names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR. | |
| ## Benchmark results | |
| Mixed Error Rate (MER%, **lower is better**), all numbers from a **single self-measured run under one protocol** | |
| (see [Evaluation](#evaluation)). Columns: the two TEA-ASR models, the original (unadapted) **Qwen3-ASR** bases, and | |
| two references — **Breeze-ASR-25** (a Taiwan-specialist ASR) and **Whisper-large-v3**. **Bold = this model.** | |
| | Benchmark | TEA-ASR-1 | TEA-ASR-1-mini | Qwen3-ASR-1.7B | Qwen3-ASR-0.6B | Breeze-ASR-25 | Whisper-large-v3 | | |
| |---|---|---|---|---|---|---| | |
| | CommonVoice 19 (zh-TW) | **3.64** | 5.14 | 3.90 | 5.79 | 8.03 | 10.17 | | |
| | ASCEND (zh-en) | **10.59** | 12.49 | 10.57 | 12.54 | 17.53 | 19.61 | | |
| | CSZS (zh-en) | **10.98** | 13.21 | 11.03 | 16.03 | 12.18 | 23.24 | | |
| | NTUML2021 | **6.80** | 7.37 | 10.12 | 11.03 | 7.50 | 9.68 | | |
| **How to read this.** **TEA-ASR-1** is the flagship model on this page. | |
| Across the suite, **TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark**, ahead of the | |
| Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; **TEA-ASR-1-mini** delivers most of that quality | |
| at well under half the parameters (780M vs 2B). Against the unadapted **Qwen3-ASR** base, the gain in this content-folded | |
| recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or | |
| better — and, importantly, the metric **folds away script differences** (see Evaluation), so it does *not* reflect | |
| the decisive practical change: TEA-ASR emits **Traditional script and Taiwan vocabulary natively**, whereas the | |
| base produces Simplified script. | |
| ## Speed & memory | |
| Measured on **NVIDIA RTX 5090 (32 GB)** (bf16, batch 1, 50 utterances, greedy decode). **xRT = audio seconds processed per | |
| wall-clock second** (higher is faster); **RTF = wall-clock / audio** (lower is faster); **peak VRAM** is the maximum | |
| allocated during inference. | |
| | Model | Params | xRT ↑ | RTF ↓ | Peak VRAM (GB) ↓ | | |
| |---|---|---|---|---| | |
| | TEA-ASR-1 | 2B | 11.0 | 0.091 | 4.16 | | |
| | TEA-ASR-1-mini | 780M | 8.1 | 0.124 | 1.65 | | |
| | Breeze-ASR-25 | 1.54B | 5.5 | 0.182 | 4.41 | | |
| | Whisper-large-v3 | 1.54B | 4.7 | 0.214 | 4.41 | | |
| ## Figures | |
| **Accuracy across the four public benchmarks** (content-fold MER%, lower is better): | |
|  | |
| **Speed and memory** (single GPU, bf16, batch 1): | |
|  | |
| **Ablation — tokenizer × finetune.** Content MER isolates the *finetune* gain (the script fold hides tokenizer effects); raw MER isolates the *tokenizer-first* localization that makes the output Traditional + Taiwan-lexicon: | |
|  | |
| ## Evaluation | |
| - **Metric — Mixed Error Rate (MER).** Character Error Rate for Chinese and Word Error Rate for the English tokens, | |
| computed jointly per utterance and micro-averaged. | |
| - **Content fold (applied uniformly to every dataset and every system).** Before scoring, both the reference and | |
| the hypothesis are normalized to a common form — **converted to Simplified Chinese with OpenCC (`t2s`)**, | |
| lowercased, and stripped of punctuation. This isolates *recognition* from *script style*, so a Simplified-output | |
| model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual | |
| output is Traditional; the fold is only for scoring.) | |
| - **Decoding.** TEA-ASR and Qwen3-ASR are decoded with `language=Chinese`; Whisper-large-v3 and Breeze-ASR-25 use | |
| their own automatic language detection. All systems are scored with the **same code on the same public splits**; | |
| we do not import numbers reported elsewhere. | |
| | Dataset | What it tests | Eval split (n) | | |
| |---|---|---| | |
| | **CommonVoice 19 (zh-TW)** | Read Taiwan-Mandarin speech | full test (5013) | | |
| | **ASCEND** | Spontaneous Mandarin–English code-switch conversation | full test (1315) | | |
| | **CSZS (zh-en)** | Zero-resource code-switch benchmark | full test (3176) | | |
| | **NTUML2021** | Mandarin lecture speech (university ML course) | test[:2000] | | |
| - **No train/test leakage.** Fine-tuning used **only** the *training* pools, disjoint from every evaluation | |
| split: the NTUML2021 *train* split, the ASCEND *train* split, and a CommonVoice slice drawn from | |
| `validated_without_test` (CommonVoice's official non-test pool, disjoint from its *test* split). Evaluation | |
| therefore runs on the **full, untouched** CommonVoice / ASCEND / NTUML2021 *test* splits; CSZS is a separate | |
| dataset not used in training at all. Every number above is leak-free. | |
| ## How it was built | |
| - **Base** `Qwen/Qwen3-ASR-1.7B` (frozen AuT audio encoder + Qwen3 decoder). | |
| - **Adaptation**: a rank-16 **decoder-only LoRA** trained on **a few hours of public audio** (CommonVoice zh-TW, | |
| ASCEND, NTUML2021), with general + code-switch **replay** to preserve the base model's broad and bilingual | |
| ability. The audio encoder is left frozen. | |
| - **Localization**: Traditional-script + Taiwan-lexicon output is rendered through the model's **own tokenizer** | |
| (the surface mapping is baked once at build time); there is **no post-processing at inference** — the | |
| Traditional output comes straight from the model's own tokenizer decode. | |
| - **Packaging**: the adapter is **merged** into the base and the localized tokenizer is shipped with it, so the | |
| release is a single drop-in checkpoint that loads like stock Qwen3-ASR. | |
| - **Decoding tip**: pass `language="Chinese"` for Taiwan speech; this also prevents translation-style outputs on | |
| dense code-switch. | |
| ## Limitations | |
| - **Dense synthetic code-switch (CSZS)**: the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the | |
| flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1. | |
| - **Scope**: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the `qwen-asr` package, | |
| exactly like the base. | |
| ## Citation | |
| ```bibtex | |
| @misc{teaasr2026, | |
| title = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin}, | |
| author = {TEA-ASR contributors}, | |
| year = {2026}, | |
| note = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR} | |
| } | |
| ``` | |
| Built on [Qwen3-ASR](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (Apache-2.0). The TEA-ASR adaptation and this checkpoint are | |
| released under the **MIT License**; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its | |
| attribution/NOTICE terms. | |