TEA-ASR-1 / README.md
JacobLinCool's picture
model card: fresh-eval numbers + protocol
04ce27d verified
|
Raw
History Blame Contribute Delete
8.51 kB
---
base_model: Qwen/Qwen3-ASR-1.7B
license: mit
library_name: transformers
pipeline_tag: automatic-speech-recognition
language:
- zh
- en
tags:
- automatic-speech-recognition
- taiwan-mandarin
- traditional-chinese
- code-switching
- qwen3-asr
- speech
---
# TEA-ASR-1 · Taiwan Everyday Audio 🍵
**TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin.** It turns real speech
into natural **Traditional Chinese** with authentic **Taiwan vocabulary**, and it
stays robust through the everyday **Mandarin–English code-switching** common in Taiwan. Adapted from the
state-of-the-art **Qwen3-ASR** foundation and merged into a single self-contained checkpoint, TEA-ASR **loads and
runs exactly like stock Qwen3-ASR** — no converters, no post-processing — while matching or surpassing both a
dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate.
`TEA-ASR-1` is the **2B flagship (best accuracy)**.
A companion **TEA-ASR-1-mini** shares the identical recipe — see [`JacobLinCool/TEA-ASR-1-mini`](https://huggingface.co/JacobLinCool/TEA-ASR-1-mini).
## Key features
- 🎯 **Built for Taiwan Mandarin** — Traditional script **and** Taiwan-style word choice, produced by the model
itself.
- 🔀 **Code-switch robust** — handles natural zh-en mixing instead of translating Mandarin into English.
- 🧩 **Drop-in Qwen3-ASR compatible** — same loading and inference API as the base model; nothing else to install
or call.
- 🪶 **Lightweight adaptation** — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public
audio, then merged for deployment.
## Quick start
```bash
pip install qwen-asr
```
```python
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1")
result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
print(result.text) # -> Traditional Chinese with Taiwan lexicon
```
Set `language="Chinese"` for Taiwan speech (recommended). You can also pass a `context=` string of hotwords
(names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.
## Benchmark results
Mixed Error Rate (MER%, **lower is better**), all numbers from a **single self-measured run under one protocol**
(see [Evaluation](#evaluation)). Columns: the two TEA-ASR models, the original (unadapted) **Qwen3-ASR** bases, and
two references — **Breeze-ASR-25** (a Taiwan-specialist ASR) and **Whisper-large-v3**. **Bold = this model.**
| Benchmark | TEA-ASR-1 | TEA-ASR-1-mini | Qwen3-ASR-1.7B | Qwen3-ASR-0.6B | Breeze-ASR-25 | Whisper-large-v3 |
|---|---|---|---|---|---|---|
| CommonVoice 19 (zh-TW) | **3.64** | 5.14 | 3.90 | 5.79 | 8.03 | 10.17 |
| ASCEND (zh-en) | **10.59** | 12.49 | 10.57 | 12.54 | 17.53 | 19.61 |
| CSZS (zh-en) | **10.98** | 13.21 | 11.03 | 16.03 | 12.18 | 23.24 |
| NTUML2021 | **6.80** | 7.37 | 10.12 | 11.03 | 7.50 | 9.68 |
**How to read this.** **TEA-ASR-1** is the flagship model on this page.
Across the suite, **TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark**, ahead of the
Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; **TEA-ASR-1-mini** delivers most of that quality
at well under half the parameters (780M vs 2B). Against the unadapted **Qwen3-ASR** base, the gain in this content-folded
recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or
better — and, importantly, the metric **folds away script differences** (see Evaluation), so it does *not* reflect
the decisive practical change: TEA-ASR emits **Traditional script and Taiwan vocabulary natively**, whereas the
base produces Simplified script.
## Speed & memory
Measured on **NVIDIA RTX 5090 (32 GB)** (bf16, batch 1, 50 utterances, greedy decode). **xRT = audio seconds processed per
wall-clock second** (higher is faster); **RTF = wall-clock / audio** (lower is faster); **peak VRAM** is the maximum
allocated during inference.
| Model | Params | xRT ↑ | RTF ↓ | Peak VRAM (GB) ↓ |
|---|---|---|---|---|
| TEA-ASR-1 | 2B | 11.0 | 0.091 | 4.16 |
| TEA-ASR-1-mini | 780M | 8.1 | 0.124 | 1.65 |
| Breeze-ASR-25 | 1.54B | 5.5 | 0.182 | 4.41 |
| Whisper-large-v3 | 1.54B | 4.7 | 0.214 | 4.41 |
## Figures
**Accuracy across the four public benchmarks** (content-fold MER%, lower is better):
![Accuracy across benchmarks](bench_mer.png)
**Speed and memory** (single GPU, bf16, batch 1):
![Speed and memory](bench_speed_vram.png)
**Ablation — tokenizer × finetune.** Content MER isolates the *finetune* gain (the script fold hides tokenizer effects); raw MER isolates the *tokenizer-first* localization that makes the output Traditional + Taiwan-lexicon:
![Ablation](ablation_2x2.png)
## Evaluation
- **Metric — Mixed Error Rate (MER).** Character Error Rate for Chinese and Word Error Rate for the English tokens,
computed jointly per utterance and micro-averaged.
- **Content fold (applied uniformly to every dataset and every system).** Before scoring, both the reference and
the hypothesis are normalized to a common form — **converted to Simplified Chinese with OpenCC (`t2s`)**,
lowercased, and stripped of punctuation. This isolates *recognition* from *script style*, so a Simplified-output
model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual
output is Traditional; the fold is only for scoring.)
- **Decoding.** TEA-ASR and Qwen3-ASR are decoded with `language=Chinese`; Whisper-large-v3 and Breeze-ASR-25 use
their own automatic language detection. All systems are scored with the **same code on the same public splits**;
we do not import numbers reported elsewhere.
| Dataset | What it tests | Eval split (n) |
|---|---|---|
| **CommonVoice 19 (zh-TW)** | Read Taiwan-Mandarin speech | full test (5013) |
| **ASCEND** | Spontaneous Mandarin–English code-switch conversation | full test (1315) |
| **CSZS (zh-en)** | Zero-resource code-switch benchmark | full test (3176) |
| **NTUML2021** | Mandarin lecture speech (university ML course) | test[:2000] |
- **No train/test leakage.** Fine-tuning used **only** the *training* pools, disjoint from every evaluation
split: the NTUML2021 *train* split, the ASCEND *train* split, and a CommonVoice slice drawn from
`validated_without_test` (CommonVoice's official non-test pool, disjoint from its *test* split). Evaluation
therefore runs on the **full, untouched** CommonVoice / ASCEND / NTUML2021 *test* splits; CSZS is a separate
dataset not used in training at all. Every number above is leak-free.
## How it was built
- **Base** `Qwen/Qwen3-ASR-1.7B` (frozen AuT audio encoder + Qwen3 decoder).
- **Adaptation**: a rank-16 **decoder-only LoRA** trained on **a few hours of public audio** (CommonVoice zh-TW,
ASCEND, NTUML2021), with general + code-switch **replay** to preserve the base model's broad and bilingual
ability. The audio encoder is left frozen.
- **Localization**: Traditional-script + Taiwan-lexicon output is rendered through the model's **own tokenizer**
(the surface mapping is baked once at build time); there is **no post-processing at inference** — the
Traditional output comes straight from the model's own tokenizer decode.
- **Packaging**: the adapter is **merged** into the base and the localized tokenizer is shipped with it, so the
release is a single drop-in checkpoint that loads like stock Qwen3-ASR.
- **Decoding tip**: pass `language="Chinese"` for Taiwan speech; this also prevents translation-style outputs on
dense code-switch.
## Limitations
- **Dense synthetic code-switch (CSZS)**: the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the
flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1.
- **Scope**: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the `qwen-asr` package,
exactly like the base.
## Citation
```bibtex
@misc{teaasr2026,
title = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin},
author = {TEA-ASR contributors},
year = {2026},
note = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR}
}
```
Built on [Qwen3-ASR](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (Apache-2.0). The TEA-ASR adaptation and this checkpoint are
released under the **MIT License**; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its
attribution/NOTICE terms.