model card: fresh-eval numbers + protocol

04ce27d verified 7 days ago

8.51 kB

	---
	base_model: Qwen/Qwen3-ASR-1.7B
	license: mit
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	language:
	- zh
	- en
	tags:
	- automatic-speech-recognition
	- taiwan-mandarin
	- traditional-chinese
	- code-switching
	- qwen3-asr
	- speech
	---

	# TEA-ASR-1 · Taiwan Everyday Audio 🍵

	TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin. It turns real speech
	into natural Traditional Chinese with authentic Taiwan vocabulary, and it
	stays robust through the everyday Mandarin–English code-switching common in Taiwan. Adapted from the
	state-of-the-art Qwen3-ASR foundation and merged into a single self-contained checkpoint, TEA-ASR **loads and
	runs exactly like stock Qwen3-ASR** — no converters, no post-processing — while matching or surpassing both a
	dedicated Taiwan specialist and a large multilingual model on every public benchmark we evaluate.

	`TEA-ASR-1` is the 2B flagship (best accuracy).
	A companion TEA-ASR-1-mini shares the identical recipe — see [`JacobLinCool/TEA-ASR-1-mini`](https://huggingface.co/JacobLinCool/TEA-ASR-1-mini).

	## Key features

	- 🎯 Built for Taiwan Mandarin — Traditional script and Taiwan-style word choice, produced by the model
	itself.
	- 🔀 Code-switch robust — handles natural zh-en mixing instead of translating Mandarin into English.
	- 🧩 Drop-in Qwen3-ASR compatible — same loading and inference API as the base model; nothing else to install
	or call.
	- 🪶 Lightweight adaptation — a small decoder LoRA on a frozen audio encoder, trained on a few hours of public
	audio, then merged for deployment.

	## Quick start

	```bash
	pip install qwen-asr
	```

	```python
	from qwen_asr import Qwen3ASRModel

	model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1")
	result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
	print(result.text) # -> Traditional Chinese with Taiwan lexicon
	```

	Set `language="Chinese"` for Taiwan speech (recommended). You can also pass a `context=` string of hotwords
	(names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.

	## Benchmark results

	Mixed Error Rate (MER%, lower is better), all numbers from a single self-measured run under one protocol
	(see [Evaluation](#evaluation)). Columns: the two TEA-ASR models, the original (unadapted) Qwen3-ASR bases, and
	two references — Breeze-ASR-25 (a Taiwan-specialist ASR) and Whisper-large-v3. Bold = this model.

	\| Benchmark \| TEA-ASR-1 \| TEA-ASR-1-mini \| Qwen3-ASR-1.7B \| Qwen3-ASR-0.6B \| Breeze-ASR-25 \| Whisper-large-v3 \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| CommonVoice 19 (zh-TW) \| 3.64 \| 5.14 \| 3.90 \| 5.79 \| 8.03 \| 10.17 \|
	\| ASCEND (zh-en) \| 10.59 \| 12.49 \| 10.57 \| 12.54 \| 17.53 \| 19.61 \|
	\| CSZS (zh-en) \| 10.98 \| 13.21 \| 11.03 \| 16.03 \| 12.18 \| 23.24 \|
	\| NTUML2021 \| 6.80 \| 7.37 \| 10.12 \| 11.03 \| 7.50 \| 9.68 \|

	How to read this. TEA-ASR-1 is the flagship model on this page.
	Across the suite, TEA-ASR-1 posts the best (or tied-best) error rate on every benchmark, ahead of the
	Taiwan-specialist Breeze-ASR-25 and far ahead of Whisper-large-v3; TEA-ASR-1-mini delivers most of that quality
	at well under half the parameters (780M vs 2B). Against the unadapted Qwen3-ASR base, the gain in this content-folded
	recognition metric is largest on in-domain lectures (NTUML2021); on the other sets recognition is on par or
	better — and, importantly, the metric folds away script differences (see Evaluation), so it does not reflect
	the decisive practical change: TEA-ASR emits Traditional script and Taiwan vocabulary natively, whereas the
	base produces Simplified script.

	## Speed & memory

	Measured on NVIDIA RTX 5090 (32 GB) (bf16, batch 1, 50 utterances, greedy decode). **xRT = audio seconds processed per
	wall-clock second (higher is faster); RTF = wall-clock / audio (lower is faster); peak VRAM** is the maximum
	allocated during inference.

	\| Model \| Params \| xRT ↑ \| RTF ↓ \| Peak VRAM (GB) ↓ \|
	\|---\|---\|---\|---\|---\|
	\| TEA-ASR-1 \| 2B \| 11.0 \| 0.091 \| 4.16 \|
	\| TEA-ASR-1-mini \| 780M \| 8.1 \| 0.124 \| 1.65 \|
	\| Breeze-ASR-25 \| 1.54B \| 5.5 \| 0.182 \| 4.41 \|
	\| Whisper-large-v3 \| 1.54B \| 4.7 \| 0.214 \| 4.41 \|

	## Figures

	Accuracy across the four public benchmarks (content-fold MER%, lower is better):

	![Accuracy across benchmarks](bench_mer.png)

	Speed and memory (single GPU, bf16, batch 1):

	![Speed and memory](bench_speed_vram.png)

	Ablation — tokenizer × finetune. Content MER isolates the finetune gain (the script fold hides tokenizer effects); raw MER isolates the tokenizer-first localization that makes the output Traditional + Taiwan-lexicon:

	![Ablation](ablation_2x2.png)

	## Evaluation

	- Metric — Mixed Error Rate (MER). Character Error Rate for Chinese and Word Error Rate for the English tokens,
	computed jointly per utterance and micro-averaged.
	- Content fold (applied uniformly to every dataset and every system). Before scoring, both the reference and
	the hypothesis are normalized to a common form — converted to Simplified Chinese with OpenCC (`t2s`),
	lowercased, and stripped of punctuation. This isolates recognition from script style, so a Simplified-output
	model (e.g. the base) and a Traditional-output model (TEA-ASR) are compared fairly on content. (TEA-ASR's actual
	output is Traditional; the fold is only for scoring.)
	- Decoding. TEA-ASR and Qwen3-ASR are decoded with `language=Chinese`; Whisper-large-v3 and Breeze-ASR-25 use
	their own automatic language detection. All systems are scored with the same code on the same public splits;
	we do not import numbers reported elsewhere.

	\| Dataset \| What it tests \| Eval split (n) \|
	\|---\|---\|---\|
	\| CommonVoice 19 (zh-TW) \| Read Taiwan-Mandarin speech \| full test (5013) \|
	\| ASCEND \| Spontaneous Mandarin–English code-switch conversation \| full test (1315) \|
	\| CSZS (zh-en) \| Zero-resource code-switch benchmark \| full test (3176) \|
	\| NTUML2021 \| Mandarin lecture speech (university ML course) \| test[:2000] \|

	- No train/test leakage. Fine-tuning used only the training pools, disjoint from every evaluation
	split: the NTUML2021 train split, the ASCEND train split, and a CommonVoice slice drawn from
	`validated_without_test` (CommonVoice's official non-test pool, disjoint from its test split). Evaluation
	therefore runs on the full, untouched CommonVoice / ASCEND / NTUML2021 test splits; CSZS is a separate
	dataset not used in training at all. Every number above is leak-free.

	## How it was built

	- Base `Qwen/Qwen3-ASR-1.7B` (frozen AuT audio encoder + Qwen3 decoder).
	- Adaptation: a rank-16 decoder-only LoRA trained on a few hours of public audio (CommonVoice zh-TW,
	ASCEND, NTUML2021), with general + code-switch replay to preserve the base model's broad and bilingual
	ability. The audio encoder is left frozen.
	- Localization: Traditional-script + Taiwan-lexicon output is rendered through the model's own tokenizer
	(the surface mapping is baked once at build time); there is no post-processing at inference — the
	Traditional output comes straight from the model's own tokenizer decode.
	- Packaging: the adapter is merged into the base and the localized tokenizer is shipped with it, so the
	release is a single drop-in checkpoint that loads like stock Qwen3-ASR.
	- Decoding tip: pass `language="Chinese"` for Taiwan speech; this also prevents translation-style outputs on
	dense code-switch.

	## Limitations

	- Dense synthetic code-switch (CSZS): the smaller TEA-ASR-1-mini trails the Taiwan specialist on this set; the
	flagship TEA-ASR-1 leads it. For heavy code-switch, prefer TEA-ASR-1.
	- Scope: validated on the Qwen3-ASR family (0.6B and 1.7B); the released models load via the `qwen-asr` package,
	exactly like the base.

	## Citation

	```bibtex
	@misc{teaasr2026,
	title = {Tokenizer-First Adaptation of Mandarin ASR to Taiwan Mandarin},
	author = {TEA-ASR contributors},
	year = {2026},
	note = {TEA-ASR (Taiwan Everyday Audio); adapted from Qwen3-ASR}
	}
	```

	Built on [Qwen3-ASR](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (Apache-2.0). The TEA-ASR adaptation and this checkpoint are
	released under the MIT License; the underlying Qwen3-ASR weights remain subject to the Apache-2.0 license and its
	attribution/NOTICE terms.