Model Description

BreezeASR-Taigi is a Taiwanese Hokkien (Taigi / 台語) automatic speech recognition (ASR) model developed as part of the Breeze Taigi framework — a comprehensive framework centered on standardized benchmarks and evaluation methodologies for Taiwanese Hokkien speech technologies.

Taiwanese Hokkien, also known as Taigi, is a critically important language with deep historical and cultural significance in Taiwan. While significant progress has been made in speech recognition for languages such as English and Mandarin Chinese, developing effective speech technologies for Taigi presents both opportunities and methodological challenges due to its complex phonological system, significant geographical accents, and diverse forms of Chinese character writing.

BreezeASR-Taigi is built by fine-tuning the Whisper multilingual speech recognition framework on approximately 10,000 hours of large-scale synthetic Taiwanese Hokkien speech data. The model transcribes spoken Taigi audio and outputs Mandarin Chinese character transcriptions, leveraging the substantial lexical overlap between Taigi and Mandarin for a pragmatic and reproducible evaluation framework.


Intended Uses & Limitations

✅ Intended Uses

  • Automatic speech recognition for Taiwanese Hokkien (Taigi) audio input
  • Benchmarking and comparing Taigi ASR systems using standardized, reproducible evaluation protocols
  • Accessibility tools and conversational AI applications serving Taigi speakers
  • Serving as the reference ASR model for evaluating Taigi TTS systems via CER-based automatic assessment

⚠️ Limitations

  • The model outputs Mandarin Chinese characters rather than native Taigi orthography (台語正字). Due to the non-one-to-one nature of the Taigi–Mandarin mapping, a perfect Taigi ASR system would not necessarily achieve 0% CER on Mandarin transcriptions. Absolute CER values should therefore be interpreted carefully and used for relative system comparison rather than as a measure of absolute transcription accuracy.
  • Performance may degrade on samples with strong regional accents, heavy dialectal variation, or technical terminology from specialized domains.
  • The training data is entirely synthetic, which may introduce distributional differences from real-world spontaneous speech.
  • Per-sample CER on the benchmark ranges from 14.49% (best case) to 52.78% (most challenging sample), indicating variability across different test inputs.
  • Systems that output proper Taigi orthography (台語正字), such as Google Gemini 3 Flash and the Ministry of Education's Taiwanese Input Method, represent linguistically accurate transcriptions that cannot be directly compared against our Mandarin-based ground truth without translation.

Training Data

The model was fine-tuned on the Taigi-synthetic-speech dataset, which comprises approximately 10,000 hours of synthetic speech data generated through a large-scale speech synthesis pipeline. Key characteristics of the dataset include:

  • Diverse speakers: A wide range of simulated voices and speaking styles
  • Diverse acoustic environments: Various background acoustic conditions
  • Diverse conversational contexts: Natural, spontaneous speech with rich linguistic variations, including code-switching between Taigi and Mandarin — a common behavior in Taiwan's bilingual society
  • Unlike previous corpora that focused primarily on read speech or limited-domain recordings, this dataset captures authentic conversational Taigi, making it well-suited for robust ASR in real-world conditions

Training Procedure

  • Base model: Whisper — a large-scale multilingual speech recognition model pretrained on 680,000 hours of weakly supervised multilingual audio data, demonstrating robust performance across multiple languages and domains
  • Fine-tuning data: Taigi-synthetic-speech (~10,000 hours of synthetic Taiwanese Hokkien speech)
  • Tokenization: Whisper's multilingual tokenizer, used to accommodate the linguistic characteristics of the Taigi dataset
  • Hardware: NVIDIA GPU

Evaluation

Benchmark Dataset

The model is evaluated on the Taigi ASR Benchmark, which consists of 30 carefully curated Mandarin–Taigi parallel audio pairs sourced from public service announcement (PSA) monthly packages released by Taiwan's Executive Yuan (行政院廣播公共服務音檔). Each PSA is approximately 30 seconds in duration and covers diverse vocabulary and domains, including technical terms from the Ministry of Transportation, the Ministry of Labor, and the Judicial Yuan, as well as everyday conversational scenarios.

Ground truth transcriptions are derived from corresponding Mandarin PSA audio through a rigorous pipeline combining:

  1. State-of-the-art Mandarin ASR systems for initial transcription
  2. Large language model (LLM)-based text refinement
  3. Human expert verification against official scripts

Evaluation Metric

Character Error Rate (CER) is used as the primary evaluation metric, defined as:

CER = (Insertions + Deletions + Substitutions) / Total Reference Characters

CER is preferred over Word Error Rate (WER) for Chinese text because Chinese writing does not contain natural word boundaries. Character-level measurement provides an unambiguous and consistent evaluation metric across all systems. Lower CER = better performance.

To ensure fair cross-system comparison, output text normalization is applied to account for differences in punctuation, sentence breaks, and capitalization of English words across different ASR systems.

Benchmark Results

System Average CER (%)
BreezeASR-Taigi (Ours) 30.13
Taiwanese Input Method (教育部台灣台語輸入法) 30.70
Yating (雅婷逐字稿) 32.11
Gemini 3 Flash 32.52
Breeze ASR 25 (ASR25, Whisper-large-v2 for Taiwanese Mandarin) 49.99

Results are averaged across 30 test samples. Gemini 3 Flash and Taiwanese Input Method results are based on translated Mandarin output from their original Taigi orthography outputs.

BreezeASR-Taigi achieves the lowest average CER at 30.13%, demonstrating consistent performance with per-sample CER ranging from 14.49% (best case) to 52.78% (most challenging sample). Notably, ASR25 — a Whisper-large-v2 model optimized for Taiwanese Mandarin rather than Taigi — shows significantly higher error rates (30.71%–76.85%), confirming the importance of Taigi-specific fine-tuning.


About Breeze Taigi

This model is part of the Breeze Taigi framework, which provides:

  • Standardized ASR benchmark: 30 curated Mandarin–Taigi audio pairs with normalized ground truth for reproducible evaluation
  • Standardized TTS benchmark: A dual evaluation framework combining automatic ASR-based CER assessment with human evaluation of pronunciation authenticity and naturalness
  • Open baseline models: BreezeASR-Taigi (ASR) and BreezyVoice-Taigi (TTS) as reference implementations

The framework aims to advance the digitalization of Taiwanese Hokkien and provide a replicable methodological framework applicable to other low-resource languages.


Citation

If you use this model or the Breeze Taigi benchmark, please cite:

@misc{lan2026breezetaigibenchmarksmodels,
      title={Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis}, 
      author={Yu-Siang Lan and Chia-Sheng Liu and Yi-Chang Chen and Po-Chun Hsu and Allyson Chiu and Shun-Wen Lin and Da-shan Shiu and Yuan-Fu Liao},
      year={2026},
      eprint={2603.19259},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.19259}, 
}

License

This model is released under the Apache 2.0 License.


---
Downloads last month
50
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MediaTek-Research/Breeze-ASR-26

Finetuned
(268)
this model
Finetunes
1 model

Paper for MediaTek-Research/Breeze-ASR-26