Model Description

BreezeASR-Taigi is a Taiwanese Hokkien (Taigi / 台語) automatic speech recognition (ASR) model developed as part of the Breeze Taigi framework — a comprehensive framework centered on standardized benchmarks and evaluation methodologies for Taiwanese Hokkien speech technologies.

Taiwanese Hokkien, also known as Taigi, is a critically important language with deep historical and cultural significance in Taiwan. While significant progress has been made in speech recognition for languages such as English and Mandarin Chinese, developing effective speech technologies for Taigi presents both opportunities and methodological challenges due to its complex phonological system, significant geographical accents, and diverse forms of Chinese character writing.

BreezeASR-Taigi is built by fine-tuning the Whisper multilingual speech recognition framework on approximately 10,000 hours of large-scale synthetic Taiwanese Hokkien speech data. The model transcribes spoken Taigi audio and outputs Mandarin Chinese character transcriptions, leveraging the substantial lexical overlap between Taigi and Mandarin for a pragmatic and reproducible evaluation framework.

Intended Uses & Limitations

✅ Intended Uses

Automatic speech recognition for Taiwanese Hokkien (Taigi) audio input
Benchmarking and comparing Taigi ASR systems using standardized, reproducible evaluation protocols
Accessibility tools and conversational AI applications serving Taigi speakers
Serving as the reference ASR model for evaluating Taigi TTS systems via CER-based automatic assessment

⚠️ Limitations

The model outputs Mandarin Chinese characters rather than native Taigi orthography (台語正字). Due to the non-one-to-one nature of the Taigi–Mandarin mapping, a perfect Taigi ASR system would not necessarily achieve 0% CER on Mandarin transcriptions. Absolute CER values should therefore be interpreted carefully and used for relative system comparison rather than as a measure of absolute transcription accuracy.
Performance may degrade on samples with strong regional accents, heavy dialectal variation, or technical terminology from specialized domains.
The training data is entirely synthetic, which may introduce distributional differences from real-world spontaneous speech.
Per-sample CER on the benchmark ranges from 14.49% (best case) to 52.78% (most challenging sample), indicating variability across different test inputs.
Systems that output proper Taigi orthography (台語正字), such as Google Gemini 3 Flash and the Ministry of Education's Taiwanese Input Method, represent linguistically accurate transcriptions that cannot be directly compared against our Mandarin-based ground truth without translation.

Training Data

The model was fine-tuned on the Taigi-synthetic-speech dataset, which comprises approximately 10,000 hours of synthetic speech data generated through a large-scale speech synthesis pipeline. Key characteristics of the dataset include:

Diverse speakers: A wide range of simulated voices and speaking styles
Diverse acoustic environments: Various background acoustic conditions
Diverse conversational contexts: Natural, spontaneous speech with rich linguistic variations, including code-switching between Taigi and Mandarin — a common behavior in Taiwan's bilingual society
Unlike previous corpora that focused primarily on read speech or limited-domain recordings, this dataset captures authentic conversational Taigi, making it well-suited for robust ASR in real-world conditions

Training Procedure

Base model: Whisper — a large-scale multilingual speech recognition model pretrained on 680,000 hours of weakly supervised multilingual audio data, demonstrating robust performance across multiple languages and domains
Fine-tuning data: Taigi-synthetic-speech (~10,000 hours of synthetic Taiwanese Hokkien speech)
Tokenization: Whisper's multilingual tokenizer, used to accommodate the linguistic characteristics of the Taigi dataset
Hardware: NVIDIA GPU

Evaluation

Benchmark Dataset

The model is evaluated on the Taigi ASR Benchmark, which consists of 30 carefully curated Mandarin–Taigi parallel audio pairs sourced from public service announcement (PSA) monthly packages released by Taiwan's Executive Yuan (行政院廣播公共服務音檔). Each PSA is approximately 30 seconds in duration and covers diverse vocabulary and domains, including technical terms from the Ministry of Transportation, the Ministry of Labor, and the Judicial Yuan, as well as everyday conversational scenarios.

Ground truth transcriptions are derived from corresponding Mandarin PSA audio through a rigorous pipeline combining:

State-of-the-art Mandarin ASR systems for initial transcription
Large language model (LLM)-based text refinement
Human expert verification against official scripts

Evaluation Metric

Character Error Rate (CER) is used as the primary evaluation metric, defined as:

CER = (Insertions + Deletions + Substitutions) / Total Reference Characters

CER is preferred over Word Error Rate (WER) for Chinese text because Chinese writing does not contain natural word boundaries. Character-level measurement provides an unambiguous and consistent evaluation metric across all systems. Lower CER = better performance.

To ensure fair cross-system comparison, output text normalization is applied to account for differences in punctuation, sentence breaks, and capitalization of English words across different ASR systems.

Benchmark Results

System	Average CER (%)
BreezeASR-Taigi (Ours)	30.13 ✅
Taiwanese Input Method (教育部台灣台語輸入法)	30.70
Yating (雅婷逐字稿)	32.11
Gemini 3 Flash	32.52
Breeze ASR 25 (ASR25, Whisper-large-v2 for Taiwanese Mandarin)	49.99

Results are averaged across 30 test samples. Gemini 3 Flash and Taiwanese Input Method results are based on translated Mandarin output from their original Taigi orthography outputs.

BreezeASR-Taigi achieves the lowest average CER at 30.13%, demonstrating consistent performance with per-sample CER ranging from 14.49% (best case) to 52.78% (most challenging sample). Notably, ASR25 — a Whisper-large-v2 model optimized for Taiwanese Mandarin rather than Taigi — shows significantly higher error rates (30.71%–76.85%), confirming the importance of Taigi-specific fine-tuning.

About Breeze Taigi

This model is part of the Breeze Taigi framework, which provides:

Standardized ASR benchmark: 30 curated Mandarin–Taigi audio pairs with normalized ground truth for reproducible evaluation
Standardized TTS benchmark: A dual evaluation framework combining automatic ASR-based CER assessment with human evaluation of pronunciation authenticity and naturalness
Open baseline models: BreezeASR-Taigi (ASR) and BreezyVoice-Taigi (TTS) as reference implementations

The framework aims to advance the digitalization of Taiwanese Hokkien and provide a replicable methodological framework applicable to other low-resource languages.

Citation

If you use this model or the Breeze Taigi benchmark, please cite:

@misc{lan2026breezetaigibenchmarksmodels,
      title={Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis}, 
      author={Yu-Siang Lan and Chia-Sheng Liu and Yi-Chang Chen and Po-Chun Hsu and Allyson Chiu and Shun-Wen Lin and Da-shan Shiu and Yuan-Fu Liao},
      year={2026},
      eprint={2603.19259},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.19259}, 
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

After more than two years of dedicated effort, we (Speech AI Research Center@NYCU and MediaTek Research) have achieved initial Taigi-to-Chinese ASR results approaching commercial-grade performance, while Taigi-to-Taibun ASR versions are also under development. We sincerely thank the Ministry of Economic Affairs for its support in facilitating approval of the NVIDIA Taipei-1 computing resource application, Wistron for sponsoring two DGX-H100 systems, and Bronci for providing various forms of support.