Breeze-ASR-25: transcribe.cpp GGUF

GGUF conversions of MediaTek-Research/Breeze-ASR-25 for use with transcribe.cpp.

Ported from upstream commit cffe7ccb404d025296a00758d0a33468bec3a9d0, pinned 2026-06-29. Validated against the transformers reference at transcribe.cpp commit 3848875 on 2026-06-29.

MediaTek Research's Breeze-ASR-25 — a fine-tune of OpenAI Whisper large-v2, converted to GGUF for transcribe.cpp. Optimized for Taiwanese Mandarin (it emits Traditional Chinese) and English, with explicit support for Mandarin-English code-switching (intra- and inter-sentential). Trained on ~11,749 hours: 10,000 h synthetic Mandarin (ODC Synth), 1,738 h English (CommonVoice17) and 11 h real code-switch (NTUML2021). Architecturally identical to Whisper large-v2 (encoder-decoder transformer, 30-second windows with chunked long-form decoding); it retains Whisper's 99-language tokenizer, but only Mandarin and English are optimized and validated — other languages remain technically accessible but out of scope.

Downloads

Quantization	Download	Size	WER (LibriSpeech test-clean)	CER (FLEURS zh)
BF16	Breeze-ASR-25-BF16.gguf	3.10 GB	2.29%	8.12%
F16	Breeze-ASR-25-F16.gguf	3.11 GB	2.29%	8.11%
Q8_0	Breeze-ASR-25-Q8_0.gguf	1.67 GB	2.27%	8.10%
Q6_K	Breeze-ASR-25-Q6_K.gguf	1.30 GB	2.29%	8.12%
Q5_K_M	Breeze-ASR-25-Q5_K_M.gguf	1.16 GB	2.25%	8.12%
Q4_K_M	Breeze-ASR-25-Q4_K_M.gguf	1.00 GB	2.26%	8.08%

Two benchmarks, both full test splits, decoded on a Modal L40S with the transcribe.cpp default recipe (greedy + temperature fallback, suppress_tokens, segment timestamps), batch 1.

English — WER, LibriSpeech test-clean (2620 utterances). Standard Whisper/EnglishTextNormalizer scoring.

Chinese — CER, FLEURS cmn_hans_cn test (945 utterances). Breeze emits Traditional Chinese while FLEURS references are Simplified, so the CER reported here is computed after folding both hypothesis and reference to Simplified with OpenCC (t2s) — otherwise the raw, script-mismatched CER is ~35% and meaningless. FLEURS ships no Traditional Mandarin split, which is why a script-normalized Simplified set is used.

Quantization is effectively free on both languages: every quant down to Q4_K_M (1.0 GB) sits within run-to-run noise of the BF16 reference (English 2.25-2.29%, Chinese 8.08-8.12%).

Usage

Build transcribe.cpp from source:

git clone git@github.com:handy-computer/transcribe.cpp.git
cd transcribe.cpp
cmake -B build && cmake --build build

Run on a 16 kHz mono WAV:

build/bin/transcribe-cli \
  -m Breeze-ASR-25-Q8_0.gguf \
  input.wav

If your audio isn't already 16 kHz mono WAV, convert it first:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

See the transcribe.cpp model page for performance numbers, numerical validation, and reproduction steps.

License

Inherited from the base model: Apache-2.0. See the upstream model card for full terms.

Original Model Card

The section below is reproduced from MediaTek-Research/Breeze-ASR-25 at commit cffe7ccb404d025296a00758d0a33468bec3a9d0 for offline reference. The upstream card is the authoritative source.

Breeze ASR 25

GitHub | Paper

Breeze ASR 25 是一款基於 Whisper-large-v2 開發的語音辨識模型，並具有以下特色：

強化繁體中文情境辨識能力
強化中英混用情境辨識能力，包含句內以及句外轉換
強化時間戳記對齊，適合自動字幕生成

Breeze ASR 25 is an advanced ASR model fine-tuned from Whisper-large-v2

Optimized for Taiwanese Mandarin
Optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
Enhanced time alignment, suitable for automatic captioning

Example:

增強範例-中英混用情境： MediaTek's 24th Anniversary

Breeze ASR 25:

面對不知道的我們怎麼用 open mind open heart 的心情去 explore
那 explore 過程也就是持續學習 不斷創新
當然如果能帶領 MediaTek 說達到這樣的 position
對做這樣的事情那覺得是一個 commitment
那也是一個 passion 那可以一直很努力的投入在做

Whisper-large-v2:

面對不知道的我們怎麼用開放心情去探索
把它探索過程也就是 仔細學習 不斷創新
當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情
那覺得是一個貢獻那也是一個熱誠
那可以一直來努力地投入在做

Performance

Word error rates of benchmarks. The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the paper as "Twister"

Short-form Audio Datasets

Dataset\Model	Language	WLV2-Auto ↓	WLV3-Auto ↓	COOL-Whisper ↓	Breeze ASR 25 (Ours) ↓
ASCEND-OVERALL*	Mixed	21.14	23.22	19.71	17.74 (-16.08%)
- ASCEND-EN	English	27.36	27.21	29.39	26.64 (-2.63%)
- ASCEND-ZH	Mandarin	17.49	17.41	18.90	16.04 (-8.29%)
- ASCEND-MIX*	Mixed	21.01	25.13	17.34	16.38 (-22.01%)
CommonVoice16-zh-TW	Mandarin	9.84	8.95	11.86	7.97 (-19%)
CSZS-zh-en*	Mixed	29.49	26.43	20.90	13.01 (-55.88%)

Long-form Audio Datasets

Dataset\Model	Language	WLV2-Auto ↓	WLV3-Auto ↓	COOL-Whisper ↓	Breeze ASR 25 (Ours) ↓
ML-lecture-2021-long*	Mandarin	6.13	6.41	6.37	4.98 (-18.76%)
Formosa-Go	Mandarin	15.03	14.90	16.83	13.61 (-9.44%)
Formosa-Show	Mandarin	29.18	27.80	29.78	27.58 (-5.48%)
Formosa-Course	Mandarin	9.50	9.67	11.12	9.94 (+0.44%)
Formosa-General	Mandarin	11.45	11.46	13.33	11.37 (-0.69%)
FormosaSpeech	Mandarin	22.34	21.22	26.71	22.09 (-1.12%)

* Code-switching datasets

Training Data

所有 Breeze ASR 25 的的訓練取樣自寬鬆自由軟體授權條款的數據集，中文部分完全採用合成語音資料：

The training data of Breeze ASR 25 is sampled from the following publicly available sources with permissive open-source licenses, where all Chinese data are synthetic:

Dataset Name	Type	Language	Total Hours	License
ODC Synth	Synthetic	Mandarin	10,000	Open Data Commons License Attribution + Apache2.0*
CommonVoice17-EN	Real	English	1,738	Creative Commons Zero
NTUML2021	Real	Code-switching	11	MIT License

*ODC Synth is generated by using text from FineWeb2 (ODC License) and a TTS model BreezyVoice (Apache2.0 License)

Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our paper.

🔧 Usage Example

字幕檔生成，請參考 GitHub Please refer to the GitHub for subtitles generation.

For quick testing, the whisper architecture is supported in Hugging Face 🤗 Transformers. First, install relavant packages:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

The model can be used with the pipeline class to transcribe audios of arbitrary length: Simple change input_audio.wav in the following example to the actual filename of your audio.

import torchaudio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline

# 1. Load audio
audio_path = "./input_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)          

# 2. Preprocess
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0)                         
waveform = waveform.squeeze().numpy()                        

if sample_rate != 16_000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
    waveform = resampler(torch.tensor(waveform)).numpy()
    sample_rate = 16_000

# 3. Load Model
processor = WhisperProcessor.from_pretrained("MediaTek-Research/Breeze-ASR-25")
model = WhisperForConditionalGeneration.from_pretrained("MediaTek-Research/Breeze-ASR-25").to("cuda").eval()

# 4. Build Pipeline
asr_pipeline = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=0
)

# 6. Inference
output = asr_pipeline(waveform, return_timestamps=True)  
print("Result:", output["text"])

You can obtain a wav file for testing by loading from a benchmark:

from datasets import load_dataset
import torch
import torchaudio


ds = load_dataset("ky552/ML2021_ASR_ST", split="test")
sample = ds[1279]["audio"]

audio_array = sample["array"]
sampling_rate = sample["sampling_rate"]

waveform = torch.tensor(audio_array).unsqueeze(0)

torchaudio.save("input_audio.wav", waveform, sampling_rate)

# Decoding Results:
# Breeze ASR 25: "放進你的 training 裡面" (correct)
# Whisper: "放進你的權利裡面"

Acknowledgements

We thank NVIDIA for providing access to the Taipei-1 supercomputer.

We thank Professor Hung-yi Lee for his valuable guidance on this project.

📜 Citation

If you find this model useful, please cite our work:

Cheng-Kang Chou*, Chan-Jan Hsu*, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

*Equal contribution

@article{chou2025selfrefiningframeworkenhancingasr,
  title={A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data},
  author={Cheng Kang Chou and Chan-Jan Hsu and Ho-Lam Chung and Liang-Hsuan Tseng and Hsi-Chun Cheng and Yu-Kuan Fu and Kuan Po Huang and Hung-Yi Lee},
  journal={arXiv preprint arXiv:2506.11130},
  year={2025}
}

Downloads last month: 14

GGUF

Model size

2B params

Architecture

whisper

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for handy-computer/Breeze-ASR-25-gguf

Base model

openai/whisper-large-v2

Finetuned

MediaTek-Research/Breeze-ASR-25

Quantized

(9)

this model

Paper for handy-computer/Breeze-ASR-25-gguf

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

Paper • 2506.11130 • Published Jun 10, 2025 • 5