Breeze-ASR-25: transcribe.cpp GGUF

GGUF conversions of MediaTek-Research/Breeze-ASR-25 for use with transcribe.cpp.

Ported from upstream commit cffe7ccb404d025296a00758d0a33468bec3a9d0, pinned 2026-06-29. Validated against the transformers reference at transcribe.cpp commit 3848875 on 2026-06-29.

MediaTek Research's Breeze-ASR-25 โ€” a fine-tune of OpenAI Whisper large-v2, converted to GGUF for transcribe.cpp. Optimized for Taiwanese Mandarin (it emits Traditional Chinese) and English, with explicit support for Mandarin-English code-switching (intra- and inter-sentential). Trained on ~11,749 hours: 10,000 h synthetic Mandarin (ODC Synth), 1,738 h English (CommonVoice17) and 11 h real code-switch (NTUML2021). Architecturally identical to Whisper large-v2 (encoder-decoder transformer, 30-second windows with chunked long-form decoding); it retains Whisper's 99-language tokenizer, but only Mandarin and English are optimized and validated โ€” other languages remain technically accessible but out of scope.

Downloads

Quantization Download Size WER (LibriSpeech test-clean) CER (FLEURS zh)
BF16 Breeze-ASR-25-BF16.gguf 3.10 GB 2.29% 8.12%
F16 Breeze-ASR-25-F16.gguf 3.11 GB 2.29% 8.11%
Q8_0 Breeze-ASR-25-Q8_0.gguf 1.67 GB 2.27% 8.10%
Q6_K Breeze-ASR-25-Q6_K.gguf 1.30 GB 2.29% 8.12%
Q5_K_M Breeze-ASR-25-Q5_K_M.gguf 1.16 GB 2.25% 8.12%
Q4_K_M Breeze-ASR-25-Q4_K_M.gguf 1.00 GB 2.26% 8.08%

Two benchmarks, both full test splits, decoded on a Modal L40S with the transcribe.cpp default recipe (greedy + temperature fallback, suppress_tokens, segment timestamps), batch 1.

English โ€” WER, LibriSpeech test-clean (2620 utterances). Standard Whisper/EnglishTextNormalizer scoring.

Chinese โ€” CER, FLEURS cmn_hans_cn test (945 utterances). Breeze emits Traditional Chinese while FLEURS references are Simplified, so the CER reported here is computed after folding both hypothesis and reference to Simplified with OpenCC (t2s) โ€” otherwise the raw, script-mismatched CER is ~35% and meaningless. FLEURS ships no Traditional Mandarin split, which is why a script-normalized Simplified set is used.

Quantization is effectively free on both languages: every quant down to Q4_K_M (1.0 GB) sits within run-to-run noise of the BF16 reference (English 2.25-2.29%, Chinese 8.08-8.12%).

Usage

Build transcribe.cpp from source:

git clone git@github.com:handy-computer/transcribe.cpp.git
cd transcribe.cpp
cmake -B build && cmake --build build

Run on a 16 kHz mono WAV:

build/bin/transcribe-cli \
  -m Breeze-ASR-25-Q8_0.gguf \
  input.wav

If your audio isn't already 16 kHz mono WAV, convert it first:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

See the transcribe.cpp model page for performance numbers, numerical validation, and reproduction steps.

License

Inherited from the base model: Apache-2.0. See the upstream model card for full terms.


Original Model Card

The section below is reproduced from MediaTek-Research/Breeze-ASR-25 at commit cffe7ccb404d025296a00758d0a33468bec3a9d0 for offline reference. The upstream card is the authoritative source.

Breeze ASR 25

Breeze ASR 25

GitHub | Paper

Breeze ASR 25 ๆ˜ฏไธ€ๆฌพๅŸบๆ–ผ Whisper-large-v2 ้–‹็™ผ็š„่ชž้Ÿณ่พจ่ญ˜ๆจกๅž‹๏ผŒไธฆๅ…ทๆœ‰ไปฅไธ‹็‰น่‰ฒ๏ผš

  • ๅผทๅŒ–็น้ซ”ไธญๆ–‡ๆƒ…ๅขƒ่พจ่ญ˜่ƒฝๅŠ›
  • ๅผทๅŒ–ไธญ่‹ฑๆทท็”จๆƒ…ๅขƒ่พจ่ญ˜่ƒฝๅŠ›๏ผŒๅŒ…ๅซๅฅๅ…งไปฅๅŠๅฅๅค–่ฝ‰ๆ›
  • ๅผทๅŒ–ๆ™‚้–“ๆˆณ่จ˜ๅฐ้ฝŠ๏ผŒ้ฉๅˆ่‡ชๅ‹•ๅญ—ๅน•็”Ÿๆˆ

Breeze ASR 25 is an advanced ASR model fine-tuned from Whisper-large-v2

  • Optimized for Taiwanese Mandarin
  • Optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
  • Enhanced time alignment, suitable for automatic captioning

Example:

ๅขžๅผท็ฏ„ไพ‹-ไธญ่‹ฑๆทท็”จๆƒ…ๅขƒ๏ผš MediaTek's 24th Anniversary

Breeze ASR 25:

้ขๅฐไธ็Ÿฅ้“็š„ๆˆ‘ๅ€‘ๆ€Ž้บผ็”จ open mind open heart ็š„ๅฟƒๆƒ…ๅŽป explore
้‚ฃ explore ้Ž็จ‹ไนŸๅฐฑๆ˜ฏๆŒ็บŒๅญธ็ฟ’ ไธๆ–ทๅ‰ตๆ–ฐ
็•ถ็„ถๅฆ‚ๆžœ่ƒฝๅธถ้ ˜ MediaTek ่ชช้”ๅˆฐ้€™ๆจฃ็š„ position
ๅฐๅš้€™ๆจฃ็š„ไบ‹ๆƒ…้‚ฃ่ฆบๅพ—ๆ˜ฏไธ€ๅ€‹ commitment
้‚ฃไนŸๆ˜ฏไธ€ๅ€‹ passion ้‚ฃๅฏไปฅไธ€็›ดๅพˆๅŠชๅŠ›็š„ๆŠ•ๅ…ฅๅœจๅš

Whisper-large-v2:

้ขๅฐไธ็Ÿฅ้“็š„ๆˆ‘ๅ€‘ๆ€Ž้บผ็”จ้–‹ๆ”พๅฟƒๆƒ…ๅŽปๆŽข็ดข
ๆŠŠๅฎƒๆŽข็ดข้Ž็จ‹ไนŸๅฐฑๆ˜ฏ ไป”็ดฐๅญธ็ฟ’ ไธๆ–ทๅ‰ตๆ–ฐ
็•ถ็„ถๅฆ‚ๆžœ่ƒฝๅธถ้ ˜MediaTek่ชช ้”ๅˆฐ้€™ๆจฃ็š„ๅฑคๆฌก ๅฐๅš้€™ๆจฃ็š„ไบ‹ๆƒ…
้‚ฃ่ฆบๅพ—ๆ˜ฏไธ€ๅ€‹่ฒข็ป้‚ฃไนŸๆ˜ฏไธ€ๅ€‹็†ฑ่ช 
้‚ฃๅฏไปฅไธ€็›ดไพ†ๅŠชๅŠ›ๅœฐๆŠ•ๅ…ฅๅœจๅš

Performance

Word error rates of benchmarks. The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the paper as "Twister"

Short-form Audio Datasets

Dataset\Model Language WLV2-Auto โ†“ WLV3-Auto โ†“ COOL-Whisper โ†“ Breeze ASR 25 (Ours) โ†“
ASCEND-OVERALL* Mixed 21.14 23.22 19.71 17.74 (-16.08%)
- ASCEND-EN English 27.36 27.21 29.39 26.64 (-2.63%)
- ASCEND-ZH Mandarin 17.49 17.41 18.90 16.04 (-8.29%)
- ASCEND-MIX* Mixed 21.01 25.13 17.34 16.38 (-22.01%)
CommonVoice16-zh-TW Mandarin 9.84 8.95 11.86 7.97 (-19%)
CSZS-zh-en* Mixed 29.49 26.43 20.90 13.01 (-55.88%)

Long-form Audio Datasets

Dataset\Model Language WLV2-Auto โ†“ WLV3-Auto โ†“ COOL-Whisper โ†“ Breeze ASR 25 (Ours) โ†“
ML-lecture-2021-long* Mandarin 6.13 6.41 6.37 4.98 (-18.76%)
Formosa-Go Mandarin 15.03 14.90 16.83 13.61 (-9.44%)
Formosa-Show Mandarin 29.18 27.80 29.78 27.58 (-5.48%)
Formosa-Course Mandarin 9.50 9.67 11.12 9.94 (+0.44%)
Formosa-General Mandarin 11.45 11.46 13.33 11.37 (-0.69%)
FormosaSpeech Mandarin 22.34 21.22 26.71 22.09 (-1.12%)

* Code-switching datasets


Training Data

ๆ‰€ๆœ‰ Breeze ASR 25 ็š„็š„่จ“็ทดๅ–ๆจฃ่‡ชๅฏฌ้ฌ†่‡ช็”ฑ่ปŸ้ซ”ๆŽˆๆฌŠๆขๆฌพ็š„ๆ•ธๆ“š้›†๏ผŒไธญๆ–‡้ƒจๅˆ†ๅฎŒๅ…จๆŽก็”จๅˆๆˆ่ชž้Ÿณ่ณ‡ๆ–™๏ผš

The training data of Breeze ASR 25 is sampled from the following publicly available sources with permissive open-source licenses, where all Chinese data are synthetic:

Dataset Name Type Language Total Hours License
ODC Synth Synthetic Mandarin 10,000 Open Data Commons License Attribution + Apache2.0*
CommonVoice17-EN Real English 1,738 Creative Commons Zero
NTUML2021 Real Code-switching 11 MIT License

*ODC Synth is generated by using text from FineWeb2 (ODC License) and a TTS model BreezyVoice (Apache2.0 License)

Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our paper.


๐Ÿ”ง Usage Example

ๅญ—ๅน•ๆช”็”Ÿๆˆ๏ผŒ่ซ‹ๅƒ่€ƒ GitHub Please refer to the GitHub for subtitles generation.

For quick testing, the whisper architecture is supported in Hugging Face ๐Ÿค— Transformers. First, install relavant packages:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

The model can be used with the pipeline class to transcribe audios of arbitrary length: Simple change input_audio.wav in the following example to the actual filename of your audio.

import torchaudio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline

# 1. Load audio
audio_path = "./input_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)          

# 2. Preprocess
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0)                         
waveform = waveform.squeeze().numpy()                        

if sample_rate != 16_000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
    waveform = resampler(torch.tensor(waveform)).numpy()
    sample_rate = 16_000

# 3. Load Model
processor = WhisperProcessor.from_pretrained("MediaTek-Research/Breeze-ASR-25")
model = WhisperForConditionalGeneration.from_pretrained("MediaTek-Research/Breeze-ASR-25").to("cuda").eval()

# 4. Build Pipeline
asr_pipeline = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=0
)

# 6. Inference
output = asr_pipeline(waveform, return_timestamps=True)  
print("Result:", output["text"])

You can obtain a wav file for testing by loading from a benchmark:

from datasets import load_dataset
import torch
import torchaudio


ds = load_dataset("ky552/ML2021_ASR_ST", split="test")
sample = ds[1279]["audio"]

audio_array = sample["array"]
sampling_rate = sample["sampling_rate"]

waveform = torch.tensor(audio_array).unsqueeze(0)

torchaudio.save("input_audio.wav", waveform, sampling_rate)

# Decoding Results:
# Breeze ASR 25: "ๆ”พ้€ฒไฝ ็š„ training ่ฃก้ข" (correct)
# Whisper: "ๆ”พ้€ฒไฝ ็š„ๆฌŠๅˆฉ่ฃก้ข"

Acknowledgements

We thank NVIDIA for providing access to the Taipei-1 supercomputer.

We thank Professor Hung-yi Lee for his valuable guidance on this project.


๐Ÿ“œ Citation

If you find this model useful, please cite our work:

Cheng-Kang Chou*, Chan-Jan Hsu*, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

*Equal contribution

@article{chou2025selfrefiningframeworkenhancingasr,
  title={A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data},
  author={Cheng Kang Chou and Chan-Jan Hsu and Ho-Lam Chung and Liang-Hsuan Tseng and Hsi-Chun Cheng and Yu-Kuan Fu and Kuan Po Huang and Hung-Yi Lee},
  journal={arXiv preprint arXiv:2506.11130},
  year={2025}
}
Downloads last month
14
GGUF
Model size
2B params
Architecture
whisper
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for handy-computer/Breeze-ASR-25-gguf

Quantized
(9)
this model

Paper for handy-computer/Breeze-ASR-25-gguf