Breeze-ASR-25: transcribe.cpp GGUF
GGUF conversions of MediaTek-Research/Breeze-ASR-25 for use with transcribe.cpp.
Ported from upstream commit cffe7ccb404d025296a00758d0a33468bec3a9d0, pinned 2026-06-29. Validated against the transformers reference at transcribe.cpp commit 3848875 on 2026-06-29.
MediaTek Research's Breeze-ASR-25 โ a fine-tune of OpenAI Whisper large-v2, converted to GGUF for transcribe.cpp. Optimized for Taiwanese Mandarin (it emits Traditional Chinese) and English, with explicit support for Mandarin-English code-switching (intra- and inter-sentential). Trained on ~11,749 hours: 10,000 h synthetic Mandarin (ODC Synth), 1,738 h English (CommonVoice17) and 11 h real code-switch (NTUML2021). Architecturally identical to Whisper large-v2 (encoder-decoder transformer, 30-second windows with chunked long-form decoding); it retains Whisper's 99-language tokenizer, but only Mandarin and English are optimized and validated โ other languages remain technically accessible but out of scope.
Downloads
| Quantization | Download | Size | WER (LibriSpeech test-clean) | CER (FLEURS zh) |
|---|---|---|---|---|
| BF16 | Breeze-ASR-25-BF16.gguf | 3.10 GB | 2.29% | 8.12% |
| F16 | Breeze-ASR-25-F16.gguf | 3.11 GB | 2.29% | 8.11% |
| Q8_0 | Breeze-ASR-25-Q8_0.gguf | 1.67 GB | 2.27% | 8.10% |
| Q6_K | Breeze-ASR-25-Q6_K.gguf | 1.30 GB | 2.29% | 8.12% |
| Q5_K_M | Breeze-ASR-25-Q5_K_M.gguf | 1.16 GB | 2.25% | 8.12% |
| Q4_K_M | Breeze-ASR-25-Q4_K_M.gguf | 1.00 GB | 2.26% | 8.08% |
Two benchmarks, both full test splits, decoded on a Modal L40S with the transcribe.cpp default recipe (greedy + temperature fallback, suppress_tokens, segment timestamps), batch 1.
English โ WER, LibriSpeech test-clean (2620 utterances). Standard
Whisper/EnglishTextNormalizer scoring.
Chinese โ CER, FLEURS cmn_hans_cn test (945 utterances). Breeze
emits Traditional Chinese while FLEURS references are Simplified,
so the CER reported here is computed after folding both hypothesis and
reference to Simplified with OpenCC (t2s) โ otherwise the raw,
script-mismatched CER is ~35% and meaningless. FLEURS ships no Traditional
Mandarin split, which is why a script-normalized Simplified set is used.
Quantization is effectively free on both languages: every quant down to Q4_K_M (1.0 GB) sits within run-to-run noise of the BF16 reference (English 2.25-2.29%, Chinese 8.08-8.12%).
Usage
Build transcribe.cpp from source:
git clone git@github.com:handy-computer/transcribe.cpp.git
cd transcribe.cpp
cmake -B build && cmake --build build
Run on a 16 kHz mono WAV:
build/bin/transcribe-cli \
-m Breeze-ASR-25-Q8_0.gguf \
input.wav
If your audio isn't already 16 kHz mono WAV, convert it first:
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
See the transcribe.cpp model page for performance numbers, numerical validation, and reproduction steps.
License
Inherited from the base model: Apache-2.0. See the upstream model card for full terms.
Original Model Card
The section below is reproduced from MediaTek-Research/Breeze-ASR-25 at commit
cffe7ccb404d025296a00758d0a33468bec3a9d0for offline reference. The upstream card is the authoritative source.
Breeze ASR 25
Breeze ASR 25 ๆฏไธๆฌพๅบๆผ Whisper-large-v2 ้็ผ็่ช้ณ่พจ่ญๆจกๅ๏ผไธฆๅ ทๆไปฅไธ็น่ฒ๏ผ
- ๅผทๅ็น้ซไธญๆๆ ๅข่พจ่ญ่ฝๅ
- ๅผทๅไธญ่ฑๆทท็จๆ ๅข่พจ่ญ่ฝๅ๏ผๅ ๅซๅฅๅ งไปฅๅๅฅๅค่ฝๆ
- ๅผทๅๆ้ๆณ่จๅฐ้ฝ๏ผ้ฉๅ่ชๅๅญๅน็ๆ
Breeze ASR 25 is an advanced ASR model fine-tuned from Whisper-large-v2
- Optimized for Taiwanese Mandarin
- Optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
- Enhanced time alignment, suitable for automatic captioning
Example:
ๅขๅผท็ฏไพ-ไธญ่ฑๆทท็จๆ ๅข๏ผ MediaTek's 24th Anniversary
Breeze ASR 25:
้ขๅฐไธ็ฅ้็ๆๅๆ้บผ็จ open mind open heart ็ๅฟๆ
ๅป explore
้ฃ explore ้็จไนๅฐฑๆฏๆ็บๅญธ็ฟ ไธๆทๅตๆฐ
็ถ็ถๅฆๆ่ฝๅธถ้ MediaTek ่ชช้ๅฐ้ๆจฃ็ position
ๅฐๅ้ๆจฃ็ไบๆ
้ฃ่ฆบๅพๆฏไธๅ commitment
้ฃไนๆฏไธๅ passion ้ฃๅฏไปฅไธ็ดๅพๅชๅ็ๆๅ
ฅๅจๅ
Whisper-large-v2:
้ขๅฐไธ็ฅ้็ๆๅๆ้บผ็จ้ๆพๅฟๆ
ๅปๆข็ดข
ๆๅฎๆข็ดข้็จไนๅฐฑๆฏ ไป็ดฐๅญธ็ฟ ไธๆทๅตๆฐ
็ถ็ถๅฆๆ่ฝๅธถ้ MediaTek่ชช ้ๅฐ้ๆจฃ็ๅฑคๆฌก ๅฐๅ้ๆจฃ็ไบๆ
้ฃ่ฆบๅพๆฏไธๅ่ฒข็ป้ฃไนๆฏไธๅ็ฑ่ช
้ฃๅฏไปฅไธ็ดไพๅชๅๅฐๆๅ
ฅๅจๅ
Performance
Word error rates of benchmarks. The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the paper as "Twister"
Short-form Audio Datasets
| Dataset\Model | Language | WLV2-Auto โ | WLV3-Auto โ | COOL-Whisper โ | Breeze ASR 25 (Ours) โ |
|---|---|---|---|---|---|
| ASCEND-OVERALL* | Mixed | 21.14 | 23.22 | 19.71 | 17.74 (-16.08%) |
| - ASCEND-EN | English | 27.36 | 27.21 | 29.39 | 26.64 (-2.63%) |
| - ASCEND-ZH | Mandarin | 17.49 | 17.41 | 18.90 | 16.04 (-8.29%) |
| - ASCEND-MIX* | Mixed | 21.01 | 25.13 | 17.34 | 16.38 (-22.01%) |
| CommonVoice16-zh-TW | Mandarin | 9.84 | 8.95 | 11.86 | 7.97 (-19%) |
| CSZS-zh-en* | Mixed | 29.49 | 26.43 | 20.90 | 13.01 (-55.88%) |
Long-form Audio Datasets
| Dataset\Model | Language | WLV2-Auto โ | WLV3-Auto โ | COOL-Whisper โ | Breeze ASR 25 (Ours) โ |
|---|---|---|---|---|---|
| ML-lecture-2021-long* | Mandarin | 6.13 | 6.41 | 6.37 | 4.98 (-18.76%) |
| Formosa-Go | Mandarin | 15.03 | 14.90 | 16.83 | 13.61 (-9.44%) |
| Formosa-Show | Mandarin | 29.18 | 27.80 | 29.78 | 27.58 (-5.48%) |
| Formosa-Course | Mandarin | 9.50 | 9.67 | 11.12 | 9.94 (+0.44%) |
| Formosa-General | Mandarin | 11.45 | 11.46 | 13.33 | 11.37 (-0.69%) |
| FormosaSpeech | Mandarin | 22.34 | 21.22 | 26.71 | 22.09 (-1.12%) |
* Code-switching datasets
Training Data
ๆๆ Breeze ASR 25 ็็่จ็ทดๅๆจฃ่ชๅฏฌ้ฌ่ช็ฑ่ป้ซๆๆฌๆขๆฌพ็ๆธๆ้๏ผไธญๆ้จๅๅฎๅ จๆก็จๅๆ่ช้ณ่ณๆ๏ผ
The training data of Breeze ASR 25 is sampled from the following publicly available sources with permissive open-source licenses, where all Chinese data are synthetic:
| Dataset Name | Type | Language | Total Hours | License |
|---|---|---|---|---|
| ODC Synth | Synthetic | Mandarin | 10,000 | Open Data Commons License Attribution + Apache2.0* |
| CommonVoice17-EN | Real | English | 1,738 | Creative Commons Zero |
| NTUML2021 | Real | Code-switching | 11 | MIT License |
*ODC Synth is generated by using text from FineWeb2 (ODC License) and a TTS model BreezyVoice (Apache2.0 License)
Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our paper.
๐ง Usage Example
ๅญๅนๆช็ๆ๏ผ่ซๅ่ GitHub Please refer to the GitHub for subtitles generation.
For quick testing, the whisper architecture is supported in Hugging Face ๐ค Transformers. First, install relavant packages:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
The model can be used with the pipeline class to transcribe audios of arbitrary length:
Simple change input_audio.wav in the following example to the actual filename of your audio.
import torchaudio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline
# 1. Load audio
audio_path = "./input_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)
# 2. Preprocess
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0)
waveform = waveform.squeeze().numpy()
if sample_rate != 16_000:
resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
waveform = resampler(torch.tensor(waveform)).numpy()
sample_rate = 16_000
# 3. Load Model
processor = WhisperProcessor.from_pretrained("MediaTek-Research/Breeze-ASR-25")
model = WhisperForConditionalGeneration.from_pretrained("MediaTek-Research/Breeze-ASR-25").to("cuda").eval()
# 4. Build Pipeline
asr_pipeline = AutomaticSpeechRecognitionPipeline(
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=0
)
# 6. Inference
output = asr_pipeline(waveform, return_timestamps=True)
print("Result:", output["text"])
You can obtain a wav file for testing by loading from a benchmark:
from datasets import load_dataset
import torch
import torchaudio
ds = load_dataset("ky552/ML2021_ASR_ST", split="test")
sample = ds[1279]["audio"]
audio_array = sample["array"]
sampling_rate = sample["sampling_rate"]
waveform = torch.tensor(audio_array).unsqueeze(0)
torchaudio.save("input_audio.wav", waveform, sampling_rate)
# Decoding Results:
# Breeze ASR 25: "ๆพ้ฒไฝ ็ training ่ฃก้ข" (correct)
# Whisper: "ๆพ้ฒไฝ ็ๆฌๅฉ่ฃก้ข"
Acknowledgements
We thank NVIDIA for providing access to the Taipei-1 supercomputer.
We thank Professor Hung-yi Lee for his valuable guidance on this project.
๐ Citation
If you find this model useful, please cite our work:
Cheng-Kang Chou*, Chan-Jan Hsu*, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data
*Equal contribution
@article{chou2025selfrefiningframeworkenhancingasr,
title={A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data},
author={Cheng Kang Chou and Chan-Jan Hsu and Ho-Lam Chung and Liang-Hsuan Tseng and Hsi-Chun Cheng and Yu-Kuan Fu and Kuan Po Huang and Hung-Yi Lee},
journal={arXiv preprint arXiv:2506.11130},
year={2025}
}
- Downloads last month
- 14
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for handy-computer/Breeze-ASR-25-gguf
Base model
openai/whisper-large-v2