SoybeanMilk
/

Breeze-ASR-25-quantized.w4a16

@@ -18,20 +18,20 @@ library_name: vllm
 # SoybeanMilk/Breeze-ASR-25-quantized.w4a16
-本專案提供 **MediaTek Research Breeze-ASR-25** 的高性能量化版本，採用 **W4A16 (GPTQ)** 格式。本模型專為 **vLLM** 引擎優化，旨在提供更輕量、更易於部署的繁體中文語音辨識解決方案。
-### 🚀 模型簡介
-- **權重量化**：使用 4-bit (W4A16) 量化，顯著降低顯存占用。
-- **繁體中文支持**：完整繼承 Breeze-ASR-25 對台灣口語、專業術語及多種方言的辨識能力。
-- **架構優化**：原生支持 vLLM，可直接啟動高效能的 Marlin 加速算子。
-### 🛠️ 推論範本 (Inference)
-本模型建議使用 **vLLM (>= 0.15.1)** 進行部署。
 ```python
 from vllm import LLM, SamplingParams
-# 初始化模型
 llm = LLM(
     model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
     max_model_len=448,
@@ -39,7 +39,7 @@ llm = LLM(
     enforce_eager=True
 )
-# 準備輸入
 prompts = {
     "encoder_prompt": {
         "prompt": "",
@@ -48,20 +48,20 @@ prompts = {
     "decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
 }
-# 執行辨識
 outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
 print(outputs[0].outputs[0].text)
 ```
-### ⚙️ 量化細節 (Quantization Details)
-本模型使用 [llm-compressor](https://github.com/vllm-project/llm-compressor) 工具進行量化。
-- **算法**：GPTQ
-- **位元數**：4-bit (W4A16)
-- **校準數據集**：`MLCommons/peoples_speech`
-- **群組大小 (Group Size)**：128
-### 🧪 如何複刻 (Replication)
-以下是產生此量化模型的完整腳本。請確保環境中安裝了 `llmcompressor` 與 `librosa`。
 ```python
 import torch
@@ -73,9 +73,8 @@ import librosa
 MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
-# 1. 準備校準數據 (使用 Peoples Speech)
 def preprocess_fn(batch):
-    # 下載並重採樣至 16kHz
     audio_data = batch["audio"]["array"]
     input_features = processor(
         audio_data,
@@ -85,9 +84,9 @@ def preprocess_fn(batch):
     return {"input_features": input_features.squeeze(0)}
 dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
-dataset = dataset.take(256) # 選取 256 個樣本進行校準
-# 2. 設定量化參數
 recipe = GPTQModifier(
     targets="Linear",
     scheme="W4A16",
@@ -95,8 +94,9 @@ recipe = GPTQModifier(
     dampening_fraction=0.01,
 )
-# 3. 執行量化壓縮
-# 注意：Whisper 卷積層建議使用 float32 載入以避免校準時的類型衝突
 compress(
     model_id=MODEL_ID,
     dataset=dataset,
@@ -109,4 +109,4 @@ compress(
 ```
 ---
-*本模型由 SoybeanMilk 基於 MediaTek Research Breeze-ASR-25 權重進行量化優化。*

 # SoybeanMilk/Breeze-ASR-25-quantized.w4a16
+This repository provides a high-performance, 4-bit quantized version of **MediaTek-Research/Breeze-ASR-25**, optimized using the **W4A16 (GPTQ)** format. It is specifically designed for the **vLLM** engine to provide a lightweight and easy-to-deploy solution for Traditional Chinese speech recognition.
+### 🚀 Model Highlights
+- **Weight Quantization**: 4-bit (W4A16) quantization significantly reduces VRAM usage while maintaining high accuracy.
+- **Traditional Chinese Optimization**: Fully inherits Breeze-ASR-25's superior recognition capabilities for Taiwan's oral expressions, professional terminology, and various dialects.
+- **Architecture Optimization**: Native support for vLLM, enabling high-performance Marlin acceleration kernels out of the box.
+### 🛠️ Inference Example
+It is recommended to use **vLLM (>= 0.15.1)** for deployment.
 ```python
 from vllm import LLM, SamplingParams
+# Initialize the model
 llm = LLM(
     model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
     max_model_len=448,
     enforce_eager=True
 )
+# Prepare input (16kHz audio)
 prompts = {
     "encoder_prompt": {
         "prompt": "",
     "decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
 }
+# Run inference
 outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
 print(outputs[0].outputs[0].text)
 ```
+### ⚙️ Quantization Details
+The model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
+- **Algorithm**: GPTQ
+- **Precision**: 4-bit (W4A16)
+- **Calibration Dataset**: `MLCommons/peoples_speech`
+- **Group Size**: 128
+### 🧪 How to Replicate
+Below is the complete script used to generate this quantized model. Ensure you have `llmcompressor` and `librosa` installed in your environment.
 ```python
 import torch
 MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
+# 1. Prepare Calibration Data (Peoples Speech)
 def preprocess_fn(batch):
     audio_data = batch["audio"]["array"]
     input_features = processor(
         audio_data,
     return {"input_features": input_features.squeeze(0)}
 dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
+dataset = dataset.take(256) # Use 256 samples for calibration
+# 2. Configure Quantization Modifier
 recipe = GPTQModifier(
     targets="Linear",
     scheme="W4A16",
     dampening_fraction=0.01,
 )
+# 3. Execute Compression
+# Note: Loading Whisper convolutional layers in float32 is recommended
+# to avoid dtype conflicts during calibration.
 compress(
     model_id=MODEL_ID,
     dataset=dataset,
 ```
 ---
+*This model was quantized and optimized by SoybeanMilk based on MediaTek-Research/Breeze-ASR-25.*