SoybeanMilk commited on
Commit
8524653
·
verified ·
1 Parent(s): 078997d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +25 -25
README.md CHANGED
@@ -18,20 +18,20 @@ library_name: vllm
18
 
19
  # SoybeanMilk/Breeze-ASR-25-quantized.w4a16
20
 
21
- 本專案提供 **MediaTek Research Breeze-ASR-25** 的高性能量化版本,採用 **W4A16 (GPTQ)** 格式。本模型專為 **vLLM** 引擎優化,旨在提供更輕量、更易於部署的繁體中文語音辨識解決方案。
22
 
23
- ### 🚀 模型簡介
24
- - **權重量化**:使用 4-bit (W4A16) 量化,顯著降低顯存占用。
25
- - **繁體中文支持**:完整繼承 Breeze-ASR-25 對台灣口語、專業術語及多種方言的辨識能力。
26
- - **架構優化**:原生支持 vLLM,可直接啟動高效能的 Marlin 加速算子。
27
 
28
- ### 🛠️ 推論範本 (Inference)
29
- 本模型建議使用 **vLLM (>= 0.15.1)** 進行部署。
30
 
31
  ```python
32
  from vllm import LLM, SamplingParams
33
 
34
- # 初始化模型
35
  llm = LLM(
36
  model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
37
  max_model_len=448,
@@ -39,7 +39,7 @@ llm = LLM(
39
  enforce_eager=True
40
  )
41
 
42
- # 準備輸入
43
  prompts = {
44
  "encoder_prompt": {
45
  "prompt": "",
@@ -48,20 +48,20 @@ prompts = {
48
  "decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
49
  }
50
 
51
- # 執行辨識
52
  outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
53
  print(outputs[0].outputs[0].text)
54
  ```
55
 
56
- ### ⚙️ 量化細節 (Quantization Details)
57
- 本模型使用 [llm-compressor](https://github.com/vllm-project/llm-compressor) 工具進行量化。
58
- - **算法**GPTQ
59
- - **位元數**4-bit (W4A16)
60
- - **校準數據集**`MLCommons/peoples_speech`
61
- - **群組大小 (Group Size)**128
62
 
63
- ### 🧪 如何複刻 (Replication)
64
- 以下是產生此量化模型的完整腳本。請確保環境中安裝了 `llmcompressor` `librosa`
65
 
66
  ```python
67
  import torch
@@ -73,9 +73,8 @@ import librosa
73
 
74
  MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
75
 
76
- # 1. 準備校準數據 (使用 Peoples Speech)
77
  def preprocess_fn(batch):
78
- # 下載並重採樣至 16kHz
79
  audio_data = batch["audio"]["array"]
80
  input_features = processor(
81
  audio_data,
@@ -85,9 +84,9 @@ def preprocess_fn(batch):
85
  return {"input_features": input_features.squeeze(0)}
86
 
87
  dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
88
- dataset = dataset.take(256) # 選取 256 個樣本進行校準
89
 
90
- # 2. 設定量化參數
91
  recipe = GPTQModifier(
92
  targets="Linear",
93
  scheme="W4A16",
@@ -95,8 +94,9 @@ recipe = GPTQModifier(
95
  dampening_fraction=0.01,
96
  )
97
 
98
- # 3. 執行量化壓縮
99
- # 注意:Whisper 卷積層建議使用 float32 載入以避免校準時的類型衝突
 
100
  compress(
101
  model_id=MODEL_ID,
102
  dataset=dataset,
@@ -109,4 +109,4 @@ compress(
109
  ```
110
 
111
  ---
112
- *本模型由 SoybeanMilk 基於 MediaTek Research Breeze-ASR-25 權重進行量化優化。*
 
18
 
19
  # SoybeanMilk/Breeze-ASR-25-quantized.w4a16
20
 
21
+ This repository provides a high-performance, 4-bit quantized version of **MediaTek-Research/Breeze-ASR-25**, optimized using the **W4A16 (GPTQ)** format. It is specifically designed for the **vLLM** engine to provide a lightweight and easy-to-deploy solution for Traditional Chinese speech recognition.
22
 
23
+ ### 🚀 Model Highlights
24
+ - **Weight Quantization**: 4-bit (W4A16) quantization significantly reduces VRAM usage while maintaining high accuracy.
25
+ - **Traditional Chinese Optimization**: Fully inherits Breeze-ASR-25's superior recognition capabilities for Taiwan's oral expressions, professional terminology, and various dialects.
26
+ - **Architecture Optimization**: Native support for vLLM, enabling high-performance Marlin acceleration kernels out of the box.
27
 
28
+ ### 🛠️ Inference Example
29
+ It is recommended to use **vLLM (>= 0.15.1)** for deployment.
30
 
31
  ```python
32
  from vllm import LLM, SamplingParams
33
 
34
+ # Initialize the model
35
  llm = LLM(
36
  model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
37
  max_model_len=448,
 
39
  enforce_eager=True
40
  )
41
 
42
+ # Prepare input (16kHz audio)
43
  prompts = {
44
  "encoder_prompt": {
45
  "prompt": "",
 
48
  "decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
49
  }
50
 
51
+ # Run inference
52
  outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
53
  print(outputs[0].outputs[0].text)
54
  ```
55
 
56
+ ### ⚙️ Quantization Details
57
+ The model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
58
+ - **Algorithm**: GPTQ
59
+ - **Precision**: 4-bit (W4A16)
60
+ - **Calibration Dataset**: `MLCommons/peoples_speech`
61
+ - **Group Size**: 128
62
 
63
+ ### 🧪 How to Replicate
64
+ Below is the complete script used to generate this quantized model. Ensure you have `llmcompressor` and `librosa` installed in your environment.
65
 
66
  ```python
67
  import torch
 
73
 
74
  MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
75
 
76
+ # 1. Prepare Calibration Data (Peoples Speech)
77
  def preprocess_fn(batch):
 
78
  audio_data = batch["audio"]["array"]
79
  input_features = processor(
80
  audio_data,
 
84
  return {"input_features": input_features.squeeze(0)}
85
 
86
  dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
87
+ dataset = dataset.take(256) # Use 256 samples for calibration
88
 
89
+ # 2. Configure Quantization Modifier
90
  recipe = GPTQModifier(
91
  targets="Linear",
92
  scheme="W4A16",
 
94
  dampening_fraction=0.01,
95
  )
96
 
97
+ # 3. Execute Compression
98
+ # Note: Loading Whisper convolutional layers in float32 is recommended
99
+ # to avoid dtype conflicts during calibration.
100
  compress(
101
  model_id=MODEL_ID,
102
  dataset=dataset,
 
109
  ```
110
 
111
  ---
112
+ *This model was quantized and optimized by SoybeanMilk based on MediaTek-Research/Breeze-ASR-25.*