Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -18,20 +18,20 @@ library_name: vllm
|
|
| 18 |
|
| 19 |
# SoybeanMilk/Breeze-ASR-25-quantized.w4a16
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
### 🚀
|
| 24 |
-
- **
|
| 25 |
-
- **
|
| 26 |
-
- **
|
| 27 |
|
| 28 |
-
### 🛠️
|
| 29 |
-
|
| 30 |
|
| 31 |
```python
|
| 32 |
from vllm import LLM, SamplingParams
|
| 33 |
|
| 34 |
-
#
|
| 35 |
llm = LLM(
|
| 36 |
model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
|
| 37 |
max_model_len=448,
|
|
@@ -39,7 +39,7 @@ llm = LLM(
|
|
| 39 |
enforce_eager=True
|
| 40 |
)
|
| 41 |
|
| 42 |
-
#
|
| 43 |
prompts = {
|
| 44 |
"encoder_prompt": {
|
| 45 |
"prompt": "",
|
|
@@ -48,20 +48,20 @@ prompts = {
|
|
| 48 |
"decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
|
| 49 |
}
|
| 50 |
|
| 51 |
-
#
|
| 52 |
outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
|
| 53 |
print(outputs[0].outputs[0].text)
|
| 54 |
```
|
| 55 |
|
| 56 |
-
### ⚙️
|
| 57 |
-
|
| 58 |
-
- **
|
| 59 |
-
- **
|
| 60 |
-
- **
|
| 61 |
-
- **
|
| 62 |
|
| 63 |
-
### 🧪
|
| 64 |
-
|
| 65 |
|
| 66 |
```python
|
| 67 |
import torch
|
|
@@ -73,9 +73,8 @@ import librosa
|
|
| 73 |
|
| 74 |
MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
|
| 75 |
|
| 76 |
-
# 1.
|
| 77 |
def preprocess_fn(batch):
|
| 78 |
-
# 下載並重採樣至 16kHz
|
| 79 |
audio_data = batch["audio"]["array"]
|
| 80 |
input_features = processor(
|
| 81 |
audio_data,
|
|
@@ -85,9 +84,9 @@ def preprocess_fn(batch):
|
|
| 85 |
return {"input_features": input_features.squeeze(0)}
|
| 86 |
|
| 87 |
dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
|
| 88 |
-
dataset = dataset.take(256) #
|
| 89 |
|
| 90 |
-
# 2.
|
| 91 |
recipe = GPTQModifier(
|
| 92 |
targets="Linear",
|
| 93 |
scheme="W4A16",
|
|
@@ -95,8 +94,9 @@ recipe = GPTQModifier(
|
|
| 95 |
dampening_fraction=0.01,
|
| 96 |
)
|
| 97 |
|
| 98 |
-
# 3.
|
| 99 |
-
#
|
|
|
|
| 100 |
compress(
|
| 101 |
model_id=MODEL_ID,
|
| 102 |
dataset=dataset,
|
|
@@ -109,4 +109,4 @@ compress(
|
|
| 109 |
```
|
| 110 |
|
| 111 |
---
|
| 112 |
-
*
|
|
|
|
| 18 |
|
| 19 |
# SoybeanMilk/Breeze-ASR-25-quantized.w4a16
|
| 20 |
|
| 21 |
+
This repository provides a high-performance, 4-bit quantized version of **MediaTek-Research/Breeze-ASR-25**, optimized using the **W4A16 (GPTQ)** format. It is specifically designed for the **vLLM** engine to provide a lightweight and easy-to-deploy solution for Traditional Chinese speech recognition.
|
| 22 |
|
| 23 |
+
### 🚀 Model Highlights
|
| 24 |
+
- **Weight Quantization**: 4-bit (W4A16) quantization significantly reduces VRAM usage while maintaining high accuracy.
|
| 25 |
+
- **Traditional Chinese Optimization**: Fully inherits Breeze-ASR-25's superior recognition capabilities for Taiwan's oral expressions, professional terminology, and various dialects.
|
| 26 |
+
- **Architecture Optimization**: Native support for vLLM, enabling high-performance Marlin acceleration kernels out of the box.
|
| 27 |
|
| 28 |
+
### 🛠️ Inference Example
|
| 29 |
+
It is recommended to use **vLLM (>= 0.15.1)** for deployment.
|
| 30 |
|
| 31 |
```python
|
| 32 |
from vllm import LLM, SamplingParams
|
| 33 |
|
| 34 |
+
# Initialize the model
|
| 35 |
llm = LLM(
|
| 36 |
model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
|
| 37 |
max_model_len=448,
|
|
|
|
| 39 |
enforce_eager=True
|
| 40 |
)
|
| 41 |
|
| 42 |
+
# Prepare input (16kHz audio)
|
| 43 |
prompts = {
|
| 44 |
"encoder_prompt": {
|
| 45 |
"prompt": "",
|
|
|
|
| 48 |
"decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
|
| 49 |
}
|
| 50 |
|
| 51 |
+
# Run inference
|
| 52 |
outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
|
| 53 |
print(outputs[0].outputs[0].text)
|
| 54 |
```
|
| 55 |
|
| 56 |
+
### ⚙️ Quantization Details
|
| 57 |
+
The model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
|
| 58 |
+
- **Algorithm**: GPTQ
|
| 59 |
+
- **Precision**: 4-bit (W4A16)
|
| 60 |
+
- **Calibration Dataset**: `MLCommons/peoples_speech`
|
| 61 |
+
- **Group Size**: 128
|
| 62 |
|
| 63 |
+
### 🧪 How to Replicate
|
| 64 |
+
Below is the complete script used to generate this quantized model. Ensure you have `llmcompressor` and `librosa` installed in your environment.
|
| 65 |
|
| 66 |
```python
|
| 67 |
import torch
|
|
|
|
| 73 |
|
| 74 |
MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
|
| 75 |
|
| 76 |
+
# 1. Prepare Calibration Data (Peoples Speech)
|
| 77 |
def preprocess_fn(batch):
|
|
|
|
| 78 |
audio_data = batch["audio"]["array"]
|
| 79 |
input_features = processor(
|
| 80 |
audio_data,
|
|
|
|
| 84 |
return {"input_features": input_features.squeeze(0)}
|
| 85 |
|
| 86 |
dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
|
| 87 |
+
dataset = dataset.take(256) # Use 256 samples for calibration
|
| 88 |
|
| 89 |
+
# 2. Configure Quantization Modifier
|
| 90 |
recipe = GPTQModifier(
|
| 91 |
targets="Linear",
|
| 92 |
scheme="W4A16",
|
|
|
|
| 94 |
dampening_fraction=0.01,
|
| 95 |
)
|
| 96 |
|
| 97 |
+
# 3. Execute Compression
|
| 98 |
+
# Note: Loading Whisper convolutional layers in float32 is recommended
|
| 99 |
+
# to avoid dtype conflicts during calibration.
|
| 100 |
compress(
|
| 101 |
model_id=MODEL_ID,
|
| 102 |
dataset=dataset,
|
|
|
|
| 109 |
```
|
| 110 |
|
| 111 |
---
|
| 112 |
+
*This model was quantized and optimized by SoybeanMilk based on MediaTek-Research/Breeze-ASR-25.*
|