SoybeanMilk/Breeze-ASR-25-quantized.w4a16
This repository provides a high-performance, 4-bit quantized version of MediaTek-Research/Breeze-ASR-25, optimized using the W4A16 (GPTQ) format. It is specifically designed for the vLLM engine to provide a lightweight and easy-to-deploy solution for Traditional Chinese speech recognition.
π Model Highlights
- Weight Quantization: 4-bit (W4A16) quantization significantly reduces VRAM usage while maintaining high accuracy.
- Traditional Chinese Optimization: Fully inherits Breeze-ASR-25's superior recognition capabilities for Taiwan's oral expressions, professional terminology, and various dialects.
- Architecture Optimization: Native support for vLLM, enabling high-performance Marlin acceleration kernels out of the box.
π οΈ Inference Example
It is recommended to use vLLM (>= 0.15.1) for deployment.
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(
model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
max_model_len=448,
limit_mm_per_prompt={"audio": 1},
enforce_eager=True
)
# Prepare input (16kHz audio)
prompts = {
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {"audio": ("path/to/audio.wav", 16000)}
},
"decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
}
# Run inference
outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
print(outputs[0].outputs[0].text)
βοΈ Quantization Details
The model was quantized using the llm-compressor library.
- Algorithm: GPTQ
- Precision: 4-bit (W4A16)
- Calibration Dataset:
MLCommons/peoples_speech - Group Size: 128
π§ͺ How to Replicate
Below is the complete script used to generate this quantized model. Ensure you have llmcompressor and librosa installed in your environment.
import torch
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import compress
from transformers import AutoProcessor
from datasets import load_dataset
import librosa
MODEL_ID = "MediaTek-Research/Breeze-ASR-25"
# 1. Prepare Calibration Data (Peoples Speech)
def preprocess_fn(batch):
audio_data = batch["audio"]["array"]
input_features = processor(
audio_data,
sampling_rate=16000,
return_tensors="pt"
).input_features
return {"input_features": input_features.squeeze(0)}
dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
dataset = dataset.take(256) # Use 256 samples for calibration
# 2. Configure Quantization Modifier
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
group_size=128,
dampening_fraction=0.01,
)
# 3. Execute Compression
# Note: Loading Whisper convolutional layers in float32 is recommended
# to avoid dtype conflicts during calibration.
compress(
model_id=MODEL_ID,
dataset=dataset,
recipe=recipe,
output_dir="Breeze-ASR-25-quantized.w4a16",
max_seq_length=448,
device_conversion="cuda",
format="compressed-tensors",
)
This model was quantized and optimized by SoybeanMilk based on MediaTek-Research/Breeze-ASR-25.
- Downloads last month
- 34
Model tree for SoybeanMilk/Breeze-ASR-25-quantized.w4a16
Base model
openai/whisper-large-v2
Finetuned
MediaTek-Research/Breeze-ASR-25