SoybeanMilk/Breeze-ASR-25-quantized.w4a16

This repository provides a high-performance, 4-bit quantized version of MediaTek-Research/Breeze-ASR-25, optimized using the W4A16 (GPTQ) format. It is specifically designed for the vLLM engine to provide a lightweight and easy-to-deploy solution for Traditional Chinese speech recognition.

🚀 Model Highlights

Weight Quantization: 4-bit (W4A16) quantization significantly reduces VRAM usage while maintaining high accuracy.
Traditional Chinese Optimization: Fully inherits Breeze-ASR-25's superior recognition capabilities for Taiwan's oral expressions, professional terminology, and various dialects.
Architecture Optimization: Native support for vLLM, enabling high-performance Marlin acceleration kernels out of the box.

🛠️ Inference Example

It is recommended to use vLLM (>= 0.15.1) for deployment.

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
    max_model_len=448,
    limit_mm_per_prompt={"audio": 1},
    enforce_eager=True
)

# Prepare input (16kHz audio)
prompts = {
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {"audio": ("path/to/audio.wav", 16000)}
    },
    "decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
}

# Run inference
outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
print(outputs[0].outputs[0].text)

⚙️ Quantization Details

The model was quantized using the llm-compressor library.

Algorithm: GPTQ
Precision: 4-bit (W4A16)
Calibration Dataset: MLCommons/peoples_speech
Group Size: 128

🧪 How to Replicate

Below is the complete script used to generate this quantized model. Ensure you have llmcompressor and librosa installed in your environment.

import torch
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import compress
from transformers import AutoProcessor
from datasets import load_dataset
import librosa

MODEL_ID = "MediaTek-Research/Breeze-ASR-25"

# 1. Prepare Calibration Data (Peoples Speech)
def preprocess_fn(batch):
    audio_data = batch["audio"]["array"]
    input_features = processor(
        audio_data, 
        sampling_rate=16000, 
        return_tensors="pt"
    ).input_features
    return {"input_features": input_features.squeeze(0)}

dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
dataset = dataset.take(256) # Use 256 samples for calibration

# 2. Configure Quantization Modifier
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    group_size=128,
    dampening_fraction=0.01,
)

# 3. Execute Compression
# Note: Loading Whisper convolutional layers in float32 is recommended 
# to avoid dtype conflicts during calibration.
compress(
    model_id=MODEL_ID,
    dataset=dataset,
    recipe=recipe,
    output_dir="Breeze-ASR-25-quantized.w4a16",
    max_seq_length=448,
    device_conversion="cuda",
    format="compressed-tensors",
)

This model was quantized and optimized by SoybeanMilk based on MediaTek-Research/Breeze-ASR-25.

Downloads last month: 34

Safetensors

Model size

0.3B params

Tensor type

I64

F32

I32

Model tree for SoybeanMilk/Breeze-ASR-25-quantized.w4a16

Base model

openai/whisper-large-v2

Finetuned

MediaTek-Research/Breeze-ASR-25

Quantized

(3)

this model