SoybeanMilk/Breeze-ASR-25-quantized.w4a16

This repository provides a high-performance, 4-bit quantized version of MediaTek-Research/Breeze-ASR-25, optimized using the W4A16 (GPTQ) format. It is specifically designed for the vLLM engine to provide a lightweight and easy-to-deploy solution for Traditional Chinese speech recognition.

πŸš€ Model Highlights

  • Weight Quantization: 4-bit (W4A16) quantization significantly reduces VRAM usage while maintaining high accuracy.
  • Traditional Chinese Optimization: Fully inherits Breeze-ASR-25's superior recognition capabilities for Taiwan's oral expressions, professional terminology, and various dialects.
  • Architecture Optimization: Native support for vLLM, enabling high-performance Marlin acceleration kernels out of the box.

πŸ› οΈ Inference Example

It is recommended to use vLLM (>= 0.15.1) for deployment.

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model="SoybeanMilk/Breeze-ASR-25-quantized.w4a16",
    max_model_len=448,
    limit_mm_per_prompt={"audio": 1},
    enforce_eager=True
)

# Prepare input (16kHz audio)
prompts = {
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {"audio": ("path/to/audio.wav", 16000)}
    },
    "decoder_prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
}

# Run inference
outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=128))
print(outputs[0].outputs[0].text)

βš™οΈ Quantization Details

The model was quantized using the llm-compressor library.

  • Algorithm: GPTQ
  • Precision: 4-bit (W4A16)
  • Calibration Dataset: MLCommons/peoples_speech
  • Group Size: 128

πŸ§ͺ How to Replicate

Below is the complete script used to generate this quantized model. Ensure you have llmcompressor and librosa installed in your environment.

import torch
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import compress
from transformers import AutoProcessor
from datasets import load_dataset
import librosa

MODEL_ID = "MediaTek-Research/Breeze-ASR-25"

# 1. Prepare Calibration Data (Peoples Speech)
def preprocess_fn(batch):
    audio_data = batch["audio"]["array"]
    input_features = processor(
        audio_data, 
        sampling_rate=16000, 
        return_tensors="pt"
    ).input_features
    return {"input_features": input_features.squeeze(0)}

dataset = load_dataset("MLCommons/peoples_speech", "clean", split="train", streaming=True)
dataset = dataset.take(256) # Use 256 samples for calibration

# 2. Configure Quantization Modifier
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    group_size=128,
    dampening_fraction=0.01,
)

# 3. Execute Compression
# Note: Loading Whisper convolutional layers in float32 is recommended 
# to avoid dtype conflicts during calibration.
compress(
    model_id=MODEL_ID,
    dataset=dataset,
    recipe=recipe,
    output_dir="Breeze-ASR-25-quantized.w4a16",
    max_seq_length=448,
    device_conversion="cuda",
    format="compressed-tensors",
)

This model was quantized and optimized by SoybeanMilk based on MediaTek-Research/Breeze-ASR-25.

Downloads last month
34
Safetensors
Model size
0.3B params
Tensor type
I64
Β·
F32
Β·
I32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SoybeanMilk/Breeze-ASR-25-quantized.w4a16

Quantized
(3)
this model