VibeVoice-ASR-INT8 / README.md
Dubedo's picture
Upload README.md with huggingface_hub
a117ee8 verified
metadata
license: mit
base_model: microsoft/VibeVoice-ASR
tags:
  - automatic-speech-recognition
  - vibevoice
  - bitsandbytes
  - 8-bit
  - int8
  - quantized
  - diarization
  - multilingual
pipeline_tag: automatic-speech-recognition
library_name: transformers

VibeVoice-ASR — Selective INT8 Quantization

Selectively quantized version of microsoft/VibeVoice-ASR for low-VRAM deployment.

Only the Qwen2.5-7B LLM backbone is quantized to INT8. Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

⚠️ This model uses the standalone vibevoice package (pip install git+https://github.com/microsoft/VibeVoice.git), NOT the HF-native transformers >= 5.3.0 variant. It requires transformers == 4.57.3.

Key details

Base model microsoft/VibeVoice-ASR
Quantization INT8 (bitsandbytes Linear8bitLt)
Modules quantized model.language_model.model.layers.* (196 layers)
Modules in BF16 acoustic_tokenizer, semantic_tokenizer, acoustic_connector, semantic_connector, lm_head (161 layers)
Model size ~9.2 GB (down from 17.3 GB)
Peak VRAM ~12.5 GB (including inference activations)
Transformers == 4.57.3
bitsandbytes >= 0.48.1

Why selective quantization?

Naive INT8 quantization of the entire model produces [Unintelligible Speech] — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.

Critical discovery: The standalone vibevoice package uses different module names than the HF-native variant. The correct skip list for the standalone model is:

Standalone (this model) HF-native (won't work here)
acoustic_tokenizer acoustic_tokenizer_encoder
semantic_tokenizer semantic_tokenizer_encoder
acoustic_connector acoustic_projection
semantic_connector semantic_projection

Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.

Usage

import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

model_id = "Dubedo/VibeVoice-ASR-INT8"

# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
    model_id,
    language_model_pretrained_name="Qwen/Qwen2.5-7B",
)

# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

# Transcribe
inputs = processor(
    audio=["path/to/audio.wav"],
    sampling_rate=None,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        pad_token_id=processor.pad_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        do_sample=False,
    )

input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)

Quantization method

Quantized on NVIDIA L4 (22GB) using the standalone vibevoice package with BitsAndBytesConfig:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    "microsoft/VibeVoice-ASR",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Important notes

  • Do NOT create a preprocessor_config.json — the standalone processor's default fallback sets speech_tok_compress_ratio=3200, which is correct. Creating one with ratio=320 causes a 10x mask shape mismatch and IndexError.
  • Requires bitsandbytes >= 0.48.1 — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
  • INT8 models cannot be moved between CPU and GPU — use delete+reload pattern for VRAM management.

Acknowledgments

Based on microsoft/VibeVoice-ASR. Built for the Dubedo AI video dubbing platform.