VibeVoice-ASR-HF — Selective NF4 4-bit Quantization

Selectively quantized version of microsoft/VibeVoice-ASR-HF for low-VRAM deployment.

Only the Qwen2.5-7B LLM backbone is quantized. The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

Key details


Base model	microsoft/VibeVoice-ASR-HF
Quantization	NF4 4-bit (bitsandbytes, double quantization)
Modules quantized	`language_model.model.layers.*` only
Modules in BF16	`acoustic_tokenizer_encoder`, `semantic_tokenizer_encoder`, `acoustic_projection`, `semantic_projection`, `lm_head`
Model size	~5.5 GB (down from 17.3 GB)
VRAM usage	~7–8 GB
Transformers	>= 5.3.0
bitsandbytes	>= 0.48.1

Why selective quantization?

Naive 4-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for NF4.

Usage

import torch
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration

model_id = "Dubedo/VibeVoice-ASR-HF-NF4"

processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

inputs = processor.apply_transcription_request(
    audio="path/to/audio.wav",
    prompt="optional hotwords here",
).to(model.device, model.dtype)

output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]

# Structured output with speaker, timestamps, text
result = processor.decode(generated_ids, return_format="parsed")[0]

Quantization method

Quantized using BitsAndBytesConfig with llm_int8_skip_modules to protect audio-critical components:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_skip_modules=[
        "acoustic_tokenizer_encoder",
        "semantic_tokenizer_encoder",
        "acoustic_projection",
        "semantic_projection",
        "lm_head",
    ],
)

Acknowledgments

Based on the selective quantization approach documented by FabioSarracino/VibeVoice-Large-Q8 and Enemyx-net/VibeVoice-ComfyUI, adapted for the HF-native ASR architecture in transformers 5.3.0.

Downloads last month: 476

Safetensors

Model size

9B params

Tensor type

F32

BF16

Model tree for Dubedo/VibeVoice-ASR-HF-NF4

Base model

microsoft/VibeVoice-ASR-HF

Quantized

(3)

this model