VibeVoice-ASR-HF โ Selective NF4 4-bit Quantization
Selectively quantized version of microsoft/VibeVoice-ASR-HF for low-VRAM deployment.
Only the Qwen2.5-7B LLM backbone is quantized. The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision โ preserving diarization accuracy and transcription quality.
Key details
| Base model | microsoft/VibeVoice-ASR-HF |
| Quantization | NF4 4-bit (bitsandbytes, double quantization) |
| Modules quantized | language_model.model.layers.* only |
| Modules in BF16 | acoustic_tokenizer_encoder, semantic_tokenizer_encoder, acoustic_projection, semantic_projection, lm_head |
| Model size | ~5.5 GB (down from 17.3 GB) |
| VRAM usage | ~7โ8 GB |
| Transformers | >= 5.3.0 |
| bitsandbytes | >= 0.48.1 |
Why selective quantization?
Naive 4-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for NF4.
Usage
import torch
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "Dubedo/VibeVoice-ASR-HF-NF4"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
inputs = processor.apply_transcription_request(
audio="path/to/audio.wav",
prompt="optional hotwords here",
).to(model.device, model.dtype)
output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
# Structured output with speaker, timestamps, text
result = processor.decode(generated_ids, return_format="parsed")[0]
Quantization method
Quantized using BitsAndBytesConfig with llm_int8_skip_modules to protect audio-critical components:
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_skip_modules=[
"acoustic_tokenizer_encoder",
"semantic_tokenizer_encoder",
"acoustic_projection",
"semantic_projection",
"lm_head",
],
)
Acknowledgments
Based on the selective quantization approach documented by FabioSarracino/VibeVoice-Large-Q8 and Enemyx-net/VibeVoice-ComfyUI, adapted for the HF-native ASR architecture in transformers 5.3.0.
- Downloads last month
- -
Model tree for Dubedo/VibeVoice-ASR-HF-NF4
Base model
microsoft/VibeVoice-ASR-HF