--- license: mit base_model: microsoft/VibeVoice-ASR-HF tags: - automatic-speech-recognition - vibevoice - bitsandbytes - 4-bit - quantized - diarization language: - multilingual pipeline_tag: automatic-speech-recognition library_name: transformers --- # VibeVoice-ASR-HF — Selective NF4 4-bit Quantization Selectively quantized version of [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) for low-VRAM deployment. **Only the Qwen2.5-7B LLM backbone is quantized.** The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality. ## Key details | | | |---|---| | Base model | [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) | | Quantization | NF4 4-bit (bitsandbytes, double quantization) | | Modules quantized | `language_model.model.layers.*` only | | Modules in BF16 | `acoustic_tokenizer_encoder`, `semantic_tokenizer_encoder`, `acoustic_projection`, `semantic_projection`, `lm_head` | | Model size | ~5.5 GB (down from 17.3 GB) | | VRAM usage | ~7–8 GB | | Transformers | >= 5.3.0 | | bitsandbytes | >= 0.48.1 | ## Why selective quantization? Naive 4-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for NF4. ## Usage ```python import torch from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration model_id = "Dubedo/VibeVoice-ASR-HF-NF4" processor = AutoProcessor.from_pretrained(model_id) model = VibeVoiceAsrForConditionalGeneration.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16, ) inputs = processor.apply_transcription_request( audio="path/to/audio.wav", prompt="optional hotwords here", ).to(model.device, model.dtype) output_ids = model.generate(**inputs) generated_ids = output_ids[:, inputs["input_ids"].shape[1]:] # Structured output with speaker, timestamps, text result = processor.decode(generated_ids, return_format="parsed")[0] ``` ## Quantization method Quantized using `BitsAndBytesConfig` with `llm_int8_skip_modules` to protect audio-critical components: ```python BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", llm_int8_skip_modules=[ "acoustic_tokenizer_encoder", "semantic_tokenizer_encoder", "acoustic_projection", "semantic_projection", "lm_head", ], ) ``` ## Acknowledgments Based on the selective quantization approach documented by [FabioSarracino/VibeVoice-Large-Q8](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8) and [Enemyx-net/VibeVoice-ComfyUI](https://github.com/Enemyx-net/VibeVoice-ComfyUI), adapted for the HF-native ASR architecture in transformers 5.3.0.