| ---
|
| license: mit
|
| base_model: microsoft/VibeVoice-ASR-HF
|
| tags:
|
| - automatic-speech-recognition
|
| - vibevoice
|
| - bitsandbytes
|
| - 8-bit
|
| - quantized
|
| - diarization
|
| language:
|
| - multilingual
|
| pipeline_tag: automatic-speech-recognition
|
| library_name: transformers
|
| ---
|
|
|
| # VibeVoice-ASR-HF — Selective INT8 8-bit Quantization
|
|
|
| Selectively quantized version of [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) for low-VRAM deployment.
|
|
|
| **Only the Qwen2.5-7B LLM backbone is quantized.** The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.
|
|
|
| ## Key details
|
|
|
| | | |
|
| |---|---|
|
| | Base model | [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) |
|
| | Quantization | INT8 8-bit (bitsandbytes) |
|
| | Modules quantized | `language_model.model.layers.*` only |
|
| | Modules in BF16 | `acoustic_tokenizer_encoder`, `semantic_tokenizer_encoder`, `acoustic_projection`, `semantic_projection`, `lm_head` |
|
| | Model size | ~9 GB (down from 17.3 GB) |
|
| | VRAM usage | ~10–11 GB |
|
| | Transformers | >= 5.3.0 |
|
| | bitsandbytes | >= 0.48.1 |
|
|
|
| ## Why selective quantization?
|
|
|
| Naive 8-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for INT8.
|
|
|
| ## Usage
|
|
|
| ```python
|
| import torch
|
| from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
| model_id = "Dubedo/VibeVoice-ASR-HF-INT8"
|
|
|
| processor = AutoProcessor.from_pretrained(model_id)
|
| model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
|
| model_id,
|
| device_map="auto",
|
| torch_dtype=torch.bfloat16,
|
| )
|
|
|
| inputs = processor.apply_transcription_request(
|
| audio="path/to/audio.wav",
|
| prompt="optional hotwords here",
|
| ).to(model.device, model.dtype)
|
|
|
| output_ids = model.generate(**inputs)
|
| generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
|
|
|
| # Structured output with speaker, timestamps, text
|
| result = processor.decode(generated_ids, return_format="parsed")[0]
|
| ```
|
|
|
| ## Quantization method
|
|
|
| Quantized using `BitsAndBytesConfig` with `llm_int8_skip_modules` to protect audio-critical components:
|
|
|
| ```python
|
| BitsAndBytesConfig(
|
| load_in_8bit=True,
|
| llm_int8_skip_modules=[
|
| "acoustic_tokenizer_encoder",
|
| "semantic_tokenizer_encoder",
|
| "acoustic_projection",
|
| "semantic_projection",
|
| "lm_head",
|
| ],
|
| )
|
| ```
|
|
|
| ## Acknowledgments
|
|
|
| Based on the selective quantization approach documented by [FabioSarracino/VibeVoice-Large-Q8](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8) and [Enemyx-net/VibeVoice-ComfyUI](https://github.com/Enemyx-net/VibeVoice-ComfyUI), adapted for the HF-native ASR architecture in transformers 5.3.0.
|
| |