| --- |
| license: mit |
| base_model: microsoft/VibeVoice-ASR |
| tags: |
| - automatic-speech-recognition |
| - vibevoice |
| - bitsandbytes |
| - 8-bit |
| - int8 |
| - quantized |
| - diarization |
| - multilingual |
| pipeline_tag: automatic-speech-recognition |
| library_name: transformers |
| --- |
| |
| # VibeVoice-ASR — Selective INT8 Quantization |
|
|
| Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment. |
|
|
| **Only the Qwen2.5-7B LLM backbone is quantized to INT8.** Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality. |
| |
| > ⚠️ This model uses the **standalone** `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`. |
| |
| ## Key details |
| |
| | | | |
| |---|---| |
| | Base model | [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) | |
| | Quantization | INT8 (bitsandbytes `Linear8bitLt`) | |
| | Modules quantized | `model.language_model.model.layers.*` (196 layers) | |
| | Modules in BF16 | `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) | |
| | Model size | ~9.2 GB (down from 17.3 GB) | |
| | Peak VRAM | ~12.5 GB (including inference activations) | |
| | Transformers | == 4.57.3 | |
| | bitsandbytes | >= 0.48.1 | |
| |
| ## Why selective quantization? |
| |
| Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully. |
| |
| **Critical discovery:** The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is: |
| |
| | Standalone (this model) | HF-native (won't work here) | |
| |---|---| |
| | `acoustic_tokenizer` | `acoustic_tokenizer_encoder` | |
| | `semantic_tokenizer` | `semantic_tokenizer_encoder` | |
| | `acoustic_connector` | `acoustic_projection` | |
| | `semantic_connector` | `semantic_projection` | |
| |
| Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output. |
| |
| ## Usage |
| |
| ```python |
| import torch |
| from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration |
| from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor |
| |
| model_id = "Dubedo/VibeVoice-ASR-INT8" |
| |
| # Load processor (no preprocessor_config.json — default ratio=3200 is correct) |
| processor = VibeVoiceASRProcessor.from_pretrained( |
| model_id, |
| language_model_pretrained_name="Qwen/Qwen2.5-7B", |
| ) |
| |
| # Load quantized model |
| model = VibeVoiceASRForConditionalGeneration.from_pretrained( |
| model_id, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| model.eval() |
| |
| # Transcribe |
| inputs = processor( |
| audio=["path/to/audio.wav"], |
| sampling_rate=None, |
| return_tensors="pt", |
| padding=True, |
| add_generation_prompt=True, |
| ) |
| inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()} |
| |
| with torch.no_grad(): |
| output_ids = model.generate( |
| **inputs, |
| max_new_tokens=32768, |
| pad_token_id=processor.pad_id, |
| eos_token_id=processor.tokenizer.eos_token_id, |
| do_sample=False, |
| ) |
| |
| input_length = inputs["input_ids"].shape[1] |
| generated_ids = output_ids[0, input_length:] |
| text = processor.decode(generated_ids, skip_special_tokens=True) |
| segments = processor.post_process_transcription(text) |
| ``` |
| |
| ## Quantization method |
| |
| Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`: |
| |
| ```python |
| from transformers import BitsAndBytesConfig |
| |
| quantization_config = BitsAndBytesConfig( |
| load_in_8bit=True, |
| llm_int8_skip_modules=[ |
| "acoustic_tokenizer", |
| "semantic_tokenizer", |
| "acoustic_connector", |
| "semantic_connector", |
| "lm_head", |
| ], |
| ) |
| |
| model = VibeVoiceASRForConditionalGeneration.from_pretrained( |
| "microsoft/VibeVoice-ASR", |
| quantization_config=quantization_config, |
| torch_dtype=torch.bfloat16, |
| dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| ``` |
| |
| ## Important notes |
| |
| - **Do NOT create a `preprocessor_config.json`** — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`. |
| - **Requires `bitsandbytes >= 0.48.1`** — v0.48.0 has a confirmed critical bug breaking INT8 quantization. |
| - **INT8 models cannot be moved between CPU and GPU** — use delete+reload pattern for VRAM management. |
|
|
| ## Acknowledgments |
|
|
| Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform. |
|
|