File size: 4,980 Bytes
ec7461b a117ee8 ec7461b a117ee8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: mit
base_model: microsoft/VibeVoice-ASR
tags:
- automatic-speech-recognition
- vibevoice
- bitsandbytes
- 8-bit
- int8
- quantized
- diarization
- multilingual
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# VibeVoice-ASR — Selective INT8 Quantization
Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment.
**Only the Qwen2.5-7B LLM backbone is quantized to INT8.** Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.
> ⚠️ This model uses the **standalone** `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`.
## Key details
| | |
|---|---|
| Base model | [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) |
| Quantization | INT8 (bitsandbytes `Linear8bitLt`) |
| Modules quantized | `model.language_model.model.layers.*` (196 layers) |
| Modules in BF16 | `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) |
| Model size | ~9.2 GB (down from 17.3 GB) |
| Peak VRAM | ~12.5 GB (including inference activations) |
| Transformers | == 4.57.3 |
| bitsandbytes | >= 0.48.1 |
## Why selective quantization?
Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.
**Critical discovery:** The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is:
| Standalone (this model) | HF-native (won't work here) |
|---|---|
| `acoustic_tokenizer` | `acoustic_tokenizer_encoder` |
| `semantic_tokenizer` | `semantic_tokenizer_encoder` |
| `acoustic_connector` | `acoustic_projection` |
| `semantic_connector` | `semantic_projection` |
Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.
## Usage
```python
import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor
model_id = "Dubedo/VibeVoice-ASR-INT8"
# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
model_id,
language_model_pretrained_name="Qwen/Qwen2.5-7B",
)
# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
model.eval()
# Transcribe
inputs = processor(
audio=["path/to/audio.wav"],
sampling_rate=None,
return_tensors="pt",
padding=True,
add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=32768,
pad_token_id=processor.pad_id,
eos_token_id=processor.tokenizer.eos_token_id,
do_sample=False,
)
input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)
```
## Quantization method
Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`:
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=[
"acoustic_tokenizer",
"semantic_tokenizer",
"acoustic_connector",
"semantic_connector",
"lm_head",
],
)
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
"microsoft/VibeVoice-ASR",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
```
## Important notes
- **Do NOT create a `preprocessor_config.json`** — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`.
- **Requires `bitsandbytes >= 0.48.1`** — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
- **INT8 models cannot be moved between CPU and GPU** — use delete+reload pattern for VRAM management.
## Acknowledgments
Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform.
|