VibeVoice-ASR-INT8 / README.md
Dubedo's picture
Upload README.md with huggingface_hub
a117ee8 verified
---
license: mit
base_model: microsoft/VibeVoice-ASR
tags:
- automatic-speech-recognition
- vibevoice
- bitsandbytes
- 8-bit
- int8
- quantized
- diarization
- multilingual
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# VibeVoice-ASR — Selective INT8 Quantization
Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment.
**Only the Qwen2.5-7B LLM backbone is quantized to INT8.** Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.
> ⚠️ This model uses the **standalone** `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`.
## Key details
| | |
|---|---|
| Base model | [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) |
| Quantization | INT8 (bitsandbytes `Linear8bitLt`) |
| Modules quantized | `model.language_model.model.layers.*` (196 layers) |
| Modules in BF16 | `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) |
| Model size | ~9.2 GB (down from 17.3 GB) |
| Peak VRAM | ~12.5 GB (including inference activations) |
| Transformers | == 4.57.3 |
| bitsandbytes | >= 0.48.1 |
## Why selective quantization?
Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.
**Critical discovery:** The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is:
| Standalone (this model) | HF-native (won't work here) |
|---|---|
| `acoustic_tokenizer` | `acoustic_tokenizer_encoder` |
| `semantic_tokenizer` | `semantic_tokenizer_encoder` |
| `acoustic_connector` | `acoustic_projection` |
| `semantic_connector` | `semantic_projection` |
Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.
## Usage
```python
import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor
model_id = "Dubedo/VibeVoice-ASR-INT8"
# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
model_id,
language_model_pretrained_name="Qwen/Qwen2.5-7B",
)
# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
model.eval()
# Transcribe
inputs = processor(
audio=["path/to/audio.wav"],
sampling_rate=None,
return_tensors="pt",
padding=True,
add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=32768,
pad_token_id=processor.pad_id,
eos_token_id=processor.tokenizer.eos_token_id,
do_sample=False,
)
input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)
```
## Quantization method
Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`:
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=[
"acoustic_tokenizer",
"semantic_tokenizer",
"acoustic_connector",
"semantic_connector",
"lm_head",
],
)
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
"microsoft/VibeVoice-ASR",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
```
## Important notes
- **Do NOT create a `preprocessor_config.json`** — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`.
- **Requires `bitsandbytes >= 0.48.1`** — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
- **INT8 models cannot be moved between CPU and GPU** — use delete+reload pattern for VRAM management.
## Acknowledgments
Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform.