File size: 4,980 Bytes
ec7461b
a117ee8
 
 
 
 
 
 
 
 
 
 
 
ec7461b
 
 
a117ee8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: mit
base_model: microsoft/VibeVoice-ASR
tags:
  - automatic-speech-recognition
  - vibevoice
  - bitsandbytes
  - 8-bit
  - int8
  - quantized
  - diarization
  - multilingual
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# VibeVoice-ASR — Selective INT8 Quantization

Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment.

**Only the Qwen2.5-7B LLM backbone is quantized to INT8.** Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

> ⚠️ This model uses the **standalone** `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`.

## Key details

| | |
|---|---|
| Base model | [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) |
| Quantization | INT8 (bitsandbytes `Linear8bitLt`) |
| Modules quantized | `model.language_model.model.layers.*` (196 layers) |
| Modules in BF16 | `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) |
| Model size | ~9.2 GB (down from 17.3 GB) |
| Peak VRAM | ~12.5 GB (including inference activations) |
| Transformers | == 4.57.3 |
| bitsandbytes | >= 0.48.1 |

## Why selective quantization?

Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.

**Critical discovery:** The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is:

| Standalone (this model) | HF-native (won't work here) |
|---|---|
| `acoustic_tokenizer` | `acoustic_tokenizer_encoder` |
| `semantic_tokenizer` | `semantic_tokenizer_encoder` |
| `acoustic_connector` | `acoustic_projection` |
| `semantic_connector` | `semantic_projection` |

Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.

## Usage

```python
import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

model_id = "Dubedo/VibeVoice-ASR-INT8"

# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
    model_id,
    language_model_pretrained_name="Qwen/Qwen2.5-7B",
)

# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

# Transcribe
inputs = processor(
    audio=["path/to/audio.wav"],
    sampling_rate=None,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        pad_token_id=processor.pad_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        do_sample=False,
    )

input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)
```

## Quantization method

Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`:

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    "microsoft/VibeVoice-ASR",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
```

## Important notes

- **Do NOT create a `preprocessor_config.json`** — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`.
- **Requires `bitsandbytes >= 0.48.1`** — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
- **INT8 models cannot be moved between CPU and GPU** — use delete+reload pattern for VRAM management.

## Acknowledgments

Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform.