File size: 4,980 Bytes

ec7461b
a117ee8
 
 
 
 
 
 
 
 
 
 
 
ec7461b
 
 
a117ee8

---
license: mit
base_model: microsoft/VibeVoice-ASR
tags:
  - automatic-speech-recognition
  - vibevoice
  - bitsandbytes
  - 8-bit
  - int8
  - quantized
  - diarization
  - multilingual
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# VibeVoice-ASR — Selective INT8 Quantization

Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment.

**Only the Qwen2.5-7B LLM backbone is quantized to INT8.** Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

> ⚠️ This model uses the **standalone** `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`.

## Key details

| | |
|---|---|
| Base model | [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) |
| Quantization | INT8 (bitsandbytes `Linear8bitLt`) |
| Modules quantized | `model.language_model.model.layers.*` (196 layers) |
| Modules in BF16 | `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) |
| Model size | ~9.2 GB (down from 17.3 GB) |
| Peak VRAM | ~12.5 GB (including inference activations) |
| Transformers | == 4.57.3 |
| bitsandbytes | >= 0.48.1 |

## Why selective quantization?

Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.

**Critical discovery:** The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is:

| Standalone (this model) | HF-native (won't work here) |
|---|---|
| `acoustic_tokenizer` | `acoustic_tokenizer_encoder` |
| `semantic_tokenizer` | `semantic_tokenizer_encoder` |
| `acoustic_connector` | `acoustic_projection` |
| `semantic_connector` | `semantic_projection` |

Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.

## Usage

```python
import torch
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

model_id = "Dubedo/VibeVoice-ASR-INT8"

# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
processor = VibeVoiceASRProcessor.from_pretrained(
    model_id,
    language_model_pretrained_name="Qwen/Qwen2.5-7B",
)

# Load quantized model
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

# Transcribe
inputs = processor(
    audio=["path/to/audio.wav"],
    sampling_rate=None,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True,
)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        pad_token_id=processor.pad_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        do_sample=False,
    )

input_length = inputs["input_ids"].shape[1]
generated_ids = output_ids[0, input_length:]
text = processor.decode(generated_ids, skip_special_tokens=True)
segments = processor.post_process_transcription(text)
```

## Quantization method

Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`:

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    "microsoft/VibeVoice-ASR",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
```

## Important notes

- **Do NOT create a `preprocessor_config.json`** — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`.
- **Requires `bitsandbytes >= 0.48.1`** — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
- **INT8 models cannot be moved between CPU and GPU** — use delete+reload pattern for VRAM management.

## Acknowledgments

Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform.