File size: 3,187 Bytes
4b3d39f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---

license: mit
base_model: microsoft/VibeVoice-ASR-HF
tags:
  - automatic-speech-recognition
  - vibevoice
  - bitsandbytes
  - 8-bit
  - quantized
  - diarization
language:
  - multilingual
pipeline_tag: automatic-speech-recognition
library_name: transformers
---


# VibeVoice-ASR-HF — Selective INT8 8-bit Quantization

Selectively quantized version of [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) for low-VRAM deployment.

**Only the Qwen2.5-7B LLM backbone is quantized.** The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.



## Key details



| | |

|---|---|

| Base model | [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) |

| Quantization | INT8 8-bit (bitsandbytes) |

| Modules quantized | `language_model.model.layers.*` only |

| Modules in BF16 | `acoustic_tokenizer_encoder`, `semantic_tokenizer_encoder`, `acoustic_projection`, `semantic_projection`, `lm_head` |

| Model size | ~9 GB (down from 17.3 GB) |

| VRAM usage | ~10–11 GB |

| Transformers | >= 5.3.0 |

| bitsandbytes | >= 0.48.1 |



## Why selective quantization?



Naive 8-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for INT8.



## Usage



```python

import torch

from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration



model_id = "Dubedo/VibeVoice-ASR-HF-INT8"



processor = AutoProcessor.from_pretrained(model_id)

model = VibeVoiceAsrForConditionalGeneration.from_pretrained(

    model_id,

    device_map="auto",

    torch_dtype=torch.bfloat16,

)



inputs = processor.apply_transcription_request(

    audio="path/to/audio.wav",

    prompt="optional hotwords here",

).to(model.device, model.dtype)



output_ids = model.generate(**inputs)

generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]



# Structured output with speaker, timestamps, text

result = processor.decode(generated_ids, return_format="parsed")[0]

```



## Quantization method



Quantized using `BitsAndBytesConfig` with `llm_int8_skip_modules` to protect audio-critical components:



```python

BitsAndBytesConfig(

    load_in_8bit=True,

    llm_int8_skip_modules=[

        "acoustic_tokenizer_encoder",

        "semantic_tokenizer_encoder",

        "acoustic_projection",

        "semantic_projection",

        "lm_head",

    ],

)

```



## Acknowledgments



Based on the selective quantization approach documented by [FabioSarracino/VibeVoice-Large-Q8](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8) and [Enemyx-net/VibeVoice-ComfyUI](https://github.com/Enemyx-net/VibeVoice-ComfyUI), adapted for the HF-native ASR architecture in transformers 5.3.0.