Dubedo
/

VibeVoice-ASR-HF-INT8

Automatic Speech Recognition

8-bit precision

Model card Files Files and versions

VibeVoice-ASR-HF-INT8 / README.md

Dubedo's picture

Upload README.md with huggingface_hub

4b3d39f verified 1 day ago

|

history blame contribute delete

3.19 kB

	---
	license: mit
	base_model: microsoft/VibeVoice-ASR-HF
	tags:
	- automatic-speech-recognition
	- vibevoice
	- bitsandbytes
	- 8-bit
	- quantized
	- diarization
	language:
	- multilingual
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	---

	# VibeVoice-ASR-HF — Selective INT8 8-bit Quantization

	Selectively quantized version of [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) for low-VRAM deployment.

	Only the Qwen2.5-7B LLM backbone is quantized. The acoustic tokenizer encoder, semantic tokenizer encoder, projection layers, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

	## Key details

	\| \| \|
	\|---\|---\|
	\| Base model \| [microsoft/VibeVoice-ASR-HF](https://huggingface.co/microsoft/VibeVoice-ASR-HF) \|
	\| Quantization \| INT8 8-bit (bitsandbytes) \|
	\| Modules quantized \| `language_model.model.layers.*` only \|
	\| Modules in BF16 \| `acoustic_tokenizer_encoder`, `semantic_tokenizer_encoder`, `acoustic_projection`, `semantic_projection`, `lm_head` \|
	\| Model size \| ~9 GB (down from 17.3 GB) \|
	\| VRAM usage \| ~10–11 GB \|
	\| Transformers \| >= 5.3.0 \|
	\| bitsandbytes \| >= 0.48.1 \|

	## Why selective quantization?

	Naive 8-bit quantization of the entire model destroys diarization (all speakers collapse to SPEAKER_00) and degrades transcription quality significantly. The acoustic and semantic tokenizer encoders process raw audio signals where small numerical errors propagate catastrophically through the convolutional stages. The LLM backbone (Qwen2.5-7B) handles quantization gracefully since its weights follow a normal distribution well-suited for INT8.

	## Usage

	```python
	import torch
	from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration

	model_id = "Dubedo/VibeVoice-ASR-HF-INT8"

	processor = AutoProcessor.from_pretrained(model_id)
	model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	inputs = processor.apply_transcription_request(
	audio="path/to/audio.wav",
	prompt="optional hotwords here",
	).to(model.device, model.dtype)

	output_ids = model.generate(**inputs)
	generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]

	# Structured output with speaker, timestamps, text
	result = processor.decode(generated_ids, return_format="parsed")[0]
	```

	## Quantization method

	Quantized using `BitsAndBytesConfig` with `llm_int8_skip_modules` to protect audio-critical components:

	```python
	BitsAndBytesConfig(
	load_in_8bit=True,
	llm_int8_skip_modules=[
	"acoustic_tokenizer_encoder",
	"semantic_tokenizer_encoder",
	"acoustic_projection",
	"semantic_projection",
	"lm_head",
	],
	)
	```

	## Acknowledgments

	Based on the selective quantization approach documented by [FabioSarracino/VibeVoice-Large-Q8](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8) and [Enemyx-net/VibeVoice-ComfyUI](https://github.com/Enemyx-net/VibeVoice-ComfyUI), adapted for the HF-native ASR architecture in transformers 5.3.0.