Upload README.md with huggingface_hub

a117ee8 verified about 1 month ago

4.98 kB

	---
	license: mit
	base_model: microsoft/VibeVoice-ASR
	tags:
	- automatic-speech-recognition
	- vibevoice
	- bitsandbytes
	- 8-bit
	- int8
	- quantized
	- diarization
	- multilingual
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	---

	# VibeVoice-ASR — Selective INT8 Quantization

	Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment.

	Only the Qwen2.5-7B LLM backbone is quantized to INT8. Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality.

	> ⚠️ This model uses the standalone `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`.

	## Key details

	\| \| \|
	\|---\|---\|
	\| Base model \| [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) \|
	\| Quantization \| INT8 (bitsandbytes `Linear8bitLt`) \|
	\| Modules quantized \| `model.language_model.model.layers.*` (196 layers) \|
	\| Modules in BF16 \| `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) \|
	\| Model size \| ~9.2 GB (down from 17.3 GB) \|
	\| Peak VRAM \| ~12.5 GB (including inference activations) \|
	\| Transformers \| == 4.57.3 \|
	\| bitsandbytes \| >= 0.48.1 \|

	## Why selective quantization?

	Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully.

	Critical discovery: The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is:

	\| Standalone (this model) \| HF-native (won't work here) \|
	\|---\|---\|
	\| `acoustic_tokenizer` \| `acoustic_tokenizer_encoder` \|
	\| `semantic_tokenizer` \| `semantic_tokenizer_encoder` \|
	\| `acoustic_connector` \| `acoustic_projection` \|
	\| `semantic_connector` \| `semantic_projection` \|

	Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output.

	## Usage

	```python
	import torch
	from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
	from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

	model_id = "Dubedo/VibeVoice-ASR-INT8"

	# Load processor (no preprocessor_config.json — default ratio=3200 is correct)
	processor = VibeVoiceASRProcessor.from_pretrained(
	model_id,
	language_model_pretrained_name="Qwen/Qwen2.5-7B",
	)

	# Load quantized model
	model = VibeVoiceASRForConditionalGeneration.from_pretrained(
	model_id,
	device_map="auto",
	trust_remote_code=True,
	)
	model.eval()

	# Transcribe
	inputs = processor(
	audio=["path/to/audio.wav"],
	sampling_rate=None,
	return_tensors="pt",
	padding=True,
	add_generation_prompt=True,
	)
	inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=32768,
	pad_token_id=processor.pad_id,
	eos_token_id=processor.tokenizer.eos_token_id,
	do_sample=False,
	)

	input_length = inputs["input_ids"].shape[1]
	generated_ids = output_ids[0, input_length:]
	text = processor.decode(generated_ids, skip_special_tokens=True)
	segments = processor.post_process_transcription(text)
	```

	## Quantization method

	Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`:

	```python
	from transformers import BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(
	load_in_8bit=True,
	llm_int8_skip_modules=[
	"acoustic_tokenizer",
	"semantic_tokenizer",
	"acoustic_connector",
	"semantic_connector",
	"lm_head",
	],
	)

	model = VibeVoiceASRForConditionalGeneration.from_pretrained(
	"microsoft/VibeVoice-ASR",
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16,
	dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	```

	## Important notes

	- Do NOT create a `preprocessor_config.json` — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`.
	- Requires `bitsandbytes >= 0.48.1` — v0.48.0 has a confirmed critical bug breaking INT8 quantization.
	- INT8 models cannot be moved between CPU and GPU — use delete+reload pattern for VRAM management.

	## Acknowledgments

	Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform.