OmniVoice MLX q4-g64: Qwen3 backbone quantized via mlx-lm; Higgs tokenizer bf16

c2c7cf4 verified 20 days ago

2.34 kB

	---
	license: apache-2.0
	library_name: mlx
	pipeline_tag: text-to-speech
	tags:
	- mlx
	- tts
	- text-to-speech
	- omnivoice
	- quantized
	base_model: k2-fsa/OmniVoice
	---

	# OmniVoice — int4 g=64 (MLX)

	4-bit, group-size-64 affine quantization of [k2-fsa/OmniVoice](https://huggingface.co/k2-fsa/OmniVoice),
	produced with `mlx-audio` for Apple Silicon.

	## Sizes

	\| \| Backbone \| Total \|
	\|---\|---\|---\|
	\| original (bf16, this repo's `audio_tokenizer/` is unchanged) \| ~1.2 GB \| ~1.6 GB \|
	\| this repo (int4 g=64 backbone, bf16 tokenizer) \| 329 MB \| 724 MB \|

	Quantization applies only to the Qwen3 backbone Linear layers (and the tied
	audio embedding/head matmuls). The Higgs Audio V2 acoustic tokenizer
	(decoder, RVQ, semantic) is left at bfloat16 to preserve audio fidelity.

	## Performance (M-series, mlx-audio 0.x)

	\| Prompt \| RTF (bf16) \| RTF (this) \|
	\|---\|---\|---\|
	\| "Voice synthesis on Apple Silicon has come a long way. We can now generate full sentences in real time." \| 3.68× \| 4.59× (+25%) \|

	Whisper-small round-trip: identical transcript to bf16 on the long prompt.

	## Usage (mlx-audio Python)

	```python
	import json
	import mlx.core as mx
	import mlx.nn as nn
	from huggingface_hub import snapshot_download
	from mlx_audio.tts.models.omnivoice.config import OmniVoiceConfig
	from mlx_audio.tts.models.omnivoice.omnivoice import Model

	path = snapshot_download("lightsofapollo/omnivoice-mlx-q4-g64")
	cfg_dict = json.load(open(f"{path}/config.json"))
	model = Model(OmniVoiceConfig(**{k: v for k, v in cfg_dict.items() if k in OmniVoiceConfig.__dataclass_fields__}))

	# IMPORTANT: quantize the model shape before loading weights.
	q = cfg_dict["quantization"]
	nn.quantize(model, group_size=q["group_size"], bits=q["bits"], mode=q.get("mode", "affine"),
	class_predicate=lambda _p, m: hasattr(m, "to_quantized"))

	raw = dict(mx.load(f"{path}/model.safetensors"))
	model.load_weights(list(model.sanitize(raw).items()))
	mx.eval(model.parameters())
	```

	## How this was made

	```bash
	python -m mlx_audio.tts.models.omnivoice.convert \
	--model k2-fsa/OmniVoice --output omnivoice-bf16 --dtype bfloat16

	python -m mlx_audio.convert \
	--hf-path omnivoice-bf16 --mlx-path omnivoice-q4-g64 \
	--quantize --q-bits 4 --q-group-size 64
	```

	## License

	Apache-2.0 (inherited from `k2-fsa/OmniVoice`).