File size: 2,337 Bytes
c2c7cf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-speech
tags:
  - mlx
  - tts
  - text-to-speech
  - omnivoice
  - quantized
base_model: k2-fsa/OmniVoice
---

# OmniVoice — int4 g=64 (MLX)

4-bit, group-size-64 affine quantization of [k2-fsa/OmniVoice](https://huggingface.co/k2-fsa/OmniVoice),
produced with `mlx-audio` for Apple Silicon.

## Sizes

| | Backbone | Total |
|---|---|---|
| original (bf16, this repo's `audio_tokenizer/` is unchanged) | ~1.2 GB | ~1.6 GB |
| **this repo (int4 g=64 backbone, bf16 tokenizer)** | **329 MB** | **724 MB** |

Quantization applies only to the Qwen3 backbone Linear layers (and the tied
audio embedding/head matmuls). The Higgs Audio V2 acoustic tokenizer
(decoder, RVQ, semantic) is left at bfloat16 to preserve audio fidelity.

## Performance (M-series, mlx-audio 0.x)

| Prompt | RTF (bf16) | RTF (this) |
|---|---|---|
| "Voice synthesis on Apple Silicon has come a long way. We can now generate full sentences in real time." | 3.68× | **4.59×** (+25%) |

Whisper-small round-trip: identical transcript to bf16 on the long prompt.

## Usage (mlx-audio Python)

```python
import json
import mlx.core as mx
import mlx.nn as nn
from huggingface_hub import snapshot_download
from mlx_audio.tts.models.omnivoice.config import OmniVoiceConfig
from mlx_audio.tts.models.omnivoice.omnivoice import Model

path = snapshot_download("lightsofapollo/omnivoice-mlx-q4-g64")
cfg_dict = json.load(open(f"{path}/config.json"))
model = Model(OmniVoiceConfig(**{k: v for k, v in cfg_dict.items() if k in OmniVoiceConfig.__dataclass_fields__}))

# IMPORTANT: quantize the model shape *before* loading weights.
q = cfg_dict["quantization"]
nn.quantize(model, group_size=q["group_size"], bits=q["bits"], mode=q.get("mode", "affine"),
            class_predicate=lambda _p, m: hasattr(m, "to_quantized"))

raw = dict(mx.load(f"{path}/model.safetensors"))
model.load_weights(list(model.sanitize(raw).items()))
mx.eval(model.parameters())
```

## How this was made

```bash
python -m mlx_audio.tts.models.omnivoice.convert \
    --model k2-fsa/OmniVoice --output omnivoice-bf16 --dtype bfloat16

python -m mlx_audio.convert \
    --hf-path omnivoice-bf16 --mlx-path omnivoice-q4-g64 \
    --quantize --q-bits 4 --q-group-size 64
```

## License

Apache-2.0 (inherited from `k2-fsa/OmniVoice`).