Zyphra-ZONOS2-4bit (MLX, mixed-precision)

A mixed-precision quantized MLX build of Zyphra's ZONOS2 autoregressive text-to-speech model, for fast local inference on Apple Silicon with mlx-audio.

Base model: mlx-community/Zyphra-ZONOS2 (BF16)
Size: 4.68 GB (vs 15.34 GB BF16) — 4.88 bits/weight
Decode @ 44.1 kHz, 9 DAC codebooks, 16-expert top-1 MoE backbone, optional speaker cloning
Single-stream RTF ~0.85 (faster than real-time) and ~4–4.3× real-time batched on an M4 Pro (with our on-device batch-sampling optimization)

Install

ZONOS2 needs mlx-audio with the ZONOS2 model + batching. Install our optimized fork:

pip install git+https://github.com/Amal-David/mlx-audio.git@zonos2-optimized

Optimization details, benchmarks, and the quantization recipe live in the fork's OPTIMIZATIONS.md.

📁 Runnable scripts are in examples/: generate.py (single / clone / long-form), batch_generate.py (throughput runner), and quantize.py (reproduce this build from BF16).

Quick start

from mlx_audio.tts import load
from mlx_audio.audio_io import write as audio_write

model = load("amal-david/Zyphra-ZONOS2-4bit", lazy=False)

result = next(model.generate(
    text="Hello, this is the four bit ZONOS two model running locally with MLX audio.",
    max_tokens=1024,
))
audio_write("zonos2.wav", result.audio, result.sample_rate)
print(result.audio_duration, "RTF", result.real_time_factor)

Batch generation — the throughput lever

For generating many clips (eval runs, datasets), batching is the big win. B = 32–64 is the sweet spot on an M4 Pro; beyond that, throughput saturates (see notes).

texts = [f"This is sample number {i}." for i in range(64)]

for r in model.batch_generate(texts, max_tokens=1024, seed=42):
    audio_write(f"out_{r.sequence_idx:03d}.wav", r.audio, r.sample_rate)

Tip: group prompts of similar length in a batch. The loop runs until the longest sequence finishes, so mixing very short and very long prompts wastes compute.

Voice cloning

Extract the speaker embedding once and reuse it so the voice stays consistent:

spk = model.extract_speaker_embedding("speaker.wav")     # 2048-D, compute once
result = next(model.generate(
    text="This sentence is spoken in the cloned reference voice.",
    speaker_embedding=spk, max_tokens=1024,
))
audio_write("cloned.wav", result.audio, result.sample_rate)

Long-form text (chunking)

There is no built-in long-form handling: a single call is capped by max_tokens (default 1024 ≈ 12 s) and quality is best within the model's window. For paragraphs/articles, split on sentence boundaries, batch the chunks, and concatenate:

import re, mlx.core as mx

def speak_long(model, text, ref_audio=None, max_chars=350, gap_s=0.12, seed=42):
    sents = re.split(r"(?<=[.!?])\s+", text.strip())
    chunks, cur = [], ""
    for s in sents:
        if len(cur) + len(s) > max_chars and cur:
            chunks.append(cur); cur = s
        else:
            cur = f"{cur} {s}".strip()
    if cur: chunks.append(cur)

    spk = model.extract_speaker_embedding(ref_audio) if ref_audio else None  # voice consistency
    results = sorted(model.batch_generate(chunks, speaker_embedding=spk, max_tokens=1024, seed=seed),
                     key=lambda r: r.sequence_idx)
    gap = mx.zeros((int(gap_s * model.sample_rate),))
    pieces = []
    for i, r in enumerate(results):
        if i: pieces.append(gap)
        pieces.append(r.audio)
    return mx.concatenate(pieces, axis=0), model.sample_rate

audio, sr = speak_long(model, open("article.txt").read())
audio_write("long.wav", audio, sr)

CLI

python -m mlx_audio.tts.generate \
  --model amal-david/Zyphra-ZONOS2-4bit \
  --text "Hello from the quantized ZONOS two model." \
  --output_path outputs --file_prefix zonos2_4bit

How this was quantized (and why it's mixed-precision)

ZONOS2 is a top-1 mixture-of-experts model: per token only ~1 of 16 experts is read, so quantization noise on the single active expert does not average out. Uniform 4-bit breaks generation — the model emits the end-of-audio token immediately and returns empty clips. The culprit, found by bisection, is 4-bit attention (not the experts). This recipe is verified to keep generation healthy across seeds:

Component	Precision	Why
MoE experts (`SwitchGLU` gate/up/down) — 94.5% of weights	4-bit gs64	The bandwidth lever; tolerates 4-bit
Attention `wq` / `wo`	8-bit gs64	4-bit attention → spurious early EOS; must stay ≥8-bit
Token embeddings	8-bit gs64	Phonetic stability
Output head, MoE router, gater, per-head temperature, RMSNorm, `ChunkedLinear` (wkv/w_in)	bf16	Quality-critical and/or tiny

Reproduce with mlx-audio's converter:

from mlx_audio.tts.utils import convert

SKIP = ("router", "multi_output", "gater", "norm", "temp")
def predicate(path, module):
    if any(s in path for s in SKIP):       return False
    if "experts"  in path:                 return {"group_size": 64, "bits": 4}
    if "attention" in path:                return {"group_size": 64, "bits": 8}  # 4-bit breaks EOS
    if "embedders" in path:                return {"group_size": 64, "bits": 8}
    return False                                                                 # keep bf16

convert("mlx-community/Zyphra-ZONOS2", "Zyphra-ZONOS2-4bit",
        quantize=True, q_group_size=64, q_bits=4, q_mode="affine",
        quant_predicate=predicate)

Performance notes (Apple M4 Pro, 20-core GPU, 273 GB/s, MLX 0.31)

Metric	BF16 base	This 4-bit
Weights on disk	15.34 GB	4.68 GB
Single-stream RTF (lower = faster)	~1.40	~0.85
Forward-only compute	1.0×	~1.6×
Batched throughput (B=32–96)	—	~4.0–4.35× real-time, ~350–375 frames/s
Peak memory @ B=32	~19 GB	~12 GB

What we learned profiling this on Apple Silicon — useful if you're optimizing it further:

The decode bottleneck is split between weight bandwidth (the forward pass) and lost pipelining from a per-frame GPU→CPU sync in the sampler — not raw compute. Quantization attacks the first; a decode-loop rewrite (on-device sampling + mx.async_eval + mx.compile) attacks the second. They stack to roughly 2.5–3× single-stream.
Top-1 MoE is sparse at batch 1 (reads ~~1/16 of experts), which is why B=1 is cheap — and why batching saturates (~~B=32–64 here): more sequences route to more distinct experts, so the MoE degenerates toward a dense all-expert read and hits the 273 GB/s wall. Throughput peaks around B=64 and regresses past it.
Quantization helps batching by cutting per-expert bandwidth, so it raises the batch ceiling.
System knobs (free): mx.set_wired_limit(...) to avoid paging on long runs; run a 1–2 token warmup before timing (first run pays ~2× Metal kernel compilation); export MLX_METAL_FAST_SYNCH=1; keep decode single-process (concurrent processes contend for the one GPU + bandwidth).

Done / further speedups available

✅ On-device batch sampling (single sync/step instead of one .tolist() per sequence) — ~+17% batch throughput at B=32, byte-identical (verified). Implemented in the fork.
✅ Ragged / continuous batching — drop finished rows mid-batch so the GPU never forwards completed sequences. On mixed-length real-EOS batches: 1.41× at B=32, up to ~1.78× at B=48 (workload-dependent). Not bit-exact: the longest surviving sequence finishes alone (B=1), and a B=1 matmul rounds slightly differently than B=8 — the AR loop turns that into a different but equally-valid sample (7/8 sequences identical in testing; the longest diverged, same length, perceptually equivalent). Lossy-but-quality-preserving. Implemented in the fork.
Single-stream pipelining (mx.async_eval + mx.compile): forward-bound at RTF ~0.82; measured mx.compile gain ~1.0–1.1× — not worth it. ChunkedLinear quant gave no forward gain either (tested). The single-stream forward is at its bandwidth/dispatch floor.
Speculative / multi-token decode — needs lightweight trained draft heads (backbone frozen); ~2–3× single-stream on this hardware. A clean 5× single-stream is below the bandwidth floor (physics).

⚠️ Quality: 4-bit experts are perceptually close to BF16 on typical utterances but can show intermittent artifacts on hard/outlier cases (top-1 MoE). A blind A/B vs the BF16 base is recommended before production use; an 8-bit build is the safe-fidelity fallback.

Code & credits

Optimized inference fork: Amal-David/mlx-audio @ zonos2-optimized — quantization recipe, batching tooling, and benchmarks (this model is built with it)
Model: Zyphra/ZONOS2
MLX framework: mlx-audio (Prince Canuma, MIT)
Reference for the ZONOS2 streaming/batching code: Lucas Newman's experimental mlx-audio branch

See the base model card for upstream license and details.

Downloads last month: 96

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for amal-david/Zyphra-ZONOS2-4bit

Base model

Zyphra/ZONOS2

Quantized

(2)

this model