Zyphra-ZONOS2-4bit (MLX, mixed-precision)

A mixed-precision quantized MLX build of Zyphra's ZONOS2 autoregressive text-to-speech model, for fast local inference on Apple Silicon with mlx-audio.

  • Base model: mlx-community/Zyphra-ZONOS2 (BF16)
  • Size: 4.68 GB (vs 15.34 GB BF16) β€” 4.88 bits/weight
  • Decode @ 44.1 kHz, 9 DAC codebooks, 16-expert top-1 MoE backbone, optional speaker cloning
  • Single-stream RTF ~0.85 (faster than real-time) and ~4–4.3Γ— real-time batched on an M4 Pro (with our on-device batch-sampling optimization)

Install

ZONOS2 needs mlx-audio with the ZONOS2 model + batching. Install our optimized fork:

pip install git+https://github.com/Amal-David/mlx-audio.git@zonos2-optimized

Optimization details, benchmarks, and the quantization recipe live in the fork's OPTIMIZATIONS.md.

πŸ“ Runnable scripts are in examples/: generate.py (single / clone / long-form), batch_generate.py (throughput runner), and quantize.py (reproduce this build from BF16).

Quick start

from mlx_audio.tts import load
from mlx_audio.audio_io import write as audio_write

model = load("amal-david/Zyphra-ZONOS2-4bit", lazy=False)

result = next(model.generate(
    text="Hello, this is the four bit ZONOS two model running locally with MLX audio.",
    max_tokens=1024,
))
audio_write("zonos2.wav", result.audio, result.sample_rate)
print(result.audio_duration, "RTF", result.real_time_factor)

Batch generation β€” the throughput lever

For generating many clips (eval runs, datasets), batching is the big win. B = 32–64 is the sweet spot on an M4 Pro; beyond that, throughput saturates (see notes).

texts = [f"This is sample number {i}." for i in range(64)]

for r in model.batch_generate(texts, max_tokens=1024, seed=42):
    audio_write(f"out_{r.sequence_idx:03d}.wav", r.audio, r.sample_rate)

Tip: group prompts of similar length in a batch. The loop runs until the longest sequence finishes, so mixing very short and very long prompts wastes compute.

Voice cloning

Extract the speaker embedding once and reuse it so the voice stays consistent:

spk = model.extract_speaker_embedding("speaker.wav")     # 2048-D, compute once
result = next(model.generate(
    text="This sentence is spoken in the cloned reference voice.",
    speaker_embedding=spk, max_tokens=1024,
))
audio_write("cloned.wav", result.audio, result.sample_rate)

Long-form text (chunking)

There is no built-in long-form handling: a single call is capped by max_tokens (default 1024 β‰ˆ 12 s) and quality is best within the model's window. For paragraphs/articles, split on sentence boundaries, batch the chunks, and concatenate:

import re, mlx.core as mx

def speak_long(model, text, ref_audio=None, max_chars=350, gap_s=0.12, seed=42):
    sents = re.split(r"(?<=[.!?])\s+", text.strip())
    chunks, cur = [], ""
    for s in sents:
        if len(cur) + len(s) > max_chars and cur:
            chunks.append(cur); cur = s
        else:
            cur = f"{cur} {s}".strip()
    if cur: chunks.append(cur)

    spk = model.extract_speaker_embedding(ref_audio) if ref_audio else None  # voice consistency
    results = sorted(model.batch_generate(chunks, speaker_embedding=spk, max_tokens=1024, seed=seed),
                     key=lambda r: r.sequence_idx)
    gap = mx.zeros((int(gap_s * model.sample_rate),))
    pieces = []
    for i, r in enumerate(results):
        if i: pieces.append(gap)
        pieces.append(r.audio)
    return mx.concatenate(pieces, axis=0), model.sample_rate

audio, sr = speak_long(model, open("article.txt").read())
audio_write("long.wav", audio, sr)

CLI

python -m mlx_audio.tts.generate \
  --model amal-david/Zyphra-ZONOS2-4bit \
  --text "Hello from the quantized ZONOS two model." \
  --output_path outputs --file_prefix zonos2_4bit

How this was quantized (and why it's mixed-precision)

ZONOS2 is a top-1 mixture-of-experts model: per token only ~1 of 16 experts is read, so quantization noise on the single active expert does not average out. Uniform 4-bit breaks generation β€” the model emits the end-of-audio token immediately and returns empty clips. The culprit, found by bisection, is 4-bit attention (not the experts). This recipe is verified to keep generation healthy across seeds:

Component Precision Why
MoE experts (SwitchGLU gate/up/down) β€” 94.5% of weights 4-bit gs64 The bandwidth lever; tolerates 4-bit
Attention wq / wo 8-bit gs64 4-bit attention β†’ spurious early EOS; must stay β‰₯8-bit
Token embeddings 8-bit gs64 Phonetic stability
Output head, MoE router, gater, per-head temperature, RMSNorm, ChunkedLinear (wkv/w_in) bf16 Quality-critical and/or tiny

Reproduce with mlx-audio's converter:

from mlx_audio.tts.utils import convert

SKIP = ("router", "multi_output", "gater", "norm", "temp")
def predicate(path, module):
    if any(s in path for s in SKIP):       return False
    if "experts"  in path:                 return {"group_size": 64, "bits": 4}
    if "attention" in path:                return {"group_size": 64, "bits": 8}  # 4-bit breaks EOS
    if "embedders" in path:                return {"group_size": 64, "bits": 8}
    return False                                                                 # keep bf16

convert("mlx-community/Zyphra-ZONOS2", "Zyphra-ZONOS2-4bit",
        quantize=True, q_group_size=64, q_bits=4, q_mode="affine",
        quant_predicate=predicate)

Performance notes (Apple M4 Pro, 20-core GPU, 273 GB/s, MLX 0.31)

Metric BF16 base This 4-bit
Weights on disk 15.34 GB 4.68 GB
Single-stream RTF (lower = faster) ~1.40 ~0.85
Forward-only compute 1.0Γ— ~1.6Γ—
Batched throughput (B=32–96) β€” ~4.0–4.35Γ— real-time, ~350–375 frames/s
Peak memory @ B=32 ~19 GB ~12 GB

What we learned profiling this on Apple Silicon β€” useful if you're optimizing it further:

  • The decode bottleneck is split between weight bandwidth (the forward pass) and lost pipelining from a per-frame GPUβ†’CPU sync in the sampler β€” not raw compute. Quantization attacks the first; a decode-loop rewrite (on-device sampling + mx.async_eval + mx.compile) attacks the second. They stack to roughly 2.5–3Γ— single-stream.
  • Top-1 MoE is sparse at batch 1 (reads 1/16 of experts), which is why B=1 is cheap β€” and why batching saturates (B=32–64 here): more sequences route to more distinct experts, so the MoE degenerates toward a dense all-expert read and hits the 273 GB/s wall. Throughput peaks around B=64 and regresses past it.
  • Quantization helps batching by cutting per-expert bandwidth, so it raises the batch ceiling.
  • System knobs (free): mx.set_wired_limit(...) to avoid paging on long runs; run a 1–2 token warmup before timing (first run pays ~2Γ— Metal kernel compilation); export MLX_METAL_FAST_SYNCH=1; keep decode single-process (concurrent processes contend for the one GPU + bandwidth).

Done / further speedups available

  • βœ… On-device batch sampling (single sync/step instead of one .tolist() per sequence) β€” ~+17% batch throughput at B=32, byte-identical (verified). Implemented in the fork.
  • βœ… Ragged / continuous batching β€” drop finished rows mid-batch so the GPU never forwards completed sequences. On mixed-length real-EOS batches: 1.41Γ— at B=32, up to ~1.78Γ— at B=48 (workload-dependent). Not bit-exact: the longest surviving sequence finishes alone (B=1), and a B=1 matmul rounds slightly differently than B=8 β€” the AR loop turns that into a different but equally-valid sample (7/8 sequences identical in testing; the longest diverged, same length, perceptually equivalent). Lossy-but-quality-preserving. Implemented in the fork.
  • Single-stream pipelining (mx.async_eval + mx.compile): forward-bound at RTF ~0.82; measured mx.compile gain ~1.0–1.1Γ— β€” not worth it. ChunkedLinear quant gave no forward gain either (tested). The single-stream forward is at its bandwidth/dispatch floor.
  • Speculative / multi-token decode β€” needs lightweight trained draft heads (backbone frozen); ~2–3Γ— single-stream on this hardware. A clean 5Γ— single-stream is below the bandwidth floor (physics).

⚠️ Quality: 4-bit experts are perceptually close to BF16 on typical utterances but can show intermittent artifacts on hard/outlier cases (top-1 MoE). A blind A/B vs the BF16 base is recommended before production use; an 8-bit build is the safe-fidelity fallback.

Code & credits

See the base model card for upstream license and details.

Downloads last month
96
Safetensors
Model size
1B params
Tensor type
BF16
Β·
U32
Β·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for amal-david/Zyphra-ZONOS2-4bit

Base model

Zyphra/ZONOS2
Quantized
(2)
this model