Instructions to use amal-david/Zyphra-ZONOS2-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use amal-david/Zyphra-ZONOS2-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Zyphra-ZONOS2-4bit amal-david/Zyphra-ZONOS2-4bit
- Zonos
How to use amal-david/Zyphra-ZONOS2-4bit with Zonos:
# pip install git+https://github.com/Zyphra/Zonos.git import torchaudio from zonos.model import Zonos from zonos.conditioning import make_cond_dict model = Zonos.from_pretrained("amal-david/Zyphra-ZONOS2-4bit", device="cuda") wav, sr = torchaudio.load("speaker.wav") # 5-10s reference clip speaker = model.make_speaker_embedding(wav, sr) cond = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us") codes = model.generate(model.prepare_conditioning(cond)) audio = model.autoencoder.decode(codes)[0].cpu() torchaudio.save("sample.wav", audio, model.autoencoder.sampling_rate) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Zyphra-ZONOS2-4bit (MLX, mixed-precision)
A mixed-precision quantized MLX build of Zyphra's ZONOS2
autoregressive text-to-speech model, for fast local inference on Apple Silicon with
mlx-audio.
- Base model:
mlx-community/Zyphra-ZONOS2(BF16) - Size: 4.68 GB (vs 15.34 GB BF16) β 4.88 bits/weight
- Decode @ 44.1 kHz, 9 DAC codebooks, 16-expert top-1 MoE backbone, optional speaker cloning
- Single-stream RTF ~0.85 (faster than real-time) and ~4β4.3Γ real-time batched on an M4 Pro (with our on-device batch-sampling optimization)
Install
ZONOS2 needs mlx-audio with the ZONOS2 model + batching. Install our optimized fork:
pip install git+https://github.com/Amal-David/mlx-audio.git@zonos2-optimized
Optimization details, benchmarks, and the quantization recipe live in the fork's
OPTIMIZATIONS.md.
π Runnable scripts are in
examples/:generate.py(single / clone / long-form),batch_generate.py(throughput runner), andquantize.py(reproduce this build from BF16).
Quick start
from mlx_audio.tts import load
from mlx_audio.audio_io import write as audio_write
model = load("amal-david/Zyphra-ZONOS2-4bit", lazy=False)
result = next(model.generate(
text="Hello, this is the four bit ZONOS two model running locally with MLX audio.",
max_tokens=1024,
))
audio_write("zonos2.wav", result.audio, result.sample_rate)
print(result.audio_duration, "RTF", result.real_time_factor)
Batch generation β the throughput lever
For generating many clips (eval runs, datasets), batching is the big win. B = 32β64 is the sweet spot on an M4 Pro; beyond that, throughput saturates (see notes).
texts = [f"This is sample number {i}." for i in range(64)]
for r in model.batch_generate(texts, max_tokens=1024, seed=42):
audio_write(f"out_{r.sequence_idx:03d}.wav", r.audio, r.sample_rate)
Tip: group prompts of similar length in a batch. The loop runs until the longest sequence finishes, so mixing very short and very long prompts wastes compute.
Voice cloning
Extract the speaker embedding once and reuse it so the voice stays consistent:
spk = model.extract_speaker_embedding("speaker.wav") # 2048-D, compute once
result = next(model.generate(
text="This sentence is spoken in the cloned reference voice.",
speaker_embedding=spk, max_tokens=1024,
))
audio_write("cloned.wav", result.audio, result.sample_rate)
Long-form text (chunking)
There is no built-in long-form handling: a single call is capped by max_tokens
(default 1024 β 12 s) and quality is best within the model's window. For paragraphs/articles,
split on sentence boundaries, batch the chunks, and concatenate:
import re, mlx.core as mx
def speak_long(model, text, ref_audio=None, max_chars=350, gap_s=0.12, seed=42):
sents = re.split(r"(?<=[.!?])\s+", text.strip())
chunks, cur = [], ""
for s in sents:
if len(cur) + len(s) > max_chars and cur:
chunks.append(cur); cur = s
else:
cur = f"{cur} {s}".strip()
if cur: chunks.append(cur)
spk = model.extract_speaker_embedding(ref_audio) if ref_audio else None # voice consistency
results = sorted(model.batch_generate(chunks, speaker_embedding=spk, max_tokens=1024, seed=seed),
key=lambda r: r.sequence_idx)
gap = mx.zeros((int(gap_s * model.sample_rate),))
pieces = []
for i, r in enumerate(results):
if i: pieces.append(gap)
pieces.append(r.audio)
return mx.concatenate(pieces, axis=0), model.sample_rate
audio, sr = speak_long(model, open("article.txt").read())
audio_write("long.wav", audio, sr)
CLI
python -m mlx_audio.tts.generate \
--model amal-david/Zyphra-ZONOS2-4bit \
--text "Hello from the quantized ZONOS two model." \
--output_path outputs --file_prefix zonos2_4bit
How this was quantized (and why it's mixed-precision)
ZONOS2 is a top-1 mixture-of-experts model: per token only ~1 of 16 experts is read, so quantization noise on the single active expert does not average out. Uniform 4-bit breaks generation β the model emits the end-of-audio token immediately and returns empty clips. The culprit, found by bisection, is 4-bit attention (not the experts). This recipe is verified to keep generation healthy across seeds:
| Component | Precision | Why |
|---|---|---|
MoE experts (SwitchGLU gate/up/down) β 94.5% of weights |
4-bit gs64 | The bandwidth lever; tolerates 4-bit |
Attention wq / wo |
8-bit gs64 | 4-bit attention β spurious early EOS; must stay β₯8-bit |
| Token embeddings | 8-bit gs64 | Phonetic stability |
Output head, MoE router, gater, per-head temperature, RMSNorm, ChunkedLinear (wkv/w_in) |
bf16 | Quality-critical and/or tiny |
Reproduce with mlx-audio's converter:
from mlx_audio.tts.utils import convert
SKIP = ("router", "multi_output", "gater", "norm", "temp")
def predicate(path, module):
if any(s in path for s in SKIP): return False
if "experts" in path: return {"group_size": 64, "bits": 4}
if "attention" in path: return {"group_size": 64, "bits": 8} # 4-bit breaks EOS
if "embedders" in path: return {"group_size": 64, "bits": 8}
return False # keep bf16
convert("mlx-community/Zyphra-ZONOS2", "Zyphra-ZONOS2-4bit",
quantize=True, q_group_size=64, q_bits=4, q_mode="affine",
quant_predicate=predicate)
Performance notes (Apple M4 Pro, 20-core GPU, 273 GB/s, MLX 0.31)
| Metric | BF16 base | This 4-bit |
|---|---|---|
| Weights on disk | 15.34 GB | 4.68 GB |
| Single-stream RTF (lower = faster) | ~1.40 | ~0.85 |
| Forward-only compute | 1.0Γ | ~1.6Γ |
| Batched throughput (B=32β96) | β | ~4.0β4.35Γ real-time, ~350β375 frames/s |
| Peak memory @ B=32 | ~19 GB | ~12 GB |
What we learned profiling this on Apple Silicon β useful if you're optimizing it further:
- The decode bottleneck is split between weight bandwidth (the forward pass) and lost
pipelining from a per-frame GPUβCPU sync in the sampler β not raw compute. Quantization
attacks the first; a decode-loop rewrite (on-device sampling +
mx.async_eval+mx.compile) attacks the second. They stack to roughly 2.5β3Γ single-stream. - Top-1 MoE is sparse at batch 1 (reads
1/16 of experts), which is why B=1 is cheap β and why batching saturates (B=32β64 here): more sequences route to more distinct experts, so the MoE degenerates toward a dense all-expert read and hits the 273 GB/s wall. Throughput peaks around B=64 and regresses past it. - Quantization helps batching by cutting per-expert bandwidth, so it raises the batch ceiling.
- System knobs (free):
mx.set_wired_limit(...)to avoid paging on long runs; run a 1β2 token warmup before timing (first run pays ~2Γ Metal kernel compilation);export MLX_METAL_FAST_SYNCH=1; keep decode single-process (concurrent processes contend for the one GPU + bandwidth).
Done / further speedups available
- β
On-device batch sampling (single sync/step instead of one
.tolist()per sequence) β ~+17% batch throughput at B=32, byte-identical (verified). Implemented in the fork. - β Ragged / continuous batching β drop finished rows mid-batch so the GPU never forwards completed sequences. On mixed-length real-EOS batches: 1.41Γ at B=32, up to ~1.78Γ at B=48 (workload-dependent). Not bit-exact: the longest surviving sequence finishes alone (B=1), and a B=1 matmul rounds slightly differently than B=8 β the AR loop turns that into a different but equally-valid sample (7/8 sequences identical in testing; the longest diverged, same length, perceptually equivalent). Lossy-but-quality-preserving. Implemented in the fork.
- Single-stream pipelining (
mx.async_eval+mx.compile): forward-bound at RTF ~0.82; measuredmx.compilegain ~1.0β1.1Γ β not worth it. ChunkedLinear quant gave no forward gain either (tested). The single-stream forward is at its bandwidth/dispatch floor. - Speculative / multi-token decode β needs lightweight trained draft heads (backbone frozen); ~2β3Γ single-stream on this hardware. A clean 5Γ single-stream is below the bandwidth floor (physics).
β οΈ Quality: 4-bit experts are perceptually close to BF16 on typical utterances but can show intermittent artifacts on hard/outlier cases (top-1 MoE). A blind A/B vs the BF16 base is recommended before production use; an 8-bit build is the safe-fidelity fallback.
Code & credits
- Optimized inference fork:
Amal-David/mlx-audio@zonos2-optimizedβ quantization recipe, batching tooling, and benchmarks (this model is built with it) - Model: Zyphra/ZONOS2
- MLX framework:
mlx-audio(Prince Canuma, MIT) - Reference for the ZONOS2 streaming/batching code: Lucas Newman's experimental mlx-audio branch
See the base model card for upstream license and details.
- Downloads last month
- 96
4-bit
Model tree for amal-david/Zyphra-ZONOS2-4bit
Base model
Zyphra/ZONOS2