ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Paper • 2506.13053 • Published
Architecture design documents for Astra-TTS — a lightweight, high-quality text-to-speech system based on ZipVoice/Zipformer.
| File | Description |
|---|---|
model_a_slim.md |
Model A — ZipVoice naively shrunk to ~55M params. Serves as baseline. |
model_b_enhanced.md |
Model B — ~55M params with architectural improvements (GQA, DepthSep Conv, Grouped Param Sharing, Dilated ConvNeXt, RoPE, etc.) + inference optimizations (EPSS, Midpoint ODE, SmoothCache). |
benchmark_prd.md |
Benchmark PRD — Full evaluation protocol comparing Original ZipVoice (123M) vs Model A (55M) vs Model B (55M) on LibriTTS. |
Determine whether smart architectural changes at ~55M params can match or exceed a naive shrink, while enabling 6-8× faster inference through combined architecture + inference-time optimizations.
| Original ZipVoice | Model A (Slim) | Model B (Enhanced) | |
|---|---|---|---|
| Params | 123M | ~55M | ~55M |
| Approach | Full size | Naive shrink | Smart redesign |
| Key changes | — | Smaller dims/fewer layers | GQA, DepthSep FFN, Grouped Sharing, Dilated ConvNeXt, RoPE, ConvNeXt text refinement, no NLA |
| Inference | Euler 16 NFE | Euler 16 NFE | Midpoint 4-step + EPSS + SmoothCache |
| Expected speed | 1× | ~1.5× | ~6-8× |
Apache-2.0
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Praha-Labs/Astra-TTS-Arch"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.