ZONOS2

krcv

gabrielclark3330 commited on 17 days ago

Commit

2dc4c4a

0 Parent(s):

Duplicate from Zyphra/ZONOS2

Browse files

Co-authored-by: Gabriel Clark <gabrielclark3330@users.noreply.huggingface.co>

Files changed (6) hide show

.gitattributes +38 -0
README.md +121 -0
assets/ZONOS2BlogThumbnail.png +3 -0
assets/zonos2_arlooop_animated.gif +3 -0
model.pth +3 -0
params.json +59 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,38 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+zonos2_arlooop_animated.gif filter=lfs diff=lfs merge=lfs -text
+ZONOS2[[:space:]]Blog[[:space:]]Thumbnail.png filter=lfs diff=lfs merge=lfs -text
+assets/ZONOS2BlogThumbnail.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,121 @@

+---
+license: apache-2.0
+pipeline_tag: text-to-speech
+library_name: ZONOS2
+---
+# ZONOS2
+<p align="center">
+  <img src="./assets/ZONOS2BlogThumbnail.png" alt="ZONOS2 title card" width="750" />
+</p>
+<div align="center">
+  <a href="https://discord.gg/gTW9JwST8q" target="_blank">
+    <img src="https://img.shields.io/badge/Join%20Our%20Discord-7289DA?style=for-the-badge&logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+---
+ZONOS2 is our latest text-to-speech model trained on more than 6 million hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers at low latency with MoE. ZONOS2 excels at high-fidelity and naturalistic voice cloning.
+During inference we use nemo TN normalized UTF-8 bytes and an ECAPA-TDNN embedding to generate DAC tokens with our MoE backbone. An inference overview can be seen below.
+<p align="center">
+  <img src="./assets/zonos2_arlooop_animated.gif" alt="ZONOS2 title card" width="750" />
+</p>
+Language support is as follows.
+| Tier   | Languages                                                                                                                                                                                      |
+| ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Tier 1 | English, Mandarin Chinese, Japanese                                                                                                                                                            |
+| Tier 2 | Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, Dutch                                                                                                       |
+| Tier 3 | Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, Latvian |
+For local inference we provide a high-performance TTS inference server built on [Mini-SGLang](https://github.com/sgl-project/mini-sglang).
+**For more details and speech samples, check out our [blog](https://www.zyphra.com/our-work/zonos2).**
+**We also have a hosted version available at [cloud.zyphra.com/audio-playground](https://cloud.zyphra.com/audio-playground).**
+---
+## Quick Start
+> **Platform Support**: Linux only (x86_64). Requires NVIDIA GPU with CUDA toolkit matching your driver version (`nvidia-smi` to check).
+### 1. Installation
+Requires [uv](https://docs.astral.sh/uv/getting-started/installation/).
+```bash
+git clone https://github.com/Zyphra/ZONOS2.git
+cd ZONOS2
+uv sync
+```
+### 2. Launch the TTS Server
+```bash
+uv run python -m minisgl --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/
+```
+`uv run` always uses the project environment, so no venv activation is needed.
+The server starts on `http://localhost:1919` by default. TTS mode is auto-detected for zonos2 models.
+`--tts-default-voices-dir <folder>` pre-populates the web UI with voice-clone
+speakers from disk; the folder is scanned recursively for speaker audio
+(`.wav`, `.mp3`, `.flac`, `.m4a`, `.ogg`, `.opus`, `.aac`, `.webm`) and saved
+embeddings (`.npy`, `.npz`). The newest voice is selected automatically on
+startup.
+### 3. Generate Speech
+**curl:**
+```bash
+curl -X POST http://localhost:1919/tts/generate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Hello world", "stream": true}' \
+  --output output.pcm
+# Convert to WAV
+ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wav
+```
+**Web UI:** Open `http://localhost:1919/` in your browser.
+## Python API (offline inference)
+You can also run the engine directly in a Python script, without starting a
+server, via `TTSLLM`:
+```python
+from minisgl.message import TTSSamplingParams
+from minisgl.tts import TTSLLM
+tts = TTSLLM(model_path="Zyphra/ZONOS2")
+results = tts.generate(
+    ["Hello from the offline Python API.", "Batched prompts work too."],
+    TTSSamplingParams(seed=42),
+)
+for i, result in enumerate(results):
+    print(f"frames={len(result['audio_tokens'])}, eos_frame={result['eos_frame']}")
+    tts.save_audio(result["audio"], f"output_{i}.wav")
+```
+## Citation
+If you find this model useful in an academic context please cite as:
+```
+@misc{zyphra2025zonos,
+  title     = {Zonos V2 Technical Report},
+  author    = {Gabriel Clark, Sofian Mejjoute, Mohamed Osman, George Close, Beren Millidge},
+  year      = {2026},
+}
+```

assets/ZONOS2BlogThumbnail.png ADDED Viewed

Git LFS Details

SHA256: d9c3c09b213fe59c7bd0214a75219f37ff4cf51da45a245f4943e759fdeef47c
Pointer size: 131 Bytes
Size of remote file: 595 kB

assets/zonos2_arlooop_animated.gif ADDED Viewed

Git LFS Details

SHA256: 6fe9bb07bfe7651272be63beed980231b3e2c27d0dd2c7b5eb33554e5fa24900
Pointer size: 133 Bytes
Size of remote file: 14.1 MB

model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f6aa0fff9036ee44ccbc625d40aa6bdd8ea223480a5447e9f6aad70c38b6ecd
+size 15336390655

params.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "model_type": "zonos2",
+  "dtype": "bfloat16",
+  "n_layers": 28,
+  "dim": 2048,
+  "head_dim": 128,
+  "n_heads": null,
+  "n_kv_heads": 4,
+  "ffn_dim_multiplier": 1.5,
+  "multiple_of": 256,
+  "norm_eps": 1e-05,
+  "rope_theta": 10000.0,
+  "max_seqlen": 6144,
+  "n_codebooks": 9,
+  "codebook_size": 1024,
+  "eoa_id": 1024,
+  "audio_pad_id": 1025,
+  "text_vocab": 519,
+  "loss_softcap": 15.0,
+  "speaker_enabled": true,
+  "speaker_embedding_dim": 2048,
+  "speaker_lda_dim": 1024,
+  "speaker_background_token_enabled": true,
+  "accurate_mode_token_enabled": true,
+  "speaking_rate_num_buckets": 8,
+  "speaking_rate_buckets": ["0-8", "8-11", "11-14", "14-17", "17-21", "21-28", "28-40", "40+"],
+  "quality_num_buckets": 60,
+  "quality_features": [
+    "lufs",
+    "estimated_snr",
+    "max_pause",
+    "estimated_bandlimit_hz",
+    "leading_silence_s",
+    "trailing_silence_s"
+  ],
+  "quality_buckets": {
+    "lufs": ["-1000--50", "-50--45.5", "-45.5--41", "-41--36.5", "-36.5--32", "-32--27.5", "-27.5--23", "-23--18.5", "-18.5--14", "-14--9.5", "-9.5--5", "-5+"],
+    "estimated_snr": ["-1000-0", "0-6", "6-12", "12-18", "18-24", "24-30", "30-36", "36-42", "42-48", "48-54", "54-60", "60+"],
+    "max_pause": ["0-0.5", "0.5-1", "1-1.5", "1.5-2", "2-2.5", "2.5-3", "3-3.5", "3.5-4", "4-4.5", "4.5-5", "5-5.5", "5.5-6"],
+    "estimated_bandlimit_hz": ["495.3-3433", "3433-6371", "6371-9310", "9310-12248", "12248-15186", "15186-18124", "18124-21062", "21062-24000"],
+    "leading_silence_s": ["0-0.05", "0.05-0.1", "0.1-0.25", "0.25-0.5", "0.5-1", "1-2", "2-4", "4+"],
+    "trailing_silence_s": ["0-0.05", "0.05-0.1", "0.1-0.25", "0.25-0.5", "0.5-1", "1-2", "2-4", "4+"]
+  },
+  "quality_dropout": {
+    "lufs": 0.25,
+    "estimated_snr": 0.25,
+    "max_pause": 0.25,
+    "estimated_bandlimit_hz": 0.25,
+    "leading_silence_s": 0.25,
+    "trailing_silence_s": 0.25
+  },
+  "moe_impl": "sonic",
+  "moe_n_experts": 16,
+  "moe_router_topk": 1,
+  "special_topk_layers": {"26": 2},
+  "moe_router_dim": 128,
+  "moe_start_from_layer": 3,
+  "moe_end_from_layer": 1
+}