contextboxai
/

ViZipvoice

+---
+license: apache-2.0
+language:
+- vi
+library_name: pytorch
+pipeline_tag: text-to-speech
+tags:
+- text-to-speech
+- zero-shot-tts
+- voice-cloning
+- vietnamese
+- zipvoice
+base_model: k2-fsa/ZipVoice
+---
+# ViZipVoice
+Vietnamese zero-shot TTS / voice cloning fine-tuned from [ZipVoice](https://github.com/k2-fsa/ZipVoice).
+- GitHub: https://github.com/iamdinhthuan/ViZipvoice
+- Model repo: https://huggingface.co/contextboxai/ViZipvoice
+- Space: https://huggingface.co/spaces/dinhthuan/ViZipvoice
+- Latest checkpoint: `checkpoint-700000.pt`, FP16 inference state dict
+- Training data: about `7000` total hours, including roughly `6500` hours of Vietnamese and `500` hours of English
+- Tokenizer: `SimpleTokenizer`, character-level, `244` tokens
+- Sample rate: `24 kHz`
+- Default vocoder: `charactr/vocos-mel-24khz`
+The wrapper loads the largest `checkpoint-<step>.pt` automatically and uses `soe-vinorm` for Vietnamese text normalization.
+## Audio Demo
+Generated with `checkpoint-700000.pt`, the current wrapper flow, and the demo text in `demo/demo_text.txt`.
+**Đinh-Quyết**
+<audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav"></audio>
+[Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav)
+**Nhã-Uyên**
+<audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav"></audio>
+[Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav)
+**MC**
+<audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav"></audio>
+[Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav)
+## Install
+```bash
+git clone https://github.com/iamdinhthuan/ViZipvoice.git
+cd ViZipvoice
+pip install -r requirements.txt
+export PYTHONPATH="$PWD:$PYTHONPATH"
+```
+## CLI
+```bash
+python3 -m zipvoice.bin.infer_vizipvoice \
+  --prompt-wav prompt.wav \
+  --prompt-text "Xin chào, đây là giọng mẫu của tôi." \
+  --text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \
+  --res-wav-path output.wav
+```
+The CLI downloads this model repo by default. Use `--model-dir models/ViZipvoice` after downloading files locally.
+## Python
+```python
+from zipvoice.vizipvoice import ViZipVoiceTTS
+tts = ViZipVoiceTTS()
+metrics = tts.synthesize(
+    prompt_wav="prompt.wav",
+    prompt_text="Xin chào, đây là giọng mẫu của tôi.",
+    text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.",
+    output_path="output.wav",
+)
+print(metrics)
+```
+## Reference Audio
+`audio/` contains 30 reference prompts. Each audio file has a sidecar `.txt` transcript with the same basename:
+```text
+audio/Đinh-Quyết.mp3
+audio/Đinh-Quyết.txt
+```
+Names only keep the audio/person name; the original `lar_*` prefix and `Pro` suffix are removed. The Gradio app reads this sidecar format automatically.
+```bash
+huggingface-cli download contextboxai/ViZipvoice \
+  --local-dir models/ViZipvoice \
+  --local-dir-use-symlinks False
+python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice
+```
+## Inference Flow
+The CLI, Python wrapper, and Gradio app use the same default flow:
+- normalize Vietnamese text with `soe-vinorm`, then clean spaces around punctuation;
+- split long text into sentences;
+- for a `1`-word sentence: use at least `24` steps and `speed=0.6`;
+- for a `2-4` word sentence: use `speed=0.8`;
+- generate each segment separately;
+- merge segments with silence, crossfade, fade in, and fade out.
+Useful knobs:
+```bash
+--no-vietnamese-normalize
+--no-split-sentences
+--crossfade-ms 80
+--silence-ms 180
+--fade-in-ms 20
+--fade-out-ms 80
+```
+## Files
+- `checkpoint-700000.pt`: latest FP16 checkpoint
+- `config.json`, `model.json`: model config
+- `tokens.txt`: Vietnamese character tokenizer
+- `audio/`: 30 reference audios plus `.txt` transcripts
+- `demo/`: regenerated audio demos and `metadata.json`
+- `vizipvoice.py`: wrapper mirrored from GitHub
+## Responsible Use
+This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content.
+## License
+Apache License 2.0. Please also credit the original ZipVoice project.