dinhthuan commited on
Commit
f29fd21
·
verified ·
1 Parent(s): befd0e1

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - vi
5
+ library_name: pytorch
6
+ pipeline_tag: text-to-speech
7
+ tags:
8
+ - text-to-speech
9
+ - zero-shot-tts
10
+ - voice-cloning
11
+ - vietnamese
12
+ - zipvoice
13
+ base_model: k2-fsa/ZipVoice
14
+ ---
15
+
16
+ # ViZipVoice
17
+
18
+ Vietnamese zero-shot TTS / voice cloning fine-tuned from [ZipVoice](https://github.com/k2-fsa/ZipVoice).
19
+
20
+ - GitHub: https://github.com/iamdinhthuan/ViZipvoice
21
+ - Model repo: https://huggingface.co/contextboxai/ViZipvoice
22
+ - Space: https://huggingface.co/spaces/dinhthuan/ViZipvoice
23
+ - Latest checkpoint: `checkpoint-700000.pt`, FP16 inference state dict
24
+ - Training data: about `7000` total hours, including roughly `6500` hours of Vietnamese and `500` hours of English
25
+ - Tokenizer: `SimpleTokenizer`, character-level, `244` tokens
26
+ - Sample rate: `24 kHz`
27
+ - Default vocoder: `charactr/vocos-mel-24khz`
28
+
29
+ The wrapper loads the largest `checkpoint-<step>.pt` automatically and uses `soe-vinorm` for Vietnamese text normalization.
30
+
31
+ ## Audio Demo
32
+
33
+ Generated with `checkpoint-700000.pt`, the current wrapper flow, and the demo text in `demo/demo_text.txt`.
34
+
35
+ **Đinh-Quyết**
36
+
37
+ <audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav"></audio>
38
+
39
+ [Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav)
40
+
41
+ **Nhã-Uyên**
42
+
43
+ <audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav"></audio>
44
+
45
+ [Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav)
46
+
47
+ **MC**
48
+
49
+ <audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav"></audio>
50
+
51
+ [Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav)
52
+
53
+ ## Install
54
+
55
+ ```bash
56
+ git clone https://github.com/iamdinhthuan/ViZipvoice.git
57
+ cd ViZipvoice
58
+ pip install -r requirements.txt
59
+ export PYTHONPATH="$PWD:$PYTHONPATH"
60
+ ```
61
+
62
+ ## CLI
63
+
64
+ ```bash
65
+ python3 -m zipvoice.bin.infer_vizipvoice \
66
+ --prompt-wav prompt.wav \
67
+ --prompt-text "Xin chào, đây là giọng mẫu của tôi." \
68
+ --text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \
69
+ --res-wav-path output.wav
70
+ ```
71
+
72
+ The CLI downloads this model repo by default. Use `--model-dir models/ViZipvoice` after downloading files locally.
73
+
74
+ ## Python
75
+
76
+ ```python
77
+ from zipvoice.vizipvoice import ViZipVoiceTTS
78
+
79
+ tts = ViZipVoiceTTS()
80
+ metrics = tts.synthesize(
81
+ prompt_wav="prompt.wav",
82
+ prompt_text="Xin chào, đây là giọng mẫu của tôi.",
83
+ text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.",
84
+ output_path="output.wav",
85
+ )
86
+ print(metrics)
87
+ ```
88
+
89
+ ## Reference Audio
90
+
91
+ `audio/` contains 30 reference prompts. Each audio file has a sidecar `.txt` transcript with the same basename:
92
+
93
+ ```text
94
+ audio/Đinh-Quyết.mp3
95
+ audio/Đinh-Quyết.txt
96
+ ```
97
+
98
+ Names only keep the audio/person name; the original `lar_*` prefix and `Pro` suffix are removed. The Gradio app reads this sidecar format automatically.
99
+
100
+ ```bash
101
+ huggingface-cli download contextboxai/ViZipvoice \
102
+ --local-dir models/ViZipvoice \
103
+ --local-dir-use-symlinks False
104
+
105
+ python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice
106
+ ```
107
+
108
+ ## Inference Flow
109
+
110
+ The CLI, Python wrapper, and Gradio app use the same default flow:
111
+
112
+ - normalize Vietnamese text with `soe-vinorm`, then clean spaces around punctuation;
113
+ - split long text into sentences;
114
+ - for a `1`-word sentence: use at least `24` steps and `speed=0.6`;
115
+ - for a `2-4` word sentence: use `speed=0.8`;
116
+ - generate each segment separately;
117
+ - merge segments with silence, crossfade, fade in, and fade out.
118
+
119
+ Useful knobs:
120
+
121
+ ```bash
122
+ --no-vietnamese-normalize
123
+ --no-split-sentences
124
+ --crossfade-ms 80
125
+ --silence-ms 180
126
+ --fade-in-ms 20
127
+ --fade-out-ms 80
128
+ ```
129
+
130
+ ## Files
131
+
132
+ - `checkpoint-700000.pt`: latest FP16 checkpoint
133
+ - `config.json`, `model.json`: model config
134
+ - `tokens.txt`: Vietnamese character tokenizer
135
+ - `audio/`: 30 reference audios plus `.txt` transcripts
136
+ - `demo/`: regenerated audio demos and `metadata.json`
137
+ - `vizipvoice.py`: wrapper mirrored from GitHub
138
+
139
+ ## Responsible Use
140
+
141
+ This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content.
142
+
143
+ ## License
144
+
145
+ Apache License 2.0. Please also credit the original ZipVoice project.