Update latest checkpoint 1860k: README.md

fa3a83e verified 7 days ago

4.45 kB

	---
	license: apache-2.0
	language:
	- vi
	library_name: pytorch
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- zero-shot-tts
	- voice-cloning
	- vietnamese
	- zipvoice
	base_model: k2-fsa/ZipVoice
	---

	# ViZipVoice

	Vietnamese zero-shot TTS / voice cloning fine-tuned from [ZipVoice](https://github.com/k2-fsa/ZipVoice).

	- GitHub: https://github.com/iamdinhthuan/ViZipvoice
	- Model repo: https://huggingface.co/contextboxai/ViZipvoice
	- Space: https://huggingface.co/spaces/dinhthuan/ViZipvoice
	- Latest checkpoint: `checkpoint-1860000.pt`, FP16 inference state dict
	- Training data: about `7000` total hours, including roughly `6500` hours of Vietnamese and `500` hours of English
	- Tokenizer: `SimpleTokenizer`, character-level, `244` tokens
	- Sample rate: `24 kHz`
	- Default vocoder: `charactr/vocos-mel-24khz`

	The wrapper loads the largest `checkpoint-<step>.pt` automatically and uses `soe-vinorm` for Vietnamese text normalization.

	## Audio Demo

	Generated with `checkpoint-1860000.pt`, the current wrapper flow, and the demo text in `demo/demo_text.txt`.

	Đinh-Quyết

	<audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav"></audio>

	[Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav)

	Nhã-Uyên

	<audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav"></audio>

	[Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav)

	MC

	<audio controls src="https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav"></audio>

	[Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav)

	## Install

	```bash
	git clone https://github.com/iamdinhthuan/ViZipvoice.git
	cd ViZipvoice
	pip install -r requirements.txt
	export PYTHONPATH="$PWD:$PYTHONPATH"
	```

	## CLI

	```bash
	python3 -m zipvoice.bin.infer_vizipvoice \
	--prompt-wav prompt.wav \
	--prompt-text "Xin chào, đây là giọng mẫu của tôi." \
	--text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \
	--res-wav-path output.wav
	```

	The CLI downloads this model repo by default. Use `--model-dir models/ViZipvoice` after downloading files locally.

	## Python

	```python
	from zipvoice.vizipvoice import ViZipVoiceTTS

	tts = ViZipVoiceTTS()
	metrics = tts.synthesize(
	prompt_wav="prompt.wav",
	prompt_text="Xin chào, đây là giọng mẫu của tôi.",
	text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.",
	output_path="output.wav",
	)
	print(metrics)
	```

	## Reference Audio

	`audio/` contains 30 reference prompts. Each audio file has a sidecar `.txt` transcript with the same basename:

	```text
	audio/Đinh-Quyết.mp3
	audio/Đinh-Quyết.txt
	```

	Names only keep the audio/person name; the original `lar_*` prefix and `Pro` suffix are removed. The Gradio app reads this sidecar format automatically.

	```bash
	huggingface-cli download contextboxai/ViZipvoice \
	--local-dir models/ViZipvoice \
	--local-dir-use-symlinks False

	python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice
	```

	## Inference Flow

	The CLI, Python wrapper, and Gradio app use the same default flow:

	- normalize Vietnamese text with `soe-vinorm`, then clean spaces around punctuation;
	- split long text into sentences;
	- for a `1`-word sentence: use at least `24` steps and `speed=0.6`;
	- for a `2-4` word sentence: use `speed=0.8`;
	- generate each segment separately;
	- merge segments with silence, crossfade, fade in, and fade out.

	Useful knobs:

	```bash
	--no-vietnamese-normalize
	--no-split-sentences
	--crossfade-ms 80
	--silence-ms 180
	--fade-in-ms 20
	--fade-out-ms 80
	```

	## Files

	- `checkpoint-1860000.pt`: latest FP16 checkpoint
	- `config.json`, `model.json`: model config
	- `tokens.txt`: Vietnamese character tokenizer
	- `audio/`: 30 reference audios plus `.txt` transcripts
	- `demo/`: regenerated audio demos and `metadata.json`
	- `vizipvoice.py`: wrapper mirrored from GitHub

	## Responsible Use

	This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content.

	## License

	Apache License 2.0. Please also credit the original ZipVoice project.