Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Paper • 2106.06103 • Published • 4
Multi-speaker VITS text-to-speech model for Mongolian, trained with Coqui TTS.
mn)speakers.pth)| File | Description |
|---|---|
best_model.pth |
Best VITS checkpoint (eval loss) |
config.json |
Coqui TTS training/inference config |
speakers.pth |
Speaker manager / speaker id map |
tensorboard/ |
TensorBoard event files (training curves) |
from huggingface_hub import hf_hub_download
from TTS.utils.synthesizer import Synthesizer
repo = "Bokhbat/mongolian-vits-tts"
model_path = hf_hub_download(repo, "best_model.pth")
config_path = hf_hub_download(repo, "config.json")
speakers_path = hf_hub_download(repo, "speakers.pth")
synth = Synthesizer(model_path, config_path, speakers_path, use_cuda=False)
wav = synth.tts("Сайн байна уу?", speaker_name=synth.tts_model.speaker_manager.speaker_names[0])
synth.save_wav(wav, "out.wav")
TensorBoard logs are included under tensorboard/ and render in the
Training metrics tab of this repository.