Mongolian VITS TTS

Multi-speaker VITS text-to-speech model for Mongolian, trained with Coqui TTS.

Architecture: VITS (end-to-end, multi-speaker)
Language: Mongolian (mn)
Sample rate: 22050 Hz
Speakers: 78 (see speakers.pth)
Checkpoint: best model at training step 241549

Files

File	Description
`best_model.pth`	Best VITS checkpoint (eval loss)
`config.json`	Coqui TTS training/inference config
`speakers.pth`	Speaker manager / speaker id map
`tensorboard/`	TensorBoard event files (training curves)

Usage

from huggingface_hub import hf_hub_download
from TTS.utils.synthesizer import Synthesizer

repo = "Bokhbat/mongolian-vits-tts"
model_path   = hf_hub_download(repo, "best_model.pth")
config_path  = hf_hub_download(repo, "config.json")
speakers_path = hf_hub_download(repo, "speakers.pth")

synth = Synthesizer(model_path, config_path, speakers_path, use_cuda=False)
wav = synth.tts("Сайн байна уу?", speaker_name=synth.tts_model.speaker_manager.speaker_names[0])
synth.save_wav(wav, "out.wav")

Training metrics

TensorBoard logs are included under tensorboard/ and render in the Training metrics tab of this repository.

Downloads last month: 41

Paper for Bokhbat/mongolian-vits-tts

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Paper • 2106.06103 • Published Jun 11, 2021 • 4