Mongolian VITS TTS

Multi-speaker VITS text-to-speech model for Mongolian, trained with Coqui TTS.

  • Architecture: VITS (end-to-end, multi-speaker)
  • Language: Mongolian (mn)
  • Sample rate: 22050 Hz
  • Speakers: 78 (see speakers.pth)
  • Checkpoint: best model at training step 241549

Files

File Description
best_model.pth Best VITS checkpoint (eval loss)
config.json Coqui TTS training/inference config
speakers.pth Speaker manager / speaker id map
tensorboard/ TensorBoard event files (training curves)

Usage

from huggingface_hub import hf_hub_download
from TTS.utils.synthesizer import Synthesizer

repo = "Bokhbat/mongolian-vits-tts"
model_path   = hf_hub_download(repo, "best_model.pth")
config_path  = hf_hub_download(repo, "config.json")
speakers_path = hf_hub_download(repo, "speakers.pth")

synth = Synthesizer(model_path, config_path, speakers_path, use_cuda=False)
wav = synth.tts("Сайн байна уу?", speaker_name=synth.tts_model.speaker_manager.speaker_names[0])
synth.save_wav(wav, "out.wav")

Training metrics

TensorBoard logs are included under tensorboard/ and render in the Training metrics tab of this repository.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Bokhbat/mongolian-vits-tts