VITS Open Bible — Turkish
A multispeaker text-to-speech model for Turkish, trained from scratch on the Open Bible corpus using the VITS architecture (end-to-end TTS with adversarial learning, 22,050 Hz output) via the Coqui TTS framework.
Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned during training. A speaker name from the training set must be supplied at inference time.
Files
| File | Purpose |
|---|---|
model_last.pth |
Trained model weights. |
config.json |
Coqui TTS model configuration. |
speakers.pth |
Speaker ID → embedding mapping. |
Intended use
- Multispeaker TTS for Turkish using one of the training-set speaker voices.
- Research on multilingual TTS, low-resource TTS evaluation, and listening studies on Open Bible–style read-speech.
How to use
Install Coqui TTS:
pip install TTS
Download the checkpoint and run inference:
import torch
from huggingface_hub import hf_hub_download
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.synthesizer import Synthesizer
repo_id = "multilingual-tts/VITS-OpenBible-Turkish"
ckpt = hf_hub_download(repo_id, "model_last.pth")
config = hf_hub_download(repo_id, "config.json")
speakers = hf_hub_download(repo_id, "speakers.pth")
use_cuda = torch.cuda.is_available()
synthesizer = Synthesizer(
tts_checkpoint=ckpt,
tts_config_path=config,
tts_speakers_file=speakers,
use_cuda=use_cuda,
)
# Coqui's Synthesizer may not inject the speakers file into the model config
# automatically — restore the SpeakerManager manually when needed.
if synthesizer.tts_model.speaker_manager is None:
synthesizer.tts_model.speaker_manager = SpeakerManager(
speaker_id_file_path=speakers
)
# List available speaker names
print(sorted(synthesizer.tts_model.speaker_manager.speaker_names))
wav = synthesizer.tts(
text="...", # text to synthesise in Turkish
speaker_name="...", # one of the speaker names printed above
split_sentences=True,
)
Training data
- Source:
davidguzmanr/open-bible-resources, configTurkish - Size: approximately 27,483 utterances
- Speakers: multispeaker; speaker identity is fixed to one of the training-set voices and selected by name at inference time
- Sample rate: 22,050 Hz
Training procedure
- Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
- Grapheme-level tokenizer, built from the training transcripts.
- Optimizer: AdamW, learning rate 2e-4.
- Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision (bf16).
Audio preprocessing and training are reproducible via the upstream open-bible-models repo.
Evaluation
Evaluated alongside other Open-Bible TTS systems on character/word error rate (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the open-bible-models repository for the evaluation pipeline and the open-bible-surveys repository for the human-listening survey methodology.
- Downloads last month
- 61