Soro-TTS — Yoruba 🇳🇬

Part of Soro-TTS, a multilingual text-to-speech system for Nigerian languages. This checkpoint is a fine-tune of facebook/mms-tts-yor on the google/WaxalNLP yor_tts subset.

Languages in the Soro-TTS suite

Language	Model
Hausa	`Shinzmann/soro-tts-hau`
Igbo	`Shinzmann/soro-tts-ibo`
Yoruba	`Shinzmann/soro-tts-yor`

Quick start

from transformers import VitsModel, AutoTokenizer
import torch, scipy.io.wavfile

model = VitsModel.from_pretrained("Shinzmann/soro-tts-yor")
tokenizer = AutoTokenizer.from_pretrained("Shinzmann/soro-tts-yor")

text = "Ẹ kú àbọ̀ sí Nàìjíríà, orílẹ̀-èdè wa tó kún fún ìbùkún."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    waveform = model(**inputs).waveform[0].numpy()

scipy.io.wavfile.write("out.wav", rate=model.config.sampling_rate, data=waveform)

Training data

Trained on the yor_tts configuration of WAXAL — studio-quality, phonetically balanced single-speaker recordings collected by Media Trust under Google Research's WAXAL initiative.

Statistic	Value
Total audio	22.52 hours
Training audio	19.17 hours (2233 clips)
Validation audio	1.78 hours
Test audio	1.56 hours
Speakers (train)	8
% words containing diacritics	73.3%
Sample rate	16 kHz

Architecture

VITS / MMS-TTS — a conditional VAE with adversarial training, a flow-based prior, and a HiFi-GAN-style decoder.

Parameters: ~83M
Sample rate: 16 kHz
Base model: facebook/mms-tts-yor (Pratap et al., 2023)

Training procedure

Hyperparameter	Value
Epochs	100
Batch size	128
Learning rate	2e-05
Optimizer	AdamW (β₁=0.8, β₂=0.99)
Precision	bf16
Loss weights	mel=35, kl=1.5, gen=1, fmaps=1, disc=3, duration=1
Recipe	`ylacombe/finetune-hf-vits`

Evaluation

Character Error Rate (CER) measured by transcribing synthesised audio with facebook/mms-1b-all ASR (target_lang=yor):

Metric	n	Value
CER (ASR-based)	20	47.72%

This proxy metric measures intelligibility, not naturalness. Human MOS evaluation by native speakers is recommended for the latter.

Limitations and biases

Single voice. WAXAL TTS is recorded by 1–2 professional voice actors per language. The model inherits that voice and accent.
Domain. Training text covers news, narration, and read speech; conversational, code-switched, or highly informal text may be out of distribution.
Tonal nuance. Yoruba relies on tone marks for meaning. Inputs without proper diacritics will produce flat or incorrect prosody.
Non-commercial. MMS-TTS base is CC BY-NC 4.0; this fine-tune inherits that license.

License

CC BY-NC 4.0 (inherited from facebook/mms-tts-yor). The WAXAL data itself is CC-BY-4.0. This model is for research only and may not be used commercially.

Citation

@misc{soro_tts_yor_2026,
  title  = {{Soro-TTS: A Multilingual Text-to-Speech System for Nigerian Languages — Yoruba}},
  author = {{Soro-TTS authors}},
  year   = {{2026}},
  url    = {{https://huggingface.co/Shinzmann/soro-tts-yor}},
}
@article{pratap2023mms,
  title  = {{Scaling Speech Technology to 1{,}000+ Languages}},
  author = {{Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others}},
  journal= {{arXiv:2305.13516}},
  year   = {{2023}}
}

Acknowledgements

Google Research and Media Trust for releasing WAXAL
Meta AI for the MMS base models
Yoach Lacombe for finetune-hf-vits

Downloads last month: 25

Safetensors

Model size

39.6M params

Tensor type

F32

Model tree for Shinzmann/soro-tts-yor

Base model

facebook/mms-tts-yor

Finetuned

(3)

this model

Dataset used to train Shinzmann/soro-tts-yor

Paper for Shinzmann/soro-tts-yor

Scaling Speech Technology to 1,000+ Languages

Paper • 2305.13516 • Published May 22, 2023 • 12

Evaluation results

Character Error Rate (ASR-based) on WAXAL TTS — Yoruba
test set self-reported

47.720