Ewe Text-to-Speech Model (Fine-tuned MMS-TTS)

This model is a fine-tuned version of facebook/mms-tts-ewe on the WAXAL TTS dataset for Ewe (ewe).

Model Description

  • Base model: MMS-TTS (VITS architecture)
  • Fine-tuned on: google/WaxalNLP (ewe_tts split)
  • Language: Ewe (ewe)
  • Sampling rate: 16000 Hz
  • Speaker support: Single speaker (average voice)

Intended Use

This model is intended for text-to-speech synthesis in Ewe. It can be used for:

  • Audiobook generation
  • Voice assistants
  • Educational tools
  • Any application requiring natural Ewe speech

Training Procedure

Hyperparameters

Parameter Value
Learning rate 5e-05
Batch size (per device) 32
Gradient accumulation steps 2
Effective batch size 64
Number of epochs (planned) 30
Actual epochs completed N/A
Total training steps N/A
Warmup steps 91
Optimizer AdamW (β1=0.8, β2=0.99)
LR scheduler Linear
Mixed precision fp16

Loss weights

Loss component Weight
Mel-spectrogram 45.0
KL divergence 1.5
Duration 1.0
Discriminator (adversarial) 1.0
Generator (adversarial) 1.0
Feature matching 1.0

Training metrics

Metric Value
Final validation loss (mel+kl) N/A
Final training loss N/A
Final learning rate N/A

How to Use

from transformers import VitsModel, AutoTokenizer
import torch
import soundfile as sf

model = VitsModel.from_pretrained("waxal-benchmarking/mms-tts-ewe-1nnocent")
tokenizer = AutoTokenizer.from_pretrained("waxal-benchmarking/mms-tts-ewe-1nnocent")

text = "Nnọọ, ụwa. Nke a bụ olu Igbo."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

waveform = outputs.waveform[0].cpu().numpy()
sf.write("output.wav", waveform, 16000)

Evaluation

Qualitative evaluation was performed by listening to generated samples.

For quantitative metrics (MOS, speaker similarity), please refer to the WAXAL benchmarking paper (coming up).

Ethical Considerations

This model is intended for research and non-commercial use.

  • The WAXAL dataset contains recordings from volunteer speakers who consented to their voice being used for TTS research.

  • The model may reflect biases present in the training data (e.g., limited dialectal coverage).

Acknowledgements

This work was made possible by:

  1. WAXAL dataset contributors – Makerere University, University of Ghana, Media Trust, Digital Umuganda, AIMS Senegal

  2. Meta AI – for releasing the MMS-TTS models

  3. Hugging Face – for the transformers library and model hosting

Citation

If you use this model, please cite the original MMS-TTS paper and the WAXAL dataset:

@inproceedings{mms_tts,
    title={Scaling Speech Technology to 1,000+ Languages},
    author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
    journal={arXiv},
    year={2023}
}
}
@dataset{waxal2026_dataset,
  title={WAXAL: A Large-Scale Multilingual African Language Speech Corpus},
  author={Anonymous},
  journal={arXiv preprint arXiv:2602.02734},
  year={2026},
  url={https://huggingface.co/datasets/google/WaxalNLP}
}
@article{waxal2026,
      title={WAXAL: A Large-Scale Multilingual African Language Speech Corpus},
      author={Abdoulaye Diack and Perry Nelson and Kwaku Agbesi and Angela Nakalembe and MohamedElfatih MohamedKhair and Vusumuzi Dube and Tavonga Siyavora and Subhashini Venugopalan and Jason Hickey and Uche Okonkwo and Abhishek Bapna and Isaac Wiafe and Raynard Dodzi Helegah and Elikem Doe Atsakpo and Charles Nutrokpor and Fiifi Baffoe Payin Winful and Kafui Kwashie Solaga and Jamal-Deen Abdulai and Akon Obu Ekpezu and Audace Niyonkuru and Samuel Rutunda and Boris Ishimwe and Michael Melese and Engineer Bainomugisha and Joyce Nakatumba-Nabende and Andrew Katumba and Claire Babirye and Jonathan Mukiibi and Vincent Kimani and Samuel Kibacia and James Maina and Fridah Emmah and Ahmed Ibrahim Shekarau and Ibrahim Shehu Adamu and Yusuf Abdullahi and Howard Lakougna and Bob MacDonald and Hadar Shemtov and Aisha Walcott-Bryant and Moustapha Cisse and Avinatan Hassidim and Jeff Dean and Yossi Matias},
      year={2026},
      eprint={2602.02734},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2602.02734},
}

Model Author

Innocent Anyaele (@1nnocent)

DISCLAIMER : This model card was automatically generated on 2026-04-09 19:21:55.

Downloads last month
112
Safetensors
Model size
36.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train waxal-benchmarking/mms-tts-ewe-1nnocent

Paper for waxal-benchmarking/mms-tts-ewe-1nnocent