Text-to-Speech
PyTorch
F5-TTS
Swahili
swahili
voice-cloning
audio

Sauti TTS

Sauti TTS is a Swahili text-to-speech checkpoint released by MsingiAI and built on top of F5-TTS v1 Base. This checkpoint is intended for research and development workflows involving Swahili speech synthesis and reference-audio-conditioned voice transfer.

This upload is a model checkpoint package, not a standalone Python library. It is designed to be used with the sauti-tts inference code and the upstream F5-TTS stack.

Model Summary

  • Model name: Sauti TTS
  • Developer: MsingiAI
  • Primary language: Swahili
  • Base model family: F5-TTS v1 Base
  • Vocoder path: Vocos via the F5-TTS stack
  • Task: text-to-speech
  • Conditioning mode: text + short reference audio

Release Snapshot

This Hub release contains:

  • model_last.pt
  • vocab.txt
  • training_config.json
  • LICENSE
  • THIRD_PARTY_NOTICES.md

The uploaded checkpoint was taken from the Modal checkpoint volume path /sauti_tts_multi/model_last.pt and corresponds to a full fine-tuning run of the multi-GPU recipe. The checkpoint includes:

  • model weights
  • EMA weights
  • optimizer state
  • scheduler state

The checkpoint metadata reports update = 15350.

Training Data

The model was trained from the Google WaxalNLP Swahili TTS subset (swa_tts), after local preparation and export into an F5-TTS-compatible format.

Prepared subset statistics for the run artifacts used in this release:

  • Total prepared utterances: 1,245
  • Total prepared audio: 4.20 hours
  • Speakers: 7
  • Gender distribution: 696 female / 549 male utterances
  • Average utterance duration: 12.15 seconds
  • Min / max utterance duration: 2.94 s / 29.95 s

Observed split sizes:

  • Train: 976 utterances, 3.31 hours
  • Validation: 133 utterances, 0.44 hours
  • Test: 136 utterances, 0.45 hours

Data preparation in the sauti-tts project includes:

  • resampling
  • silence trimming
  • loudness normalization
  • Swahili text normalization
  • F5-TTS-compatible metadata export

Training Configuration

This checkpoint corresponds to the multi-GPU training recipe:

  • learning_rate = 2e-5
  • batch_size_per_gpu = 2000 frames
  • num_warmup_updates = 300
  • mixed_precision = bf16
  • use_ema = true
  • ema_decay = 0.9999

The release also includes the exact training_config.json used for the run.

Intended Use

This checkpoint is intended for:

  • Swahili TTS research
  • speech generation experiments for African language technology
  • reference-audio-conditioned voice transfer experiments
  • benchmarking and reproducibility work around F5-TTS fine-tuning

Out-of-Scope Use

This checkpoint is not intended for:

  • impersonation, fraud, or deception
  • biometric identification or identity claims
  • safety-critical systems
  • any deployment that ignores upstream license or dataset obligations

Limitations

  • Output quality depends strongly on reference audio quality and the accuracy of the reference transcript.
  • This release is focused on Swahili; quality outside Swahili has not been established.
  • Very short pauses and waveform boundaries may still benefit from cleanup during inference.
  • This is a full training checkpoint package, so the file is significantly larger than a stripped inference-only export.
  • This release does not yet include benchmark tables or objective evaluation results in the Hub repo itself.

How to Use

1. Clone the inference code

Use this checkpoint with the sauti-tts codebase and the upstream F5-TTS dependency.

git clone https://github.com/Msingi-AI/sauti-tts
cd sauti-tts
pip install -r requirements.txt
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS && pip install -e . && cd ..

2. Download the checkpoint and vocab

hf download msingiai/sauti-tts model_last.pt --local-dir ckpts/sauti_tts_multi
hf download msingiai/sauti-tts vocab.txt --local-dir data/waxal_swahili

3. Run inference

python scripts/inference.py \
  --checkpoint ckpts/sauti_tts_multi/model_last.pt \
  --vocab_path data/waxal_swahili/vocab.txt \
  --text "Habari, karibu kwenye Sauti TTS." \
  --ref_audio path/to/reference.wav \
  --ref_text "Habari, karibu kwenye Sauti TTS." \
  --output outputs/generated.wav

The inference script also supports:

  • long-text chunking
  • configurable sampling steps
  • configurable guidance strength
  • batch generation from a text file

Repository Notes

This model card describes the Hub release artifact. The checkpoint was produced from the sauti-tts training stack, which provides:

  • dataset preparation
  • model wrapping for F5-TTS
  • training scripts
  • Modal training and inference launchers
  • evaluation utilities

Licensing

This Hub repo is marked as license: other on purpose.

Reason:

  • the uploaded artifact is a derived model checkpoint
  • the checkpoint is based on F5-TTS
  • the training data comes from WaxalNLP

You must comply with the obligations attached to the upstream model and dataset in addition to any licensing that applies to the surrounding repository code. See:

  • LICENSE
  • THIRD_PARTY_NOTICES.md

Citation

If you use this release, cite the project and the upstream work:

@misc{sauti_tts_2026,
  title={Sauti TTS: Swahili Text-to-Speech via F5-TTS Fine-tuning on WaxalNLP},
  author={MsingiAI},
  year={2026}
}

@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

@misc{waxal2026,
  title={WAXAL: A Large-Scale Multilingual African Speech Corpus},
  author={Diack, Abdoulaye and others},
  year={2026},
  url={https://huggingface.co/datasets/google/WaxalNLP}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for msingiai/sauti-tts

Base model

SWivid/F5-TTS
Finetuned
(87)
this model

Dataset used to train msingiai/sauti-tts

Paper for msingiai/sauti-tts