Sauti TTS

Sauti TTS is a Swahili text-to-speech checkpoint released by MsingiAI and built on top of F5-TTS v1 Base. This checkpoint is intended for research and development workflows involving Swahili speech synthesis and reference-audio-conditioned voice transfer.

This upload is a model checkpoint package, not a standalone Python library. It is designed to be used with the sauti-tts inference code and the upstream F5-TTS stack.

Model Summary

Model name: Sauti TTS
Developer: MsingiAI
Primary language: Swahili
Base model family: F5-TTS v1 Base
Vocoder path: Vocos via the F5-TTS stack
Task: text-to-speech
Conditioning mode: text + short reference audio

Release Snapshot

This Hub release contains:

model_last.pt
vocab.txt
training_config.json
LICENSE
THIRD_PARTY_NOTICES.md

The uploaded checkpoint was taken from the Modal checkpoint volume path /sauti_tts_multi/model_last.pt and corresponds to a full fine-tuning run of the multi-GPU recipe. The checkpoint includes:

model weights
EMA weights
optimizer state
scheduler state

The checkpoint metadata reports update = 15350.

Training Data

The model was trained from the Google WaxalNLP Swahili TTS subset (swa_tts), after local preparation and export into an F5-TTS-compatible format.

Prepared subset statistics for the run artifacts used in this release:

Total prepared utterances: 1,245
Total prepared audio: 4.20 hours
Speakers: 7
Gender distribution: 696 female / 549 male utterances
Average utterance duration: 12.15 seconds
Min / max utterance duration: 2.94 s / 29.95 s

Observed split sizes:

Train: 976 utterances, 3.31 hours
Validation: 133 utterances, 0.44 hours
Test: 136 utterances, 0.45 hours

Data preparation in the sauti-tts project includes:

resampling
silence trimming
loudness normalization
Swahili text normalization
F5-TTS-compatible metadata export

Training Configuration

This checkpoint corresponds to the multi-GPU training recipe:

learning_rate = 2e-5
batch_size_per_gpu = 2000 frames
num_warmup_updates = 300
mixed_precision = bf16
use_ema = true
ema_decay = 0.9999

The release also includes the exact training_config.json used for the run.

Intended Use

This checkpoint is intended for:

Swahili TTS research
speech generation experiments for African language technology
reference-audio-conditioned voice transfer experiments
benchmarking and reproducibility work around F5-TTS fine-tuning

Out-of-Scope Use

This checkpoint is not intended for:

impersonation, fraud, or deception
biometric identification or identity claims
safety-critical systems
any deployment that ignores upstream license or dataset obligations

Limitations

Output quality depends strongly on reference audio quality and the accuracy of the reference transcript.
This release is focused on Swahili; quality outside Swahili has not been established.
Very short pauses and waveform boundaries may still benefit from cleanup during inference.
This is a full training checkpoint package, so the file is significantly larger than a stripped inference-only export.
This release does not yet include benchmark tables or objective evaluation results in the Hub repo itself.

How to Use

1. Clone the inference code

Use this checkpoint with the sauti-tts codebase and the upstream F5-TTS dependency.

git clone https://github.com/Msingi-AI/sauti-tts
cd sauti-tts
pip install -r requirements.txt
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS && pip install -e . && cd ..

2. Download the checkpoint and vocab

hf download msingiai/sauti-tts model_last.pt --local-dir ckpts/sauti_tts_multi
hf download msingiai/sauti-tts vocab.txt --local-dir data/waxal_swahili

3. Run inference

python scripts/inference.py \
  --checkpoint ckpts/sauti_tts_multi/model_last.pt \
  --vocab_path data/waxal_swahili/vocab.txt \
  --text "Habari, karibu kwenye Sauti TTS." \
  --ref_audio path/to/reference.wav \
  --ref_text "Habari, karibu kwenye Sauti TTS." \
  --output outputs/generated.wav

The inference script also supports:

long-text chunking
configurable sampling steps
configurable guidance strength
batch generation from a text file

Repository Notes

This model card describes the Hub release artifact. The checkpoint was produced from the sauti-tts training stack, which provides:

dataset preparation
model wrapping for F5-TTS
training scripts
Modal training and inference launchers
evaluation utilities

Licensing

This Hub repo is marked as license: other on purpose.

Reason:

the uploaded artifact is a derived model checkpoint
the checkpoint is based on F5-TTS
the training data comes from WaxalNLP

You must comply with the obligations attached to the upstream model and dataset in addition to any licensing that applies to the surrounding repository code. See:

LICENSE
THIRD_PARTY_NOTICES.md

Citation

If you use this release, cite the project and the upstream work:

@misc{sauti_tts_2026,
  title={Sauti TTS: Swahili Text-to-Speech via F5-TTS Fine-tuning on WaxalNLP},
  author={MsingiAI},
  year={2026}
}

@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

@misc{waxal2026,
  title={WAXAL: A Large-Scale Multilingual African Speech Corpus},
  author={Diack, Abdoulaye and others},
  year={2026},
  url={https://huggingface.co/datasets/google/WaxalNLP}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for msingiai/sauti-tts

Base model

SWivid/F5-TTS

Finetuned

(129)

this model

Dataset used to train msingiai/sauti-tts

Paper for msingiai/sauti-tts

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published Oct 9, 2024 • 47