Sauti TTS
Sauti TTS is a Swahili text-to-speech checkpoint released by MsingiAI and built on top of F5-TTS v1 Base. This checkpoint is intended for research and development workflows involving Swahili speech synthesis and reference-audio-conditioned voice transfer.
This upload is a model checkpoint package, not a standalone Python library.
It is designed to be used with the sauti-tts inference code and the upstream
F5-TTS stack.
Model Summary
- Model name: Sauti TTS
- Developer: MsingiAI
- Primary language: Swahili
- Base model family: F5-TTS v1 Base
- Vocoder path: Vocos via the F5-TTS stack
- Task: text-to-speech
- Conditioning mode: text + short reference audio
Release Snapshot
This Hub release contains:
model_last.ptvocab.txttraining_config.jsonLICENSETHIRD_PARTY_NOTICES.md
The uploaded checkpoint was taken from the Modal checkpoint volume path
/sauti_tts_multi/model_last.pt and corresponds to a full fine-tuning run of
the multi-GPU recipe. The checkpoint includes:
- model weights
- EMA weights
- optimizer state
- scheduler state
The checkpoint metadata reports update = 15350.
Training Data
The model was trained from the Google WaxalNLP Swahili TTS subset
(swa_tts), after local preparation and export into an F5-TTS-compatible
format.
Prepared subset statistics for the run artifacts used in this release:
- Total prepared utterances: 1,245
- Total prepared audio: 4.20 hours
- Speakers: 7
- Gender distribution: 696 female / 549 male utterances
- Average utterance duration: 12.15 seconds
- Min / max utterance duration: 2.94 s / 29.95 s
Observed split sizes:
- Train: 976 utterances, 3.31 hours
- Validation: 133 utterances, 0.44 hours
- Test: 136 utterances, 0.45 hours
Data preparation in the sauti-tts project includes:
- resampling
- silence trimming
- loudness normalization
- Swahili text normalization
- F5-TTS-compatible metadata export
Training Configuration
This checkpoint corresponds to the multi-GPU training recipe:
learning_rate = 2e-5batch_size_per_gpu = 2000framesnum_warmup_updates = 300mixed_precision = bf16use_ema = trueema_decay = 0.9999
The release also includes the exact training_config.json used for the run.
Intended Use
This checkpoint is intended for:
- Swahili TTS research
- speech generation experiments for African language technology
- reference-audio-conditioned voice transfer experiments
- benchmarking and reproducibility work around F5-TTS fine-tuning
Out-of-Scope Use
This checkpoint is not intended for:
- impersonation, fraud, or deception
- biometric identification or identity claims
- safety-critical systems
- any deployment that ignores upstream license or dataset obligations
Limitations
- Output quality depends strongly on reference audio quality and the accuracy of the reference transcript.
- This release is focused on Swahili; quality outside Swahili has not been established.
- Very short pauses and waveform boundaries may still benefit from cleanup during inference.
- This is a full training checkpoint package, so the file is significantly larger than a stripped inference-only export.
- This release does not yet include benchmark tables or objective evaluation results in the Hub repo itself.
How to Use
1. Clone the inference code
Use this checkpoint with the sauti-tts codebase and the upstream F5-TTS
dependency.
git clone https://github.com/Msingi-AI/sauti-tts
cd sauti-tts
pip install -r requirements.txt
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS && pip install -e . && cd ..
2. Download the checkpoint and vocab
hf download msingiai/sauti-tts model_last.pt --local-dir ckpts/sauti_tts_multi
hf download msingiai/sauti-tts vocab.txt --local-dir data/waxal_swahili
3. Run inference
python scripts/inference.py \
--checkpoint ckpts/sauti_tts_multi/model_last.pt \
--vocab_path data/waxal_swahili/vocab.txt \
--text "Habari, karibu kwenye Sauti TTS." \
--ref_audio path/to/reference.wav \
--ref_text "Habari, karibu kwenye Sauti TTS." \
--output outputs/generated.wav
The inference script also supports:
- long-text chunking
- configurable sampling steps
- configurable guidance strength
- batch generation from a text file
Repository Notes
This model card describes the Hub release artifact. The checkpoint was produced
from the sauti-tts training stack, which provides:
- dataset preparation
- model wrapping for F5-TTS
- training scripts
- Modal training and inference launchers
- evaluation utilities
Licensing
This Hub repo is marked as license: other on purpose.
Reason:
- the uploaded artifact is a derived model checkpoint
- the checkpoint is based on F5-TTS
- the training data comes from WaxalNLP
You must comply with the obligations attached to the upstream model and dataset in addition to any licensing that applies to the surrounding repository code. See:
LICENSETHIRD_PARTY_NOTICES.md
Citation
If you use this release, cite the project and the upstream work:
@misc{sauti_tts_2026,
title={Sauti TTS: Swahili Text-to-Speech via F5-TTS Fine-tuning on WaxalNLP},
author={MsingiAI},
year={2026}
}
@article{chen2024f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Chen, Yushen and others},
journal={arXiv preprint arXiv:2410.06885},
year={2024}
}
@misc{waxal2026,
title={WAXAL: A Large-Scale Multilingual African Speech Corpus},
author={Diack, Abdoulaye and others},
year={2026},
url={https://huggingface.co/datasets/google/WaxalNLP}
}
Model tree for msingiai/sauti-tts
Base model
SWivid/F5-TTS