F5-TTS v1 - Nigerian Multilingual
A text-to-speech model for Nigerian languages based on F5-TTS, fine-tuned for natural speech synthesis in Igbo, Hausa, Yoruba, Nigerian Pidgin, and Nigerian English.
Supported Languages
| Code | Language | Native Name |
|---|---|---|
ig |
Igbo | Asụsụ Igbo |
ha |
Hausa | Harshen Hausa |
yo |
Yoruba | Èdè Yorùbá |
pcm |
Nigerian Pidgin | Naija |
en |
Nigerian English | - |
Quick Start
Installation
pip install f5-tts
Inference
from f5_tts.api import F5TTS
# Initialize with Nigerian model
tts = F5TTS(
ckpt_file="path/to/model.pt",
vocab_file="path/to/vocab.txt"
)
# Generate speech
audio = tts.infer(
ref_audio="reference.wav", # 5-15 second reference audio
ref_text="Reference transcript",
gen_text="Text to synthesize",
target_sample_rate=24000
)
Using with Hugging Face
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(repo_id="babs/f5tts-v1", filename="model.pt")
vocab_path = hf_hub_download(repo_id="babs/f5tts-v1", filename="vocab.txt")
# Use with F5-TTS
from f5_tts.api import F5TTS
tts = F5TTS(ckpt_file=model_path, vocab_file=vocab_path)
Model Details
Architecture
- Base: F5-TTS DiT (Diffusion Transformer)
- Vocoder: Vocos (24kHz)
- Tokenizer: Character-level (2,624 tokens)
Training
| Parameter | Value |
|---|---|
| Base Model | F5TTS_v1_Base (1.25M steps) |
| Fine-tuning Steps | 151,540 |
| Final Loss | 0.731 |
| Hardware | 2x NVIDIA A100-SXM4-80GB |
| Precision | bf16 |
| Learning Rate | 5e-5 |
| Batch Size | 51,200 frames/GPU |
| Epochs | 20 |
Dataset
Trained on internal Nigerian language speech corpus covering:
- Multiple speakers per language
- Diverse recording conditions
- Natural conversational and read speech
Usage Tips
- Reference Audio: Use 5-15 seconds of clear speech from your target speaker
- Reference Text: Provide accurate transcription of the reference audio
- Language Mixing: The model handles code-switching between supported languages
- Punctuation: Include punctuation for natural prosody
Limitations
- Best results with reference audio similar to target speaker characteristics
- May struggle with very long sentences (>50 words)
- Tone marking for tonal languages (Igbo, Yoruba, Hausa) improves quality
- Not designed for real-time streaming (use dedicated streaming models for <100ms latency)
Files
| File | Description | Size |
|---|---|---|
model.pt |
Model weights | 5.1 GB |
vocab.txt |
Character vocabulary | 12 KB |
config.json |
Model configuration | 1 KB |
samples/ |
Audio samples from training | ~2 MB |
Citation
If you use this model, please cite:
@misc{f5tts-nigerian-2026,
title={F5-TTS Nigerian Multilingual},
author={Spitch AI},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/babs/f5tts-v1}
}
License
MIT License - See the base F5-TTS repository for details.
Contact
For questions or issues, contact the Spitch AI team.
- Downloads last month
- 57