F5-TTS โ Gujarati fine-tune
A fine-tuned F5-TTS model for Gujarati text-to-speech using flow matching.
Model details
| Attribute | Value |
|---|---|
| Base model | SWivid/F5-TTS (F5TTS_v1_Base) |
| Architecture | Diffusion Transformer with ConvNeXt V2 |
| Language | Gujarati (gu) |
| Training | |
| Tokenizer | Custom (extended for Gujarati characters) |
Training data
Fine-tuned on Gujarati clips from Arjun4707/gu-hi-tts (~36K clips after CPS + duration filtering).
Data source: Audio clips scraped from publicly available YouTube videos. Preprocessed to 24kHz mono PCM-16, silence-trimmed, peak-normalized to -3 dBFS.
Known limitations
- Single-speaker clips produce good quality; multi-speaker clips in training data caused stopping/blabbering artifacts
- Total generation length (prompt + generated) capped at ~30 seconds
Training code
Full training pipeline and troubleshooting: BhammarArjun/TTS_2_training
License
CC-BY-NC-4.0 โ Non-commercial use only.
The base F5-TTS model is CC-BY-NC-4.0 (trained on the Emilia in-the-wild dataset). Our fine-tuning data was also sourced from YouTube audio.
Citation
@misc{arjun2026f5ttsgu,
title = {F5-TTS fine-tuned for Gujarati},
author = {Arjun Bhammar},
year = {2026},
url = {https://huggingface.co/Arjun4707/F5-TTS-Gujarati}
}
Acknowledgements
- SWivid / F5-TTS for the base model and training framework
- Downloads last month
- 13
Model tree for Arjun4707/F5-TTS-Gujarati
Base model
SWivid/F5-TTS