metadata
language:
- gu
license: cc-by-nc-4.0
tags:
- text-to-speech
- tts
- gujarati
- f5-tts
- flow-matching
- indic
pipeline_tag: text-to-speech
base_model: SWivid/F5-TTS
datasets:
- Arjun4707/gu-hi-tts
F5-TTS — Gujarati fine-tune
A fine-tuned F5-TTS model for Gujarati text-to-speech using flow matching.
Model details
| Attribute | Value |
|---|---|
| Base model | SWivid/F5-TTS (F5TTS_v1_Base) |
| Architecture | Diffusion Transformer with ConvNeXt V2 |
| Language | Gujarati (gu) |
| Training | |
| Tokenizer | Custom (extended for Gujarati characters) |
Training data
Fine-tuned on Gujarati clips from Arjun4707/gu-hi-tts (~36K clips after CPS + duration filtering).
Data source: Audio clips scraped from publicly available YouTube videos. Preprocessed to 24kHz mono PCM-16, silence-trimmed, peak-normalized to -3 dBFS.
Known limitations
- Single-speaker clips produce good quality; multi-speaker clips in training data caused stopping/blabbering artifacts
- Total generation length (prompt + generated) capped at ~30 seconds
Training code
Full training pipeline and troubleshooting: BhammarArjun/TTS_2_training
License
CC-BY-NC-4.0 — Non-commercial use only.
The base F5-TTS model is CC-BY-NC-4.0 (trained on the Emilia in-the-wild dataset). Our fine-tuning data was also sourced from YouTube audio.
Citation
@misc{arjun2026f5ttsgu,
title = {F5-TTS fine-tuned for Gujarati},
author = {Arjun Bhammar},
year = {2026},
url = {https://huggingface.co/Arjun4707/F5-TTS-Gujarati}
}
Acknowledgements
- SWivid / F5-TTS for the base model and training framework