You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

F5-TTS — Gujarati fine-tune

A fine-tuned F5-TTS model for Gujarati text-to-speech using flow matching.

Model details

Attribute	Value
Base model	SWivid/F5-TTS (F5TTS_v1_Base)
Architecture	Diffusion Transformer with ConvNeXt V2
Language	Gujarati (`gu`)
Training	~~150K steps (~~21 epochs) on NVIDIA L4 (24 GB)
Tokenizer	Custom (extended for Gujarati characters)

Training data

Fine-tuned on Gujarati clips from Arjun4707/gu-hi-tts (~36K clips after CPS + duration filtering).

Data source: Audio clips scraped from publicly available YouTube videos. Preprocessed to 24kHz mono PCM-16, silence-trimmed, peak-normalized to -3 dBFS.

Known limitations

Single-speaker clips produce good quality; multi-speaker clips in training data caused stopping/blabbering artifacts
Total generation length (prompt + generated) capped at ~30 seconds

Training code

Full training pipeline and troubleshooting: BhammarArjun/TTS_2_training

License

CC-BY-NC-4.0 — Non-commercial use only.

The base F5-TTS model is CC-BY-NC-4.0 (trained on the Emilia in-the-wild dataset). Our fine-tuning data was also sourced from YouTube audio.

Citation

@misc{arjun2026f5ttsgu,
  title   = {F5-TTS fine-tuned for Gujarati},
  author  = {Arjun Bhammar},
  year    = {2026},
  url     = {https://huggingface.co/Arjun4707/F5-TTS-Gujarati}
}

Acknowledgements

SWivid / F5-TTS for the base model and training framework

Downloads last month: -

Model tree for Arjun4707/F5-TTS-Gujarati

Base model

SWivid/F5-TTS

Finetuned

(131)

this model