mozilla-foundation/common_voice_17_0
Updated • 5.28k • 17
A Swedish Text-To-Speech model fine-tuned from F5-TTS using approximately 200 hours of speech from the Common Voice dataset and parliamentary recordings from the RixVox dataset. Training was conducted locally using an RTX 4080.
Dataset preparation scripts can be found at https://github.com/ChiliOlavi/F5-TTS/tree/swedish-tts
- --exp_name
- F5TTS_v1_Base
- --learning_rate
- "0.0001"
- --batch_size_per_gpu
- "2000"
- --batch_size_type
- frame
- --max_samples
- "96"
- --grad_accumulation_steps
- "16"
- --max_grad_norm
- "0.3"
- --epochs
- "100"
- --num_warmup_updates
- "3000"
- --save_per_updates
- "10000"
- --keep_last_n_checkpoints
- "-1"
- --last_per_updates
- "5000"
- --tokenizer
- pinyin
{dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4}
Special thanks to Amos Wallgren for quality assurance.
Base model
SWivid/F5-TTS