A Swedish Text-To-Speech model fine-tuned from F5-TTS using approximately 200 hours of speech from the Common Voice dataset and parliamentary recordings from the RixVox dataset. Training was conducted locally using an RTX 4080.

Dataset preparation scripts can be found at https://github.com/ChiliOlavi/F5-TTS/tree/swedish-tts

Training Configuration

                    - --exp_name
                    - F5TTS_v1_Base
                    - --learning_rate
                    - "0.0001"
                    - --batch_size_per_gpu
                    - "2000"
                    - --batch_size_type
                    - frame
                    - --max_samples
                    - "96"
                    - --grad_accumulation_steps
                    - "16"
                    - --max_grad_norm
                    - "0.3"
                    - --epochs
                    - "100"
                    - --num_warmup_updates
                    - "3000"
                    - --save_per_updates
                    - "10000"
                    - --keep_last_n_checkpoints
                    - "-1"
                    - --last_per_updates
                    - "5000"
                    - --tokenizer
                    - pinyin

Inference Parameters

{dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4}

Thanks

Special thanks to Amos Wallgren for quality assurance.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EkhoCollective/f5-tts-swedish

Base model

SWivid/F5-TTS

Finetuned

(140)

this model

EkhoCollective
/

f5-tts-swedish

Training Configuration

Inference Parameters

Thanks

Model tree for EkhoCollective/f5-tts-swedish

Datasets used to train EkhoCollective/f5-tts-swedish