| | --- |
| | language: |
| | - dv |
| | license: cc-by-nc-4.0 |
| | tags: |
| | - tts |
| | - text-to-speech |
| | - f5-tts |
| | - flow-matching |
| | - dhivehi |
| | - maldivian |
| | - thaana |
| | - voice-cloning |
| | - zero-shot-tts |
| | datasets: |
| | - Serialtechlab/dhivehi-mms-v5-combined |
| | - Serialtechlab/dv-presidential-speech |
| | - alakxender/dv-audio-syn-lg |
| | base_model: SWivid/F5-TTS |
| | pipeline_tag: text-to-speech |
| | --- |
| | |
| | # F5-TTS Fine-tuned for Dhivehi (ދިވެހި) |
| |
|
| | Fine-tuned [F5-TTS](https://github.com/SWivid/F5-TTS) model for Dhivehi (Maldivian) |
| | text-to-speech with zero-shot voice cloning. |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture:** DiT (dim=1024, depth=22, heads=16) |
| | - **Base Model:** F5-TTS v1 Base |
| | - **Vocoder:** Vocos (24kHz) |
| | - **Tokenizer:** Custom character-level (Thaana + Latin + punctuation) |
| | - **Vocab size:** 2604 characters (59 Thaana chars added to base vocab) |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from f5_tts.api import F5TTS |
| | |
| | tts = F5TTS( |
| | model="F5TTS_v1_Base", |
| | ckpt_file="model.pt", |
| | vocab_file="vocab.txt", |
| | ) |
| | |
| | wav, sr, _ = tts.infer( |
| | ref_file="reference.wav", |
| | ref_text="reference text in Dhivehi", |
| | gen_text="ދިވެހިރާއްޖެއަކީ ވަރަކް ރީތި ޔައުމެކެވެ", |
| | ) |
| | ``` |
| |
|
| | ## Training Data |
| |
|
| | | Dataset | Samples | |
| | |---------|---------| |
| | | Serialtechlab/dhivehi-mms-v5-combined | ~9,660 | |
| | | Serialtechlab/dv-presidential-speech | ~1,660 | |
| | | alakxender/dv-audio-syn-lg | ~50,000 (synthetic) | |
| |
|
| | ## Training Config |
| |
|
| | - Learning rate: 1e-05 |
| | - Batch size: 19200 frames |
| | - Epochs: 100 |
| | - Mixed precision: bf16 |
| | - GPU: NVIDIA A100 40GB |
| |
|
| | ## Files |
| |
|
| | - `model.pt` - Fine-tuned F5-TTS weights |
| | - `vocab.txt` - Extended character vocabulary (Thaana + base) |
| | - `config.json` - Training configuration |
| |
|