Text-to-Speech
F5-TTS
Divehi
tts
flow-matching
dhivehi
maldivian
thaana
voice-cloning
zero-shot-tts
f5-tts-dhivehi / README.md
Serialtechlab's picture
F5-TTS Dhivehi fine-tuned model
64209f9 verified
---
language:
- dv
license: cc-by-nc-4.0
tags:
- tts
- text-to-speech
- f5-tts
- flow-matching
- dhivehi
- maldivian
- thaana
- voice-cloning
- zero-shot-tts
datasets:
- Serialtechlab/dhivehi-mms-v5-combined
- Serialtechlab/dv-presidential-speech
- alakxender/dv-audio-syn-lg
base_model: SWivid/F5-TTS
pipeline_tag: text-to-speech
---
# F5-TTS Fine-tuned for Dhivehi (ދިވެހި)
Fine-tuned [F5-TTS](https://github.com/SWivid/F5-TTS) model for Dhivehi (Maldivian)
text-to-speech with zero-shot voice cloning.
## Model Details
- **Architecture:** DiT (dim=1024, depth=22, heads=16)
- **Base Model:** F5-TTS v1 Base
- **Vocoder:** Vocos (24kHz)
- **Tokenizer:** Custom character-level (Thaana + Latin + punctuation)
- **Vocab size:** 2604 characters (59 Thaana chars added to base vocab)
## Usage
```python
from f5_tts.api import F5TTS
tts = F5TTS(
model="F5TTS_v1_Base",
ckpt_file="model.pt",
vocab_file="vocab.txt",
)
wav, sr, _ = tts.infer(
ref_file="reference.wav",
ref_text="reference text in Dhivehi",
gen_text="ދިވެހިރާއްޖެއަކީ ވަރަކް ރީތި ޔައުމެކެވެ",
)
```
## Training Data
| Dataset | Samples |
|---------|---------|
| Serialtechlab/dhivehi-mms-v5-combined | ~9,660 |
| Serialtechlab/dv-presidential-speech | ~1,660 |
| alakxender/dv-audio-syn-lg | ~50,000 (synthetic) |
## Training Config
- Learning rate: 1e-05
- Batch size: 19200 frames
- Epochs: 100
- Mixed precision: bf16
- GPU: NVIDIA A100 40GB
## Files
- `model.pt` - Fine-tuned F5-TTS weights
- `vocab.txt` - Extended character vocabulary (Thaana + base)
- `config.json` - Training configuration