Text-to-Speech
F5-TTS
Divehi
tts
flow-matching
dhivehi
maldivian
thaana
voice-cloning
zero-shot-tts
f5-tts-dhivehi / README.md
Serialtechlab's picture
F5-TTS Dhivehi fine-tuned model
64209f9 verified
metadata
language:
  - dv
license: cc-by-nc-4.0
tags:
  - tts
  - text-to-speech
  - f5-tts
  - flow-matching
  - dhivehi
  - maldivian
  - thaana
  - voice-cloning
  - zero-shot-tts
datasets:
  - Serialtechlab/dhivehi-mms-v5-combined
  - Serialtechlab/dv-presidential-speech
  - alakxender/dv-audio-syn-lg
base_model: SWivid/F5-TTS
pipeline_tag: text-to-speech

F5-TTS Fine-tuned for Dhivehi (ދިވެހި)

Fine-tuned F5-TTS model for Dhivehi (Maldivian) text-to-speech with zero-shot voice cloning.

Model Details

  • Architecture: DiT (dim=1024, depth=22, heads=16)
  • Base Model: F5-TTS v1 Base
  • Vocoder: Vocos (24kHz)
  • Tokenizer: Custom character-level (Thaana + Latin + punctuation)
  • Vocab size: 2604 characters (59 Thaana chars added to base vocab)

Usage

from f5_tts.api import F5TTS

tts = F5TTS(
    model="F5TTS_v1_Base",
    ckpt_file="model.pt",
    vocab_file="vocab.txt",
)

wav, sr, _ = tts.infer(
    ref_file="reference.wav",
    ref_text="reference text in Dhivehi",
    gen_text="ދިވެހިރާއްޖެއަކީ ވަރަކް ރީތި ޔައުމެކެވެ",
)

Training Data

Dataset Samples
Serialtechlab/dhivehi-mms-v5-combined ~9,660
Serialtechlab/dv-presidential-speech ~1,660
alakxender/dv-audio-syn-lg ~50,000 (synthetic)

Training Config

  • Learning rate: 1e-05
  • Batch size: 19200 frames
  • Epochs: 100
  • Mixed precision: bf16
  • GPU: NVIDIA A100 40GB

Files

  • model.pt - Fine-tuned F5-TTS weights
  • vocab.txt - Extended character vocabulary (Thaana + base)
  • config.json - Training configuration