Serialtechlab
/

f5-tts-dhivehi

Model card Files Files and versions

f5-tts-dhivehi / README.md

Serialtechlab's picture

F5-TTS Dhivehi fine-tuned model

64209f9 verified 4 days ago

|

history blame contribute delete

1.64 kB

	---
	language:
	- dv
	license: cc-by-nc-4.0
	tags:
	- tts
	- text-to-speech
	- f5-tts
	- flow-matching
	- dhivehi
	- maldivian
	- thaana
	- voice-cloning
	- zero-shot-tts
	datasets:
	- Serialtechlab/dhivehi-mms-v5-combined
	- Serialtechlab/dv-presidential-speech
	- alakxender/dv-audio-syn-lg
	base_model: SWivid/F5-TTS
	pipeline_tag: text-to-speech
	---

	# F5-TTS Fine-tuned for Dhivehi (ދިވެހި)

	Fine-tuned [F5-TTS](https://github.com/SWivid/F5-TTS) model for Dhivehi (Maldivian)
	text-to-speech with zero-shot voice cloning.

	## Model Details

	- Architecture: DiT (dim=1024, depth=22, heads=16)
	- Base Model: F5-TTS v1 Base
	- Vocoder: Vocos (24kHz)
	- Tokenizer: Custom character-level (Thaana + Latin + punctuation)
	- Vocab size: 2604 characters (59 Thaana chars added to base vocab)

	## Usage

	```python
	from f5_tts.api import F5TTS

	tts = F5TTS(
	model="F5TTS_v1_Base",
	ckpt_file="model.pt",
	vocab_file="vocab.txt",
	)

	wav, sr, _ = tts.infer(
	ref_file="reference.wav",
	ref_text="reference text in Dhivehi",
	gen_text="ދިވެހިރާއްޖެއަކީ ވަރަކް ރީތި ޔައުމެކެވެ",
	)
	```

	## Training Data

	\| Dataset \| Samples \|
	\|---------\|---------\|
	\| Serialtechlab/dhivehi-mms-v5-combined \| ~9,660 \|
	\| Serialtechlab/dv-presidential-speech \| ~1,660 \|
	\| alakxender/dv-audio-syn-lg \| ~50,000 (synthetic) \|

	## Training Config

	- Learning rate: 1e-05
	- Batch size: 19200 frames
	- Epochs: 100
	- Mixed precision: bf16
	- GPU: NVIDIA A100 40GB

	## Files

	- `model.pt` - Fine-tuned F5-TTS weights
	- `vocab.txt` - Extended character vocabulary (Thaana + base)
	- `config.json` - Training configuration