Serialtechlab
/

xtts-v2-dhivehi

Model card Files Files and versions

xtts-v2-dhivehi / README.md

Serialtechlab's picture

Upload README.md with huggingface_hub

5f0b25f verified 16 days ago

|

history blame contribute delete

3.1 kB

	---
	language:
	- dv
	license: other
	license_name: coqui-public-model-license
	license_link: https://coqui.ai/cpml
	tags:
	- text-to-speech
	- tts
	- voice-cloning
	- xtts
	- dhivehi
	- thaana
	- maldives
	library_name: coqui
	pipeline_tag: text-to-speech
	base_model: coqui/XTTS-v2
	---

	# XTTS v2 - Dhivehi (Thaana)

	Fine-tuned [XTTS v2.0](https://huggingface.co/coqui/XTTS-v2) for Dhivehi (Maldivian, Thaana script) text-to-speech with zero-shot voice cloning.

	## Model Details

	- Base model: XTTS v2.0 (Coqui)
	- Language: Dhivehi (dv) - Thaana script
	- Architecture: GPT-2 + DVAE + HiFiGAN vocoder
	- Audio: 24kHz output
	- Training step: 95366

	## Training Data

	~59,000 samples (~75+ hours) from multiple Dhivehi speech sources:
	- [Serialtechlab/dhivehi-javaabu-speech-parquet](https://huggingface.co/datasets/Serialtechlab/dhivehi-javaabu-speech-parquet) - news/article narration
	- [Serialtechlab/dv-presidential-speech](https://huggingface.co/datasets/Serialtechlab/dv-presidential-speech) - presidential addresses
	- [Serialtechlab/dhivehi-tts-female-01](https://huggingface.co/datasets/Serialtechlab/dhivehi-tts-female-01) - female speaker
	- [alakxender/dv-audio-syn-lg](https://huggingface.co/datasets/alakxender/dv-audio-syn-lg) - synthetic speech (subset)

	## Usage

	### Install

	```bash
	pip install coqui-tts
	```

	### Inference

	```python
	import torch, torchaudio
	from TTS.tts.configs.xtts_config import XttsConfig
	from TTS.tts.models.xtts import Xtts

	# Download all files from this repo into a local directory

	config = XttsConfig()
	config.load_json("config.json")

	model = Xtts.init_from_config(config)
	model.load_checkpoint(
	config,
	checkpoint_path="model.pth",
	vocab_path="vocab.json",
	eval=True,
	strict=False,
	)
	model.cuda()

	# Get speaker embedding from a reference WAV (5-15 sec of clean speech)
	gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
	audio_path=["reference.wav"],
	gpt_cond_len=24,
	gpt_cond_chunk_len=4,
	)

	# Generate speech
	out = model.inference(
	text="\u0784\u07a8\u0790\u07b0\u0789\u07a8\ufdf2 \u0783\u07a6\u0781\u07aa\u0789\u07a7\u0782\u07a8 \u0783\u07a6\u0781\u07a9\u0789\u07a8",
	language="dv",
	gpt_cond_latent=gpt_cond_latent,
	speaker_embedding=speaker_embedding,
	temperature=0.7,
	)

	wav = torch.tensor(out["wav"]).unsqueeze(0)
	torchaudio.save("output.wav", wav, 24000)
	```

	## Files

	- `model.pth` - Fine-tuned GPT checkpoint
	- `config.json` - Model configuration
	- `vocab.json` - Extended BPE vocabulary (base XTTS + Thaana characters)
	- `dvae.pth` - Discrete VAE (from base XTTS v2.0)
	- `mel_stats.pth` - Mel spectrogram normalization stats (from base XTTS v2.0)

	## Limitations

	- Voice cloning quality depends on the reference audio (clean, 5-15 seconds recommended)
	- Text longer than ~300 characters may be truncated
	- Some rare Dhivehi words may be mispronounced
	- Model is still being actively trained - newer checkpoints may be uploaded

	## License

	This model inherits the [Coqui Public Model License](https://coqui.ai/cpml) from the base XTTS v2.0 model.