| --- |
| language: |
| - dv |
| license: other |
| license_name: coqui-public-model-license |
| license_link: https://coqui.ai/cpml |
| tags: |
| - text-to-speech |
| - tts |
| - voice-cloning |
| - xtts |
| - dhivehi |
| - thaana |
| - maldives |
| library_name: coqui |
| pipeline_tag: text-to-speech |
| base_model: coqui/XTTS-v2 |
| --- |
| |
| # XTTS v2 - Dhivehi (Thaana) |
|
|
| Fine-tuned [XTTS v2.0](https://huggingface.co/coqui/XTTS-v2) for **Dhivehi** (Maldivian, Thaana script) text-to-speech with zero-shot voice cloning. |
|
|
| ## Model Details |
|
|
| - **Base model:** XTTS v2.0 (Coqui) |
| - **Language:** Dhivehi (dv) - Thaana script |
| - **Architecture:** GPT-2 + DVAE + HiFiGAN vocoder |
| - **Audio:** 24kHz output |
| - **Training step:** 95366 |
|
|
| ## Training Data |
|
|
| ~59,000 samples (~75+ hours) from multiple Dhivehi speech sources: |
| - [Serialtechlab/dhivehi-javaabu-speech-parquet](https://huggingface.co/datasets/Serialtechlab/dhivehi-javaabu-speech-parquet) - news/article narration |
| - [Serialtechlab/dv-presidential-speech](https://huggingface.co/datasets/Serialtechlab/dv-presidential-speech) - presidential addresses |
| - [Serialtechlab/dhivehi-tts-female-01](https://huggingface.co/datasets/Serialtechlab/dhivehi-tts-female-01) - female speaker |
| - [alakxender/dv-audio-syn-lg](https://huggingface.co/datasets/alakxender/dv-audio-syn-lg) - synthetic speech (subset) |
|
|
| ## Usage |
|
|
| ### Install |
|
|
| ```bash |
| pip install coqui-tts |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| import torch, torchaudio |
| from TTS.tts.configs.xtts_config import XttsConfig |
| from TTS.tts.models.xtts import Xtts |
| |
| # Download all files from this repo into a local directory |
| |
| config = XttsConfig() |
| config.load_json("config.json") |
| |
| model = Xtts.init_from_config(config) |
| model.load_checkpoint( |
| config, |
| checkpoint_path="model.pth", |
| vocab_path="vocab.json", |
| eval=True, |
| strict=False, |
| ) |
| model.cuda() |
| |
| # Get speaker embedding from a reference WAV (5-15 sec of clean speech) |
| gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( |
| audio_path=["reference.wav"], |
| gpt_cond_len=24, |
| gpt_cond_chunk_len=4, |
| ) |
| |
| # Generate speech |
| out = model.inference( |
| text="\u0784\u07a8\u0790\u07b0\u0789\u07a8\ufdf2 \u0783\u07a6\u0781\u07aa\u0789\u07a7\u0782\u07a8 \u0783\u07a6\u0781\u07a9\u0789\u07a8", |
| language="dv", |
| gpt_cond_latent=gpt_cond_latent, |
| speaker_embedding=speaker_embedding, |
| temperature=0.7, |
| ) |
| |
| wav = torch.tensor(out["wav"]).unsqueeze(0) |
| torchaudio.save("output.wav", wav, 24000) |
| ``` |
|
|
| ## Files |
|
|
| - `model.pth` - Fine-tuned GPT checkpoint |
| - `config.json` - Model configuration |
| - `vocab.json` - Extended BPE vocabulary (base XTTS + Thaana characters) |
| - `dvae.pth` - Discrete VAE (from base XTTS v2.0) |
| - `mel_stats.pth` - Mel spectrogram normalization stats (from base XTTS v2.0) |
|
|
| ## Limitations |
|
|
| - Voice cloning quality depends on the reference audio (clean, 5-15 seconds recommended) |
| - Text longer than ~300 characters may be truncated |
| - Some rare Dhivehi words may be mispronounced |
| - Model is still being actively trained - newer checkpoints may be uploaded |
|
|
| ## License |
|
|
| This model inherits the [Coqui Public Model License](https://coqui.ai/cpml) from the base XTTS v2.0 model. |
|
|