You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Configuration Parsing Warning:Invalid JSON for config file config.json

XTTS v2 - Dhivehi (Thaana)

Fine-tuned XTTS v2.0 for Dhivehi (Maldivian, Thaana script) text-to-speech with zero-shot voice cloning.

Model Details

Base model: XTTS v2.0 (Coqui)
Language: Dhivehi (dv) - Thaana script
Architecture: GPT-2 + DVAE + HiFiGAN vocoder
Audio: 24kHz output
Training step: 95366

Training Data

~~59,000 samples (~~75+ hours) from multiple Dhivehi speech sources:

Serialtechlab/dhivehi-javaabu-speech-parquet - news/article narration
Serialtechlab/dv-presidential-speech - presidential addresses
Serialtechlab/dhivehi-tts-female-01 - female speaker
alakxender/dv-audio-syn-lg - synthetic speech (subset)

Usage

Install

pip install coqui-tts

Inference

import torch, torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Download all files from this repo into a local directory

config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    eval=True,
    strict=False,
)
model.cuda()

# Get speaker embedding from a reference WAV (5-15 sec of clean speech)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference.wav"],
    gpt_cond_len=24,
    gpt_cond_chunk_len=4,
)

# Generate speech
out = model.inference(
    text="\u0784\u07a8\u0790\u07b0\u0789\u07a8\ufdf2 \u0783\u07a6\u0781\u07aa\u0789\u07a7\u0782\u07a8 \u0783\u07a6\u0781\u07a9\u0789\u07a8",
    language="dv",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)

Files

model.pth - Fine-tuned GPT checkpoint
config.json - Model configuration
vocab.json - Extended BPE vocabulary (base XTTS + Thaana characters)
dvae.pth - Discrete VAE (from base XTTS v2.0)
mel_stats.pth - Mel spectrogram normalization stats (from base XTTS v2.0)

Limitations

Voice cloning quality depends on the reference audio (clean, 5-15 seconds recommended)
Text longer than ~300 characters may be truncated
Some rare Dhivehi words may be mispronounced
Model is still being actively trained - newer checkpoints may be uploaded

License

This model inherits the Coqui Public Model License from the base XTTS v2.0 model.

Downloads last month: 22

Model tree for Serialtechlab/xtts-v2-dhivehi

Base model

coqui/XTTS-v2

Finetuned

(72)

this model