You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Configuration Parsing Warning:Invalid JSON for config file config.json

XTTS v2 - Dhivehi (Thaana)

Fine-tuned XTTS v2.0 for Dhivehi (Maldivian, Thaana script) text-to-speech with zero-shot voice cloning.

Model Details

  • Base model: XTTS v2.0 (Coqui)
  • Language: Dhivehi (dv) - Thaana script
  • Architecture: GPT-2 + DVAE + HiFiGAN vocoder
  • Audio: 24kHz output
  • Training step: 95366

Training Data

59,000 samples (75+ hours) from multiple Dhivehi speech sources:

Usage

Install

pip install coqui-tts

Inference

import torch, torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Download all files from this repo into a local directory

config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    eval=True,
    strict=False,
)
model.cuda()

# Get speaker embedding from a reference WAV (5-15 sec of clean speech)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference.wav"],
    gpt_cond_len=24,
    gpt_cond_chunk_len=4,
)

# Generate speech
out = model.inference(
    text="\u0784\u07a8\u0790\u07b0\u0789\u07a8\ufdf2 \u0783\u07a6\u0781\u07aa\u0789\u07a7\u0782\u07a8 \u0783\u07a6\u0781\u07a9\u0789\u07a8",
    language="dv",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)

Files

  • model.pth - Fine-tuned GPT checkpoint
  • config.json - Model configuration
  • vocab.json - Extended BPE vocabulary (base XTTS + Thaana characters)
  • dvae.pth - Discrete VAE (from base XTTS v2.0)
  • mel_stats.pth - Mel spectrogram normalization stats (from base XTTS v2.0)

Limitations

  • Voice cloning quality depends on the reference audio (clean, 5-15 seconds recommended)
  • Text longer than ~300 characters may be truncated
  • Some rare Dhivehi words may be mispronounced
  • Model is still being actively trained - newer checkpoints may be uploaded

License

This model inherits the Coqui Public Model License from the base XTTS v2.0 model.

Downloads last month
187
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Serialtechlab/xtts-v2-dhivehi

Base model

coqui/XTTS-v2
Finetuned
(61)
this model