xtts-v2-dhivehi / README.md

Serialtechlab

Upload README.md with huggingface_hub

5f0b25f verified 16 days ago

preview code

raw

history blame contribute delete

3.1 kB

metadata

language:
  - dv
license: other
license_name: coqui-public-model-license
license_link: https://coqui.ai/cpml
tags:
  - text-to-speech
  - tts
  - voice-cloning
  - xtts
  - dhivehi
  - thaana
  - maldives
library_name: coqui
pipeline_tag: text-to-speech
base_model: coqui/XTTS-v2

XTTS v2 - Dhivehi (Thaana)

Fine-tuned XTTS v2.0 for Dhivehi (Maldivian, Thaana script) text-to-speech with zero-shot voice cloning.

Model Details

Base model: XTTS v2.0 (Coqui)
Language: Dhivehi (dv) - Thaana script
Architecture: GPT-2 + DVAE + HiFiGAN vocoder
Audio: 24kHz output
Training step: 95366

Training Data

~~59,000 samples (~~75+ hours) from multiple Dhivehi speech sources:

Serialtechlab/dhivehi-javaabu-speech-parquet - news/article narration
Serialtechlab/dv-presidential-speech - presidential addresses
Serialtechlab/dhivehi-tts-female-01 - female speaker
alakxender/dv-audio-syn-lg - synthetic speech (subset)

Usage

Install

pip install coqui-tts

Inference

import torch, torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Download all files from this repo into a local directory

config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    eval=True,
    strict=False,
)
model.cuda()

# Get speaker embedding from a reference WAV (5-15 sec of clean speech)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference.wav"],
    gpt_cond_len=24,
    gpt_cond_chunk_len=4,
)

# Generate speech
out = model.inference(
    text="\u0784\u07a8\u0790\u07b0\u0789\u07a8\ufdf2 \u0783\u07a6\u0781\u07aa\u0789\u07a7\u0782\u07a8 \u0783\u07a6\u0781\u07a9\u0789\u07a8",
    language="dv",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)

Files

model.pth - Fine-tuned GPT checkpoint
config.json - Model configuration
vocab.json - Extended BPE vocabulary (base XTTS + Thaana characters)
dvae.pth - Discrete VAE (from base XTTS v2.0)
mel_stats.pth - Mel spectrogram normalization stats (from base XTTS v2.0)

Limitations

Voice cloning quality depends on the reference audio (clean, 5-15 seconds recommended)
Text longer than ~300 characters may be truncated
Some rare Dhivehi words may be mispronounced
Model is still being actively trained - newer checkpoints may be uploaded

License

This model inherits the Coqui Public Model License from the base XTTS v2.0 model.