xtts-cv / README.md
reuben256's picture
Add model card (README.md)
ad74c5c verified
metadata
widget:
  - src: https://huggingface.co/spaces/abidlabs/xtts-v2/raw/main/app.py
    example_title: Text-to-Speech
    inputs:
      - interface: text
        label: Text Input
        value: >-
          Ekikokyo kino nakyo kyetaagisa omuntu okutunula n'alengera ekintu
          ekyakula nga kyefaanaanyirizaako ennyumba."
      - interface: audio
        label: Speaker Reference
        value: https://huggingface.co/coqui/XTTS-v2/resolve/main/female.wav
      - interface: slider
        label: Temperature
        value: 0.75
        minimum: 0
        maximum: 1
        step: 0.05
      - interface: slider
        label: Top-P
        value: 0.85
        minimum: 0
        maximum: 1
        step: 0.05
      - interface: slider
        label: Top-K
        value: 50
        minimum: 1
        maximum: 100
        step: 1
      - interface: slider
        label: Repetition Penalty
        value: 5
        minimum: 1
        maximum: 10
        step: 0.1

XTTS Luganda Fine-tuned Model

This is a fine-tuned XTTS model for the Luganda language, trained using the Common Voice Luganda dataset.

Model Details

  • Base Model: Coqui XTTS v2
  • Language: Luganda (lg)
  • Dataset: Common Voice Luganda
  • Fine-tuning Date: May 2024

How to use

This model can be loaded and used with the TTS library, similar to other XTTS models. You will need to provide a speaker reference audio for inference.

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch

# Load config
config = XttsConfig()
config.load_json("config.json")

# Init model
model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="best_model.pth",
    vocab_path="vocab.json",
    eval=True,
    use_deepspeed=False,
)

# Move model to GPU if available
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(DEVICE)

# Generate speaker latents from a reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["path/to/speaker_reference.wav"],
    gpt_cond_len=config.gpt_cond_len,
    max_ref_length=config.max_ref_len,
    sound_norm_refs=config.sound_norm_refs,
)

# Synthesize text
text = "Yasalawo kutandika kusuubula mwanyi."
output = model.inference(
    text=text,
    language='lg',
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.75,
    top_p=0.85,
    top_k=50,
    repetition_penalty=5.0,
    enable_text_splitting=True,
)

# The synthesized audio is in output['wav']
# You can save it or play it.