xtts-cv / README.md
reuben256's picture
Add model card (README.md)
ad74c5c verified
---
widget:
- src: https://huggingface.co/spaces/abidlabs/xtts-v2/raw/main/app.py
example_title: Text-to-Speech
inputs:
- interface: text
label: Text Input
value: Ekikokyo kino nakyo kyetaagisa omuntu okutunula n'alengera ekintu ekyakula nga kyefaanaanyirizaako ennyumba."
- interface: audio
label: Speaker Reference
value: https://huggingface.co/coqui/XTTS-v2/resolve/main/female.wav
- interface: slider
label: Temperature
value: 0.75
minimum: 0
maximum: 1
step: 0.05
- interface: slider
label: Top-P
value: 0.85
minimum: 0
maximum: 1
step: 0.05
- interface: slider
label: Top-K
value: 50
minimum: 1
maximum: 100
step: 1
- interface: slider
label: Repetition Penalty
value: 5
minimum: 1
maximum: 10
step: 0.1
---
# XTTS Luganda Fine-tuned Model
This is a fine-tuned XTTS model for the Luganda language, trained using the Common Voice Luganda dataset.
## Model Details
- **Base Model:** Coqui XTTS v2
- **Language:** Luganda (lg)
- **Dataset:** Common Voice Luganda
- **Fine-tuning Date:** May 2024
## How to use
This model can be loaded and used with the `TTS` library, similar to other XTTS models. You will need to provide a speaker reference audio for inference.
```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
# Load config
config = XttsConfig()
config.load_json("config.json")
# Init model
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_path="best_model.pth",
vocab_path="vocab.json",
eval=True,
use_deepspeed=False,
)
# Move model to GPU if available
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(DEVICE)
# Generate speaker latents from a reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["path/to/speaker_reference.wav"],
gpt_cond_len=config.gpt_cond_len,
max_ref_length=config.max_ref_len,
sound_norm_refs=config.sound_norm_refs,
)
# Synthesize text
text = "Yasalawo kutandika kusuubula mwanyi."
output = model.inference(
text=text,
language='lg',
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.75,
top_p=0.85,
top_k=50,
repetition_penalty=5.0,
enable_text_splitting=True,
)
# The synthesized audio is in output['wav']
# You can save it or play it.
```