|
|
| --- |
| widget: |
| - src: https://huggingface.co/spaces/abidlabs/xtts-v2/raw/main/app.py |
| example_title: Text-to-Speech |
| inputs: |
| - interface: text |
| label: Text Input |
| value: Ekikokyo kino nakyo kyetaagisa omuntu okutunula n'alengera ekintu ekyakula nga kyefaanaanyirizaako ennyumba." |
| - interface: audio |
| label: Speaker Reference |
| value: https://huggingface.co/coqui/XTTS-v2/resolve/main/female.wav |
| - interface: slider |
| label: Temperature |
| value: 0.75 |
| minimum: 0 |
| maximum: 1 |
| step: 0.05 |
| - interface: slider |
| label: Top-P |
| value: 0.85 |
| minimum: 0 |
| maximum: 1 |
| step: 0.05 |
| - interface: slider |
| label: Top-K |
| value: 50 |
| minimum: 1 |
| maximum: 100 |
| step: 1 |
| - interface: slider |
| label: Repetition Penalty |
| value: 5 |
| minimum: 1 |
| maximum: 10 |
| step: 0.1 |
| --- |
| # XTTS Luganda Fine-tuned Model |
|
|
| This is a fine-tuned XTTS model for the Luganda language, trained using the Common Voice Luganda dataset. |
|
|
| ## Model Details |
|
|
| - **Base Model:** Coqui XTTS v2 |
| - **Language:** Luganda (lg) |
| - **Dataset:** Common Voice Luganda |
| - **Fine-tuning Date:** May 2024 |
|
|
| ## How to use |
|
|
| This model can be loaded and used with the `TTS` library, similar to other XTTS models. You will need to provide a speaker reference audio for inference. |
|
|
| ```python |
| from TTS.tts.configs.xtts_config import XttsConfig |
| from TTS.tts.models.xtts import Xtts |
| import torch |
| |
| # Load config |
| config = XttsConfig() |
| config.load_json("config.json") |
| |
| # Init model |
| model = Xtts.init_from_config(config) |
| model.load_checkpoint( |
| config, |
| checkpoint_path="best_model.pth", |
| vocab_path="vocab.json", |
| eval=True, |
| use_deepspeed=False, |
| ) |
| |
| # Move model to GPU if available |
| DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' |
| model.to(DEVICE) |
| |
| # Generate speaker latents from a reference audio |
| gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( |
| audio_path=["path/to/speaker_reference.wav"], |
| gpt_cond_len=config.gpt_cond_len, |
| max_ref_length=config.max_ref_len, |
| sound_norm_refs=config.sound_norm_refs, |
| ) |
| |
| # Synthesize text |
| text = "Yasalawo kutandika kusuubula mwanyi." |
| output = model.inference( |
| text=text, |
| language='lg', |
| gpt_cond_latent=gpt_cond_latent, |
| speaker_embedding=speaker_embedding, |
| temperature=0.75, |
| top_p=0.85, |
| top_k=50, |
| repetition_penalty=5.0, |
| enable_text_splitting=True, |
| ) |
| |
| # The synthesized audio is in output['wav'] |
| # You can save it or play it. |
| ``` |
|
|