--- language: - rw tags: - text-to-speech - tts - xtts - kinyarwanda - african-languages pipeline_tag: text-to-speech --- # XTTS v2 — Kinyarwanda A fine-tuned [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) text-to-speech model for **Kinyarwanda (rw)**, trained on speech from Commonvoice. ## Usage ### Requirements The upstream `TTS` package requires a patched installation. Clone the fine-tuning repo and install its dependencies: ```bash git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git cd XTTSv2-Finetuning-for-New-Languages pip install -r requirements.txt ``` ### Quick Start ```python import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # Load model config = XttsConfig() config.load_json("config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False) model.to("cuda" if torch.cuda.is_available() else "cpu") # Get speaker embedding from a reference audio clip gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( audio_path="reference_speaker.wav", gpt_cond_len=model.config.gpt_cond_len, max_ref_length=model.config.max_ref_len, sound_norm_refs=model.config.sound_norm_refs, ) # Synthesize result = model.inference( text="Ndashaka amazi n'ibiryo", language="rw", gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, temperature=0.1, length_penalty=1.0, repetition_penalty=10.0, top_k=10, top_p=0.3, ) torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000) ``` ### CLI Inference A full inference script is included: ```bash python inference.py \ -t "Ndashaka amazi n'ibiryo" \ -s reference_speaker.wav \ -l rw \ -o output.wav ``` ## Files - `model.pth` — Model weights (85k-step checkpoint) - `config.json` — Model configuration - `vocab.json` — Tokenizer vocabulary - `inference.py` — Standalone inference script - `reference_speaker.wav` — Sample reference audio for voice cloning