| | --- |
| | language: |
| | - rw |
| | tags: |
| | - text-to-speech |
| | - tts |
| | - xtts |
| | - kinyarwanda |
| | - african-languages |
| | pipeline_tag: text-to-speech |
| | --- |
| | |
| | # XTTS v2 β Kinyarwanda |
| |
|
| | A fine-tuned [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) text-to-speech model for **Kinyarwanda (rw)**, trained on speech from Commonvoice. |
| |
|
| | ## Usage |
| |
|
| | ### Requirements |
| |
|
| | The upstream `TTS` package requires a patched installation. Clone the fine-tuning repo and install its dependencies: |
| |
|
| | ```bash |
| | git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git |
| | cd XTTSv2-Finetuning-for-New-Languages |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | import torch |
| | import torchaudio |
| | from TTS.tts.configs.xtts_config import XttsConfig |
| | from TTS.tts.models.xtts import Xtts |
| | |
| | # Load model |
| | config = XttsConfig() |
| | config.load_json("config.json") |
| | model = Xtts.init_from_config(config) |
| | model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False) |
| | model.to("cuda" if torch.cuda.is_available() else "cpu") |
| | |
| | # Get speaker embedding from a reference audio clip |
| | gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( |
| | audio_path="reference_speaker.wav", |
| | gpt_cond_len=model.config.gpt_cond_len, |
| | max_ref_length=model.config.max_ref_len, |
| | sound_norm_refs=model.config.sound_norm_refs, |
| | ) |
| | |
| | # Synthesize |
| | result = model.inference( |
| | text="Ndashaka amazi n'ibiryo", |
| | language="rw", |
| | gpt_cond_latent=gpt_cond_latent, |
| | speaker_embedding=speaker_embedding, |
| | temperature=0.1, |
| | length_penalty=1.0, |
| | repetition_penalty=10.0, |
| | top_k=10, |
| | top_p=0.3, |
| | ) |
| | |
| | torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000) |
| | ``` |
| |
|
| | ### CLI Inference |
| |
|
| | A full inference script is included: |
| |
|
| | ```bash |
| | python inference.py \ |
| | -t "Ndashaka amazi n'ibiryo" \ |
| | -s reference_speaker.wav \ |
| | -l rw \ |
| | -o output.wav |
| | ``` |
| |
|
| | ## Files |
| |
|
| | - `model.pth` β Model weights (85k-step checkpoint) |
| | - `config.json` β Model configuration |
| | - `vocab.json` β Tokenizer vocabulary |
| | - `inference.py` β Standalone inference script |
| | - `reference_speaker.wav` β Sample reference audio for voice cloning |