xtts-v2-kinyarwanda / README.md
alexgichamba's picture
Upload folder using huggingface_hub
70f229b verified
---
language:
- rw
tags:
- text-to-speech
- tts
- xtts
- kinyarwanda
- african-languages
pipeline_tag: text-to-speech
---
# XTTS v2 β€” Kinyarwanda
A fine-tuned [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) text-to-speech model for **Kinyarwanda (rw)**, trained on speech from Commonvoice.
## Usage
### Requirements
The upstream `TTS` package requires a patched installation. Clone the fine-tuning repo and install its dependencies:
```bash
git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git
cd XTTSv2-Finetuning-for-New-Languages
pip install -r requirements.txt
```
### Quick Start
```python
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# Load model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False)
model.to("cuda" if torch.cuda.is_available() else "cpu")
# Get speaker embedding from a reference audio clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path="reference_speaker.wav",
gpt_cond_len=model.config.gpt_cond_len,
max_ref_length=model.config.max_ref_len,
sound_norm_refs=model.config.sound_norm_refs,
)
# Synthesize
result = model.inference(
text="Ndashaka amazi n'ibiryo",
language="rw",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.1,
length_penalty=1.0,
repetition_penalty=10.0,
top_k=10,
top_p=0.3,
)
torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000)
```
### CLI Inference
A full inference script is included:
```bash
python inference.py \
-t "Ndashaka amazi n'ibiryo" \
-s reference_speaker.wav \
-l rw \
-o output.wav
```
## Files
- `model.pth` β€” Model weights (85k-step checkpoint)
- `config.json` β€” Model configuration
- `vocab.json` β€” Tokenizer vocabulary
- `inference.py` β€” Standalone inference script
- `reference_speaker.wav` β€” Sample reference audio for voice cloning