File size: 2,148 Bytes
70f229b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
  - rw
tags:
  - text-to-speech
  - tts
  - xtts
  - kinyarwanda
  - african-languages
pipeline_tag: text-to-speech
---

# XTTS v2 — Kinyarwanda

A fine-tuned [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) text-to-speech model for **Kinyarwanda (rw)**, trained on speech from Commonvoice.

## Usage

### Requirements

The upstream `TTS` package requires a patched installation. Clone the fine-tuning repo and install its dependencies:

```bash
git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git
cd XTTSv2-Finetuning-for-New-Languages
pip install -r requirements.txt
```

### Quick Start

```python
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Load model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False)
model.to("cuda" if torch.cuda.is_available() else "cpu")

# Get speaker embedding from a reference audio clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference_speaker.wav",
    gpt_cond_len=model.config.gpt_cond_len,
    max_ref_length=model.config.max_ref_len,
    sound_norm_refs=model.config.sound_norm_refs,
)

# Synthesize
result = model.inference(
    text="Ndashaka amazi n'ibiryo",
    language="rw",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.1,
    length_penalty=1.0,
    repetition_penalty=10.0,
    top_k=10,
    top_p=0.3,
)

torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000)
```

### CLI Inference

A full inference script is included:

```bash
python inference.py \
  -t "Ndashaka amazi n'ibiryo" \
  -s reference_speaker.wav \
  -l rw \
  -o output.wav
```

## Files

- `model.pth` — Model weights (85k-step checkpoint)
- `config.json` — Model configuration
- `vocab.json` — Tokenizer vocabulary
- `inference.py` — Standalone inference script
- `reference_speaker.wav` — Sample reference audio for voice cloning