WakandaAI
/

xtts-v2-kinyarwanda

african-languages

Model card Files Files and versions

xtts-v2-kinyarwanda / README.md

alexgichamba's picture

Upload folder using huggingface_hub

70f229b verified 25 days ago

|

history blame contribute delete

2.15 kB

	---
	language:
	- rw
	tags:
	- text-to-speech
	- tts
	- xtts
	- kinyarwanda
	- african-languages
	pipeline_tag: text-to-speech
	---

	# XTTS v2 — Kinyarwanda

	A fine-tuned [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) text-to-speech model for Kinyarwanda (rw), trained on speech from Commonvoice.

	## Usage

	### Requirements

	The upstream `TTS` package requires a patched installation. Clone the fine-tuning repo and install its dependencies:

	```bash
	git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git
	cd XTTSv2-Finetuning-for-New-Languages
	pip install -r requirements.txt
	```

	### Quick Start

	```python
	import torch
	import torchaudio
	from TTS.tts.configs.xtts_config import XttsConfig
	from TTS.tts.models.xtts import Xtts

	# Load model
	config = XttsConfig()
	config.load_json("config.json")
	model = Xtts.init_from_config(config)
	model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False)
	model.to("cuda" if torch.cuda.is_available() else "cpu")

	# Get speaker embedding from a reference audio clip
	gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
	audio_path="reference_speaker.wav",
	gpt_cond_len=model.config.gpt_cond_len,
	max_ref_length=model.config.max_ref_len,
	sound_norm_refs=model.config.sound_norm_refs,
	)

	# Synthesize
	result = model.inference(
	text="Ndashaka amazi n'ibiryo",
	language="rw",
	gpt_cond_latent=gpt_cond_latent,
	speaker_embedding=speaker_embedding,
	temperature=0.1,
	length_penalty=1.0,
	repetition_penalty=10.0,
	top_k=10,
	top_p=0.3,
	)

	torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000)
	```

	### CLI Inference

	A full inference script is included:

	```bash
	python inference.py \
	-t "Ndashaka amazi n'ibiryo" \
	-s reference_speaker.wav \
	-l rw \
	-o output.wav
	```

	## Files

	- `model.pth` — Model weights (85k-step checkpoint)
	- `config.json` — Model configuration
	- `vocab.json` — Tokenizer vocabulary
	- `inference.py` — Standalone inference script
	- `reference_speaker.wav` — Sample reference audio for voice cloning