MiraTTS Kinyarwanda (Phase 1 - Language Acquisition)

Developed by: Professor
License: Apache 2.0
Finetuned from model: YatharthS/MiraTTS
Language: Kinyarwanda (rw), English (en)

Model Overview

This is a foundational Text-to-Speech (TTS) model for the Kinyarwanda language. It is built on the MiraTTS architecture (which utilizes a 0.5B parameter Qwen2.5 LLM backbone) and was fine-tuned to map Kinyarwanda text to its correct phonetic and acoustic representations.

Note: This is a "Phase 1" checkpoint. It was trained on a combined dataset of high-fidelity human speech and synthetic speech to teach the model the core phonetic rules, prefixes, and rhythm of Kinyarwanda. It is capable of generating intelligible Kinyarwanda speech but may exhibit occasional synthetic artifacts or hallucinated padding. A Phase 2 model (refined strictly on human data) is recommended for production use.

Training Details

The model was trained using the Unsloth framework for optimized hardware utilization. Training was intentionally halted early (around Epoch 10) to prevent the LLM backbone from memorizing the dataset and losing natural prosody.

Dataset Size: 28,629 audio-text pairs
Effective Batch Size: 256 (64 per device * 4 gradient accumulation steps)
Total Steps Trained: 1,189
Starting Loss: 10.84
Final Loss: 5.76
Hardware: Trained on a single NVIDIA GPU in bfloat16 precision (where supported).

How to Use (Inference)

Because this model utilizes the highly optimized Lmdeploy backend for rapid audio generation, it requires a modern NVIDIA GPU (such as an L4, A100, or G4) to run at full speed.

Below is the standard inference script to generate Kinyarwanda audio using a reference voice clip.

1. Installation

Ensure you install the optimized MiraTTS library and align your PyTorch audio dependencies:

pip install git+https://github.com/ysharma3501/MiraTTS.git
# Ensure torchaudio and torchvision match your active PyTorch version

2. Python Inference Code

import torch
from mira.model import MiraTTS
from IPython.display import Audio, display

print("Loading Kinyarwanda Phase 1 Model...")
# Initialize the model directly from the Hub
mira_tts = MiraTTS("Professor/MiraTTS-Kinyarwanda-Phase1")

# Provide a path to a real, high-quality audio file to use as the voice print
reference_audio_path = "/path/to/your/reference_audio.wav"

test_text = "Muraho neza! Uyu munsi turimo kugerageza porogaramu nshya y'ikinyarwanda."

# Extract voice context and synthesize
print("Synthesizing audio...")
context_tokens = mira_tts.encode_audio(reference_audio_path)
audio = mira_tts.generate(test_text, context_tokens)

# Play the audio (if running in a Jupyter/Colab notebook)
display(Audio(audio, rate=48000))

Limitations

Hardware Constraints: Requires a CUDA-enabled NVIDIA GPU. Running on older architectures (like the T4) requires bypassing the optimized pipeline and forcing float32 precision, which is significantly slower.
End-of-Sequence Hallucinations: Because this is an LLM-based generative model, it may occasionally continue generating extra Kinyarwanda syllables after the input text is finished.

This model was trained 2x faster with Unsloth.
Unsloth Made With Love

Downloads last month: 1

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for Professor/MiraTTS-Kinyarwanda-Phase1

Base model

YatharthS/MiraTTS

Finetuned

(12)

this model