MiraTTS Kinyarwanda (Phase 1 - Language Acquisition)
- Developed by: Professor
- License: Apache 2.0
- Finetuned from model: YatharthS/MiraTTS
- Language: Kinyarwanda (
rw), English (en)
Model Overview
This is a foundational Text-to-Speech (TTS) model for the Kinyarwanda language. It is built on the MiraTTS architecture (which utilizes a 0.5B parameter Qwen2.5 LLM backbone) and was fine-tuned to map Kinyarwanda text to its correct phonetic and acoustic representations.
Note: This is a "Phase 1" checkpoint. It was trained on a combined dataset of high-fidelity human speech and synthetic speech to teach the model the core phonetic rules, prefixes, and rhythm of Kinyarwanda. It is capable of generating intelligible Kinyarwanda speech but may exhibit occasional synthetic artifacts or hallucinated padding. A Phase 2 model (refined strictly on human data) is recommended for production use.
Training Details
The model was trained using the Unsloth framework for optimized hardware utilization. Training was intentionally halted early (around Epoch 10) to prevent the LLM backbone from memorizing the dataset and losing natural prosody.
- Dataset Size: 28,629 audio-text pairs
- Effective Batch Size: 256 (64 per device * 4 gradient accumulation steps)
- Total Steps Trained: 1,189
- Starting Loss: 10.84
- Final Loss: 5.76
- Hardware: Trained on a single NVIDIA GPU in
bfloat16precision (where supported).
How to Use (Inference)
Because this model utilizes the highly optimized Lmdeploy backend for rapid audio generation, it requires a modern NVIDIA GPU (such as an L4, A100, or G4) to run at full speed.
Below is the standard inference script to generate Kinyarwanda audio using a reference voice clip.
1. Installation
Ensure you install the optimized MiraTTS library and align your PyTorch audio dependencies:
pip install git+https://github.com/ysharma3501/MiraTTS.git
# Ensure torchaudio and torchvision match your active PyTorch version
2. Python Inference Code
import torch
from mira.model import MiraTTS
from IPython.display import Audio, display
print("Loading Kinyarwanda Phase 1 Model...")
# Initialize the model directly from the Hub
mira_tts = MiraTTS("Professor/MiraTTS-Kinyarwanda-Phase1")
# Provide a path to a real, high-quality audio file to use as the voice print
reference_audio_path = "/path/to/your/reference_audio.wav"
test_text = "Muraho neza! Uyu munsi turimo kugerageza porogaramu nshya y'ikinyarwanda."
# Extract voice context and synthesize
print("Synthesizing audio...")
context_tokens = mira_tts.encode_audio(reference_audio_path)
audio = mira_tts.generate(test_text, context_tokens)
# Play the audio (if running in a Jupyter/Colab notebook)
display(Audio(audio, rate=48000))
Limitations
- Hardware Constraints: Requires a CUDA-enabled NVIDIA GPU. Running on older architectures (like the T4) requires bypassing the optimized pipeline and forcing float32 precision, which is significantly slower.
- End-of-Sequence Hallucinations: Because this is an LLM-based generative model, it may occasionally continue generating extra Kinyarwanda syllables after the input text is finished.
This model was trained 2x faster with Unsloth.
- Downloads last month
- 22
Model tree for Professor/MiraTTS-Kinyarwanda-Phase1
Base model
YatharthS/MiraTTS