Text-to-Speech
NeMo
Ganda
tts
fastpitch
hifigan
luganda
african-languages
low-resource
edge
on-device
Instructions to use Cal3bd3v/ganda-nemo-experimental with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Cal3bd3v/ganda-nemo-experimental with NeMo:
# tag did not correspond to a valid NeMo domain.
- Notebooks
- Google Colab
- Kaggle
File size: 5,468 Bytes
8d32fb8 cfef2d4 8d32fb8 cfef2d4 8d32fb8 cfef2d4 8d32fb8 2d08d16 8d32fb8 cfef2d4 2d08d16 cfef2d4 2d08d16 cfef2d4 8d32fb8 cfef2d4 8d32fb8 cfef2d4 8d32fb8 cfef2d4 8d32fb8 cfef2d4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
language: lg
license: apache-2.0
library_name: nemo
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- nemo
- fastpitch
- hifigan
- luganda
- african-languages
- low-resource
- edge
- on-device
datasets:
- Sunbird/salt
---
# Ganda NeMo (Experimental) — Luganda TTS
A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting **on-device and edge deployment on mobile phones**. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.
> **Status:** experimental. Released as-is for research, evaluation.
## Model summary
| Component | File | Arch | Size |
|---|---|---|---|
| Acoustic model | `luganda_fastpitch.nemo` | FastPitch (FFTransformer, 6L, d=384) | 187 MB |
| Vocoder | `luganda_hifigan.nemo` | HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) | 339 MB |
- **Language:** Luganda (ISO 639-1 `lg`)
- **Sample rate:** 22,050 Hz
- **Mel config:** 80 bins, `n_fft=1024`, `hop=256`, `win=1024`, range 0–8000 Hz
- **License:** Apache-2.0
## Intended use
- Research on low-resource African-language TTS.
- Prototyping **on-device / edge** Luganda voice output on mobile (primary deployment target).
- Benchmarking and comparison against other Luganda / Bantu-language TTS systems.
## Training data
Trained on the **[Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)** Luganda subset — a mixed male/female multi-speaker corpus. Approximately **2,380 clips / ~2.69 hours** were used.
## Model architecture & training
### FastPitch (acoustic model)
- FFTransformer encoder/decoder: 6 layers, 1 head, `d_model=384`, `d_inner=1536`, dropout 0.1
- Duration + pitch predictors: 2-layer temporal predictors, filter size 256
- Learned alignment (`learn_alignment: true`), bin-loss warmup 100 epochs
- Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz
- Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
- Training steps: ~20,000
- NeMo version at training time: 1.8.0rc0
### HiFi-GAN (vocoder)
- Generator: upsample rates `[8, 8, 2, 2]`, kernel sizes `[16, 16, 4, 4]`, initial channels 512
- Resblock type 1, kernel sizes `[3, 7, 11]`, dilations `[[1,3,5], [1,3,5], [1,3,5]]`
- Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
- Training steps: ~20,000
- NeMo version at training time: 1.23.0
## Limitations
**Text frontend — English G2P.** Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's `EnglishPhonemesTokenizer` with `EnglishG2p` (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.
**Single output voice.** The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.
**Low-resource training.** ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.
**Text normalization.** The packaged text normalizer is `nemo_text_processing.text_normalization.Normalizer` with `lang: en`. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline.
## Usage
```python
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
import soundfile as sf
fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")
text = "Oli otya?"
parsed = fastpitch.parse(text)
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)
sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)
```
### Loading from the Hub
```python
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")
fastpitch = FastPitchModel.restore_from(fp)
hifigan = HifiGanModel.restore_from(hg)
```
## Edge / on-device deployment
The primary deployment target is mobile. Suggested paths:
- Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
- HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
- Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.
Quantization, pruning, and distillation have not been applied in this release.
## Ethical considerations
- The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
- Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).
## Attribution
- **Model author:** Caleb Lwanga, **[Crane AI Labs](https://huggingface.co/CraneAILabs)**
- **Base framework:** [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- **Training corpus:** [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)
## License
Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.
|