Text-to-Speech
NeMo
Ganda
tts
fastpitch
hifigan
luganda
african-languages
low-resource
edge
on-device
Instructions to use Cal3bd3v/ganda-nemo-experimental with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Cal3bd3v/ganda-nemo-experimental with NeMo:
# tag did not correspond to a valid NeMo domain.
- Notebooks
- Google Colab
- Kaggle
| language: lg | |
| license: apache-2.0 | |
| library_name: nemo | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - nemo | |
| - fastpitch | |
| - hifigan | |
| - luganda | |
| - african-languages | |
| - low-resource | |
| - edge | |
| - on-device | |
| datasets: | |
| - Sunbird/salt | |
| # Ganda NeMo (Experimental) — Luganda TTS | |
| A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting **on-device and edge deployment on mobile phones**. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder. | |
| > **Status:** experimental. Released as-is for research, evaluation. | |
| ## Model summary | |
| | Component | File | Arch | Size | | |
| |---|---|---|---| | |
| | Acoustic model | `luganda_fastpitch.nemo` | FastPitch (FFTransformer, 6L, d=384) | 187 MB | | |
| | Vocoder | `luganda_hifigan.nemo` | HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) | 339 MB | | |
| - **Language:** Luganda (ISO 639-1 `lg`) | |
| - **Sample rate:** 22,050 Hz | |
| - **Mel config:** 80 bins, `n_fft=1024`, `hop=256`, `win=1024`, range 0–8000 Hz | |
| - **License:** Apache-2.0 | |
| ## Intended use | |
| - Research on low-resource African-language TTS. | |
| - Prototyping **on-device / edge** Luganda voice output on mobile (primary deployment target). | |
| - Benchmarking and comparison against other Luganda / Bantu-language TTS systems. | |
| ## Training data | |
| Trained on the **[Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)** Luganda subset — a mixed male/female multi-speaker corpus. Approximately **2,380 clips / ~2.69 hours** were used. | |
| ## Model architecture & training | |
| ### FastPitch (acoustic model) | |
| - FFTransformer encoder/decoder: 6 layers, 1 head, `d_model=384`, `d_inner=1536`, dropout 0.1 | |
| - Duration + pitch predictors: 2-layer temporal predictors, filter size 256 | |
| - Learned alignment (`learn_alignment: true`), bin-loss warmup 100 epochs | |
| - Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz | |
| - Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24 | |
| - Training steps: ~20,000 | |
| - NeMo version at training time: 1.8.0rc0 | |
| ### HiFi-GAN (vocoder) | |
| - Generator: upsample rates `[8, 8, 2, 2]`, kernel sizes `[16, 16, 4, 4]`, initial channels 512 | |
| - Resblock type 1, kernel sizes `[3, 7, 11]`, dilations `[[1,3,5], [1,3,5], [1,3,5]]` | |
| - Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16 | |
| - Training steps: ~20,000 | |
| - NeMo version at training time: 1.23.0 | |
| ## Limitations | |
| **Text frontend — English G2P.** Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's `EnglishPhonemesTokenizer` with `EnglishG2p` (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome. | |
| **Single output voice.** The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference. | |
| **Low-resource training.** ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains. | |
| **Text normalization.** The packaged text normalizer is `nemo_text_processing.text_normalization.Normalizer` with `lang: en`. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline. | |
| ## Usage | |
| ```python | |
| from nemo.collections.tts.models import FastPitchModel, HifiGanModel | |
| import soundfile as sf | |
| fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo") | |
| hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo") | |
| text = "Oli otya?" | |
| parsed = fastpitch.parse(text) | |
| spectrogram = fastpitch.generate_spectrogram(tokens=parsed) | |
| audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram) | |
| sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050) | |
| ``` | |
| ### Loading from the Hub | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| from nemo.collections.tts.models import FastPitchModel, HifiGanModel | |
| fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo") | |
| hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo") | |
| fastpitch = FastPitchModel.restore_from(fp) | |
| hifigan = HifiGanModel.restore_from(hg) | |
| ``` | |
| ## Edge / on-device deployment | |
| The primary deployment target is mobile. Suggested paths: | |
| - Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch. | |
| - HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs. | |
| - Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device. | |
| Quantization, pruning, and distillation have not been applied in this release. | |
| ## Ethical considerations | |
| - The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms. | |
| - Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation). | |
| ## Attribution | |
| - **Model author:** Caleb Lwanga, **[Crane AI Labs](https://huggingface.co/CraneAILabs)** | |
| - **Base framework:** [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) | |
| - **Training corpus:** [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt) | |
| ## License | |
| Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training. | |