Text-to-Speech
NeMo
Ganda
tts
fastpitch
hifigan
luganda
african-languages
low-resource
edge
on-device
Instructions to use Cal3bd3v/ganda-nemo-experimental with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Cal3bd3v/ganda-nemo-experimental with NeMo:
# tag did not correspond to a valid NeMo domain.
- Notebooks
- Google Colab
- Kaggle
docs: rewrite model card (architecture, training, limitations, deployment)
Browse files
README.md
CHANGED
|
@@ -1,34 +1,140 @@
|
|
| 1 |
---
|
| 2 |
language: lg
|
| 3 |
license: apache-2.0
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
-
# Luganda TTS
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
-
| Model | Description | Size |
|
| 14 |
-
|-------|-------------|------|
|
| 15 |
-
| `luganda_fastpitch.nemo` | FastPitch spectrogram generator | 187 MB |
|
| 16 |
-
| `luganda_hifigan.nemo` | HiFi-GAN neural vocoder | 339 MB |
|
| 17 |
|
| 18 |
-
##
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
-
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Usage
|
|
|
|
| 25 |
```python
|
| 26 |
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
|
|
|
|
| 27 |
|
| 28 |
fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
|
| 29 |
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")
|
| 30 |
|
| 31 |
text = "Oli otya?"
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
language: lg
|
| 3 |
license: apache-2.0
|
| 4 |
+
library_name: nemo
|
| 5 |
+
pipeline_tag: text-to-speech
|
| 6 |
+
tags:
|
| 7 |
+
- text-to-speech
|
| 8 |
+
- tts
|
| 9 |
+
- nemo
|
| 10 |
+
- fastpitch
|
| 11 |
+
- hifigan
|
| 12 |
+
- luganda
|
| 13 |
+
- african-languages
|
| 14 |
+
- low-resource
|
| 15 |
+
- edge
|
| 16 |
+
- on-device
|
| 17 |
+
datasets:
|
| 18 |
+
- Sunbird/salt
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# Ganda NeMo (Experimental) — Luganda TTS
|
| 22 |
|
| 23 |
+
A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting **on-device and edge deployment on mobile phones**. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.
|
| 24 |
|
| 25 |
+
> **Status:** experimental. Released as-is for research, evaluation, and partnership discussion.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
## Model summary
|
| 28 |
+
|
| 29 |
+
| Component | File | Arch | Size |
|
| 30 |
+
|---|---|---|---|
|
| 31 |
+
| Acoustic model | `luganda_fastpitch.nemo` | FastPitch (FFTransformer, 6L, d=384) | 187 MB |
|
| 32 |
+
| Vocoder | `luganda_hifigan.nemo` | HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) | 339 MB |
|
| 33 |
+
|
| 34 |
+
- **Language:** Luganda (ISO 639-1 `lg`)
|
| 35 |
+
- **Sample rate:** 22,050 Hz
|
| 36 |
+
- **Mel config:** 80 bins, `n_fft=1024`, `hop=256`, `win=1024`, range 0–8000 Hz
|
| 37 |
+
- **License:** Apache-2.0
|
| 38 |
+
|
| 39 |
+
## Intended use
|
| 40 |
+
|
| 41 |
+
- Research on low-resource African-language TTS.
|
| 42 |
+
- Prototyping **on-device / edge** Luganda voice output on mobile (primary deployment target).
|
| 43 |
+
- Benchmarking and comparison against other Luganda / Bantu-language TTS systems.
|
| 44 |
+
|
| 45 |
+
### Out of scope
|
| 46 |
+
|
| 47 |
+
- Production deployments without clinical, safety, or accessibility review.
|
| 48 |
+
- Voice cloning or speaker impersonation (model is single-speaker only, see limitations).
|
| 49 |
+
- Any use that violates the Apache-2.0 license or the licensing of the underlying Sunbird SALT corpus.
|
| 50 |
+
|
| 51 |
+
## Training data
|
| 52 |
+
|
| 53 |
+
Trained on the **[Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)** Luganda subset — a mixed male/female multi-speaker corpus. Approximately **2,380 clips / ~2.69 hours** were used.
|
| 54 |
+
|
| 55 |
+
Note that although the training data is multi-speaker, the acoustic model is configured with `n_speakers: 1` and **contains no speaker embedding** (verified in the checkpoint). The model therefore produces a single speaker-averaged voice rather than per-speaker conditional synthesis.
|
| 56 |
+
|
| 57 |
+
## Model architecture & training
|
| 58 |
+
|
| 59 |
+
### FastPitch (acoustic model)
|
| 60 |
+
|
| 61 |
+
- FFTransformer encoder/decoder: 6 layers, 1 head, `d_model=384`, `d_inner=1536`, dropout 0.1
|
| 62 |
+
- Duration + pitch predictors: 2-layer temporal predictors, filter size 256
|
| 63 |
+
- Learned alignment (`learn_alignment: true`), bin-loss warmup 100 epochs
|
| 64 |
+
- Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz
|
| 65 |
+
- Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
|
| 66 |
+
- Training steps: ~20,000
|
| 67 |
+
- NeMo version at training time: 1.8.0rc0
|
| 68 |
+
|
| 69 |
+
### HiFi-GAN (vocoder)
|
| 70 |
+
|
| 71 |
+
- Generator: upsample rates `[8, 8, 2, 2]`, kernel sizes `[16, 16, 4, 4]`, initial channels 512
|
| 72 |
+
- Resblock type 1, kernel sizes `[3, 7, 11]`, dilations `[[1,3,5], [1,3,5], [1,3,5]]`
|
| 73 |
+
- Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
|
| 74 |
+
- Training steps: ~20,000
|
| 75 |
+
- NeMo version at training time: 1.23.0
|
| 76 |
+
|
| 77 |
+
## Limitations
|
| 78 |
+
|
| 79 |
+
**Text frontend — English G2P.** Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource, and we did not find a training corpus large enough to reliably train one from scratch. FastPitch here uses NeMo's `EnglishPhonemesTokenizer` with `EnglishG2p` (CMUdict) as the text frontend. Luganda spelling is largely phonemic, so this works reasonably via grapheme pass-through, but expect degraded pronunciation on words the English G2P re-phonemizes. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.
|
| 80 |
+
|
| 81 |
+
**Single output voice.** As noted above, the model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.
|
| 82 |
+
|
| 83 |
+
**Low-resource training.** ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.
|
| 84 |
+
|
| 85 |
+
**Text normalization.** The packaged text normalizer is `nemo_text_processing.text_normalization.Normalizer` with `lang: en`. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline.
|
| 86 |
|
| 87 |
## Usage
|
| 88 |
+
|
| 89 |
```python
|
| 90 |
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
|
| 91 |
+
import soundfile as sf
|
| 92 |
|
| 93 |
fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
|
| 94 |
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")
|
| 95 |
|
| 96 |
text = "Oli otya?"
|
| 97 |
+
parsed = fastpitch.parse(text)
|
| 98 |
+
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
|
| 99 |
+
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)
|
| 100 |
+
|
| 101 |
+
sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
### Loading from the Hub
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
from huggingface_hub import hf_hub_download
|
| 108 |
+
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
|
| 109 |
+
|
| 110 |
+
fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
|
| 111 |
+
hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")
|
| 112 |
+
|
| 113 |
+
fastpitch = FastPitchModel.restore_from(fp)
|
| 114 |
+
hifigan = HifiGanModel.restore_from(hg)
|
| 115 |
```
|
| 116 |
+
|
| 117 |
+
## Edge / on-device deployment
|
| 118 |
+
|
| 119 |
+
The primary deployment target is mobile. Suggested paths:
|
| 120 |
+
|
| 121 |
+
- Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
|
| 122 |
+
- HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
|
| 123 |
+
- Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.
|
| 124 |
+
|
| 125 |
+
Quantization, pruning, and distillation have not been applied in this release.
|
| 126 |
+
|
| 127 |
+
## Ethical considerations
|
| 128 |
+
|
| 129 |
+
- The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
|
| 130 |
+
- Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).
|
| 131 |
+
|
| 132 |
+
## Attribution
|
| 133 |
+
|
| 134 |
+
- **Model author:** Caleb Lwanga, **[Crane AI Labs](https://huggingface.co/CraneAILabs)**
|
| 135 |
+
- **Base framework:** [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
|
| 136 |
+
- **Training corpus:** [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)
|
| 137 |
+
|
| 138 |
+
## License
|
| 139 |
+
|
| 140 |
+
Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.
|