File size: 5,468 Bytes
8d32fb8
 
 
cfef2d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d32fb8
 
cfef2d4
8d32fb8
cfef2d4
8d32fb8
2d08d16
8d32fb8
cfef2d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d08d16
cfef2d4
2d08d16
cfef2d4
 
 
 
8d32fb8
 
cfef2d4
8d32fb8
 
cfef2d4
8d32fb8
 
 
 
 
cfef2d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d32fb8
cfef2d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language: lg
license: apache-2.0
library_name: nemo
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - nemo
  - fastpitch
  - hifigan
  - luganda
  - african-languages
  - low-resource
  - edge
  - on-device
datasets:
  - Sunbird/salt
---

# Ganda NeMo (Experimental) — Luganda TTS

A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting **on-device and edge deployment on mobile phones**. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.

> **Status:** experimental. Released as-is for research, evaluation.

## Model summary

| Component | File | Arch | Size |
|---|---|---|---|
| Acoustic model | `luganda_fastpitch.nemo` | FastPitch (FFTransformer, 6L, d=384) | 187 MB |
| Vocoder | `luganda_hifigan.nemo` | HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) | 339 MB |

- **Language:** Luganda (ISO 639-1 `lg`)
- **Sample rate:** 22,050 Hz
- **Mel config:** 80 bins, `n_fft=1024`, `hop=256`, `win=1024`, range 0–8000 Hz
- **License:** Apache-2.0

## Intended use

- Research on low-resource African-language TTS.
- Prototyping **on-device / edge** Luganda voice output on mobile (primary deployment target).
- Benchmarking and comparison against other Luganda / Bantu-language TTS systems.



## Training data

Trained on the **[Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)** Luganda subset — a mixed male/female multi-speaker corpus. Approximately **2,380 clips / ~2.69 hours** were used.


## Model architecture & training

### FastPitch (acoustic model)

- FFTransformer encoder/decoder: 6 layers, 1 head, `d_model=384`, `d_inner=1536`, dropout 0.1
- Duration + pitch predictors: 2-layer temporal predictors, filter size 256
- Learned alignment (`learn_alignment: true`), bin-loss warmup 100 epochs
- Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz
- Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
- Training steps: ~20,000
- NeMo version at training time: 1.8.0rc0

### HiFi-GAN (vocoder)

- Generator: upsample rates `[8, 8, 2, 2]`, kernel sizes `[16, 16, 4, 4]`, initial channels 512
- Resblock type 1, kernel sizes `[3, 7, 11]`, dilations `[[1,3,5], [1,3,5], [1,3,5]]`
- Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
- Training steps: ~20,000
- NeMo version at training time: 1.23.0

## Limitations

**Text frontend — English G2P.** Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's `EnglishPhonemesTokenizer` with `EnglishG2p` (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.

**Single output voice.** The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.

**Low-resource training.** ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.

**Text normalization.** The packaged text normalizer is `nemo_text_processing.text_normalization.Normalizer` with `lang: en`. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline.

## Usage

```python
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
import soundfile as sf

fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")

text = "Oli otya?"
parsed = fastpitch.parse(text)
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)

sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)
```

### Loading from the Hub

```python
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")

fastpitch = FastPitchModel.restore_from(fp)
hifigan = HifiGanModel.restore_from(hg)
```

## Edge / on-device deployment

The primary deployment target is mobile. Suggested paths:

- Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
- HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
- Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.

Quantization, pruning, and distillation have not been applied in this release.

## Ethical considerations

- The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
- Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).

## Attribution

- **Model author:** Caleb Lwanga, **[Crane AI Labs](https://huggingface.co/CraneAILabs)**
- **Base framework:** [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- **Training corpus:** [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)

## License

Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.