Cal3bd3v commited on
Commit
cfef2d4
·
verified ·
1 Parent(s): 8d32fb8

docs: rewrite model card (architecture, training, limitations, deployment)

Browse files
Files changed (1) hide show
  1. README.md +122 -16
README.md CHANGED
@@ -1,34 +1,140 @@
1
  ---
2
  language: lg
3
  license: apache-2.0
4
- tags: [text-to-speech, tts, nemo, luganda, african-languages]
5
- datasets: [Sunbird/salt]
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
 
8
- # Luganda TTS v3
9
 
10
- Text-to-Speech system for Luganda using NVIDIA NeMo.
11
 
12
- ## Models
13
- | Model | Description | Size |
14
- |-------|-------------|------|
15
- | `luganda_fastpitch.nemo` | FastPitch spectrogram generator | 187 MB |
16
- | `luganda_hifigan.nemo` | HiFi-GAN neural vocoder | 339 MB |
17
 
18
- ## Training
19
- - **Dataset**: Sunbird/salt (~2,380 samples, 2.69 hours)
20
- - **FastPitch**: 20,000 steps
21
- - **HiFi-GAN**: 20,000 steps
22
- - **Sample Rate**: 22,050 Hz
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Usage
 
25
  ```python
26
  from nemo.collections.tts.models import FastPitchModel, HifiGanModel
 
27
 
28
  fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
29
  hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")
30
 
31
  text = "Oli otya?"
32
- spec = fastpitch.generate_spectrogram(tokens=fastpitch.parse(text))
33
- audio = hifigan.convert_spectrogram_to_audio(spec=spec)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language: lg
3
  license: apache-2.0
4
+ library_name: nemo
5
+ pipeline_tag: text-to-speech
6
+ tags:
7
+ - text-to-speech
8
+ - tts
9
+ - nemo
10
+ - fastpitch
11
+ - hifigan
12
+ - luganda
13
+ - african-languages
14
+ - low-resource
15
+ - edge
16
+ - on-device
17
+ datasets:
18
+ - Sunbird/salt
19
  ---
20
 
21
+ # Ganda NeMo (Experimental) — Luganda TTS
22
 
23
+ A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting **on-device and edge deployment on mobile phones**. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.
24
 
25
+ > **Status:** experimental. Released as-is for research, evaluation, and partnership discussion.
 
 
 
 
26
 
27
+ ## Model summary
28
+
29
+ | Component | File | Arch | Size |
30
+ |---|---|---|---|
31
+ | Acoustic model | `luganda_fastpitch.nemo` | FastPitch (FFTransformer, 6L, d=384) | 187 MB |
32
+ | Vocoder | `luganda_hifigan.nemo` | HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) | 339 MB |
33
+
34
+ - **Language:** Luganda (ISO 639-1 `lg`)
35
+ - **Sample rate:** 22,050 Hz
36
+ - **Mel config:** 80 bins, `n_fft=1024`, `hop=256`, `win=1024`, range 0–8000 Hz
37
+ - **License:** Apache-2.0
38
+
39
+ ## Intended use
40
+
41
+ - Research on low-resource African-language TTS.
42
+ - Prototyping **on-device / edge** Luganda voice output on mobile (primary deployment target).
43
+ - Benchmarking and comparison against other Luganda / Bantu-language TTS systems.
44
+
45
+ ### Out of scope
46
+
47
+ - Production deployments without clinical, safety, or accessibility review.
48
+ - Voice cloning or speaker impersonation (model is single-speaker only, see limitations).
49
+ - Any use that violates the Apache-2.0 license or the licensing of the underlying Sunbird SALT corpus.
50
+
51
+ ## Training data
52
+
53
+ Trained on the **[Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)** Luganda subset — a mixed male/female multi-speaker corpus. Approximately **2,380 clips / ~2.69 hours** were used.
54
+
55
+ Note that although the training data is multi-speaker, the acoustic model is configured with `n_speakers: 1` and **contains no speaker embedding** (verified in the checkpoint). The model therefore produces a single speaker-averaged voice rather than per-speaker conditional synthesis.
56
+
57
+ ## Model architecture & training
58
+
59
+ ### FastPitch (acoustic model)
60
+
61
+ - FFTransformer encoder/decoder: 6 layers, 1 head, `d_model=384`, `d_inner=1536`, dropout 0.1
62
+ - Duration + pitch predictors: 2-layer temporal predictors, filter size 256
63
+ - Learned alignment (`learn_alignment: true`), bin-loss warmup 100 epochs
64
+ - Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz
65
+ - Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
66
+ - Training steps: ~20,000
67
+ - NeMo version at training time: 1.8.0rc0
68
+
69
+ ### HiFi-GAN (vocoder)
70
+
71
+ - Generator: upsample rates `[8, 8, 2, 2]`, kernel sizes `[16, 16, 4, 4]`, initial channels 512
72
+ - Resblock type 1, kernel sizes `[3, 7, 11]`, dilations `[[1,3,5], [1,3,5], [1,3,5]]`
73
+ - Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
74
+ - Training steps: ~20,000
75
+ - NeMo version at training time: 1.23.0
76
+
77
+ ## Limitations
78
+
79
+ **Text frontend — English G2P.** Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource, and we did not find a training corpus large enough to reliably train one from scratch. FastPitch here uses NeMo's `EnglishPhonemesTokenizer` with `EnglishG2p` (CMUdict) as the text frontend. Luganda spelling is largely phonemic, so this works reasonably via grapheme pass-through, but expect degraded pronunciation on words the English G2P re-phonemizes. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.
80
+
81
+ **Single output voice.** As noted above, the model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.
82
+
83
+ **Low-resource training.** ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.
84
+
85
+ **Text normalization.** The packaged text normalizer is `nemo_text_processing.text_normalization.Normalizer` with `lang: en`. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline.
86
 
87
  ## Usage
88
+
89
  ```python
90
  from nemo.collections.tts.models import FastPitchModel, HifiGanModel
91
+ import soundfile as sf
92
 
93
  fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
94
  hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")
95
 
96
  text = "Oli otya?"
97
+ parsed = fastpitch.parse(text)
98
+ spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
99
+ audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)
100
+
101
+ sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)
102
+ ```
103
+
104
+ ### Loading from the Hub
105
+
106
+ ```python
107
+ from huggingface_hub import hf_hub_download
108
+ from nemo.collections.tts.models import FastPitchModel, HifiGanModel
109
+
110
+ fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
111
+ hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")
112
+
113
+ fastpitch = FastPitchModel.restore_from(fp)
114
+ hifigan = HifiGanModel.restore_from(hg)
115
  ```
116
+
117
+ ## Edge / on-device deployment
118
+
119
+ The primary deployment target is mobile. Suggested paths:
120
+
121
+ - Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
122
+ - HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
123
+ - Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.
124
+
125
+ Quantization, pruning, and distillation have not been applied in this release.
126
+
127
+ ## Ethical considerations
128
+
129
+ - The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
130
+ - Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).
131
+
132
+ ## Attribution
133
+
134
+ - **Model author:** Caleb Lwanga, **[Crane AI Labs](https://huggingface.co/CraneAILabs)**
135
+ - **Base framework:** [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
136
+ - **Training corpus:** [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)
137
+
138
+ ## License
139
+
140
+ Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.