Update README.md

2d08d16 verified about 1 month ago

5.47 kB

	---
	language: lg
	license: apache-2.0
	library_name: nemo
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- tts
	- nemo
	- fastpitch
	- hifigan
	- luganda
	- african-languages
	- low-resource
	- edge
	- on-device
	datasets:
	- Sunbird/salt
	---

	# Ganda NeMo (Experimental) — Luganda TTS

	A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting on-device and edge deployment on mobile phones. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.

	> Status: experimental. Released as-is for research, evaluation.

	## Model summary

	\| Component \| File \| Arch \| Size \|
	\|---\|---\|---\|---\|
	\| Acoustic model \| `luganda_fastpitch.nemo` \| FastPitch (FFTransformer, 6L, d=384) \| 187 MB \|
	\| Vocoder \| `luganda_hifigan.nemo` \| HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) \| 339 MB \|

	- Language: Luganda (ISO 639-1 `lg`)
	- Sample rate: 22,050 Hz
	- Mel config: 80 bins, `n_fft=1024`, `hop=256`, `win=1024`, range 0–8000 Hz
	- License: Apache-2.0

	## Intended use

	- Research on low-resource African-language TTS.
	- Prototyping on-device / edge Luganda voice output on mobile (primary deployment target).
	- Benchmarking and comparison against other Luganda / Bantu-language TTS systems.



	## Training data

	Trained on the [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt) Luganda subset — a mixed male/female multi-speaker corpus. Approximately 2,380 clips / ~2.69 hours were used.


	## Model architecture & training

	### FastPitch (acoustic model)

	- FFTransformer encoder/decoder: 6 layers, 1 head, `d_model=384`, `d_inner=1536`, dropout 0.1
	- Duration + pitch predictors: 2-layer temporal predictors, filter size 256
	- Learned alignment (`learn_alignment: true`), bin-loss warmup 100 epochs
	- Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz
	- Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
	- Training steps: ~20,000
	- NeMo version at training time: 1.8.0rc0

	### HiFi-GAN (vocoder)

	- Generator: upsample rates `[8, 8, 2, 2]`, kernel sizes `[16, 16, 4, 4]`, initial channels 512
	- Resblock type 1, kernel sizes `[3, 7, 11]`, dilations `[[1,3,5], [1,3,5], [1,3,5]]`
	- Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
	- Training steps: ~20,000
	- NeMo version at training time: 1.23.0

	## Limitations

	Text frontend — English G2P. Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's `EnglishPhonemesTokenizer` with `EnglishG2p` (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.

	Single output voice. The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.

	Low-resource training. ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.

	Text normalization. The packaged text normalizer is `nemo_text_processing.text_normalization.Normalizer` with `lang: en`. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline.

	## Usage

	```python
	from nemo.collections.tts.models import FastPitchModel, HifiGanModel
	import soundfile as sf

	fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
	hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")

	text = "Oli otya?"
	parsed = fastpitch.parse(text)
	spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
	audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)

	sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)
	```

	### Loading from the Hub

	```python
	from huggingface_hub import hf_hub_download
	from nemo.collections.tts.models import FastPitchModel, HifiGanModel

	fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
	hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")

	fastpitch = FastPitchModel.restore_from(fp)
	hifigan = HifiGanModel.restore_from(hg)
	```

	## Edge / on-device deployment

	The primary deployment target is mobile. Suggested paths:

	- Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
	- HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
	- Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.

	Quantization, pruning, and distillation have not been applied in this release.

	## Ethical considerations

	- The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
	- Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).

	## Attribution

	- Model author: Caleb Lwanga, [Crane AI Labs](https://huggingface.co/CraneAILabs)
	- Base framework: [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
	- Training corpus: [Sunbird SALT](https://huggingface.co/datasets/Sunbird/salt)

	## License

	Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.