| # Revolab VITS — Multi-Speaker Bahasa Melayu TTS | |
| VITS voice models trained on Revolab Malay speech datasets. | |
| ## Speakers (2 production-quality) | |
| | Speaker ID | Name | Samples | CER | WER | | |
| |------------|------|---------|-----|-----| | |
| | sarah | sarah | 27,792 | 0.0537 | 0.1835 | | |
| | paan | Paan | 27,434 | 0.0681 | 0.1561 | | |
| All speakers evaluated at CER < 10% (production quality). | |
| ## Structure | |
| ``` | |
| speakers.json # Speaker registry with eval metrics | |
| speakers/ | |
| <name>/ | |
| model.onnx # ONNX export for inference | |
| model.onnx.json # Phoneme config | |
| ``` | |
| ## Performance (CPU) | |
| | Metric | Value | | |
| |--------|-------| | |
| | Avg latency | ~54ms | | |
| | Avg RTF | 0.030 | | |
| | Speed | **33.6x realtime** | | |
| ## Training | |
| All models trained with: | |
| - Architecture: VITS | |
| - Phonemizer: espeak-ng (`ms` voice) | |
| - Sample rate: 22050Hz | |
| - GPU: NVIDIA H200 NVL | |