vits / README.md
khursanirevo's picture
feat: mirror Revolab/vits production data (sarah, paan)
15d2fee verified
|
Raw
History Blame Contribute Delete
869 Bytes
# Revolab VITS — Multi-Speaker Bahasa Melayu TTS
VITS voice models trained on Revolab Malay speech datasets.
## Speakers (2 production-quality)
| Speaker ID | Name | Samples | CER | WER |
|------------|------|---------|-----|-----|
| sarah | sarah | 27,792 | 0.0537 | 0.1835 |
| paan | Paan | 27,434 | 0.0681 | 0.1561 |
All speakers evaluated at CER < 10% (production quality).
## Structure
```
speakers.json # Speaker registry with eval metrics
speakers/
<name>/
model.onnx # ONNX export for inference
model.onnx.json # Phoneme config
```
## Performance (CPU)
| Metric | Value |
|--------|-------|
| Avg latency | ~54ms |
| Avg RTF | 0.030 |
| Speed | **33.6x realtime** |
## Training
All models trained with:
- Architecture: VITS
- Phonemizer: espeak-ng (`ms` voice)
- Sample rate: 22050Hz
- GPU: NVIDIA H200 NVL