Commit ·
21f6b3c
1
Parent(s): a6a634a
Add model checkpoints and config files
Browse filesSigned-off-by: adrianstanea <adrianstanea1@gmail.com>
- .gitattributes +1 -0
- README.md +98 -0
- config.json +21 -0
- hifigan_config.json +38 -0
- models/bas/grad-tts-bas-10_100.pt +3 -0
- models/bas/grad-tts-bas-10_15.pt +3 -0
- models/bas/grad-tts-bas-10_50.pt +3 -0
- models/bas/grad-tts-bas-950_100.pt +3 -0
- models/bas/grad-tts-bas-950_15.pt +3 -0
- models/bas/grad-tts-bas-950_50.pt +3 -0
- models/sgs/grad-tts-sgs-10_100.pt +3 -0
- models/sgs/grad-tts-sgs-10_15.pt +3 -0
- models/sgs/grad-tts-sgs-10_50.pt +3 -0
- models/sgs/grad-tts-sgs-950_100.pt +3 -0
- models/sgs/grad-tts-sgs-950_15.pt +3 -0
- models/sgs/grad-tts-sgs-950_50.pt +3 -0
- models/swara/grad-tts-base-1000.pt +3 -0
- models/vocoder/hifigan_univ_v1 +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
models/vocoder/hifigan_univ_v1 filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,101 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- ro
|
| 5 |
+
tags:
|
| 6 |
+
- text-to-speech
|
| 7 |
+
- Grad-TTS
|
| 8 |
+
- Diffusion
|
| 9 |
+
library_name: pytorch
|
| 10 |
+
datasets:
|
| 11 |
+
- SWARA-1.0
|
| 12 |
---
|
| 13 |
+
|
| 14 |
+
# Ro-Grad-TTS: Romanian Text-to-Speech
|
| 15 |
+
|
| 16 |
+
Romanian adaptation of [Grad-TTS](https://arxiv.org/abs/2105.06337), trained on the [SWARA 1.0 dataset](https://speech.utcluj.ro/swarasc/).
|
| 17 |
+
|
| 18 |
+
## Quick Start
|
| 19 |
+
|
| 20 |
+
This repository only contains the pretrained model weights for Romanian Grad-TTS. The actual package for Romanian TTS inference, including installation and usage instructions, is hosted on GitHub at [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git).
|
| 21 |
+
|
| 22 |
+
When using the Romanian Grad-TTS package, the weights from this repository will be automatically downloaded as needed. To install and run Romanian TTS inference, please follow the instructions in the main repository linked above.
|
| 23 |
+
|
| 24 |
+
## Details
|
| 25 |
+
|
| 26 |
+
- **Architecture**: Grad-TTS (diffusion-based TTS)
|
| 27 |
+
- **Language**: Romanian
|
| 28 |
+
- **Phonemization**: Espeak-ng
|
| 29 |
+
- **Vocoder**: HiFi-GAN (universal v1)
|
| 30 |
+
- **Sample rate**: 22050 Hz
|
| 31 |
+
- **Training data**: SWARA 1.0 Romanian speech corpus
|
| 32 |
+
|
| 33 |
+
## Available Models
|
| 34 |
+
|
| 35 |
+
### Baseline Model
|
| 36 |
+
|
| 37 |
+
| Model | Type | Description |
|
| 38 |
+
| --------- | -------- | ---------------------------------------------------- |
|
| 39 |
+
| **swara** | Baseline | Speaker-agnostic model trained on full SWARA dataset |
|
| 40 |
+
|
| 41 |
+
### Fine-tuned Speaker Models
|
| 42 |
+
|
| 43 |
+
| Model | Speaker | Training Samples | Fine-tune Epochs | Use Case |
|
| 44 |
+
| ----------- | ------------ | ---------------- | ---------------- | -------------------------------- |
|
| 45 |
+
| **bas_10** | BAS (Female) | 10 samples | 100 | Few-shot learning / Low-resource |
|
| 46 |
+
| **bas_950** | BAS (Female) | 950 samples | 100 | Production-ready speaker |
|
| 47 |
+
| **sgs_10** | SGS (Male) | 10 samples | 100 | Few-shot learning / Low-resource |
|
| 48 |
+
| **sgs_950** | SGS (Male) | 950 samples | 100 | Production-ready speaker |
|
| 49 |
+
|
| 50 |
+
**Vocoder**: Universal HiFi-GAN vocoder
|
| 51 |
+
|
| 52 |
+
## Repository Structure
|
| 53 |
+
|
| 54 |
+
```sh
|
| 55 |
+
adrianstanea/Ro-Grad-TTS/
|
| 56 |
+
├── config.json # Model hyperparameters
|
| 57 |
+
├── hifigan_config.json # Vocoder configuration
|
| 58 |
+
└──── models/
|
| 59 |
+
├── swara/
|
| 60 |
+
│ └── grad-tts-base-1000.pt # Baseline model
|
| 61 |
+
├── bas/
|
| 62 |
+
│ └── grad-tts-bas-{10,950}_{15,50,100}.pt
|
| 63 |
+
├── sgs/
|
| 64 |
+
│ └── grad-tts-sgs-{10,950}_{15,50,100}.pt
|
| 65 |
+
└── vocoder/
|
| 66 |
+
└── hifigan_univ_v1 # Universal HiFi-GAN
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Citation
|
| 70 |
+
|
| 71 |
+
If you use this Romanian adaptation in your research, please cite:
|
| 72 |
+
|
| 73 |
+
```bibtex
|
| 74 |
+
@ARTICLE{11269795,
|
| 75 |
+
author={Răgman, Teodora and Bogdan Stânea, Adrian and Cucu, Horia and Stan, Adriana},
|
| 76 |
+
journal={IEEE Access},
|
| 77 |
+
title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools},
|
| 78 |
+
year={2025},
|
| 79 |
+
volume={13},
|
| 80 |
+
number={},
|
| 81 |
+
pages={203415-203428},
|
| 82 |
+
keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian},
|
| 83 |
+
doi={10.1109/ACCESS.2025.3637322}
|
| 84 |
+
}
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Origianl Grad-TTS Citation
|
| 88 |
+
|
| 89 |
+
```bibtex
|
| 90 |
+
@article{popov2021grad,
|
| 91 |
+
title={Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech},
|
| 92 |
+
author={Popov, Vadim and Vovk, Ivan and Gogoryan, Vladimir and Sadekova, Tasnima and Kudinov, Mikhail},
|
| 93 |
+
journal={International Conference on Machine Learning},
|
| 94 |
+
year={2021}
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## References
|
| 99 |
+
|
| 100 |
+
- [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git) - Training, documentation, and research details
|
| 101 |
+
- [huawei-noah/Speech-Backbones](https://github.com/huawei-noah/Speech-Backbones) - Base architecture and paper
|
config.json
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "grad-tts",
|
| 3 |
+
"language": "ro",
|
| 4 |
+
"n_spks": 1,
|
| 5 |
+
"spk_emb_dim": 64,
|
| 6 |
+
"n_enc_channels": 192,
|
| 7 |
+
"filter_channels": 768,
|
| 8 |
+
"filter_channels_dp": 256,
|
| 9 |
+
"n_heads": 2,
|
| 10 |
+
"n_enc_layers": 6,
|
| 11 |
+
"enc_kernel": 3,
|
| 12 |
+
"enc_dropout": 0.1,
|
| 13 |
+
"window_size": 4,
|
| 14 |
+
"n_feats": 80,
|
| 15 |
+
"dec_dim": 64,
|
| 16 |
+
"beta_min": 0.05,
|
| 17 |
+
"beta_max": 20.0,
|
| 18 |
+
"pe_scale": 1000,
|
| 19 |
+
"sample_rate": 22050,
|
| 20 |
+
"add_blank": true
|
| 21 |
+
}
|
hifigan_config.json
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"resblock": "1",
|
| 3 |
+
"num_gpus": 0,
|
| 4 |
+
"batch_size": 16,
|
| 5 |
+
"learning_rate": 0.0004,
|
| 6 |
+
"adam_b1": 0.8,
|
| 7 |
+
"adam_b2": 0.99,
|
| 8 |
+
"lr_decay": 0.999,
|
| 9 |
+
"seed": 1234,
|
| 10 |
+
|
| 11 |
+
"upsample_rates": [8,8,2,2],
|
| 12 |
+
"upsample_kernel_sizes": [16,16,4,4],
|
| 13 |
+
"upsample_initial_channel": 512,
|
| 14 |
+
"resblock_kernel_sizes": [3,7,11],
|
| 15 |
+
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
|
| 16 |
+
"resblock_initial_channel": 256,
|
| 17 |
+
|
| 18 |
+
"segment_size": 8192,
|
| 19 |
+
"num_mels": 80,
|
| 20 |
+
"num_freq": 1025,
|
| 21 |
+
"n_fft": 1024,
|
| 22 |
+
"hop_size": 256,
|
| 23 |
+
"win_size": 1024,
|
| 24 |
+
|
| 25 |
+
"sampling_rate": 22050,
|
| 26 |
+
|
| 27 |
+
"fmin": 0,
|
| 28 |
+
"fmax": 8000,
|
| 29 |
+
"fmax_loss": null,
|
| 30 |
+
|
| 31 |
+
"num_workers": 4,
|
| 32 |
+
|
| 33 |
+
"dist_config": {
|
| 34 |
+
"dist_backend": "nccl",
|
| 35 |
+
"dist_url": "tcp://localhost:54321",
|
| 36 |
+
"world_size": 1
|
| 37 |
+
}
|
| 38 |
+
}
|
models/bas/grad-tts-bas-10_100.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2496c1451640dbf50d247f4ffc520fbb768bf5d9512f3d0875e1b0431f7625c7
|
| 3 |
+
size 59484571
|
models/bas/grad-tts-bas-10_15.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:56ea54d5cde11cff79d57c34af9d3407b1cea294ceadfe5cb9949c5caba64025
|
| 3 |
+
size 59484571
|
models/bas/grad-tts-bas-10_50.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:df6401ee7f066b7e8b83e5185030d55a29bca9ae87897dcb2b5ec41c64ef001c
|
| 3 |
+
size 59484571
|
models/bas/grad-tts-bas-950_100.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6bf8faa190b2f5fa361581b365c471327c237b1b818a390b5b7016760ad607a6
|
| 3 |
+
size 59484571
|
models/bas/grad-tts-bas-950_15.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9f9eeb8d028c84c14b20bba26c107475bef35e3cee33fec4f33d096713c52bb4
|
| 3 |
+
size 59484571
|
models/bas/grad-tts-bas-950_50.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8c0e96fe8fe2ec1f6a8f0f88f05b2923671ea56bde8cfb34306552d8db48b386
|
| 3 |
+
size 59484571
|
models/sgs/grad-tts-sgs-10_100.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3100c5f1fb2e3f2d94790e5b27b22ea990ecf01f5db694ab279eaeb2fd874e29
|
| 3 |
+
size 59484571
|
models/sgs/grad-tts-sgs-10_15.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:30b3bdaf8595f4c04f5f939839126c3e134b557a55842605770b8d4ac1b1f1d4
|
| 3 |
+
size 59484571
|
models/sgs/grad-tts-sgs-10_50.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5833afde853640b228b90ad23d006f200e7eaccab288916dff5b21864ed10de6
|
| 3 |
+
size 59484571
|
models/sgs/grad-tts-sgs-950_100.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8fe8ef94739bde87025a26c87874ff46beccad144b6696b61c6a976f8c69e919
|
| 3 |
+
size 59484571
|
models/sgs/grad-tts-sgs-950_15.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:664f6d77470501f2fbaa74477f1e828b94fe4d59f265ed6ff322c6865c55fcac
|
| 3 |
+
size 59484571
|
models/sgs/grad-tts-sgs-950_50.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9462ab4808bc7a15fd19ebe3db14084f5b00a7302c888c9b0505de531023bec4
|
| 3 |
+
size 59484571
|
models/swara/grad-tts-base-1000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:141842ea3fc006215aa66234c5ef59b333ccd9c501f4288c3ff743f4a35c5d43
|
| 3 |
+
size 59484571
|
models/vocoder/hifigan_univ_v1
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:771eaf4876485a35e25577563d390c262e23c2421e4a8c929eacfde34a5b7a60
|
| 3 |
+
size 55788858
|