--- license: apache-2.0 language: - ro tags: - text-to-speech - Grad-TTS - Diffusion library_name: pytorch datasets: - SWARA-1.0 --- # Ro-Grad-TTS: Romanian Text-to-Speech Romanian adaptation of [Grad-TTS](https://arxiv.org/abs/2105.06337), trained on the [SWARA 1.0 dataset](https://speech.utcluj.ro/swarasc/). ## Quick Start This repository only contains the pretrained model weights for Romanian Grad-TTS. The actual package for Romanian TTS inference, including installation and usage instructions, is hosted on GitHub at [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git). When using the Romanian Grad-TTS package, the weights from this repository will be automatically downloaded as needed. To install and run Romanian TTS inference, please follow the instructions in the main repository linked above. ## Details - **Architecture**: Grad-TTS (diffusion-based TTS) - **Language**: Romanian - **Phonemization**: Espeak-ng - **Vocoder**: HiFi-GAN (universal v1) - **Sample rate**: 22050 Hz - **Training data**: SWARA 1.0 Romanian speech corpus ## Available Models ### Baseline Model | Model | Type | Description | | --------- | -------- | ---------------------------------------------------- | | **swara** | Baseline | Speaker-agnostic model trained on full SWARA dataset | ### Fine-tuned Speaker Models | Model | Speaker | Training Samples | Fine-tune Epochs | Use Case | | ----------- | ------------ | ---------------- | ---------------- | -------------------------------- | | **bas_10** | BAS (Female) | 10 samples | 100 | Few-shot learning / Low-resource | | **bas_950** | BAS (Female) | 950 samples | 100 | Production-ready speaker | | **sgs_10** | SGS (Male) | 10 samples | 100 | Few-shot learning / Low-resource | | **sgs_950** | SGS (Male) | 950 samples | 100 | Production-ready speaker | **Vocoder**: Universal HiFi-GAN vocoder ## Repository Structure ```sh adrianstanea/Ro-Grad-TTS/ ├── config.json # Model hyperparameters ├── hifigan_config.json # Vocoder configuration └──── models/ ├── swara/ │ └── grad-tts-base-1000.pt # Baseline model ├── bas/ │ └── grad-tts-bas-{10,950}_{15,50,100}.pt ├── sgs/ │ └── grad-tts-sgs-{10,950}_{15,50,100}.pt └── vocoder/ └── hifigan_univ_v1 # Universal HiFi-GAN ``` ## Citation If you use this Romanian adaptation in your research, please cite: ```bibtex @ARTICLE{11269795, author={Răgman, Teodora and Bogdan Stânea, Adrian and Cucu, Horia and Stan, Adriana}, journal={IEEE Access}, title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools}, year={2025}, volume={13}, number={}, pages={203415-203428}, keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian}, doi={10.1109/ACCESS.2025.3637322} } ``` ### Origianl Grad-TTS Citation ```bibtex @article{popov2021grad, title={Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech}, author={Popov, Vadim and Vovk, Ivan and Gogoryan, Vladimir and Sadekova, Tasnima and Kudinov, Mikhail}, journal={International Conference on Machine Learning}, year={2021} } ``` ## References - [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git) - Training, documentation, and research details - [huawei-noah/Speech-Backbones](https://github.com/huawei-noah/Speech-Backbones) - Base architecture and paper