Add model checkpoints and config files

21f6b3c 8 days ago

4.07 kB

	---
	license: apache-2.0
	language:
	- ro
	tags:
	- text-to-speech
	- Grad-TTS
	- Diffusion
	library_name: pytorch
	datasets:
	- SWARA-1.0
	---

	# Ro-Grad-TTS: Romanian Text-to-Speech

	Romanian adaptation of [Grad-TTS](https://arxiv.org/abs/2105.06337), trained on the [SWARA 1.0 dataset](https://speech.utcluj.ro/swarasc/).

	## Quick Start

	This repository only contains the pretrained model weights for Romanian Grad-TTS. The actual package for Romanian TTS inference, including installation and usage instructions, is hosted on GitHub at [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git).

	When using the Romanian Grad-TTS package, the weights from this repository will be automatically downloaded as needed. To install and run Romanian TTS inference, please follow the instructions in the main repository linked above.

	## Details

	- Architecture: Grad-TTS (diffusion-based TTS)
	- Language: Romanian
	- Phonemization: Espeak-ng
	- Vocoder: HiFi-GAN (universal v1)
	- Sample rate: 22050 Hz
	- Training data: SWARA 1.0 Romanian speech corpus

	## Available Models

	### Baseline Model

	\| Model \| Type \| Description \|
	\| --------- \| -------- \| ---------------------------------------------------- \|
	\| swara \| Baseline \| Speaker-agnostic model trained on full SWARA dataset \|

	### Fine-tuned Speaker Models

	\| Model \| Speaker \| Training Samples \| Fine-tune Epochs \| Use Case \|
	\| ----------- \| ------------ \| ---------------- \| ---------------- \| -------------------------------- \|
	\| bas_10 \| BAS (Female) \| 10 samples \| 100 \| Few-shot learning / Low-resource \|
	\| bas_950 \| BAS (Female) \| 950 samples \| 100 \| Production-ready speaker \|
	\| sgs_10 \| SGS (Male) \| 10 samples \| 100 \| Few-shot learning / Low-resource \|
	\| sgs_950 \| SGS (Male) \| 950 samples \| 100 \| Production-ready speaker \|

	Vocoder: Universal HiFi-GAN vocoder

	## Repository Structure

	```sh
	adrianstanea/Ro-Grad-TTS/
	├── config.json # Model hyperparameters
	├── hifigan_config.json # Vocoder configuration
	└──── models/
	├── swara/
	│ └── grad-tts-base-1000.pt # Baseline model
	├── bas/
	│ └── grad-tts-bas-{10,950}_{15,50,100}.pt
	├── sgs/
	│ └── grad-tts-sgs-{10,950}_{15,50,100}.pt
	└── vocoder/
	└── hifigan_univ_v1 # Universal HiFi-GAN
	```

	## Citation

	If you use this Romanian adaptation in your research, please cite:

	```bibtex
	@ARTICLE{11269795,
	author={Răgman, Teodora and Bogdan Stânea, Adrian and Cucu, Horia and Stan, Adriana},
	journal={IEEE Access},
	title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools},
	year={2025},
	volume={13},
	number={},
	pages={203415-203428},
	keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian},
	doi={10.1109/ACCESS.2025.3637322}
	}
	```

	### Origianl Grad-TTS Citation

	```bibtex
	@article{popov2021grad,
	title={Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech},
	author={Popov, Vadim and Vovk, Ivan and Gogoryan, Vladimir and Sadekova, Tasnima and Kudinov, Mikhail},
	journal={International Conference on Machine Learning},
	year={2021}
	}
	```

	## References

	- [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git) - Training, documentation, and research details
	- [huawei-noah/Speech-Backbones](https://github.com/huawei-noah/Speech-Backbones) - Base architecture and paper