---
license: apache-2.0
language:
  - ro
tags:
  - text-to-speech
  - Grad-TTS
  - Diffusion
library_name: pytorch
datasets:
  - SWARA-1.0
---

# Ro-Grad-TTS: Romanian Text-to-Speech

Romanian adaptation of [Grad-TTS](https://arxiv.org/abs/2105.06337), trained on the [SWARA 1.0 dataset](https://speech.utcluj.ro/swarasc/).

## Quick Start

This repository only contains the pretrained model weights for Romanian Grad-TTS. The actual package for Romanian TTS inference, including installation and usage instructions, is hosted on GitHub at [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git).

When using the Romanian Grad-TTS package, the weights from this repository will be automatically downloaded as needed. To install and run Romanian TTS inference, please follow the instructions in the main repository linked above.

## Details

- **Architecture**: Grad-TTS (diffusion-based TTS)
- **Language**: Romanian
- **Phonemization**: Espeak-ng
- **Vocoder**: HiFi-GAN (universal v1)
- **Sample rate**: 22050 Hz
- **Training data**: SWARA 1.0 Romanian speech corpus

## Available Models

### Baseline Model

| Model     | Type     | Description                                          |
| --------- | -------- | ---------------------------------------------------- |
| **swara** | Baseline | Speaker-agnostic model trained on full SWARA dataset |

### Fine-tuned Speaker Models

| Model       | Speaker      | Training Samples | Fine-tune Epochs | Use Case                         |
| ----------- | ------------ | ---------------- | ---------------- | -------------------------------- |
| **bas_10**  | BAS (Female) | 10 samples       | 100              | Few-shot learning / Low-resource |
| **bas_950** | BAS (Female) | 950 samples      | 100              | Production-ready speaker         |
| **sgs_10**  | SGS (Male)   | 10 samples       | 100              | Few-shot learning / Low-resource |
| **sgs_950** | SGS (Male)   | 950 samples      | 100              | Production-ready speaker         |

**Vocoder**: Universal HiFi-GAN vocoder

## Repository Structure

```sh
adrianstanea/Ro-Grad-TTS/
├── config.json                                      # Model hyperparameters
├── hifigan_config.json                              # Vocoder configuration
└──── models/
    ├── swara/
    │   └── grad-tts-base-1000.pt                    # Baseline model
    ├── bas/
    │   └── grad-tts-bas-{10,950}_{15,50,100}.pt
    ├── sgs/
    │   └── grad-tts-sgs-{10,950}_{15,50,100}.pt
    └── vocoder/
        └── hifigan_univ_v1                          # Universal HiFi-GAN
```

## Citation

If you use this Romanian adaptation in your research, please cite:

```bibtex
@ARTICLE{11269795,
  author={Răgman, Teodora and Bogdan Stânea, Adrian and Cucu, Horia and Stan, Adriana},
  journal={IEEE Access},
  title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools},
  year={2025},
  volume={13},
  number={},
  pages={203415-203428},
  keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian},
  doi={10.1109/ACCESS.2025.3637322}
}
```

### Origianl Grad-TTS Citation

```bibtex
@article{popov2021grad,
  title={Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech},
  author={Popov, Vadim and Vovk, Ivan and Gogoryan, Vladimir and Sadekova, Tasnima and Kudinov, Mikhail},
  journal={International Conference on Machine Learning},
  year={2021}
}
```

## References

- [adrianstanea/Ro-Grad-TTS](https://github.com/adrianstanea/Ro-Grad-TTS.git) - Training, documentation, and research details
- [huawei-noah/Speech-Backbones](https://github.com/huawei-noah/Speech-Backbones) - Base architecture and paper