Romanian TTS Model (Finetuned)

This is a VITS model finetuned for the Romanian language. It was trained (from scratch) on the SWARA dataset and finetuned on specific speaker samples (BEA/SGS).

Model Details

Architecture: VITS
Language: Romanian (ro)
Base Dataset: The SWARA Speech Corpus (18k samples)
Base Model: trained on 16 speakers (includes both male & female voices, balanced data). The base model components can be found in the 'swara' directory.
Finetuning: finetuned on 2 speakers (bas and sgs). Their checkpoints can be found in the 'bas' and 'sgs' directories.
Sample rate: 22050Hz

Usage instructions

Included in the official repository of VITS: https://github.com/jaywalnut310/vits.git
Our repository on finetuning various TTS models for the Romanian language: https://gitlab.com/opentts_ragman/OpenTTS

Citation

If you use this model, please cite the original VITS paper and the SWARA dataset:

@article{kim2021vits,
  title={{Vits: Variational inference with adversarial learning for end-to-end text-to-speech}},
  author={Kim, Jaehyeon and Kong, Jae Sung and Yoon, Byoungkun and Kim, Sungjoo and Choi, Daehyun},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={6879--6895},
  year={2021}
}

@inproceedings{stan_sped2017,
  author = {Stan, Adriana and Dinescu, Florina and Tiple, Cristina and Meza, Serban and Orza, Bogdan and Chirila, Magdalena and Giurgiu, Mircea},
  title = {{The SWARA Speech Corpus: A Large Parallel Romanian Read Speech Dataset}},
  year = 2017,
  address = {Bucharest, Romania},
  booktitle = {{Proceedings of the 9th Conference on Speech Technology and Human-Computer Dialogue (SpeD)}},
  month = {July, 6-9},
}

If you use this specific finetuned checkpoint in your work, please cite it as follows:

@ARTICLE{11269795,
  author={Răgman, Teodora and Bogdan Stânea, Adrian and Cucu, Horia and Stan, Adriana},
  journal={IEEE Access}, 
  title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools}, 
  year={2025},
  volume={13},
  number={},
  pages={203415-203428},
  keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian},
  doi={10.1109/ACCESS.2025.3637322}}

Downloads last month: -; Downloads are not tracked for this model. How to track