| | --- |
| | license: apache-2.0 |
| | language: |
| | - ro |
| | pipeline_tag: text-to-speech |
| | tags: |
| | - tts |
| | - romanian |
| | - matcha-tts |
| | - conditional-flow-matching |
| | - swara |
| | library_name: pytorch |
| | datasets: |
| | - SWARA-1.0 |
| | --- |
| | |
| | # Matcha-TTS Romanian Models |
| |
|
| | Pre-trained Romanian text-to-speech models based on [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) trained on the SWARA 1.0 dataset. |
| |
|
| | ## Quick Start |
| |
|
| | ### Clone Repository |
| |
|
| | Since this repository contains custom inference code and model loading utilities, you need to clone it: |
| |
|
| | ```bash |
| | # Clone from HuggingFace Hub |
| | git clone https://huggingface.co/adrianstanea/Ro-Matcha-TTS |
| | cd Ro-Matcha-TTS |
| | |
| | # Install Git LFS (if not already installed) to download large model files |
| | git lfs install |
| | git lfs pull |
| | ``` |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | # Install system dependencies (required for phonemization) |
| | sudo apt-get install espeak-ng |
| | |
| | # Install the main Matcha-TTS repository |
| | pip install git+https://github.com/adrianstanea/Matcha-TTS.git |
| | |
| | # Install required dependencies |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Usage |
| |
|
| | ```python |
| | import sys |
| | sys.path.append("src") |
| | from model_loader import ModelLoader |
| | |
| | # Load from local cloned repository |
| | loader = ModelLoader.from_pretrained("./") |
| | |
| | # List available models |
| | print(loader.list_models()) |
| | # {'swara': {...}, 'bas_10': {...}, 'bas_950': {...}, ...} |
| | |
| | # Load production-ready BAS speaker |
| | model_info = loader.load_models(model="bas_950") |
| | print(f"Model: {model_info['model_name']}") |
| | print(f"Path: {model_info['model_path']}") |
| | |
| | # Load few-shot SGS speaker |
| | model_info = loader.load_models(model="sgs_10") |
| | print(f"Training data: {model_info['model_info']['training_data']}") |
| | |
| | # Use with original Matcha-TTS inference code |
| | # See examples/inference_example.py for complete usage |
| | ``` |
| |
|
| | ### Run Example |
| |
|
| | ```bash |
| | cd examples |
| | python inference_example.py |
| | ``` |
| |
|
| | ## Available Models |
| |
|
| | ### Baseline Model |
| |
|
| | | Model | Type | Description | |
| | | --------- | -------- | ---------------------------------------------------- | |
| | | **swara** | Baseline | Speaker-agnostic model trained on full SWARA dataset | |
| |
|
| | ### Fine-tuned Speaker Models |
| |
|
| | | Model | Speaker | Training Samples | Fine-tune Epochs | Use Case | |
| | | ----------- | ---------- | ---------------- | ---------------- | -------------------------------- | |
| | | **bas_10** | BAS (Male) | 10 samples | 100 | Few-shot learning / Low-resource | |
| | | **bas_950** | BAS (Male) | 950 samples | 100 | Production-ready speaker | |
| | | **sgs_10** | SGS (Male) | 10 samples | 100 | Few-shot learning / Low-resource | |
| | | **sgs_950** | SGS (Male) | 950 samples | 100 | Production-ready speaker | |
| |
|
| | **Vocoder**: Universal HiFi-GAN vocoder |
| |
|
| | ### Research Methodology |
| |
|
| | - **Training Strategy**: Baseline β Speaker Fine-tuning (100 epochs) |
| | - **Data Efficiency Study**: 10 vs 950 samples comparison |
| | - **Low-Resource Learning**: Demonstrates few-shot TTS adaptation |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture**: Matcha-TTS (Conditional Flow Matching) |
| | - **Dataset**: SWARA 1.0 Romanian Speech Corpus |
| | - **Sample Rate**: 22,050 Hz |
| | - **Language**: Romanian (ro) |
| | - **Text Processing**: eSpeak Romanian phonemizer |
| | - **Model Size**: ~100M parameters per model |
| |
|
| | ## Repository Structure |
| |
|
| | ``` |
| | βββ models/ # Model checkpoints (Git LFS) |
| | β βββ swara/ |
| | β β βββ matcha-base-1000.ckpt # Baseline model (1000 epochs) |
| | β βββ bas/ |
| | β β βββ matcha-bas-10_100.ckpt # BAS speaker (10 samples, 100 epochs) |
| | β β βββ matcha-bas-950_100.ckpt # BAS speaker (950 samples, 100 epochs) |
| | β βββ sgs/ |
| | β β βββ matcha-sgs-10_100.ckpt # SGS speaker (10 samples, 100 epochs) |
| | β β βββ matcha-sgs-950_100.ckpt # SGS speaker (950 samples, 100 epochs) |
| | β βββ vocoder/ |
| | β βββ hifigan_univ_v1 # Universal HiFi-GAN vocoder |
| | βββ configs/ |
| | β βββ config.json # Model configuration |
| | βββ src/ |
| | β βββ model_loader.py # HuggingFace-compatible loader |
| | βββ examples/ |
| | βββ sample_texts_ro.txt # Sample Romanian texts |
| | βββ inference_example.py # Complete usage example |
| | ``` |
| |
|
| | ## Usage with Original Repository |
| |
|
| | This repository provides model weights and HuggingFace integration. For training, evaluation, and advanced features, use the [main repository](https://github.com/adrianstanea/Matcha-TTS). |
| |
|
| | ```python |
| | # After loading models with ModelLoader |
| | from matcha.models.matcha_tts import MatchaTTS |
| | import torch |
| | |
| | # Load using paths from ModelLoader |
| | model = MatchaTTS.load_from_checkpoint(model_info['model_path']) |
| | # ... continue with original inference code |
| | ``` |
| |
|
| | ## Requirements |
| |
|
| | - Python 3.10 |
| | - Main Matcha-TTS repository for inference |
| | - HuggingFace Hub for model downloading |
| |
|
| | ## License |
| |
|
| | Same as the original [Matcha-TTS repository](https://github.com/adrianstanea/Matcha-TTS). |
| |
|
| | ## Citation |
| |
|
| | If you use this Romanian adaptation in your research, please cite: |
| |
|
| | ```bibtex |
| | @ARTICLE{11269795, |
| | author={RΔgman, Teodora and Bogdan StΓ’nea, Adrian and Cucu, Horia and Stan, Adriana}, |
| | journal={IEEE Access}, |
| | title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools}, |
| | year={2025}, |
| | volume={13}, |
| | number={}, |
| | pages={203415-203428}, |
| | keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian}, |
| | doi={10.1109/ACCESS.2025.3637322} |
| | } |
| | ``` |
| |
|
| | **Original Matcha-TTS Citation:** |
| |
|
| | ```bibtex |
| | @inproceedings{mehta2024matcha, |
| | title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching}, |
| | author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje}, |
| | booktitle={Proc. ICASSP}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - [Main Repository](https://github.com/adrianstanea/Matcha-TTS) - Training, documentation, and research details |
| | - [Original Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) - Base architecture and paper |
| |
|