File size: 6,252 Bytes
bca11b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
license: apache-2.0
language:
- ro
pipeline_tag: text-to-speech
tags:
- tts
- romanian
- matcha-tts
- conditional-flow-matching
- swara
library_name: pytorch
datasets:
- SWARA-1.0
---

# Matcha-TTS Romanian Models

Pre-trained Romanian text-to-speech models based on [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) trained on the SWARA 1.0 dataset.

## Quick Start

### Clone Repository

Since this repository contains custom inference code and model loading utilities, you need to clone it:

```bash
# Clone from HuggingFace Hub
git clone https://huggingface.co/adrianstanea/Ro-Matcha-TTS
cd Ro-Matcha-TTS

# Install Git LFS (if not already installed) to download large model files
git lfs install
git lfs pull
```

### Installation

```bash
# Install system dependencies (required for phonemization)
sudo apt-get install espeak-ng

# Install the main Matcha-TTS repository
pip install git+https://github.com/adrianstanea/Matcha-TTS.git

# Install required dependencies
pip install -r requirements.txt
```

### Usage

```python
import sys
sys.path.append("src")
from model_loader import ModelLoader

# Load from local cloned repository
loader = ModelLoader.from_pretrained("./")

# List available models
print(loader.list_models())
# {'swara': {...}, 'bas_10': {...}, 'bas_950': {...}, ...}

# Load production-ready BAS speaker
model_info = loader.load_models(model="bas_950")
print(f"Model: {model_info['model_name']}")
print(f"Path: {model_info['model_path']}")

# Load few-shot SGS speaker
model_info = loader.load_models(model="sgs_10")
print(f"Training data: {model_info['model_info']['training_data']}")

# Use with original Matcha-TTS inference code
# See examples/inference_example.py for complete usage
```

### Run Example

```bash
cd examples
python inference_example.py
```

## Available Models

### Baseline Model

| Model     | Type     | Description                                          |
| --------- | -------- | ---------------------------------------------------- |
| **swara** | Baseline | Speaker-agnostic model trained on full SWARA dataset |

### Fine-tuned Speaker Models

| Model       | Speaker    | Training Samples | Fine-tune Epochs | Use Case                         |
| ----------- | ---------- | ---------------- | ---------------- | -------------------------------- |
| **bas_10**  | BAS (Male) | 10 samples       | 100              | Few-shot learning / Low-resource |
| **bas_950** | BAS (Male) | 950 samples      | 100              | Production-ready speaker         |
| **sgs_10**  | SGS (Male) | 10 samples       | 100              | Few-shot learning / Low-resource |
| **sgs_950** | SGS (Male) | 950 samples      | 100              | Production-ready speaker         |

**Vocoder**: Universal HiFi-GAN vocoder

### Research Methodology

- **Training Strategy**: Baseline β†’ Speaker Fine-tuning (100 epochs)
- **Data Efficiency Study**: 10 vs 950 samples comparison
- **Low-Resource Learning**: Demonstrates few-shot TTS adaptation

## Model Details

- **Architecture**: Matcha-TTS (Conditional Flow Matching)
- **Dataset**: SWARA 1.0 Romanian Speech Corpus
- **Sample Rate**: 22,050 Hz
- **Language**: Romanian (ro)
- **Text Processing**: eSpeak Romanian phonemizer
- **Model Size**: ~100M parameters per model

## Repository Structure

```
β”œβ”€β”€ models/                          # Model checkpoints (Git LFS)
β”‚   β”œβ”€β”€ swara/
β”‚   β”‚   └── matcha-base-1000.ckpt   # Baseline model (1000 epochs)
β”‚   β”œβ”€β”€ bas/
β”‚   β”‚   β”œβ”€β”€ matcha-bas-10_100.ckpt  # BAS speaker (10 samples, 100 epochs)
β”‚   β”‚   └── matcha-bas-950_100.ckpt # BAS speaker (950 samples, 100 epochs)
β”‚   β”œβ”€β”€ sgs/
β”‚   β”‚   β”œβ”€β”€ matcha-sgs-10_100.ckpt  # SGS speaker (10 samples, 100 epochs)
β”‚   β”‚   └── matcha-sgs-950_100.ckpt # SGS speaker (950 samples, 100 epochs)
β”‚   └── vocoder/
β”‚       └── hifigan_univ_v1         # Universal HiFi-GAN vocoder
β”œβ”€β”€ configs/
β”‚   └── config.json                  # Model configuration
β”œβ”€β”€ src/
β”‚   └── model_loader.py              # HuggingFace-compatible loader
└── examples/
    β”œβ”€β”€ sample_texts_ro.txt          # Sample Romanian texts
    └── inference_example.py         # Complete usage example
```

## Usage with Original Repository

This repository provides model weights and HuggingFace integration. For training, evaluation, and advanced features, use the [main repository](https://github.com/adrianstanea/Matcha-TTS).

```python
# After loading models with ModelLoader
from matcha.models.matcha_tts import MatchaTTS
import torch

# Load using paths from ModelLoader
model = MatchaTTS.load_from_checkpoint(model_info['model_path'])
# ... continue with original inference code
```

## Requirements

- Python 3.10
- Main Matcha-TTS repository for inference
- HuggingFace Hub for model downloading

## License

Same as the original [Matcha-TTS repository](https://github.com/adrianstanea/Matcha-TTS).

## Citation

If you use this Romanian adaptation in your research, please cite:

```bibtex
@ARTICLE{11269795,
  author={Răgman, Teodora and Bogdan StÒnea, Adrian and Cucu, Horia and Stan, Adriana},
  journal={IEEE Access},
  title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools},
  year={2025},
  volume={13},
  number={},
  pages={203415-203428},
  keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian},
  doi={10.1109/ACCESS.2025.3637322}
}
```

**Original Matcha-TTS Citation:**

```bibtex
@inproceedings{mehta2024matcha,
  title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
  author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2024}
}
```

## Links

- [Main Repository](https://github.com/adrianstanea/Matcha-TTS) - Training, documentation, and research details
- [Original Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) - Base architecture and paper