Initial upload of Romanian Matcha-TTS models

bca11b0 about 1 month ago

6.25 kB

	---
	license: apache-2.0
	language:
	- ro
	pipeline_tag: text-to-speech
	tags:
	- tts
	- romanian
	- matcha-tts
	- conditional-flow-matching
	- swara
	library_name: pytorch
	datasets:
	- SWARA-1.0
	---

	# Matcha-TTS Romanian Models

	Pre-trained Romanian text-to-speech models based on [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) trained on the SWARA 1.0 dataset.

	## Quick Start

	### Clone Repository

	Since this repository contains custom inference code and model loading utilities, you need to clone it:

	```bash
	# Clone from HuggingFace Hub
	git clone https://huggingface.co/adrianstanea/Ro-Matcha-TTS
	cd Ro-Matcha-TTS

	# Install Git LFS (if not already installed) to download large model files
	git lfs install
	git lfs pull
	```

	### Installation

	```bash
	# Install system dependencies (required for phonemization)
	sudo apt-get install espeak-ng

	# Install the main Matcha-TTS repository
	pip install git+https://github.com/adrianstanea/Matcha-TTS.git

	# Install required dependencies
	pip install -r requirements.txt
	```

	### Usage

	```python
	import sys
	sys.path.append("src")
	from model_loader import ModelLoader

	# Load from local cloned repository
	loader = ModelLoader.from_pretrained("./")

	# List available models
	print(loader.list_models())
	# {'swara': {...}, 'bas_10': {...}, 'bas_950': {...}, ...}

	# Load production-ready BAS speaker
	model_info = loader.load_models(model="bas_950")
	print(f"Model: {model_info['model_name']}")
	print(f"Path: {model_info['model_path']}")

	# Load few-shot SGS speaker
	model_info = loader.load_models(model="sgs_10")
	print(f"Training data: {model_info['model_info']['training_data']}")

	# Use with original Matcha-TTS inference code
	# See examples/inference_example.py for complete usage
	```

	### Run Example

	```bash
	cd examples
	python inference_example.py
	```

	## Available Models

	### Baseline Model

	\| Model \| Type \| Description \|
	\| --------- \| -------- \| ---------------------------------------------------- \|
	\| swara \| Baseline \| Speaker-agnostic model trained on full SWARA dataset \|

	### Fine-tuned Speaker Models

	\| Model \| Speaker \| Training Samples \| Fine-tune Epochs \| Use Case \|
	\| ----------- \| ---------- \| ---------------- \| ---------------- \| -------------------------------- \|
	\| bas_10 \| BAS (Male) \| 10 samples \| 100 \| Few-shot learning / Low-resource \|
	\| bas_950 \| BAS (Male) \| 950 samples \| 100 \| Production-ready speaker \|
	\| sgs_10 \| SGS (Male) \| 10 samples \| 100 \| Few-shot learning / Low-resource \|
	\| sgs_950 \| SGS (Male) \| 950 samples \| 100 \| Production-ready speaker \|

	Vocoder: Universal HiFi-GAN vocoder

	### Research Methodology

	- Training Strategy: Baseline → Speaker Fine-tuning (100 epochs)
	- Data Efficiency Study: 10 vs 950 samples comparison
	- Low-Resource Learning: Demonstrates few-shot TTS adaptation

	## Model Details

	- Architecture: Matcha-TTS (Conditional Flow Matching)
	- Dataset: SWARA 1.0 Romanian Speech Corpus
	- Sample Rate: 22,050 Hz
	- Language: Romanian (ro)
	- Text Processing: eSpeak Romanian phonemizer
	- Model Size: ~100M parameters per model

	## Repository Structure

	```
	├── models/ # Model checkpoints (Git LFS)
	│ ├── swara/
	│ │ └── matcha-base-1000.ckpt # Baseline model (1000 epochs)
	│ ├── bas/
	│ │ ├── matcha-bas-10_100.ckpt # BAS speaker (10 samples, 100 epochs)
	│ │ └── matcha-bas-950_100.ckpt # BAS speaker (950 samples, 100 epochs)
	│ ├── sgs/
	│ │ ├── matcha-sgs-10_100.ckpt # SGS speaker (10 samples, 100 epochs)
	│ │ └── matcha-sgs-950_100.ckpt # SGS speaker (950 samples, 100 epochs)
	│ └── vocoder/
	│ └── hifigan_univ_v1 # Universal HiFi-GAN vocoder
	├── configs/
	│ └── config.json # Model configuration
	├── src/
	│ └── model_loader.py # HuggingFace-compatible loader
	└── examples/
	├── sample_texts_ro.txt # Sample Romanian texts
	└── inference_example.py # Complete usage example
	```

	## Usage with Original Repository

	This repository provides model weights and HuggingFace integration. For training, evaluation, and advanced features, use the [main repository](https://github.com/adrianstanea/Matcha-TTS).

	```python
	# After loading models with ModelLoader
	from matcha.models.matcha_tts import MatchaTTS
	import torch

	# Load using paths from ModelLoader
	model = MatchaTTS.load_from_checkpoint(model_info['model_path'])
	# ... continue with original inference code
	```

	## Requirements

	- Python 3.10
	- Main Matcha-TTS repository for inference
	- HuggingFace Hub for model downloading

	## License

	Same as the original [Matcha-TTS repository](https://github.com/adrianstanea/Matcha-TTS).

	## Citation

	If you use this Romanian adaptation in your research, please cite:

	```bibtex
	@ARTICLE{11269795,
	author={Răgman, Teodora and Bogdan Stânea, Adrian and Cucu, Horia and Stan, Adriana},
	journal={IEEE Access},
	title={How Open Is Open TTS? A Practical Evaluation of Open Source TTS Tools},
	year={2025},
	volume={13},
	number={},
	pages={203415-203428},
	keywords={Computer architecture;Training;Text to speech;Spectrogram;Decoding;Computational modeling;Codecs;Predictive models;Acoustics;Low latency communication;Speech synthesis;open tools;evaluation;computational requirements;TTS adaptation;text-to-speech;objective measures;listening test;Romanian},
	doi={10.1109/ACCESS.2025.3637322}
	}
	```

	Original Matcha-TTS Citation:

	```bibtex
	@inproceedings{mehta2024matcha,
	title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
	author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
	booktitle={Proc. ICASSP},
	year={2024}
	}
	```

	## Links

	- [Main Repository](https://github.com/adrianstanea/Matcha-TTS) - Training, documentation, and research details
	- [Original Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) - Base architecture and paper