XTTS-v2 Romanian
Fine-tuned XTTS-v2 for high-quality Romanian text-to-speech with voice cloning. Achieves 6.3% WER (measured by Whisper large-v3) across 5 distinct voices trained on ~150 hours of Romanian speech.
Live Demo & Audio Samples | Training Code (Codeberg)
Audio Samples
Costel (male, literary narration) — 1.9% WER
"Profesorul a explicat cu răbdare lecția dificilă de matematică."
"Această carte reprezintă o contribuție importantă la literatura contemporană."
"Bună ziua, mă numesc Alexandru și sunt din București."
Mărioara (female, expressive storytelling) — 6.1% WER
"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."
"Bucură-te de bucuria Bucuroaiei cum s-a bucurat și ea de bucuria lui Bucurel când a venit de la București."
Georgel (male, solemn delivery) — 1.8% WER
"Fișierele și rețelele informatice sunt esențiale în științele moderne."
Lăcrămioara (female, clear broadcast style) — 7.2% WER
"Ce-ntâmplare întâmplăreață s-a-ntâmplat în tâmplărie, un tâmplar din întâmplare s-a lovit cu tâmpla-n cap."
Dorel (male, conversational) — 14.5% WER
"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."
Key Innovation: Unicode Diacritics Normalization
Romanian uses comma-below diacritics (s-comma U+0219, t-comma U+021B), but many text sources contain visually identical cedilla variants (s-cedilla U+015F, t-cedilla U+0163) inherited from legacy encodings. These are different Unicode codepoints that map to different token embeddings.
This model solves the problem at two levels:
- Smart embedding initialization -- new Romanian token embeddings were initialized from their closest existing donors (Turkish cedilla characters) rather than random weights, giving the model a meaningful starting point.
- Runtime normalization -- all input text must be normalized to comma-below form before inference (see Quick Start below).
Quick Start
Installation
pip install TTS==0.22.0
Note: TTS 0.22.0 requires patches for PyTorch 2.x compatibility and Romanian tokenizer support. See setup_runpod.sh for the exact patches needed.
Download Model
# Clone the model repo
git lfs install
git clone https://huggingface.co/eduardem/xtts-v2-romanian
cd xtts-v2-romanian
Or download individual files:
from huggingface_hub import hf_hub_download
for fname in ["config.json", "model.pth", "dvae.pth", "mel_stats.pth", "vocab.json", "speakers_xtts.pth"]:
hf_hub_download(repo_id="eduardem/xtts-v2-romanian", filename=fname, local_dir="xtts-v2-romanian")
Basic Inference
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# ---------------------------------------------------------------
# REQUIRED: Normalize cedilla -> comma-below before every inference
# Without this, diacritics will be silently mispronounced or skipped.
# ---------------------------------------------------------------
CEDILLA_TO_COMMA = str.maketrans({
"\u015f": "\u0219", # ş -> ș (lowercase s)
"\u0163": "\u021b", # ţ -> ț (lowercase t)
"\u015e": "\u0218", # Ş -> Ș (uppercase S)
"\u0162": "\u021a", # Ţ -> Ț (uppercase T)
})
# Load model
config = XttsConfig()
config.load_json("xtts-v2-romanian/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="xtts-v2-romanian", use_deepspeed=False)
model.cuda()
# Prepare text -- always normalize!
text = "Bună ziua, mă numesc Alexandru și sunt din București."
text = text.translate(CEDILLA_TO_COMMA)
# Clone a voice from a ~6s reference clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["xtts-v2-romanian/reference_voices/costel.wav"]
)
# Generate speech
out = model.inference(
text=text,
language="ro",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.3,
top_p=0.7,
top_k=30,
length_penalty=0.8,
repetition_penalty=10.0,
)
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
Voice Cloning with Your Own Voice
You can clone any voice from a ~6-second WAV reference clip:
# Use your own reference audio (WAV, ~6 seconds, clear speech)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["my_voice.wav"]
)
out = model.inference(
text="Aceasta este o propoziție de test în limba română.".translate(CEDILLA_TO_COMMA),
language="ro",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
)
Inference Parameters
| Parameter | Default | Description |
|---|---|---|
temperature |
0.3 | Sampling temperature. Lower = more deterministic. |
top_p |
0.7 | Nucleus sampling threshold. |
top_k |
30 | Top-k sampling. |
length_penalty |
0.8 | Values <1.0 discourage overly long output. |
repetition_penalty |
10.0 | Penalizes repeated tokens. |
Voices
Five voices are included with reference audio clips in the reference_voices/ directory.
| Voice | Speaker ID | Description | WER |
|---|---|---|---|
| Costel | speaker_male_literature |
Male, literary narration | 1.9% |
| Mărioara | speaker_female_hp |
Female, expressive storytelling | 6.1% |
| Lăcrămioara | speaker_female_adr |
Female, clear broadcast style | 7.2% |
| Georgel | speaker_male_bible |
Male, solemn delivery | 1.8% |
| Dorel | speaker_male_4 |
Male, conversational | 14.5% |
WER is measured per-voice on 10 test sentences (5 diacritics-heavy + 5 common phrases) using Whisper large-v3. Lower is better.
Training
Dataset
eduardem/romanian-speech-v1 -- approximately 62,000 clips totaling ~150 hours of Romanian speech from 5 speakers, sampled at 22,050 Hz.
Infrastructure
- GPU: NVIDIA RTX 5000 Ada (32 GB) on RunPod
- Training time: ~42 hours (50 epochs, ~51 minutes per epoch)
- Global steps: 678,350
- Final loss (epoch 45): mel=2.72, text=0.025
Two-Phase Training Strategy
Phase 1 -- Embedding Warmup (1 epoch)
- Freeze all GPT layers; train only
text_embeddingandtext_pos_embedding(1.4% of parameters) - Learning rate: 1e-4
- Purpose: bring the 6 new Romanian token embeddings (ă, â, î, ș, ț,
[ro]) from random initialization to a meaningful representation before unfreezing the full model
Phase 2 -- Full GPT Fine-Tune (50 epochs)
- Unfreeze all GPT layers
- Learning rate: 5e-6
- Batch size: 4, gradient accumulation: 63 (effective batch size = 252)
- Optimizer: AdamW (betas=0.9, 0.96)
text_ce_weight=0.01(auxiliary regularizer, not primary objective)gpt_use_masking_gt_prompt_approach=True(prevents reference audio parroting)
Critical Training Parameters
These values are non-negotiable for XTTS-v2 fine-tuning:
- Effective batch size >= 252 -- the official Coqui recipe minimum. Smaller batches cause mode collapse.
- text_ce_weight = 0.01 -- increasing this breaks the mel/text loss balance and causes text-prediction shortcuts.
- gpt_use_masking_gt_prompt_approach = True -- without this, the model learns to copy the reference audio instead of conditioning on input text.
Evaluation
WER progression during training, measured on held-out test sentences with Whisper large-v3:
| Checkpoint | Overall WER | Diacritics WER | Common WER |
|---|---|---|---|
| Phase 1 (warmup) | 144.9% | 146.3% | 143.6% |
| Epoch 5 | 18.3% | 15.9% | 20.7% |
| Epoch 10 | 7.7% | 7.9% | 7.5% |
| Epoch 15 | 6.6% | 7.0% | 6.2% |
| Epoch 20 | 4.7% | 6.0% | 3.4% |
| Epoch 25 | 9.1% | 11.0% | 7.2% |
| Epoch 30 | 7.3% | 6.9% | 7.8% |
| Epoch 35 | 11.5% | 10.7% | 12.2% |
| Epoch 40 | 7.3% | 3.8% | 10.9% |
| Epoch 45 | 6.3% | 5.9% | 6.7% |
| Epoch 50 | 7.9% | 7.6% | 8.3% |
Epoch 45 is selected as the release checkpoint. Epoch 50 shows WER regression (7.9%), suggesting early overfitting. Training loss decreases monotonically throughout, while WER fluctuates -- a common pattern in TTS where automatic metrics don't perfectly correlate with perceptual quality.
Model Files
| File | Size | Description |
|---|---|---|
config.json |
4 KB | XTTS-v2 configuration |
model.pth |
2.1 GB | Fine-tuned model weights |
dvae.pth |
211 MB | Discrete VAE for mel-spectrogram tokenization |
mel_stats.pth |
1 KB | Mel-spectrogram normalization statistics |
vocab.json |
270 KB | Extended vocabulary with Romanian diacritics + [ro] |
speakers_xtts.pth |
8 MB | Speaker embedding defaults |
reference_voices/ |
— | ~6s WAV clips for each of the 5 voices |
Limitations
- 5 voices only -- the model may not generalize well to Romanian accents or dialects not represented in the training data.
- WER is Whisper-based -- Whisper large-v3 is used as an automated metric; no human evaluation has been conducted. Whisper itself may have biases on Romanian text.
- Library patches required -- the TTS 0.22.0 library needs patches for PyTorch 2.x compatibility and Romanian tokenizer support. See
setup_runpod.shin the Codeberg repo. - Cedilla normalization is mandatory -- forgetting to normalize input text will silently degrade output quality for any word containing s or t with diacritics.
License
This model is released under the Coqui Public Model License (CPML), inherited from the base XTTS-v2 model.
Attribution
- XTTS-v2 base model: Coqui AI / TTS
- ADR speech data: gigant/romanian_speech_synthesis_0_8_1
- Whisper evaluation: OpenAI Whisper
Citation
@misc{musat2026xttsromanian,
title={Fine-tuning XTTS-v2 for Romanian: Unicode Normalization and Smart Embedding Initialization},
author={Musat, Eduard},
year={2026},
eprint={TODO},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
Links
- Training code & scripts -- full training pipeline, evaluation, and setup scripts
- Live demo -- audio samples with WER comparison across all voices and epochs
- Training dataset -- ~62k clips, ~150 hours, 5 Romanian speakers
- Downloads last month
- 58
Model tree for eduardem/xtts-v2-romanian
Base model
coqui/XTTS-v2