XTTS-v2 Romanian

Fine-tuned XTTS-v2 for high-quality Romanian text-to-speech with voice cloning. Achieves 6.3% WER (measured by Whisper large-v3) across 5 distinct voices trained on ~150 hours of Romanian speech.

Live Demo & Audio Samples | Training Code (Codeberg)

Audio Samples

Costel (male, literary narration) — 1.9% WER

"Profesorul a explicat cu răbdare lecția dificilă de matematică."

"Această carte reprezintă o contribuție importantă la literatura contemporană."

"Bună ziua, mă numesc Alexandru și sunt din București."

Mărioara (female, expressive storytelling) — 6.1% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

"Bucură-te de bucuria Bucuroaiei cum s-a bucurat și ea de bucuria lui Bucurel când a venit de la București."

Georgel (male, solemn delivery) — 1.8% WER

"Fișierele și rețelele informatice sunt esențiale în științele moderne."

Lăcrămioara (female, clear broadcast style) — 7.2% WER

"Ce-ntâmplare întâmplăreață s-a-ntâmplat în tâmplărie, un tâmplar din întâmplare s-a lovit cu tâmpla-n cap."

Dorel (male, conversational) — 14.5% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

Key Innovation: Unicode Diacritics Normalization

Romanian uses comma-below diacritics (s-comma U+0219, t-comma U+021B), but many text sources contain visually identical cedilla variants (s-cedilla U+015F, t-cedilla U+0163) inherited from legacy encodings. These are different Unicode codepoints that map to different token embeddings.

This model solves the problem at two levels:

  1. Smart embedding initialization -- new Romanian token embeddings were initialized from their closest existing donors (Turkish cedilla characters) rather than random weights, giving the model a meaningful starting point.
  2. Runtime normalization -- all input text must be normalized to comma-below form before inference (see Quick Start below).

Quick Start

Installation

pip install TTS==0.22.0

Note: TTS 0.22.0 requires patches for PyTorch 2.x compatibility and Romanian tokenizer support. See setup_runpod.sh for the exact patches needed.

Download Model

# Clone the model repo
git lfs install
git clone https://huggingface.co/eduardem/xtts-v2-romanian
cd xtts-v2-romanian

Or download individual files:

from huggingface_hub import hf_hub_download

for fname in ["config.json", "model.pth", "dvae.pth", "mel_stats.pth", "vocab.json", "speakers_xtts.pth"]:
    hf_hub_download(repo_id="eduardem/xtts-v2-romanian", filename=fname, local_dir="xtts-v2-romanian")

Basic Inference

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# ---------------------------------------------------------------
# REQUIRED: Normalize cedilla -> comma-below before every inference
# Without this, diacritics will be silently mispronounced or skipped.
# ---------------------------------------------------------------
CEDILLA_TO_COMMA = str.maketrans({
    "\u015f": "\u0219",  # ş -> ș  (lowercase s)
    "\u0163": "\u021b",  # ţ -> ț  (lowercase t)
    "\u015e": "\u0218",  # Ş -> Ș  (uppercase S)
    "\u0162": "\u021a",  # Ţ -> Ț  (uppercase T)
})

# Load model
config = XttsConfig()
config.load_json("xtts-v2-romanian/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="xtts-v2-romanian", use_deepspeed=False)
model.cuda()

# Prepare text -- always normalize!
text = "Bună ziua, mă numesc Alexandru și sunt din București."
text = text.translate(CEDILLA_TO_COMMA)

# Clone a voice from a ~6s reference clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["xtts-v2-romanian/reference_voices/costel.wav"]
)

# Generate speech
out = model.inference(
    text=text,
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.3,
    top_p=0.7,
    top_k=30,
    length_penalty=0.8,
    repetition_penalty=10.0,
)

torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Voice Cloning with Your Own Voice

You can clone any voice from a ~6-second WAV reference clip:

# Use your own reference audio (WAV, ~6 seconds, clear speech)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["my_voice.wav"]
)

out = model.inference(
    text="Aceasta este o propoziție de test în limba română.".translate(CEDILLA_TO_COMMA),
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
)

Inference Parameters

Parameter Default Description
temperature 0.3 Sampling temperature. Lower = more deterministic.
top_p 0.7 Nucleus sampling threshold.
top_k 30 Top-k sampling.
length_penalty 0.8 Values <1.0 discourage overly long output.
repetition_penalty 10.0 Penalizes repeated tokens.

Voices

Five voices are included with reference audio clips in the reference_voices/ directory.

Voice Speaker ID Description WER
Costel speaker_male_literature Male, literary narration 1.9%
Mărioara speaker_female_hp Female, expressive storytelling 6.1%
Lăcrămioara speaker_female_adr Female, clear broadcast style 7.2%
Georgel speaker_male_bible Male, solemn delivery 1.8%
Dorel speaker_male_4 Male, conversational 14.5%

WER is measured per-voice on 10 test sentences (5 diacritics-heavy + 5 common phrases) using Whisper large-v3. Lower is better.

Training

Dataset

eduardem/romanian-speech-v1 -- approximately 62,000 clips totaling ~150 hours of Romanian speech from 5 speakers, sampled at 22,050 Hz.

Infrastructure

  • GPU: NVIDIA RTX 5000 Ada (32 GB) on RunPod
  • Training time: ~42 hours (50 epochs, ~51 minutes per epoch)
  • Global steps: 678,350
  • Final loss (epoch 45): mel=2.72, text=0.025

Two-Phase Training Strategy

Phase 1 -- Embedding Warmup (1 epoch)

  • Freeze all GPT layers; train only text_embedding and text_pos_embedding (1.4% of parameters)
  • Learning rate: 1e-4
  • Purpose: bring the 6 new Romanian token embeddings (ă, â, î, ș, ț, [ro]) from random initialization to a meaningful representation before unfreezing the full model

Phase 2 -- Full GPT Fine-Tune (50 epochs)

  • Unfreeze all GPT layers
  • Learning rate: 5e-6
  • Batch size: 4, gradient accumulation: 63 (effective batch size = 252)
  • Optimizer: AdamW (betas=0.9, 0.96)
  • text_ce_weight=0.01 (auxiliary regularizer, not primary objective)
  • gpt_use_masking_gt_prompt_approach=True (prevents reference audio parroting)

Critical Training Parameters

These values are non-negotiable for XTTS-v2 fine-tuning:

  • Effective batch size >= 252 -- the official Coqui recipe minimum. Smaller batches cause mode collapse.
  • text_ce_weight = 0.01 -- increasing this breaks the mel/text loss balance and causes text-prediction shortcuts.
  • gpt_use_masking_gt_prompt_approach = True -- without this, the model learns to copy the reference audio instead of conditioning on input text.

Evaluation

WER progression during training, measured on held-out test sentences with Whisper large-v3:

Checkpoint Overall WER Diacritics WER Common WER
Phase 1 (warmup) 144.9% 146.3% 143.6%
Epoch 5 18.3% 15.9% 20.7%
Epoch 10 7.7% 7.9% 7.5%
Epoch 15 6.6% 7.0% 6.2%
Epoch 20 4.7% 6.0% 3.4%
Epoch 25 9.1% 11.0% 7.2%
Epoch 30 7.3% 6.9% 7.8%
Epoch 35 11.5% 10.7% 12.2%
Epoch 40 7.3% 3.8% 10.9%
Epoch 45 6.3% 5.9% 6.7%
Epoch 50 7.9% 7.6% 8.3%

Epoch 45 is selected as the release checkpoint. Epoch 50 shows WER regression (7.9%), suggesting early overfitting. Training loss decreases monotonically throughout, while WER fluctuates -- a common pattern in TTS where automatic metrics don't perfectly correlate with perceptual quality.

Model Files

File Size Description
config.json 4 KB XTTS-v2 configuration
model.pth 2.1 GB Fine-tuned model weights
dvae.pth 211 MB Discrete VAE for mel-spectrogram tokenization
mel_stats.pth 1 KB Mel-spectrogram normalization statistics
vocab.json 270 KB Extended vocabulary with Romanian diacritics + [ro]
speakers_xtts.pth 8 MB Speaker embedding defaults
reference_voices/ ~6s WAV clips for each of the 5 voices

Limitations

  • 5 voices only -- the model may not generalize well to Romanian accents or dialects not represented in the training data.
  • WER is Whisper-based -- Whisper large-v3 is used as an automated metric; no human evaluation has been conducted. Whisper itself may have biases on Romanian text.
  • Library patches required -- the TTS 0.22.0 library needs patches for PyTorch 2.x compatibility and Romanian tokenizer support. See setup_runpod.sh in the Codeberg repo.
  • Cedilla normalization is mandatory -- forgetting to normalize input text will silently degrade output quality for any word containing s or t with diacritics.

License

This model is released under the Coqui Public Model License (CPML), inherited from the base XTTS-v2 model.

Attribution

Citation

@misc{musat2026xttsromanian,
  title={Fine-tuning XTTS-v2 for Romanian: Unicode Normalization and Smart Embedding Initialization},
  author={Musat, Eduard},
  year={2026},
  eprint={TODO},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
}

Links

Downloads last month
58
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for eduardem/xtts-v2-romanian

Base model

coqui/XTTS-v2
Finetuned
(58)
this model