XTTS-v2 Romanian

A newer version is available: XTTS-v2 Romanian v2 — 15 voices, ~470 speakers, ~471K training clips. Trained on a combined dataset 3x larger than v1.

Fine-tuned XTTS-v2 for high-quality Romanian text-to-speech with voice cloning. Achieves 6.3% WER (measured by Whisper large-v3) across 5 distinct voices trained on ~150 hours of Romanian speech.

Live Demo & Audio Samples | Training Code (Codeberg)

Audio Samples

Costel (male, literary narration) — 1.9% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Mărioara (female, expressive storytelling) — 6.1% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

Georgel (male, solemn delivery) — 1.8% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

Lăcrămioara (female, clear broadcast style) — 7.2% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

Dorel (male, conversational) — 14.5% WER

"Țara românească și-a păstrat tradițiile străvechi de-a lungul secolelor."

Key Innovation: Unicode Diacritics Normalization

Romanian uses comma-below diacritics (s-comma U+0219, t-comma U+021B), but many text sources contain visually identical cedilla variants (s-cedilla U+015F, t-cedilla U+0163) inherited from legacy encodings. These are different Unicode codepoints that map to different token embeddings.

This model solves the problem at two levels:

Smart embedding initialization -- new Romanian token embeddings were initialized from their closest existing donors (Turkish cedilla characters) rather than random weights, giving the model a meaningful starting point.
Runtime normalization -- all input text must be normalized to comma-below form before inference (see Quick Start below).

Quick Start

Installation

pip install TTS==0.22.0

Note: TTS 0.22.0 requires patches for PyTorch 2.x compatibility and Romanian tokenizer support. See setup_runpod.sh for the exact patches needed.

Download Model

# Clone the model repo
git lfs install
git clone https://huggingface.co/eduardem/xtts-v2-romanian
cd xtts-v2-romanian

Or download individual files:

from huggingface_hub import hf_hub_download

for fname in ["config.json", "model.pth", "dvae.pth", "mel_stats.pth", "vocab.json", "speakers_xtts.pth"]:
    hf_hub_download(repo_id="eduardem/xtts-v2-romanian", filename=fname, local_dir="xtts-v2-romanian")

Basic Inference

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# ---------------------------------------------------------------
# REQUIRED: Normalize cedilla -> comma-below before every inference
# Without this, diacritics will be silently mispronounced or skipped.
# ---------------------------------------------------------------
CEDILLA_TO_COMMA = str.maketrans({
    "\u015f": "\u0219",  # ş -> ș  (lowercase s)
    "\u0163": "\u021b",  # ţ -> ț  (lowercase t)
    "\u015e": "\u0218",  # Ş -> Ș  (uppercase S)
    "\u0162": "\u021a",  # Ţ -> Ț  (uppercase T)
})

# Load model
config = XttsConfig()
config.load_json("xtts-v2-romanian/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="xtts-v2-romanian", use_deepspeed=False)
model.cuda()

# Prepare text -- always normalize!
text = "Bună ziua, mă numesc Alexandru și sunt din București."
text = text.translate(CEDILLA_TO_COMMA)

# Clone a voice from a ~6s reference clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["xtts-v2-romanian/reference_voices/costel.wav"]
)

# Generate speech
out = model.inference(
    text=text,
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.3,
    top_p=0.7,
    top_k=30,
    length_penalty=0.8,
    repetition_penalty=10.0,
)

torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Voice Cloning with Your Own Voice

You can clone any voice from a ~6-second WAV reference clip:

# Use your own reference audio (WAV, ~6 seconds, clear speech)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["my_voice.wav"]
)

out = model.inference(
    text="Aceasta este o propoziție de test în limba română.".translate(CEDILLA_TO_COMMA),
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
)

Inference Parameters

Parameter	Default	Description
`temperature`	0.3	Sampling temperature. Lower = more deterministic.
`top_p`	0.7	Nucleus sampling threshold.
`top_k`	30	Top-k sampling.
`length_penalty`	0.8	Values <1.0 discourage overly long output.
`repetition_penalty`	10.0	Penalizes repeated tokens.

Voices

Five voices are included with reference audio clips in the reference_voices/ directory.

Voice	Speaker ID	Description	WER
Costel	`speaker_male_literature`	Male, literary narration	1.9%
Mărioara	`speaker_female_hp`	Female, expressive storytelling	6.1%
Lăcrămioara	`speaker_female_adr`	Female, clear broadcast style	7.2%
Georgel	`speaker_male_bible`	Male, solemn delivery	1.8%
Dorel	`speaker_male_4`	Male, conversational	14.5%

WER is measured per-voice on 10 test sentences (5 diacritics-heavy + 5 common phrases) using Whisper large-v3. Lower is better.

Training

Dataset

eduardem/romanian-speech-v1 -- approximately 62,000 clips totaling ~150 hours of Romanian speech from 5 speakers, sampled at 22,050 Hz.

Infrastructure

GPU: NVIDIA RTX 5000 Ada (32 GB) on RunPod
Training time: ~42 hours (50 epochs, ~51 minutes per epoch)
Global steps: 678,350
Final loss (epoch 45): mel=2.72, text=0.025

Two-Phase Training Strategy

Phase 1 -- Embedding Warmup (1 epoch)

Freeze all GPT layers; train only text_embedding and text_pos_embedding (1.4% of parameters)
Learning rate: 1e-4
Purpose: bring the 6 new Romanian token embeddings (ă, â, î, ș, ț, [ro]) from random initialization to a meaningful representation before unfreezing the full model

Phase 2 -- Full GPT Fine-Tune (50 epochs)

Unfreeze all GPT layers
Learning rate: 5e-6
Batch size: 4, gradient accumulation: 63 (effective batch size = 252)
Optimizer: AdamW (betas=0.9, 0.96)
text_ce_weight=0.01 (auxiliary regularizer, not primary objective)
gpt_use_masking_gt_prompt_approach=True (prevents reference audio parroting)

Critical Training Parameters

These values are non-negotiable for XTTS-v2 fine-tuning:

Effective batch size >= 252 -- the official Coqui recipe minimum. Smaller batches cause mode collapse.
text_ce_weight = 0.01 -- increasing this breaks the mel/text loss balance and causes text-prediction shortcuts.
gpt_use_masking_gt_prompt_approach = True -- without this, the model learns to copy the reference audio instead of conditioning on input text.

Evaluation

WER progression during training, measured on held-out test sentences with Whisper large-v3:

Checkpoint	Overall WER	Diacritics WER	Common WER
Phase 1 (warmup)	144.9%	146.3%	143.6%
Epoch 5	18.3%	15.9%	20.7%
Epoch 10	7.7%	7.9%	7.5%
Epoch 15	6.6%	7.0%	6.2%
Epoch 20	4.7%	6.0%	3.4%
Epoch 25	9.1%	11.0%	7.2%
Epoch 30	7.3%	6.9%	7.8%
Epoch 35	11.5%	10.7%	12.2%
Epoch 40	7.3%	3.8%	10.9%
Epoch 45	6.3%	5.9%	6.7%
Epoch 50	7.9%	7.6%	8.3%

Epoch 45 is selected as the release checkpoint. Epoch 50 shows WER regression (7.9%), suggesting early overfitting. Training loss decreases monotonically throughout, while WER fluctuates -- a common pattern in TTS where automatic metrics don't perfectly correlate with perceptual quality.

Model Files

File	Size	Description
`config.json`	4 KB	XTTS-v2 configuration
`model.pth`	2.1 GB	Fine-tuned model weights
`dvae.pth`	211 MB	Discrete VAE for mel-spectrogram tokenization
`mel_stats.pth`	1 KB	Mel-spectrogram normalization statistics
`vocab.json`	270 KB	Extended vocabulary with Romanian diacritics + `[ro]`
`speakers_xtts.pth`	8 MB	Speaker embedding defaults
`reference_voices/`	—	~6s WAV clips for each of the 5 voices

Limitations

5 voices only -- the model may not generalize well to Romanian accents or dialects not represented in the training data.
WER is Whisper-based -- Whisper large-v3 is used as an automated metric; no human evaluation has been conducted. Whisper itself may have biases on Romanian text.
Library patches required -- the TTS 0.22.0 library needs patches for PyTorch 2.x compatibility and Romanian tokenizer support. See setup_runpod.sh in the Codeberg repo.
Cedilla normalization is mandatory -- forgetting to normalize input text will silently degrade output quality for any word containing s or t with diacritics.

License

This model is released under the Coqui Public Model License (CPML), inherited from the base XTTS-v2 model.

Attribution

XTTS-v2 base model: Coqui AI / TTS
ADR speech data: gigant/romanian_speech_synthesis_0_8_1
Whisper evaluation: OpenAI Whisper

Citation

@misc{musat2026xttsromanian,
  title={Fine-tuning XTTS-v2 for Romanian: Unicode Normalization and Smart Embedding Initialization},
  author={Musat, Eduard},
  year={2026},
  eprint={TODO},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
}

Model tree for eduardem/xtts-v2-romanian

Base model

coqui/XTTS-v2

Finetuned

(73)

this model

eduardem
/

xtts-v2-romanian

XTTS-v2 Romanian

Audio Samples

Costel (male, literary narration) — 1.9% WER

Mărioara (female, expressive storytelling) — 6.1% WER

Georgel (male, solemn delivery) — 1.8% WER

Lăcrămioara (female, clear broadcast style) — 7.2% WER

Dorel (male, conversational) — 14.5% WER

Key Innovation: Unicode Diacritics Normalization

Quick Start

Installation

Download Model

Basic Inference

Voice Cloning with Your Own Voice

Inference Parameters

Voices

Training

Dataset

Infrastructure

Two-Phase Training Strategy

Critical Training Parameters

Evaluation

Model Files

Limitations

License

Attribution

Citation

Links

Model tree for eduardem/xtts-v2-romanian

Space using eduardem/xtts-v2-romanian 1