F5-TTS Georgian

A fine-tuned version of SWivid/F5-TTS (335M params) for Georgian text-to-speech. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.

Model Details


Base model	SWivid/F5-TTS v1 Base (335M params, DiT + ConvNeXt V2)
Fine-tuning	Full fine-tune (continuation of flow-matching pretraining), no LoRA
Training data	NMikka/Common-Voice-Geo-Cleaned — 20,300 samples, 12 speakers
Training	110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB)
Sample rate	24 kHz
Voice cloning	Works well with training speakers; generalizing to new voices is WIP
License	CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights)

Evaluation — FLEURS Georgian Benchmark (979 unseen samples)

Round-trip CER: TTS generates audio → Meta Omnilingual ASR 7B transcribes → compare to original text.

Metric	Value
CER mean	0.0509
CER median	0.0309
CER p90	0.1183
CER std	0.0558
WER mean	0.1866
WER median	0.1600

CER distribution:

65.9% of samples < 5% CER
85.9% of samples < 10% CER
96.5% of samples < 20% CER
0 catastrophic failures (> 50% CER)

Evaluated with speaker 3 reference audio (NISQA MOS 4.99).

Usage

Install

pip install f5-tts

Download Model

from huggingface_hub import hf_hub_download

# Download checkpoint and vocab
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

Inference

The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.

from datasets import load_dataset
from huggingface_hub import hf_hub_download
from f5_tts.api import F5TTS
import soundfile as sf
import numpy as np

# Download model
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

# Load a reference sample from the training dataset
ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
ref_sample = ds[92]  # Pick any sample as voice reference, but this one is the one i used while testing alot.

# Save reference audio to temp file (F5-TTS expects a file path)
ref_path = "/tmp/ref.wav"
sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])

# Load model
model = F5TTS(
    ckpt_file=ckpt_path,
    vocab_file=vocab_path,
    device="cuda",
    use_ema=False,  # Important: this checkpoint was not trained with EMA
)

# Generate speech using a training speaker as reference
wav, sr, _ = model.infer(
    ref_file=ref_path,
    ref_text=ref_sample["text"],
    gen_text="საქართველო მდებარეობს კავკასიის რეგიონში, ევროპისა და აზიის გასაყარზე",
)
sf.write("output.wav", wav, sr)

Generation Parameters

wav, sr, _ = model.infer(
    ref_file="reference.wav",
    ref_text="reference transcript",
    gen_text="text to synthesize",
    nfe_step=32,       # Denoising steps (default 32, higher = better quality, slower)
    cfg_strength=2.0,  # Classifier-free guidance (default 2.0)
    speed=1.0,         # Speech speed multiplier
)

Training Details


Method	Full fine-tune (flow-matching loss, continuation of pretraining)
Base checkpoint	`F5TTS_v1_Base/model_1250000.safetensors`
Learning rate	1e-5
Warmup	500 steps
Batch size	9,600 audio frames per GPU
Max sequences/batch	64
Optimizer	8-bit Adam (bitsandbytes)
Epochs	100
Total updates	110,000
Tokenizer	Character-level (`char`, not `pinyin`)
Vocab	2,579 tokens (2,545 pretrained + 34 Georgian characters)
GPU	1x NVIDIA RTX A6000 (48GB)

Vocab Extension

The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (ა-ჰ + „). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 → 2,580 dimensions.

Limitations and Future Work

License: CC-BY-NC-4.0 — non-commercial use only (inherited from F5-TTS weights)
Voice cloning to new speakers is limited — the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
Some complex Georgian text with rare characters may produce higher error rates
No emotion or prosody control beyond what the reference audio provides

Part of the Georgian TTS Benchmark

This model was trained as part of the first Georgian TTS benchmark — a comparative study of 6 open-source TTS architectures. See the full project: github.com/NMikaa/TTS_pipelines

Citation

@misc{f5tts-georgian-2026,
  title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
  author={NMikka},
  year={2026},
  url={https://huggingface.co/NMikka/F5-TTS-Georgian}
}

Downloads last month: 61

Model tree for NMikka/F5-TTS-Georgian

Base model

SWivid/F5-TTS

Finetuned

(137)

this model

NMikka
/

F5-TTS-Georgian