F5-TTS Georgian

A fine-tuned version of SWivid/F5-TTS (335M params) for Georgian text-to-speech. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.

Model Details

Base model SWivid/F5-TTS v1 Base (335M params, DiT + ConvNeXt V2)
Fine-tuning Full fine-tune (continuation of flow-matching pretraining), no LoRA
Training data NMikka/Common-Voice-Geo-Cleaned β€” 20,300 samples, 12 speakers
Training 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB)
Sample rate 24 kHz
Voice cloning Works well with training speakers; generalizing to new voices is WIP
License CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights)

Evaluation β€” FLEURS Georgian Benchmark (979 unseen samples)

Round-trip CER: TTS generates audio β†’ Meta Omnilingual ASR 7B transcribes β†’ compare to original text.

Metric Value
CER mean 0.0509
CER median 0.0309
CER p90 0.1183
CER std 0.0558
WER mean 0.1866
WER median 0.1600

CER distribution:

  • 65.9% of samples < 5% CER
  • 85.9% of samples < 10% CER
  • 96.5% of samples < 20% CER
  • 0 catastrophic failures (> 50% CER)

Evaluated with speaker 3 reference audio (NISQA MOS 4.99).

Usage

Install

pip install f5-tts

Download Model

from huggingface_hub import hf_hub_download

# Download checkpoint and vocab
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

Inference

The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.

from datasets import load_dataset
from huggingface_hub import hf_hub_download
from f5_tts.api import F5TTS
import soundfile as sf
import numpy as np

# Download model
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

# Load a reference sample from the training dataset
ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
ref_sample = ds[92]  # Pick any sample as voice reference, but this one is the one i used while testing alot.

# Save reference audio to temp file (F5-TTS expects a file path)
ref_path = "/tmp/ref.wav"
sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])

# Load model
model = F5TTS(
    ckpt_file=ckpt_path,
    vocab_file=vocab_path,
    device="cuda",
    use_ema=False,  # Important: this checkpoint was not trained with EMA
)

# Generate speech using a training speaker as reference
wav, sr, _ = model.infer(
    ref_file=ref_path,
    ref_text=ref_sample["text"],
    gen_text="ბაαƒ₯αƒαƒ αƒ—αƒ•αƒ”αƒšαƒ მდებარეობბ αƒ™αƒαƒ•αƒ™αƒαƒ‘αƒ˜αƒ˜αƒ‘ αƒ αƒ”αƒ’αƒ˜αƒαƒœαƒ¨αƒ˜, αƒ”αƒ•αƒ αƒαƒžαƒ˜αƒ‘αƒ და αƒαƒ–αƒ˜αƒ˜αƒ‘ გაბაყარზე",
)
sf.write("output.wav", wav, sr)

Generation Parameters

wav, sr, _ = model.infer(
    ref_file="reference.wav",
    ref_text="reference transcript",
    gen_text="text to synthesize",
    nfe_step=32,       # Denoising steps (default 32, higher = better quality, slower)
    cfg_strength=2.0,  # Classifier-free guidance (default 2.0)
    speed=1.0,         # Speech speed multiplier
)

Training Details

Method Full fine-tune (flow-matching loss, continuation of pretraining)
Base checkpoint F5TTS_v1_Base/model_1250000.safetensors
Learning rate 1e-5
Warmup 500 steps
Batch size 9,600 audio frames per GPU
Max sequences/batch 64
Optimizer 8-bit Adam (bitsandbytes)
Epochs 100
Total updates 110,000
Tokenizer Character-level (char, not pinyin)
Vocab 2,579 tokens (2,545 pretrained + 34 Georgian characters)
GPU 1x NVIDIA RTX A6000 (48GB)

Vocab Extension

The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (ა-αƒ° + β€ž). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 β†’ 2,580 dimensions.

Limitations and Future Work

  • License: CC-BY-NC-4.0 β€” non-commercial use only (inherited from F5-TTS weights)
  • Voice cloning to new speakers is limited β€” the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
  • Trained on 12 speakers from Common Voice Georgian β€” limited speaker diversity
  • Some complex Georgian text with rare characters may produce higher error rates
  • No emotion or prosody control beyond what the reference audio provides

Part of the Georgian TTS Benchmark

This model was trained as part of the first Georgian TTS benchmark β€” a comparative study of 6 open-source TTS architectures. See the full project: github.com/NMikaa/TTS_pipelines

Citation

@misc{f5tts-georgian-2026,
  title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
  author={NMikka},
  year={2026},
  url={https://huggingface.co/NMikka/F5-TTS-Georgian}
}
Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NMikka/F5-TTS-Georgian

Base model

SWivid/F5-TTS
Finetuned
(83)
this model

Dataset used to train NMikka/F5-TTS-Georgian