F5-TTS Georgian
A fine-tuned version of SWivid/F5-TTS (335M params) for Georgian text-to-speech. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.
Model Details
| Base model | SWivid/F5-TTS v1 Base (335M params, DiT + ConvNeXt V2) |
| Fine-tuning | Full fine-tune (continuation of flow-matching pretraining), no LoRA |
| Training data | NMikka/Common-Voice-Geo-Cleaned β 20,300 samples, 12 speakers |
| Training | 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB) |
| Sample rate | 24 kHz |
| Voice cloning | Works well with training speakers; generalizing to new voices is WIP |
| License | CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights) |
Evaluation β FLEURS Georgian Benchmark (979 unseen samples)
Round-trip CER: TTS generates audio β Meta Omnilingual ASR 7B transcribes β compare to original text.
| Metric | Value |
|---|---|
| CER mean | 0.0509 |
| CER median | 0.0309 |
| CER p90 | 0.1183 |
| CER std | 0.0558 |
| WER mean | 0.1866 |
| WER median | 0.1600 |
CER distribution:
- 65.9% of samples < 5% CER
- 85.9% of samples < 10% CER
- 96.5% of samples < 20% CER
- 0 catastrophic failures (> 50% CER)
Evaluated with speaker 3 reference audio (NISQA MOS 4.99).
Usage
Install
pip install f5-tts
Download Model
from huggingface_hub import hf_hub_download
# Download checkpoint and vocab
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
Inference
The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from f5_tts.api import F5TTS
import soundfile as sf
import numpy as np
# Download model
ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
# Load a reference sample from the training dataset
ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
ref_sample = ds[92] # Pick any sample as voice reference, but this one is the one i used while testing alot.
# Save reference audio to temp file (F5-TTS expects a file path)
ref_path = "/tmp/ref.wav"
sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])
# Load model
model = F5TTS(
ckpt_file=ckpt_path,
vocab_file=vocab_path,
device="cuda",
use_ema=False, # Important: this checkpoint was not trained with EMA
)
# Generate speech using a training speaker as reference
wav, sr, _ = model.infer(
ref_file=ref_path,
ref_text=ref_sample["text"],
gen_text="α‘αα₯αα ααααα αααααα αααα‘ αααααα‘ααα‘ α αααααα¨α, ααα αααα‘α αα ααααα‘ ααα‘αα§αα αα",
)
sf.write("output.wav", wav, sr)
Generation Parameters
wav, sr, _ = model.infer(
ref_file="reference.wav",
ref_text="reference transcript",
gen_text="text to synthesize",
nfe_step=32, # Denoising steps (default 32, higher = better quality, slower)
cfg_strength=2.0, # Classifier-free guidance (default 2.0)
speed=1.0, # Speech speed multiplier
)
Training Details
| Method | Full fine-tune (flow-matching loss, continuation of pretraining) |
| Base checkpoint | F5TTS_v1_Base/model_1250000.safetensors |
| Learning rate | 1e-5 |
| Warmup | 500 steps |
| Batch size | 9,600 audio frames per GPU |
| Max sequences/batch | 64 |
| Optimizer | 8-bit Adam (bitsandbytes) |
| Epochs | 100 |
| Total updates | 110,000 |
| Tokenizer | Character-level (char, not pinyin) |
| Vocab | 2,579 tokens (2,545 pretrained + 34 Georgian characters) |
| GPU | 1x NVIDIA RTX A6000 (48GB) |
Vocab Extension
The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (α-α° + β). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 β 2,580 dimensions.
Limitations and Future Work
- License: CC-BY-NC-4.0 β non-commercial use only (inherited from F5-TTS weights)
- Voice cloning to new speakers is limited β the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
- Trained on 12 speakers from Common Voice Georgian β limited speaker diversity
- Some complex Georgian text with rare characters may produce higher error rates
- No emotion or prosody control beyond what the reference audio provides
Part of the Georgian TTS Benchmark
This model was trained as part of the first Georgian TTS benchmark β a comparative study of 6 open-source TTS architectures. See the full project: github.com/NMikaa/TTS_pipelines
Citation
@misc{f5tts-georgian-2026,
title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
author={NMikka},
year={2026},
url={https://huggingface.co/NMikka/F5-TTS-Georgian}
}
- Downloads last month
- 30
Model tree for NMikka/F5-TTS-Georgian
Base model
SWivid/F5-TTS