MagPIE TTS β€” Georgian

A fine-tuned MagPIE TTS model for Georgian (αƒ₯αƒαƒ αƒ—αƒ£αƒšαƒ˜) text-to-speech synthesis.

This is the open-source TTS model fine-tuned specifically for Georgian, produced as part of the Georgian TTS Benchmark.

Evaluation Results

Evaluated on the full FLEURS Georgian test set (979 samples) using round-trip intelligibility:

Metric Score
CER 2.16%
WER 7.08%

CER/WER measured via round-trip: TTS generates audio β†’ Meta Omnilingual ASR 7B transcribes it β†’ compare to original text.

Quick Start

Installation

# MagPIE TTS requires NeMo 2.8+ (not yet on PyPI β€” install from source)
git clone https://github.com/NVIDIA/NeMo.git
cd NeMo && pip install -e ".[tts]"
pip install huggingface_hub

Requires Python 3.10+, PyTorch 2.0+, CUDA 11.8+

Inference

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import MagpieTTSModel
from nemo.collections.tts.parts.utils.tts_dataset_utils import chunk_text_for_inference

# Download and load model
nemo_path = hf_hub_download(repo_id="NMikka/Magpie-TTS-Geo-357m", filename="magpie_tts_georgian.nemo")
model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu")
model = model.eval().cuda()

# Synthesize
text = "გამარჯობა, მე მαƒ₯αƒ•αƒ˜αƒ αƒ›αƒαƒ’αƒžαƒαƒ˜ და αƒ₯αƒαƒ αƒ—αƒ£αƒšαƒαƒ“ αƒ•αƒšαƒαƒžαƒαƒ αƒαƒ™αƒαƒ‘."

chunked_tokens, chunked_tokens_len, _ = chunk_text_for_inference(
    text=text,
    language="ka",
    tokenizer_name="text_ce_tokenizer",
    text_tokenizer=model.tokenizer,
    eos_token_id=model.eos_id,
)

chunk_state = model.create_chunk_state(batch_size=1)
all_codes = []

for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
    batch = {
        "text": toks.unsqueeze(0).cuda(),
        "text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long),
        "speaker_indices": 1,  # speaker index (0-4)
    }
    with torch.no_grad():
        output = model.generate_speech(
            batch,
            chunk_state=chunk_state,
            end_of_text=[i == len(chunked_tokens) - 1],
            beginning_of_text=(i == 0),
            use_cfg=True,
            use_local_transformer_for_inference=True,
        )
    if output.predicted_codes_lens[0] > 0:
        all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]])

# Decode to waveform
codes = torch.cat(all_codes, dim=1).unsqueeze(0)
codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long)
audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens)
waveform = audio[0, :audio_lens[0]].cpu().float().unsqueeze(0)

torchaudio.save("output.wav", waveform, 22050)

Convenience Wrapper

For easier use, here's a helper function:

def synthesize(model, text, speaker=1, use_cfg=True):
    """Generate Georgian speech from text.

    Args:
        model: Loaded MagpieTTSModel
        text: Georgian text string
        speaker: Baked speaker index (0-4). Speaker 1 recommended.
        use_cfg: Use classifier-free guidance (better quality, 2x slower)

    Returns:
        waveform (torch.Tensor): Audio tensor, shape (1, num_samples), 22050 Hz
    """
    chunked_tokens, chunked_tokens_len, _ = chunk_text_for_inference(
        text=text,
        language="ka",
        tokenizer_name="text_ce_tokenizer",
        text_tokenizer=model.tokenizer,
        eos_token_id=model.eos_id,
    )

    chunk_state = model.create_chunk_state(batch_size=1)
    all_codes = []

    for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
        batch = {
            "text": toks.unsqueeze(0).cuda(),
            "text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long),
            "speaker_indices": speaker,
        }
        with torch.no_grad():
            output = model.generate_speech(
                batch,
                chunk_state=chunk_state,
                end_of_text=[i == len(chunked_tokens) - 1],
                beginning_of_text=(i == 0),
                use_cfg=use_cfg,
                use_local_transformer_for_inference=True,
            )
        if output.predicted_codes_lens[0] > 0:
            all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]])

    if not all_codes:
        return None

    codes = torch.cat(all_codes, dim=1).unsqueeze(0)
    codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long)
    audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens)
    return audio[0, :audio_lens[0]].cpu().float().unsqueeze(0)


# Usage:
waveform = synthesize(model, "გამარჯობა αƒ›αƒ‘αƒαƒ€αƒšαƒ˜αƒ")
torchaudio.save("hello_world.wav", waveform, 22050)

How It Works

MagPIE TTS is an encoder-decoder transformer (not a diffusion or flow model):

  1. ByT5-small encodes text at the byte level β€” no language-specific tokenizer needed
  2. 6-layer causal encoder processes text embeddings
  3. CTC monotonic alignment maps text to audio frames (prevents hallucinations β€” no skipped or repeated words)
  4. 12-layer causal decoder autoregressively generates NanoCodec tokens
  5. NanoCodec (22kHz, 8 codebooks) decodes tokens to waveform

Classifier-Free Guidance (CFG) runs two forward passes (with/without text conditioning) and interpolates. Set use_cfg=False for ~2x faster inference with slightly lower quality.

Speakers

The model has 5 baked speaker embeddings from pretraining. Set via speaker_indices in the batch dict.

Index Quality
1 Best (recommended)
0 Good
2 Acceptable
3 Mediocre
4 Mediocre

Parameters

You can tune inference parameters via model.inference_parameters:

model.inference_parameters.temperature = 0.6    # sampling temperature (lower = more deterministic)
model.inference_parameters.topk = 80            # top-k sampling (lower = more focused)
model.inference_parameters.cfg_scale = 2.5      # CFG strength (higher = follows text more strictly)
model.inference_parameters.max_decoder_steps = 500  # max generation length in frames

Training Details

Base model nvidia/magpie_tts_multilingual_357m
Method Full SFT via NeMo
Training data NMikka/Common-Voice-Geo-Cleaned (~20,300 clips, 24kHz, resampled to 22,050 Hz)
Parameters 357M (all trainable)
Epochs 37
Steps 15,614
Learning rate 2e-5
Precision bf16-mixed
GPU 1x A6000 (48GB)
Best val_loss 9.5569
Sample rate 22,050 Hz
Codec NanoCodec (8 codebooks, 21.5 fps, 1.89 kbps)

Limitations

  • Single language: Fine-tuned on Georgian only. The base model supports 105 languages but this checkpoint is specialized.
  • No voice cloning: Uses 5 baked speaker embeddings from pretraining. Reference audio cloning was not trained.
  • Autoregressive: Not real-time. RTF ~0.6-0.8 on A6000 with CFG, ~0.4-0.7 without.
  • NeMo dependency: Requires NVIDIA NeMo toolkit. Not a standalone model.
  • NanoCodec dependency: The codec model (nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps) is downloaded automatically on first use.

Citation

@misc{magpie-tts-georgian-2026,
  title={MagPIE TTS Georgian: Fine-tuned Text-to-Speech for Georgian},
  author={TODO},
  year={2026},
  url={https://huggingface.co/NMikka/Magpie-TTS-Geo-357m}
}

Acknowledgments

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NMikka/Magpie-TTS-Geo-357m

Finetuned
(1)
this model

Dataset used to train NMikka/Magpie-TTS-Geo-357m