MagPIE TTS β Georgian
A fine-tuned MagPIE TTS model for Georgian (α₯αα αα£αα) text-to-speech synthesis.
This is the open-source TTS model fine-tuned specifically for Georgian, produced as part of the Georgian TTS Benchmark.
Evaluation Results
Evaluated on the full FLEURS Georgian test set (979 samples) using round-trip intelligibility:
| Metric | Score |
|---|---|
| CER | 2.16% |
| WER | 7.08% |
CER/WER measured via round-trip: TTS generates audio β Meta Omnilingual ASR 7B transcribes it β compare to original text.
Quick Start
Installation
# MagPIE TTS requires NeMo 2.8+ (not yet on PyPI β install from source)
git clone https://github.com/NVIDIA/NeMo.git
cd NeMo && pip install -e ".[tts]"
pip install huggingface_hub
Requires Python 3.10+, PyTorch 2.0+, CUDA 11.8+
Inference
import torch
import torchaudio
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import MagpieTTSModel
from nemo.collections.tts.parts.utils.tts_dataset_utils import chunk_text_for_inference
# Download and load model
nemo_path = hf_hub_download(repo_id="NMikka/Magpie-TTS-Geo-357m", filename="magpie_tts_georgian.nemo")
model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu")
model = model.eval().cuda()
# Synthesize
text = "ααααα α―ααα, αα αα₯ααα αααααα αα α₯αα αα£ααα αααααα αααα."
chunked_tokens, chunked_tokens_len, _ = chunk_text_for_inference(
text=text,
language="ka",
tokenizer_name="text_ce_tokenizer",
text_tokenizer=model.tokenizer,
eos_token_id=model.eos_id,
)
chunk_state = model.create_chunk_state(batch_size=1)
all_codes = []
for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
batch = {
"text": toks.unsqueeze(0).cuda(),
"text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long),
"speaker_indices": 1, # speaker index (0-4)
}
with torch.no_grad():
output = model.generate_speech(
batch,
chunk_state=chunk_state,
end_of_text=[i == len(chunked_tokens) - 1],
beginning_of_text=(i == 0),
use_cfg=True,
use_local_transformer_for_inference=True,
)
if output.predicted_codes_lens[0] > 0:
all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]])
# Decode to waveform
codes = torch.cat(all_codes, dim=1).unsqueeze(0)
codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long)
audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens)
waveform = audio[0, :audio_lens[0]].cpu().float().unsqueeze(0)
torchaudio.save("output.wav", waveform, 22050)
Convenience Wrapper
For easier use, here's a helper function:
def synthesize(model, text, speaker=1, use_cfg=True):
"""Generate Georgian speech from text.
Args:
model: Loaded MagpieTTSModel
text: Georgian text string
speaker: Baked speaker index (0-4). Speaker 1 recommended.
use_cfg: Use classifier-free guidance (better quality, 2x slower)
Returns:
waveform (torch.Tensor): Audio tensor, shape (1, num_samples), 22050 Hz
"""
chunked_tokens, chunked_tokens_len, _ = chunk_text_for_inference(
text=text,
language="ka",
tokenizer_name="text_ce_tokenizer",
text_tokenizer=model.tokenizer,
eos_token_id=model.eos_id,
)
chunk_state = model.create_chunk_state(batch_size=1)
all_codes = []
for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
batch = {
"text": toks.unsqueeze(0).cuda(),
"text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long),
"speaker_indices": speaker,
}
with torch.no_grad():
output = model.generate_speech(
batch,
chunk_state=chunk_state,
end_of_text=[i == len(chunked_tokens) - 1],
beginning_of_text=(i == 0),
use_cfg=use_cfg,
use_local_transformer_for_inference=True,
)
if output.predicted_codes_lens[0] > 0:
all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]])
if not all_codes:
return None
codes = torch.cat(all_codes, dim=1).unsqueeze(0)
codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long)
audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens)
return audio[0, :audio_lens[0]].cpu().float().unsqueeze(0)
# Usage:
waveform = synthesize(model, "ααααα α―ααα αα‘αα€ααα")
torchaudio.save("hello_world.wav", waveform, 22050)
How It Works
MagPIE TTS is an encoder-decoder transformer (not a diffusion or flow model):
- ByT5-small encodes text at the byte level β no language-specific tokenizer needed
- 6-layer causal encoder processes text embeddings
- CTC monotonic alignment maps text to audio frames (prevents hallucinations β no skipped or repeated words)
- 12-layer causal decoder autoregressively generates NanoCodec tokens
- NanoCodec (22kHz, 8 codebooks) decodes tokens to waveform
Classifier-Free Guidance (CFG) runs two forward passes (with/without text conditioning) and interpolates. Set use_cfg=False for ~2x faster inference with slightly lower quality.
Speakers
The model has 5 baked speaker embeddings from pretraining. Set via speaker_indices in the batch dict.
| Index | Quality |
|---|---|
| 1 | Best (recommended) |
| 0 | Good |
| 2 | Acceptable |
| 3 | Mediocre |
| 4 | Mediocre |
Parameters
You can tune inference parameters via model.inference_parameters:
model.inference_parameters.temperature = 0.6 # sampling temperature (lower = more deterministic)
model.inference_parameters.topk = 80 # top-k sampling (lower = more focused)
model.inference_parameters.cfg_scale = 2.5 # CFG strength (higher = follows text more strictly)
model.inference_parameters.max_decoder_steps = 500 # max generation length in frames
Training Details
| Base model | nvidia/magpie_tts_multilingual_357m |
| Method | Full SFT via NeMo |
| Training data | NMikka/Common-Voice-Geo-Cleaned (~20,300 clips, 24kHz, resampled to 22,050 Hz) |
| Parameters | 357M (all trainable) |
| Epochs | 37 |
| Steps | 15,614 |
| Learning rate | 2e-5 |
| Precision | bf16-mixed |
| GPU | 1x A6000 (48GB) |
| Best val_loss | 9.5569 |
| Sample rate | 22,050 Hz |
| Codec | NanoCodec (8 codebooks, 21.5 fps, 1.89 kbps) |
Limitations
- Single language: Fine-tuned on Georgian only. The base model supports 105 languages but this checkpoint is specialized.
- No voice cloning: Uses 5 baked speaker embeddings from pretraining. Reference audio cloning was not trained.
- Autoregressive: Not real-time. RTF ~0.6-0.8 on A6000 with CFG, ~0.4-0.7 without.
- NeMo dependency: Requires NVIDIA NeMo toolkit. Not a standalone model.
- NanoCodec dependency: The codec model (
nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps) is downloaded automatically on first use.
Citation
@misc{magpie-tts-georgian-2026,
title={MagPIE TTS Georgian: Fine-tuned Text-to-Speech for Georgian},
author={TODO},
year={2026},
url={https://huggingface.co/NMikka/Magpie-TTS-Geo-357m}
}
Acknowledgments
- NVIDIA NeMo for the MagPIE TTS architecture and training framework
- NMikka/Common-Voice-Geo-Cleaned for the cleaned Georgian speech dataset
- Google FLEURS for the evaluation benchmark
- Downloads last month
- 3
Model tree for NMikka/Magpie-TTS-Geo-357m
Base model
nvidia/magpie_tts_multilingual_357m