CSM-1B Georgian

A fine-tuned version of sesame/csm-1b for Georgian text-to-speech. This is open-source Georgian TTS model based on the CSM architecture.

Model Details

Base model sesame/csm-1b (1B Llama backbone + Mimi codec)
Fine-tuning LoRA (rank=64, alpha=64) via Unsloth, merged into base weights
Training data NMikka/Common-Voice-Geo-Cleaned โ€” 21,421 samples, 12 speakers, 35 hours
Training ~14 epochs, ~25 hours on single NVIDIA RTX A6000 (48GB)
Sample rate 24 kHz
Speakers 12 (multi-speaker, speaker ID 0โ€“11)

Evaluation

In-Domain (Common Voice Georgian)

Metric Value
CER 0.0281
WER 0.1363
MCD 5.43 dB
ECAPA-TDNN similarity 0.5609

FLEURS Georgian Benchmark (979 unseen samples)

Metric Value
CER mean 0.1081
CER median 0.0541
CER p90 0.2507
WER mean 0.2494

48.5% of samples achieve CER < 5%, and 65.9% are below 10% CER. Evaluated using round-trip ASR with Meta Omnilingual ASR 7B (1.9% baseline CER on Georgian).

Usage

import torch
import soundfile as sf
from transformers import CsmForConditionalGeneration, AutoProcessor

# Load model
model = CsmForConditionalGeneration.from_pretrained("NMikka/CSM-1B-Georgian", device_map="cuda")
processor = AutoProcessor.from_pretrained("NMikka/CSM-1B-Georgian")
model.eval()

# Generate speech (speaker 7 recommended โ€” best CER)
text = "แƒ’แƒแƒ›แƒแƒ แƒฏแƒแƒ‘แƒ, แƒ แƒแƒ’แƒแƒ  แƒฎแƒแƒ ?"
inputs = processor(f"[7]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)

# Save to file
sf.write("output.wav", audio[0].cpu().numpy(), 24000)

Speaker Selection

The model supports 12 speakers (IDs ['1', '10', '11', '12', '14', '2', '3', '4', '5', '6', '7', '8'], sorry for this! going to fix in next models). Speaker 7 produces the most intelligible output (0.42% CER in-domain). Per-speaker quality varies:

Speaker CER Recommended
7 0.0042 Best overall
3 0.0174 Best speaker similarity

Use the speaker ID in the text prefix: [7]your text here.

Generation Parameters

# Default (recommended)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)

# For longer utterances (up to 15s)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 15)

# With temperature (lower = more stable, higher = more expressive)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10, temperature=0.7)

Warning: Do not use temperature > 0.9 or repetition_penalty โ€” these can cause CUDA errors with out-of-range token IDs on LoRA models.

Batch Generation

texts = ["[7]แƒžแƒ˜แƒ แƒ•แƒ”แƒšแƒ˜ แƒฌแƒ˜แƒœแƒแƒ“แƒแƒ“แƒ”แƒ‘แƒ.", "[7]แƒ›แƒ”แƒแƒ แƒ” แƒฌแƒ˜แƒœแƒแƒ“แƒแƒ“แƒ”แƒ‘แƒ.", "[7]แƒ›แƒ”แƒกแƒแƒ›แƒ” แƒฌแƒ˜แƒœแƒแƒ“แƒแƒ“แƒ”แƒ‘แƒ."]
inputs = processor(texts, add_special_tokens=True, padding=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    audios = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)

for i, audio in enumerate(audios):
    sf.write(f"output_{i}.wav", audio.cpu().numpy(), 24000)

Training Details

LoRA rank 64
LoRA alpha 64
Target modules q/k/v/o_proj, gate/up/down_proj, n_embed
Trainable params 58M / 1.69B (3.44%)
Batch size 64 (effective 128 with grad accum=2)
Learning rate 5e-5 (cosine schedule)
Optimizer AdamW (weight decay 0.002)
Final eval loss 5.553
Framework Unsloth + HuggingFace Transformers

Limitations

  • Trained on 12 speakers from Common Voice Georgian โ€” limited speaker diversity
  • Long sentences (>10s of audio) may produce hallucinations or truncations
  • 4.1% of FLEURS samples had CER > 50% (failure cases on complex text)
  • Georgian only

Citation

@misc{csm1b-georgian-2026,
  title={CSM-1B Georgian: Fine-tuned Text-to-Speech for Georgian},
  author={NMikka},
  year={2026},
  url={https://huggingface.co/NMikka/CSM-1B-Georgian}
}
Downloads last month
57
Safetensors
Model size
2B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for NMikka/CSM-1B-Georgian

Base model

sesame/csm-1b
Finetuned
(25)
this model

Dataset used to train NMikka/CSM-1B-Georgian