CSM-1B Georgian

A fine-tuned version of sesame/csm-1b for Georgian text-to-speech. This is open-source Georgian TTS model based on the CSM architecture.

Model Details


Base model	sesame/csm-1b (1B Llama backbone + Mimi codec)
Fine-tuning	LoRA (rank=64, alpha=64) via Unsloth, merged into base weights
Training data	NMikka/Common-Voice-Geo-Cleaned — 21,421 samples, 12 speakers, 35 hours
Training	~14 epochs, ~25 hours on single NVIDIA RTX A6000 (48GB)
Sample rate	24 kHz
Speakers	12 (multi-speaker, speaker ID 0–11)

Evaluation

In-Domain (Common Voice Georgian)

Metric	Value
CER	0.0281
WER	0.1363
MCD	5.43 dB
ECAPA-TDNN similarity	0.5609

FLEURS Georgian Benchmark (979 unseen samples)

Metric	Value
CER mean	0.1081
CER median	0.0541
CER p90	0.2507
WER mean	0.2494

48.5% of samples achieve CER < 5%, and 65.9% are below 10% CER. Evaluated using round-trip ASR with Meta Omnilingual ASR 7B (1.9% baseline CER on Georgian).

Usage

import torch
import soundfile as sf
from transformers import CsmForConditionalGeneration, AutoProcessor

# Load model
model = CsmForConditionalGeneration.from_pretrained("NMikka/CSM-1B-Georgian", device_map="cuda")
processor = AutoProcessor.from_pretrained("NMikka/CSM-1B-Georgian")
model.eval()

# Generate speech (speaker 7 recommended — best CER)
text = "გამარჯობა, როგორ ხარ?"
inputs = processor(f"[7]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)

# Save to file
sf.write("output.wav", audio[0].cpu().numpy(), 24000)

Speaker Selection

The model supports 12 speakers (IDs ['1', '10', '11', '12', '14', '2', '3', '4', '5', '6', '7', '8'], sorry for this! going to fix in next models). Speaker 7 produces the most intelligible output (0.42% CER in-domain). Per-speaker quality varies:

Speaker	CER	Recommended
7	0.0042	Best overall
3	0.0174	Best speaker similarity

Use the speaker ID in the text prefix: [7]your text here.

Generation Parameters

# Default (recommended)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)

# For longer utterances (up to 15s)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 15)

# With temperature (lower = more stable, higher = more expressive)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10, temperature=0.7)

Warning: Do not use temperature > 0.9 or repetition_penalty — these can cause CUDA errors with out-of-range token IDs on LoRA models.

Batch Generation

texts = ["[7]პირველი წინადადება.", "[7]მეორე წინადადება.", "[7]მესამე წინადადება."]
inputs = processor(texts, add_special_tokens=True, padding=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    audios = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)

for i, audio in enumerate(audios):
    sf.write(f"output_{i}.wav", audio.cpu().numpy(), 24000)

Training Details


LoRA rank	64
LoRA alpha	64
Target modules	q/k/v/o_proj, gate/up/down_proj, n_embed
Trainable params	58M / 1.69B (3.44%)
Batch size	64 (effective 128 with grad accum=2)
Learning rate	5e-5 (cosine schedule)
Optimizer	AdamW (weight decay 0.002)
Final eval loss	5.553
Framework	Unsloth + HuggingFace Transformers

Limitations

Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
Long sentences (>10s of audio) may produce hallucinations or truncations
4.1% of FLEURS samples had CER > 50% (failure cases on complex text)
Georgian only

Citation

@misc{csm1b-georgian-2026,
  title={CSM-1B Georgian: Fine-tuned Text-to-Speech for Georgian},
  author={NMikka},
  year={2026},
  url={https://huggingface.co/NMikka/CSM-1B-Georgian}
}

Downloads last month: 50

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for NMikka/CSM-1B-Georgian

Base model

sesame/csm-1b

Finetuned

(26)

this model

NMikka
/

CSM-1B-Georgian