CSM-1B Georgian
A fine-tuned version of sesame/csm-1b for Georgian text-to-speech. This is open-source Georgian TTS model based on the CSM architecture.
Model Details
| Base model | sesame/csm-1b (1B Llama backbone + Mimi codec) |
| Fine-tuning | LoRA (rank=64, alpha=64) via Unsloth, merged into base weights |
| Training data | NMikka/Common-Voice-Geo-Cleaned โ 21,421 samples, 12 speakers, 35 hours |
| Training | ~14 epochs, ~25 hours on single NVIDIA RTX A6000 (48GB) |
| Sample rate | 24 kHz |
| Speakers | 12 (multi-speaker, speaker ID 0โ11) |
Evaluation
In-Domain (Common Voice Georgian)
| Metric | Value |
|---|---|
| CER | 0.0281 |
| WER | 0.1363 |
| MCD | 5.43 dB |
| ECAPA-TDNN similarity | 0.5609 |
FLEURS Georgian Benchmark (979 unseen samples)
| Metric | Value |
|---|---|
| CER mean | 0.1081 |
| CER median | 0.0541 |
| CER p90 | 0.2507 |
| WER mean | 0.2494 |
48.5% of samples achieve CER < 5%, and 65.9% are below 10% CER. Evaluated using round-trip ASR with Meta Omnilingual ASR 7B (1.9% baseline CER on Georgian).
Usage
import torch
import soundfile as sf
from transformers import CsmForConditionalGeneration, AutoProcessor
# Load model
model = CsmForConditionalGeneration.from_pretrained("NMikka/CSM-1B-Georgian", device_map="cuda")
processor = AutoProcessor.from_pretrained("NMikka/CSM-1B-Georgian")
model.eval()
# Generate speech (speaker 7 recommended โ best CER)
text = "แแแแแ แฏแแแ, แ แแแแ แฎแแ ?"
inputs = processor(f"[7]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
with torch.no_grad():
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)
# Save to file
sf.write("output.wav", audio[0].cpu().numpy(), 24000)
Speaker Selection
The model supports 12 speakers (IDs ['1', '10', '11', '12', '14', '2', '3', '4', '5', '6', '7', '8'], sorry for this! going to fix in next models). Speaker 7 produces the most intelligible output (0.42% CER in-domain). Per-speaker quality varies:
| Speaker | CER | Recommended |
|---|---|---|
| 7 | 0.0042 | Best overall |
| 3 | 0.0174 | Best speaker similarity |
Use the speaker ID in the text prefix: [7]your text here.
Generation Parameters
# Default (recommended)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)
# For longer utterances (up to 15s)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 15)
# With temperature (lower = more stable, higher = more expressive)
audio = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10, temperature=0.7)
Warning: Do not use temperature > 0.9 or repetition_penalty โ these can cause CUDA errors with out-of-range token IDs on LoRA models.
Batch Generation
texts = ["[7]แแแ แแแแ แฌแแแแแแแแแ.", "[7]แแแแ แ แฌแแแแแแแแแ.", "[7]แแแกแแแ แฌแแแแแแแแแ."]
inputs = processor(texts, add_special_tokens=True, padding=True, return_tensors="pt").to("cuda")
with torch.no_grad():
audios = model.generate(**inputs, output_audio=True, max_new_tokens=125 * 10)
for i, audio in enumerate(audios):
sf.write(f"output_{i}.wav", audio.cpu().numpy(), 24000)
Training Details
| LoRA rank | 64 |
| LoRA alpha | 64 |
| Target modules | q/k/v/o_proj, gate/up/down_proj, n_embed |
| Trainable params | 58M / 1.69B (3.44%) |
| Batch size | 64 (effective 128 with grad accum=2) |
| Learning rate | 5e-5 (cosine schedule) |
| Optimizer | AdamW (weight decay 0.002) |
| Final eval loss | 5.553 |
| Framework | Unsloth + HuggingFace Transformers |
Limitations
- Trained on 12 speakers from Common Voice Georgian โ limited speaker diversity
- Long sentences (>10s of audio) may produce hallucinations or truncations
- 4.1% of FLEURS samples had CER > 50% (failure cases on complex text)
- Georgian only
Citation
@misc{csm1b-georgian-2026,
title={CSM-1B Georgian: Fine-tuned Text-to-Speech for Georgian},
author={NMikka},
year={2026},
url={https://huggingface.co/NMikka/CSM-1B-Georgian}
}
- Downloads last month
- 57
Model tree for NMikka/CSM-1B-Georgian
Base model
sesame/csm-1b