bol-tts-marathi — Kokoro-82M fine-tuned for Marathi

A Marathi (मराठी) text-to-speech fine-tune of hexgrad/Kokoro-82M, trained with the semidark/kokoro-deutsch recipe. Handles pure Marathi and Minglish (Marathi + English code-switching) via a client-side Devanagari-transliteration preprocessor.

Architecture: StyleTTS2 acoustic model + ISTFTNet decoder (Kokoro-82M, unchanged)
Parameters: 81.76 M
Sample rate: 24 kHz
Voices: 25 (4 Marathi-trained + 19 stock-Kokoro crossovers + 2 synthetic) — see voice catalog below
Live demo: shreyask/bol-tts-marathi (in-browser via WebGPU)
Write-up: kshreyas.dev/post/bol-tts-marathi — full design + debugging story with audio samples
Code: github.com/shreyaskarnik/bol-tts-marathi
ONNX export: shreyask/bol-tts-marathi-onnx

Voice catalog (25 voices)

Marathi-trained (4)

ID	Display	Source	Default speed
`mf_asha`	Asha (आशा)	Rasa `marathi_female`	1.00×
`mm_vivek`	Vivek (विवेक)	Rasa `marathi_male`	1.00×
`mf_mukta`	Mukta (मुक्ता)	SPRINGLab female	0.80×
`mm_dnyanesh`	Dnyanesh (ज्ञानेश)	SPRINGLab male	0.80×

Stock-Kokoro crossovers (19)

Stock voicepacks from hexgrad/kokoro.js used as ref_s on this fine-tune. Because v0.2 is a continuation fine-tune, the encoder latent space stays close enough to stock Kokoro's that stock voicepacks plug in directly. Pre-screened by peak < 0.95 to filter ones that clip.

ID	Display	Source language
`af_heart`	Svara (स्वरा)	US English F
`af_alloy`	Anvita (अन्विता)	US English F
`af_aoede`	Sanika (सानिका)	US English F
`af_bella`	Naina (नैना)	US English F
`af_jessica`	Ishani (ईशानी)	US English F
`af_nova`	Tara (तारा)	US English F
`af_sarah`	Kavya (काव्या)	US English F
`af_sky`	Akasha (आकाशा)	US English F
`am_liam`	Atharv (अथर्व)	US English M
`bf_isabella`	Ira (इरा)	UK English F
`bm_fable`	Aaryan (आर्यन)	UK English M
`ff_siwis`	Esha (ईशा)	French F
`hm_omega`	Vihaan (विहान)	Hindi M
`im_nicola`	Niraj (निरज)	Italian M
`pf_dora`	Rhea (रिया)	Portuguese F
`zf_xiaoni`	Nyra (नयरा)	Mandarin F
`zf_xiaoxiao`	Pari (परी)	Mandarin F (kid)
`zf_xiaoyi`	Vir (वीर)	Mandarin F (perceived M kid)
`zm_yunyang`	Aakash (आकाश)	Mandarin M

Synthetic — generated arithmetically with no reference audio (2)

ID	Display	Recipe
`syn_sama`	Sama (समा)	Centroid (mean) of 5 modern English female voicepacks
`syn_navya`	Navya (नव्या)	Centroid + per-position Gaussian noise (1σ)

The voicepack tensor [510, 1, 256] is a plain embedding — it can be constructed by averaging existing voicepacks, sampling near the centroid, or interpolating. See voicepack zoo in the repo for recipes.

Usage

import torch, soundfile as sf
from kokoro import KModel, KPipeline
import kokoro.pipeline as _kp

_kp.LANG_CODES["m"] = "mr"  # monkey-patch Marathi lang code

kmodel = KModel(
    repo_id="shreyask/bol-tts-marathi",
    config="config.json",
    model="kokoro-mr-v0_2.pth",
)
kmodel.train(False)

pipeline = KPipeline(lang_code="m", repo_id="shreyask/bol-tts-marathi", model=kmodel)
voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True)

text = "नमस्कार, मी मराठी बोलतो."
chunks = []
for _gs, _ps, audio in pipeline(text, voice=voice, speed=1.0):
    chunks.append(audio)

sf.write("out.wav", chunks[0].numpy() if len(chunks) == 1 else torch.cat(chunks).numpy(), 24000)

Minglish (loanword) handling

For Marathi mixed with English ("Friday ला Zomato वर dinner order करूया का?"), use the loanword preprocessor first to transliterate Latin tokens to Devanagari before phonemization:

from preprocess_loanwords import preprocess
text = preprocess("Friday ला Zomato वर dinner order करूया का?")
# → "फ्रायडे ला झोमॅटो वर डिनर ऑर्डर करूया का?"
# Then feed to the pipeline as usual.

Source + ~19,500-entry lookup table: scripts/preprocess_loanwords.py and data/loanword_map.json.

Per-voice timestamps

Kokoro predicts per-phoneme durations. KModel.forward_with_tokens returns (audio, pred_dur). pred_dur is in predictor frames where 1 frame = 600 audio samples at 24 kHz (the prosody predictor runs at half the mel-frame rate; the decoder upsamples 2× before iSTFT):

audio, pred_dur = kmodel.forward_with_tokens(input_ids, ref_s, speed=1.0)
durations_sec = pred_dur.squeeze().cpu().numpy() * 600 / 24000
starts = durations_sec.cumsum() - durations_sec
# (starts[i], starts[i] + durations_sec[i]) is the time span of phoneme[i]

Training

Phase	Details
Base	`hexgrad/Kokoro-82M`
Stage 1	10 epochs, bs=12, fp32, ~9h on A100 SXM 80GB. Final val_loss ≈ 0.23
Stage 2	10 epochs, bs=8, `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, ~13h
Train utts	24,676 (95/5 split)
Speakers	331 (2 Rasa + 329 IndicVoices-R) + SPRINGLab IndicTTS-Marathi (single F + single M)
Vocab change	`ɭ` (U+026D, retroflex lateral) at Kokoro slot 144 — Marathi-specific phoneme that Hindi doesn't have

Full methodology: TRAINING_GUIDE.md.

Datasets

AI4Bharat/Rasa (CC-BY-4.0) — Marathi, 13,900 studio-quality utts, 2 speakers.
AI4Bharat/IndicVoices-R (CC-BY-4.0, gated) — Marathi, ~11,910 utts, 329 speakers after filtering.
SPRINGLab/IndicTTS-Marathi (IITM EULA, commercial-OK) — single female + single male speaker, used for Mukta + Dnyanesh.

Limitations

Pure-English-only sentences — the decoder hallucinates Marathi acoustics if you don't give it any Devanagari context. The Minglish trick handles mixed input via Devanagari transliteration; pure English needs a different fallback.
Long-tail loanwords — the 19,500-entry map covers high-frequency English words in Indian usage; rarer words fall through to espeak-mr unchanged.
Decoder English-leakage is accidental, not designed — v0.2's decoder happens to render /ɟʰ/ (Devanagari झ) with an English-flavored /z/ quality, which makes "amazing" → अमेझिंग → audible "amazing." The follow-up v0.5 retraining lost this property by being more correctly Marathi; v0.6 is planned to preserve the leakage deliberately.

License

Apache 2.0. Training data under their respective licenses (Rasa CC-BY-4.0, IndicVoices-R CC-BY-4.0, SPRINGLab IITM EULA).

Citation

@software{bol_tts_marathi_2026,
  title={bol-tts-marathi: Kokoro-82M fine-tuned for Marathi},
  author={Karnik, Shreyas},
  year={2026},
  url={https://github.com/shreyaskarnik/bol-tts-marathi},
  license={Apache-2.0}
}
@software{kokoro_2025,
  title={Kokoro-82M},
  author={hexgrad},
  year={2025},
  url={https://github.com/hexgrad/kokoro}
}
@software{kokoro_deutsch_2026,
  title={kokoro-deutsch},
  author={semidark},
  year={2026},
  url={https://github.com/semidark/kokoro-deutsch}
}

Downloads last month: 34

Model tree for shreyask/bol-tts-marathi

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M