| --- |
| license: apache-2.0 |
| language: |
| - mr |
| library_name: kokoro |
| pipeline_tag: text-to-speech |
| base_model: hexgrad/Kokoro-82M |
| base_model_relation: finetune |
| datasets: |
| - ai4bharat/Rasa |
| - ai4bharat/indicvoices_r |
| - SPRINGLab/IndicTTS_Marathi |
| tags: |
| - text-to-speech |
| - tts |
| - kokoro |
| - marathi |
| - minglish |
| - indic |
| - styletts2 |
| - bol-tts |
| --- |
| |
| # bol-tts-marathi — Kokoro-82M fine-tuned for Marathi |
|
|
| A Marathi (मराठी) text-to-speech fine-tune of [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M), trained with the [semidark/kokoro-deutsch](https://github.com/semidark/kokoro-deutsch) recipe. Handles pure Marathi and **Minglish** (Marathi + English code-switching) via a client-side Devanagari-transliteration preprocessor. |
|
|
| - **Architecture:** StyleTTS2 acoustic model + ISTFTNet decoder (Kokoro-82M, unchanged) |
| - **Parameters:** 81.76 M |
| - **Sample rate:** 24 kHz |
| - **Voices:** 25 (4 Marathi-trained + 19 stock-Kokoro crossovers + 2 synthetic) — see voice catalog below |
| - **Live demo:** [shreyask/bol-tts-marathi](https://huggingface.co/spaces/shreyask/bol-tts-marathi) (in-browser via WebGPU) |
| - **Write-up:** [kshreyas.dev/post/bol-tts-marathi](https://kshreyas.dev/post/bol-tts-marathi/) — full design + debugging story with audio samples |
| - **Code:** [github.com/shreyaskarnik/bol-tts-marathi](https://github.com/shreyaskarnik/bol-tts-marathi) |
| - **ONNX export:** [shreyask/bol-tts-marathi-onnx](https://huggingface.co/shreyask/bol-tts-marathi-onnx) |
|
|
| ## Voice catalog (25 voices) |
|
|
| ### Marathi-trained (4) |
|
|
| | ID | Display | Source | Default speed | |
| |---|---|---|---| |
| | `mf_asha` | Asha (आशा) | [Rasa](https://huggingface.co/datasets/ai4bharat/Rasa) `marathi_female` | 1.00× | |
| | `mm_vivek` | Vivek (विवेक) | Rasa `marathi_male` | 1.00× | |
| | `mf_mukta` | Mukta (मुक्ता) | [SPRINGLab](https://huggingface.co/datasets/SPRINGLab/IndicTTS_Marathi) female | 0.80× | |
| | `mm_dnyanesh` | Dnyanesh (ज्ञानेश) | SPRINGLab male | 0.80× | |
|
|
| ### Stock-Kokoro crossovers (19) |
|
|
| Stock voicepacks from [hexgrad/kokoro.js](https://github.com/hexgrad/kokoro.js) used as `ref_s` on this fine-tune. Because v0.2 is a continuation fine-tune, the encoder latent space stays close enough to stock Kokoro's that stock voicepacks plug in directly. Pre-screened by `peak < 0.95` to filter ones that clip. |
|
|
| | ID | Display | Source language | |
| |---|---|---| |
| | `af_heart` | Svara (स्वरा) | US English F | |
| | `af_alloy` | Anvita (अन्विता) | US English F | |
| | `af_aoede` | Sanika (सानिका) | US English F | |
| | `af_bella` | Naina (नैना) | US English F | |
| | `af_jessica` | Ishani (ईशानी) | US English F | |
| | `af_nova` | Tara (तारा) | US English F | |
| | `af_sarah` | Kavya (काव्या) | US English F | |
| | `af_sky` | Akasha (आकाशा) | US English F | |
| | `am_liam` | Atharv (अथर्व) | US English M | |
| | `bf_isabella` | Ira (इरा) | UK English F | |
| | `bm_fable` | Aaryan (आर्यन) | UK English M | |
| | `ff_siwis` | Esha (ईशा) | French F | |
| | `hm_omega` | Vihaan (विहान) | Hindi M | |
| | `im_nicola` | Niraj (निरज) | Italian M | |
| | `pf_dora` | Rhea (रिया) | Portuguese F | |
| | `zf_xiaoni` | Nyra (नयरा) | Mandarin F | |
| | `zf_xiaoxiao` | Pari (परी) | Mandarin F (kid) | |
| | `zf_xiaoyi` | Vir (वीर) | Mandarin F (perceived M kid) | |
| | `zm_yunyang` | Aakash (आकाश) | Mandarin M | |
|
|
| ### Synthetic — generated arithmetically with no reference audio (2) |
|
|
| | ID | Display | Recipe | |
| |---|---|---| |
| | `syn_sama` | Sama (समा) | Centroid (mean) of 5 modern English female voicepacks | |
| | `syn_navya` | Navya (नव्या) | Centroid + per-position Gaussian noise (1σ) | |
|
|
| The voicepack tensor `[510, 1, 256]` is a plain embedding — it can be constructed by averaging existing voicepacks, sampling near the centroid, or interpolating. See [voicepack zoo](https://github.com/shreyaskarnik/bol-tts-marathi#synthetic-voicepacks) in the repo for recipes. |
|
|
| ## Usage |
|
|
| ```python |
| import torch, soundfile as sf |
| from kokoro import KModel, KPipeline |
| import kokoro.pipeline as _kp |
| |
| _kp.LANG_CODES["m"] = "mr" # monkey-patch Marathi lang code |
| |
| kmodel = KModel( |
| repo_id="shreyask/bol-tts-marathi", |
| config="config.json", |
| model="kokoro-mr-v0_2.pth", |
| ) |
| kmodel.train(False) |
| |
| pipeline = KPipeline(lang_code="m", repo_id="shreyask/bol-tts-marathi", model=kmodel) |
| voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True) |
| |
| text = "नमस्कार, मी मराठी बोलतो." |
| chunks = [] |
| for _gs, _ps, audio in pipeline(text, voice=voice, speed=1.0): |
| chunks.append(audio) |
| |
| sf.write("out.wav", chunks[0].numpy() if len(chunks) == 1 else torch.cat(chunks).numpy(), 24000) |
| ``` |
|
|
| ### Minglish (loanword) handling |
|
|
| For Marathi mixed with English (`"Friday ला Zomato वर dinner order करूया का?"`), use the loanword preprocessor first to transliterate Latin tokens to Devanagari before phonemization: |
|
|
| ```python |
| from preprocess_loanwords import preprocess |
| text = preprocess("Friday ला Zomato वर dinner order करूया का?") |
| # → "फ्रायडे ला झोमॅटो वर डिनर ऑर्डर करूया का?" |
| # Then feed to the pipeline as usual. |
| ``` |
|
|
| Source + ~19,500-entry lookup table: [scripts/preprocess_loanwords.py](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/scripts/preprocess_loanwords.py) and [data/loanword_map.json](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/data/loanword_map.json). |
|
|
| ### Per-voice timestamps |
|
|
| Kokoro predicts per-phoneme durations. `KModel.forward_with_tokens` returns `(audio, pred_dur)`. `pred_dur` is in **predictor frames** where 1 frame = 600 audio samples at 24 kHz (the prosody predictor runs at half the mel-frame rate; the decoder upsamples 2× before iSTFT): |
|
|
| ```python |
| audio, pred_dur = kmodel.forward_with_tokens(input_ids, ref_s, speed=1.0) |
| durations_sec = pred_dur.squeeze().cpu().numpy() * 600 / 24000 |
| starts = durations_sec.cumsum() - durations_sec |
| # (starts[i], starts[i] + durations_sec[i]) is the time span of phoneme[i] |
| ``` |
|
|
| ## Training |
|
|
| | Phase | Details | |
| |---|---| |
| | Base | `hexgrad/Kokoro-82M` | |
| | Stage 1 | 10 epochs, bs=12, fp32, ~9h on A100 SXM 80GB. Final val_loss ≈ 0.23 | |
| | Stage 2 | 10 epochs, bs=8, `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, ~13h | |
| | Train utts | 24,676 (95/5 split) | |
| | Speakers | 331 (2 Rasa + 329 IndicVoices-R) + SPRINGLab IndicTTS-Marathi (single F + single M) | |
| | Vocab change | `ɭ` (U+026D, retroflex lateral) at Kokoro slot 144 — Marathi-specific phoneme that Hindi doesn't have | |
| |
| Full methodology: [TRAINING_GUIDE.md](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/docs/TRAINING_GUIDE.md). |
| |
| ## Datasets |
| |
| - **[AI4Bharat/Rasa](https://huggingface.co/datasets/ai4bharat/Rasa)** (CC-BY-4.0) — Marathi, 13,900 studio-quality utts, 2 speakers. |
| - **[AI4Bharat/IndicVoices-R](https://huggingface.co/datasets/ai4bharat/indicvoices_r)** (CC-BY-4.0, gated) — Marathi, ~11,910 utts, 329 speakers after filtering. |
| - **[SPRINGLab/IndicTTS-Marathi](https://huggingface.co/datasets/SPRINGLab/IndicTTS_Marathi)** (IITM EULA, commercial-OK) — single female + single male speaker, used for Mukta + Dnyanesh. |
| |
| ## Limitations |
| |
| - **Pure-English-only sentences** — the decoder hallucinates Marathi acoustics if you don't give it any Devanagari context. The Minglish trick handles mixed input via Devanagari transliteration; pure English needs a different fallback. |
| - **Long-tail loanwords** — the 19,500-entry map covers high-frequency English words in Indian usage; rarer words fall through to espeak-mr unchanged. |
| - **Decoder English-leakage is accidental, not designed** — v0.2's decoder happens to render `/ɟʰ/` (Devanagari झ) with an English-flavored `/z/` quality, which makes "amazing" → अमेझिंग → audible "amazing." The follow-up v0.5 retraining lost this property by being more correctly Marathi; v0.6 is planned to preserve the leakage deliberately. |
| |
| ## License |
| |
| Apache 2.0. Training data under their respective licenses (Rasa CC-BY-4.0, IndicVoices-R CC-BY-4.0, SPRINGLab IITM EULA). |
| |
| ## Citation |
| |
| ```bibtex |
| @software{bol_tts_marathi_2026, |
| title={bol-tts-marathi: Kokoro-82M fine-tuned for Marathi}, |
| author={Karnik, Shreyas}, |
| year={2026}, |
| url={https://github.com/shreyaskarnik/bol-tts-marathi}, |
| license={Apache-2.0} |
| } |
| @software{kokoro_2025, |
| title={Kokoro-82M}, |
| author={hexgrad}, |
| year={2025}, |
| url={https://github.com/hexgrad/kokoro} |
| } |
| @software{kokoro_deutsch_2026, |
| title={kokoro-deutsch}, |
| author={semidark}, |
| year={2026}, |
| url={https://github.com/semidark/kokoro-deutsch} |
| } |
| ``` |
| |