--- license: apache-2.0 language: - mr library_name: kokoro pipeline_tag: text-to-speech base_model: hexgrad/Kokoro-82M base_model_relation: finetune datasets: - ai4bharat/Rasa - ai4bharat/indicvoices_r - SPRINGLab/IndicTTS_Marathi tags: - text-to-speech - tts - kokoro - marathi - minglish - indic - styletts2 - bol-tts --- # bol-tts-marathi — Kokoro-82M fine-tuned for Marathi A Marathi (मराठी) text-to-speech fine-tune of [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M), trained with the [semidark/kokoro-deutsch](https://github.com/semidark/kokoro-deutsch) recipe. Handles pure Marathi and **Minglish** (Marathi + English code-switching) via a client-side Devanagari-transliteration preprocessor. - **Architecture:** StyleTTS2 acoustic model + ISTFTNet decoder (Kokoro-82M, unchanged) - **Parameters:** 81.76 M - **Sample rate:** 24 kHz - **Voices:** 25 (4 Marathi-trained + 19 stock-Kokoro crossovers + 2 synthetic) — see voice catalog below - **Live demo:** [shreyask/bol-tts-marathi](https://huggingface.co/spaces/shreyask/bol-tts-marathi) (in-browser via WebGPU) - **Write-up:** [kshreyas.dev/post/bol-tts-marathi](https://kshreyas.dev/post/bol-tts-marathi/) — full design + debugging story with audio samples - **Code:** [github.com/shreyaskarnik/bol-tts-marathi](https://github.com/shreyaskarnik/bol-tts-marathi) - **ONNX export:** [shreyask/bol-tts-marathi-onnx](https://huggingface.co/shreyask/bol-tts-marathi-onnx) ## Voice catalog (25 voices) ### Marathi-trained (4) | ID | Display | Source | Default speed | |---|---|---|---| | `mf_asha` | Asha (आशा) | [Rasa](https://huggingface.co/datasets/ai4bharat/Rasa) `marathi_female` | 1.00× | | `mm_vivek` | Vivek (विवेक) | Rasa `marathi_male` | 1.00× | | `mf_mukta` | Mukta (मुक्ता) | [SPRINGLab](https://huggingface.co/datasets/SPRINGLab/IndicTTS_Marathi) female | 0.80× | | `mm_dnyanesh` | Dnyanesh (ज्ञानेश) | SPRINGLab male | 0.80× | ### Stock-Kokoro crossovers (19) Stock voicepacks from [hexgrad/kokoro.js](https://github.com/hexgrad/kokoro.js) used as `ref_s` on this fine-tune. Because v0.2 is a continuation fine-tune, the encoder latent space stays close enough to stock Kokoro's that stock voicepacks plug in directly. Pre-screened by `peak < 0.95` to filter ones that clip. | ID | Display | Source language | |---|---|---| | `af_heart` | Svara (स्वरा) | US English F | | `af_alloy` | Anvita (अन्विता) | US English F | | `af_aoede` | Sanika (सानिका) | US English F | | `af_bella` | Naina (नैना) | US English F | | `af_jessica` | Ishani (ईशानी) | US English F | | `af_nova` | Tara (तारा) | US English F | | `af_sarah` | Kavya (काव्या) | US English F | | `af_sky` | Akasha (आकाशा) | US English F | | `am_liam` | Atharv (अथर्व) | US English M | | `bf_isabella` | Ira (इरा) | UK English F | | `bm_fable` | Aaryan (आर्यन) | UK English M | | `ff_siwis` | Esha (ईशा) | French F | | `hm_omega` | Vihaan (विहान) | Hindi M | | `im_nicola` | Niraj (निरज) | Italian M | | `pf_dora` | Rhea (रिया) | Portuguese F | | `zf_xiaoni` | Nyra (नयरा) | Mandarin F | | `zf_xiaoxiao` | Pari (परी) | Mandarin F (kid) | | `zf_xiaoyi` | Vir (वीर) | Mandarin F (perceived M kid) | | `zm_yunyang` | Aakash (आकाश) | Mandarin M | ### Synthetic — generated arithmetically with no reference audio (2) | ID | Display | Recipe | |---|---|---| | `syn_sama` | Sama (समा) | Centroid (mean) of 5 modern English female voicepacks | | `syn_navya` | Navya (नव्या) | Centroid + per-position Gaussian noise (1σ) | The voicepack tensor `[510, 1, 256]` is a plain embedding — it can be constructed by averaging existing voicepacks, sampling near the centroid, or interpolating. See [voicepack zoo](https://github.com/shreyaskarnik/bol-tts-marathi#synthetic-voicepacks) in the repo for recipes. ## Usage ```python import torch, soundfile as sf from kokoro import KModel, KPipeline import kokoro.pipeline as _kp _kp.LANG_CODES["m"] = "mr" # monkey-patch Marathi lang code kmodel = KModel( repo_id="shreyask/bol-tts-marathi", config="config.json", model="kokoro-mr-v0_2.pth", ) kmodel.train(False) pipeline = KPipeline(lang_code="m", repo_id="shreyask/bol-tts-marathi", model=kmodel) voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True) text = "नमस्कार, मी मराठी बोलतो." chunks = [] for _gs, _ps, audio in pipeline(text, voice=voice, speed=1.0): chunks.append(audio) sf.write("out.wav", chunks[0].numpy() if len(chunks) == 1 else torch.cat(chunks).numpy(), 24000) ``` ### Minglish (loanword) handling For Marathi mixed with English (`"Friday ला Zomato वर dinner order करूया का?"`), use the loanword preprocessor first to transliterate Latin tokens to Devanagari before phonemization: ```python from preprocess_loanwords import preprocess text = preprocess("Friday ला Zomato वर dinner order करूया का?") # → "फ्रायडे ला झोमॅटो वर डिनर ऑर्डर करूया का?" # Then feed to the pipeline as usual. ``` Source + ~19,500-entry lookup table: [scripts/preprocess_loanwords.py](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/scripts/preprocess_loanwords.py) and [data/loanword_map.json](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/data/loanword_map.json). ### Per-voice timestamps Kokoro predicts per-phoneme durations. `KModel.forward_with_tokens` returns `(audio, pred_dur)`. `pred_dur` is in **predictor frames** where 1 frame = 600 audio samples at 24 kHz (the prosody predictor runs at half the mel-frame rate; the decoder upsamples 2× before iSTFT): ```python audio, pred_dur = kmodel.forward_with_tokens(input_ids, ref_s, speed=1.0) durations_sec = pred_dur.squeeze().cpu().numpy() * 600 / 24000 starts = durations_sec.cumsum() - durations_sec # (starts[i], starts[i] + durations_sec[i]) is the time span of phoneme[i] ``` ## Training | Phase | Details | |---|---| | Base | `hexgrad/Kokoro-82M` | | Stage 1 | 10 epochs, bs=12, fp32, ~9h on A100 SXM 80GB. Final val_loss ≈ 0.23 | | Stage 2 | 10 epochs, bs=8, `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, ~13h | | Train utts | 24,676 (95/5 split) | | Speakers | 331 (2 Rasa + 329 IndicVoices-R) + SPRINGLab IndicTTS-Marathi (single F + single M) | | Vocab change | `ɭ` (U+026D, retroflex lateral) at Kokoro slot 144 — Marathi-specific phoneme that Hindi doesn't have | Full methodology: [TRAINING_GUIDE.md](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/docs/TRAINING_GUIDE.md). ## Datasets - **[AI4Bharat/Rasa](https://huggingface.co/datasets/ai4bharat/Rasa)** (CC-BY-4.0) — Marathi, 13,900 studio-quality utts, 2 speakers. - **[AI4Bharat/IndicVoices-R](https://huggingface.co/datasets/ai4bharat/indicvoices_r)** (CC-BY-4.0, gated) — Marathi, ~11,910 utts, 329 speakers after filtering. - **[SPRINGLab/IndicTTS-Marathi](https://huggingface.co/datasets/SPRINGLab/IndicTTS_Marathi)** (IITM EULA, commercial-OK) — single female + single male speaker, used for Mukta + Dnyanesh. ## Limitations - **Pure-English-only sentences** — the decoder hallucinates Marathi acoustics if you don't give it any Devanagari context. The Minglish trick handles mixed input via Devanagari transliteration; pure English needs a different fallback. - **Long-tail loanwords** — the 19,500-entry map covers high-frequency English words in Indian usage; rarer words fall through to espeak-mr unchanged. - **Decoder English-leakage is accidental, not designed** — v0.2's decoder happens to render `/ɟʰ/` (Devanagari झ) with an English-flavored `/z/` quality, which makes "amazing" → अमेझिंग → audible "amazing." The follow-up v0.5 retraining lost this property by being more correctly Marathi; v0.6 is planned to preserve the leakage deliberately. ## License Apache 2.0. Training data under their respective licenses (Rasa CC-BY-4.0, IndicVoices-R CC-BY-4.0, SPRINGLab IITM EULA). ## Citation ```bibtex @software{bol_tts_marathi_2026, title={bol-tts-marathi: Kokoro-82M fine-tuned for Marathi}, author={Karnik, Shreyas}, year={2026}, url={https://github.com/shreyaskarnik/bol-tts-marathi}, license={Apache-2.0} } @software{kokoro_2025, title={Kokoro-82M}, author={hexgrad}, year={2025}, url={https://github.com/hexgrad/kokoro} } @software{kokoro_deutsch_2026, title={kokoro-deutsch}, author={semidark}, year={2026}, url={https://github.com/semidark/kokoro-deutsch} } ```