|
|
--- |
|
|
base_model: kenpath/svara-tts-v1 |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- hi |
|
|
- bn |
|
|
- mr |
|
|
- te |
|
|
- kn |
|
|
- bho |
|
|
- mag |
|
|
- hne |
|
|
- mai |
|
|
- as |
|
|
- brx |
|
|
- doi |
|
|
- gu |
|
|
- ml |
|
|
- pa |
|
|
- ta |
|
|
- ne |
|
|
- sa |
|
|
- en |
|
|
tags: |
|
|
- text-to-speech |
|
|
- speech-synthesis |
|
|
- transformers |
|
|
- multilingual |
|
|
- indic |
|
|
- orpheus |
|
|
- lora |
|
|
- low-latency |
|
|
- gguf |
|
|
- zero-shot |
|
|
- emotions |
|
|
- discrete-audio-tokens |
|
|
task_categories: |
|
|
- text-to-speech |
|
|
pipeline_tag: text-to-speech |
|
|
pretty_name: Svara-TTS v1 |
|
|
datasets: |
|
|
- SYSPIN |
|
|
- RASA |
|
|
- IndicTTS |
|
|
- SPICOR |
|
|
--- |
|
|
|
|
|
|
|
|
# svara-tts-voiceclone-beta — Voice Cloning + Expressive TTS for Indic Languages |
|
|
|
|
|
[](https://huggingface.co/kenpath/svara-tts-voiceclone-beta) |
|
|
[](https://huggingface.co/spaces/kenpath/svara-tts) |
|
|
[](https://colab.research.google.com/) |
|
|
[](https://github.com/Kenpath/svara-tts-inference) |
|
|
|
|
|
**svara-tts-voiceclone-beta** is an experimental extension of **svara-tts-v1**, designed to bring **lightweight voice cloning** and improved **accent preservation** to Indic languages. It introduces a simple but effective **reference-swap finetuning** technique, enabling more stable zero-shot speaker identity across long, expressive utterances. |
|
|
|
|
|
Built on an Orpheus-style discrete audio token architecture, the model supports **19 languages**, expressive cues (`<laugh>`, `<yawn>`, `<angry>`), and low-latency TTS on commodity hardware. |
|
|
|
|
|
--- |
|
|
|
|
|
## At a Glance |
|
|
|
|
|
* **Languages (19):** Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, Nepali, Sanskrit, Indian English. |
|
|
* **Voice Cloning:** Improved consistency using **reference-swap finetuning**, works with short (≈10s) reference audio. |
|
|
* **Expressivity:** Emotion tags; non-verbal cues; improved Indic prosody. |
|
|
* **Low-Latency Deployment:** Fully compatible with GGUF and **vLLM**. |
|
|
* **Adaptability:** LoRA-ready; easy to specialize for speakers, domains, or dialects. |
|
|
|
|
|
Demo playback uses the same Space as svara-tts-v1. |
|
|
|
|
|
--- |
|
|
|
|
|
## Prompting (Orpheus-Style) |
|
|
|
|
|
* Place style/emotion tags at the end: |
|
|
`आज शाम को जल्दी मिलते हैं। <neutral>` |
|
|
* Provide reference audio tokens before the target text. |
|
|
* Use punctuation to control rhythm, pauses, and emphasis. |
|
|
|
|
|
**Zero-shot example:** |
|
|
|
|
|
``` |
|
|
<BOS> |
|
|
<reference_audio_tokens_here> |
|
|
कल शाम को जल्दी मिलते हैं। <neutral> |
|
|
<SOA> |
|
|
``` |
|
|
|
|
|
Speaker IDs remain compatible with svara-tts-v1: **`Language (Gender)`**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data Summary |
|
|
|
|
|
`svara-tts-voiceclone-beta` is enhanced from the multilingual base of **svara-tts-v1**, trained on: |
|
|
|
|
|
* **SYSPIN**, **RASA**, **IndicTTS**, **SPICOR** |
|
|
* ~2000 hours, ~50 speakers, balanced male/female |
|
|
* Rich phoneme coverage across 19 Indic languages |
|
|
|
|
|
The **reference-swap augmentation** uses multi-utterance samples to improve speaker consistency across Indic phonetic variation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
* Zero-shot voice cloning for Indic voices |
|
|
* Dialogue systems, IVR, learning apps, accessibility solutions |
|
|
* Content creation, localization, storytelling |
|
|
* Research on speech identity, expressivity, and multilingual TTS |
|
|
|
|
|
## Out-of-Scope / Not Intended |
|
|
|
|
|
* Impersonating private individuals without consent |
|
|
* Fraud, targeted deception, harassment |
|
|
* High-risk or safety-critical deployments |
|
|
* Perfect 1:1 replication of voices (this is a beta research release) |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* Zero-shot cloning is **not** identical to dedicated finetuning |
|
|
* Speaker similarity may degrade over long utterances |
|
|
* Varies by language due to dataset imbalance |
|
|
* Emotion emphasis may differ across low-resource languages |
|
|
* Rare names and numbers may require normalization or rewriting |
|
|
|
|
|
These improve with targeted LoRA finetuning or higher-quality data. |
|
|
|
|
|
--- |
|
|
|
|
|
## Responsible Use |
|
|
|
|
|
By using this model, you agree to follow applicable laws and ethical guidelines. Synthetic speech should be disclosed when appropriate. Avoid impersonation or harmful use cases. |
|
|
|
|
|
--- |
|
|
|
|
|
## Sources & Links |
|
|
|
|
|
* **Base Model (svara-tts-v1):** [https://huggingface.co/kenpath/svara-tts-v1](https://huggingface.co/kenpath/svara-tts-v1) |
|
|
* **Demo Space:** [https://huggingface.co/spaces/kenpath/svara-tts](https://huggingface.co/spaces/kenpath/svara-tts) |
|
|
* **Inference Repo:** [https://github.com/Kenpath/svara-tts-inference](https://github.com/Kenpath/svara-tts-inference) |
|
|
* **Indic Text Normalizer:** [https://github.com/Kenpath/indic-text-normalization](https://github.com/Kenpath/indic-text-normalization) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
Developed by **Kenpath Technologies**. Special thanks to: |
|
|
|
|
|
* **Canopy Labs — Orpheus** (architecture & research release) |
|
|
* **SYSPIN / SPICOR — IISc Bangalore** |
|
|
* **AI4Bharat — RASA** |
|
|
* **IIT Madras — IndicTTS** |
|
|
* **Unsloth** (training tools & LoRA insights) |
|
|
* **RunPod** (GPU compute credits) |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
**Apache-2.0** |
|
|
|
|
|
--- |
|
|
|
|
|
## Versioning & Changelog |
|
|
|
|
|
* **v0.1.0-beta:** Initial release with reference-swap voice cloning |