Instructions to use indra17/swarlekha with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use indra17/swarlekha with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
Swarlekha
Swarlekha is a Nepali and English text-to-speech and voice cloning model based on Resemble AI's open-source Chatterbox architecture. This repository contains the base model components derived from Chatterbox together with Swarlekha's Nepali fine-tuned T3 checkpoint and Nepali tokenizer.
The project is currently under active development.
Model Summary
Swarlekha supports:
- Text-to-speech generation
- Zero-shot voice cloning from a short reference audio sample
- Nepali text-to-speech using a custom Nepali tokenizer and fine-tuned checkpoint
- English text-to-speech through the Chatterbox-derived base components
- Voice conditioning with speaker embeddings and prompt speech tokens
The model pipeline follows the Chatterbox-style architecture:
Text -> Tokenizer -> T3 -> Speech Tokens -> S3Gen -> Audio
^
VoiceEncoder / reference audio
Repository Contents
This model repository is intended to host:
| File | Description |
|---|---|
ve.safetensors |
Voice encoder weights derived from Chatterbox |
s3gen.safetensors |
Speech generation / vocoder weights derived from Chatterbox |
conds.pt |
Optional default voice conditionals |
t3_nepali_checkpoint.pt |
Swarlekha Nepali fine-tuned T3 checkpoint |
tokenizer_np.json |
Swarlekha Nepali tokenizer |
The Chatterbox-derived files provide the foundation for speech tokenization, voice conditioning, and audio generation. The Nepali checkpoint and tokenizer were fine-tuned/trained by the Swarlekha team.
Intended Use
Swarlekha is intended for research, education, prototyping, and development of Nepali and English speech synthesis systems. It may be useful for:
- Nepali TTS experiments
- Voice cloning research with consented reference audio
- Assistive speech applications
- Speech interface prototypes
- Evaluation of multilingual TTS workflows
Do not use this model to impersonate people, clone voices without permission, generate misleading audio, or violate applicable laws or platform policies.
Quick Start
Install the project dependencies from the Swarlekha code repository:
pip install -r requirements.txt
Example Nepali TTS with voice cloning:
import torch
import torchaudio
from swarlekha_model.tts_nepali import SwarlekhaNepaliTTS
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SwarlekhaNepaliTTS.from_pretrained(device=device)
text = "मेरो देश नेपाल हो। म पोखरामा बस्छु। पोखरा निकै सुन्दर छ।"
reference_audio = "examples/input/indra.wav"
wav = model.generate(text, audio_prompt_path=reference_audio)
torchaudio.save("swarlekha_output.wav", wav, model.sr)
The generated audio is saved as a WAV file at the model's synthesis sample rate.
Note: update the model loading code's Hugging Face
repo_idto this repository before release if it still points toResembleAI/chatterbox.
Training Details
The Nepali adaptation was built by extending the original Chatterbox tokenizer and fine-tuning the T3 component for Nepali text-to-speech.
Tokenizer strategy:
- Preserves the original English tokenizer IDs for compatibility with the pretrained model
- Adds a Nepali language tag,
[ne] - Adds Devanagari characters and Nepali BPE merge tokens
- Supports Nepali punctuation, Devanagari digits, and standard Nepali text normalization
Fine-tuning strategy:
- Base architecture: Chatterbox-derived T3, S3Gen, S3Tokenizer, and VoiceEncoder
- Fine-tuned component: T3 text-to-speech model
- Efficient training: LoRA adapters on attention projections
- Memory optimizations: gradient checkpointing, mixed precision, small batch size with gradient accumulation
- Dataset format: Nepali TTS metadata with utterance ID, speaker ID, text, and audio
- Audio duration filtering: approximately 0.5 to 15 seconds
- Validation split: 5%
Example Training Configuration
The training pipeline was designed to run on modest GPU hardware such as an RTX 3050 with 6 GB VRAM:
LoRA rank: 8
LoRA alpha: 16
LoRA dropout: 0.1
Batch size: 1
Gradient accumulation steps: 16
Learning rate: 2e-5
Weight decay: 0.01
Warmup steps: 500
Max steps: 50000
Mixed precision: fp16
Scheduler: cosine
Audio Requirements
For best voice cloning quality:
- Use a clean reference WAV file
- Prefer 3-10 seconds of speech
- Avoid background noise, music, clipping, or overlapping speakers
- Use reference speech from the speaker you have permission to clone
Limitations
Swarlekha is still under development. Known limitations may include:
- Nepali pronunciation quality may vary across dialects, rare words, punctuation patterns, and code-switched text
- Voice cloning quality depends strongly on reference audio quality
- Long-form generation may require sentence-level chunking for best results
- The model may produce artifacts, skipped words, repetitions, or unstable prosody
- English support comes from the Chatterbox-derived base components; Nepali support is the main Swarlekha adaptation
Ethical Use
Voice cloning can be sensitive. Users are responsible for obtaining consent from speakers whose voices are used as references. Generated audio should be clearly disclosed where appropriate, especially in public, commercial, political, legal, financial, or identity-sensitive contexts.
Attribution
Swarlekha is based on the Chatterbox architecture and publicly available model components by Resemble AI:
- Chatterbox GitHub: https://github.com/resemble-ai/chatterbox
- Chatterbox Hugging Face model: https://huggingface.co/ResembleAI/chatterbox
We thank Resemble AI, Hugging Face, PyTorch, and the open-source speech synthesis community for making this work possible.
License
This repository is released under the MIT License.
Contact
Indra Prasad Sapkota
Email: bishal.sap21@gmail.com
Model tree for indra17/swarlekha
Base model
ResembleAI/chatterbox