Swarlekha

Swarlekha is a Nepali and English text-to-speech and voice cloning model based on Resemble AI's open-source Chatterbox architecture. This repository contains the base model components derived from Chatterbox together with Swarlekha's Nepali fine-tuned T3 checkpoint and Nepali tokenizer.

The project is currently under active development.

Model Summary

Swarlekha supports:

Text-to-speech generation
Zero-shot voice cloning from a short reference audio sample
Nepali text-to-speech using a custom Nepali tokenizer and fine-tuned checkpoint
English text-to-speech through the Chatterbox-derived base components
Voice conditioning with speaker embeddings and prompt speech tokens

The model pipeline follows the Chatterbox-style architecture:

Text -> Tokenizer -> T3 -> Speech Tokens -> S3Gen -> Audio
                         ^
                  VoiceEncoder / reference audio

Repository Contents

This model repository is intended to host:

File	Description
`ve.safetensors`	Voice encoder weights derived from Chatterbox
`s3gen.safetensors`	Speech generation / vocoder weights derived from Chatterbox
`conds.pt`	Optional default voice conditionals
`t3_nepali_checkpoint.pt`	Swarlekha Nepali fine-tuned T3 checkpoint
`tokenizer_np.json`	Swarlekha Nepali tokenizer

The Chatterbox-derived files provide the foundation for speech tokenization, voice conditioning, and audio generation. The Nepali checkpoint and tokenizer were fine-tuned/trained by the Swarlekha team.

Intended Use

Swarlekha is intended for research, education, prototyping, and development of Nepali and English speech synthesis systems. It may be useful for:

Nepali TTS experiments
Voice cloning research with consented reference audio
Assistive speech applications
Speech interface prototypes
Evaluation of multilingual TTS workflows

Do not use this model to impersonate people, clone voices without permission, generate misleading audio, or violate applicable laws or platform policies.

Quick Start

Install the project dependencies from the Swarlekha code repository:

pip install -r requirements.txt

Example Nepali TTS with voice cloning:

import torch
import torchaudio
from swarlekha_model.tts_nepali import SwarlekhaNepaliTTS

device = "cuda" if torch.cuda.is_available() else "cpu"

model = SwarlekhaNepaliTTS.from_pretrained(device=device)

text = "मेरो देश नेपाल हो। म पोखरामा बस्छु। पोखरा निकै सुन्दर छ।"
reference_audio = "examples/input/indra.wav"

wav = model.generate(text, audio_prompt_path=reference_audio)
torchaudio.save("swarlekha_output.wav", wav, model.sr)

The generated audio is saved as a WAV file at the model's synthesis sample rate.

Note: update the model loading code's Hugging Face repo_id to this repository before release if it still points to ResembleAI/chatterbox.

Training Details

The Nepali adaptation was built by extending the original Chatterbox tokenizer and fine-tuning the T3 component for Nepali text-to-speech.

Tokenizer strategy:

Preserves the original English tokenizer IDs for compatibility with the pretrained model
Adds a Nepali language tag, [ne]
Adds Devanagari characters and Nepali BPE merge tokens
Supports Nepali punctuation, Devanagari digits, and standard Nepali text normalization

Fine-tuning strategy:

Base architecture: Chatterbox-derived T3, S3Gen, S3Tokenizer, and VoiceEncoder
Fine-tuned component: T3 text-to-speech model
Efficient training: LoRA adapters on attention projections
Memory optimizations: gradient checkpointing, mixed precision, small batch size with gradient accumulation
Dataset format: Nepali TTS metadata with utterance ID, speaker ID, text, and audio
Audio duration filtering: approximately 0.5 to 15 seconds
Validation split: 5%

Example Training Configuration

The training pipeline was designed to run on modest GPU hardware such as an RTX 3050 with 6 GB VRAM:

LoRA rank: 8
LoRA alpha: 16
LoRA dropout: 0.1
Batch size: 1
Gradient accumulation steps: 16
Learning rate: 2e-5
Weight decay: 0.01
Warmup steps: 500
Max steps: 50000
Mixed precision: fp16
Scheduler: cosine

Audio Requirements

For best voice cloning quality:

Use a clean reference WAV file
Prefer 3-10 seconds of speech
Avoid background noise, music, clipping, or overlapping speakers
Use reference speech from the speaker you have permission to clone

Limitations

Swarlekha is still under development. Known limitations may include:

Nepali pronunciation quality may vary across dialects, rare words, punctuation patterns, and code-switched text
Voice cloning quality depends strongly on reference audio quality
Long-form generation may require sentence-level chunking for best results
The model may produce artifacts, skipped words, repetitions, or unstable prosody
English support comes from the Chatterbox-derived base components; Nepali support is the main Swarlekha adaptation

Ethical Use

Voice cloning can be sensitive. Users are responsible for obtaining consent from speakers whose voices are used as references. Generated audio should be clearly disclosed where appropriate, especially in public, commercial, political, legal, financial, or identity-sensitive contexts.

Attribution

Swarlekha is based on the Chatterbox architecture and publicly available model components by Resemble AI:

Chatterbox GitHub: https://github.com/resemble-ai/chatterbox
Chatterbox Hugging Face model: https://huggingface.co/ResembleAI/chatterbox

We thank Resemble AI, Hugging Face, PyTorch, and the open-source speech synthesis community for making this work possible.

License

This repository is released under the MIT License.

Contact

Indra Prasad Sapkota
Email: bishal.sap21@gmail.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for indra17/swarlekha

Base model

ResembleAI/chatterbox

Finetuned

(58)

this model