Swarlekha

Swarlekha is a Nepali and English text-to-speech and voice cloning model based on Resemble AI's open-source Chatterbox architecture. This repository contains the base model components derived from Chatterbox together with Swarlekha's Nepali fine-tuned T3 checkpoint and Nepali tokenizer.

The project is currently under active development.

Model Summary

Swarlekha supports:

  • Text-to-speech generation
  • Zero-shot voice cloning from a short reference audio sample
  • Nepali text-to-speech using a custom Nepali tokenizer and fine-tuned checkpoint
  • English text-to-speech through the Chatterbox-derived base components
  • Voice conditioning with speaker embeddings and prompt speech tokens

The model pipeline follows the Chatterbox-style architecture:

Text -> Tokenizer -> T3 -> Speech Tokens -> S3Gen -> Audio
                         ^
                  VoiceEncoder / reference audio

Repository Contents

This model repository is intended to host:

File Description
ve.safetensors Voice encoder weights derived from Chatterbox
s3gen.safetensors Speech generation / vocoder weights derived from Chatterbox
conds.pt Optional default voice conditionals
t3_nepali_checkpoint.pt Swarlekha Nepali fine-tuned T3 checkpoint
tokenizer_np.json Swarlekha Nepali tokenizer

The Chatterbox-derived files provide the foundation for speech tokenization, voice conditioning, and audio generation. The Nepali checkpoint and tokenizer were fine-tuned/trained by the Swarlekha team.

Intended Use

Swarlekha is intended for research, education, prototyping, and development of Nepali and English speech synthesis systems. It may be useful for:

  • Nepali TTS experiments
  • Voice cloning research with consented reference audio
  • Assistive speech applications
  • Speech interface prototypes
  • Evaluation of multilingual TTS workflows

Do not use this model to impersonate people, clone voices without permission, generate misleading audio, or violate applicable laws or platform policies.

Quick Start

Install the project dependencies from the Swarlekha code repository:

pip install -r requirements.txt

Example Nepali TTS with voice cloning:

import torch
import torchaudio
from swarlekha_model.tts_nepali import SwarlekhaNepaliTTS

device = "cuda" if torch.cuda.is_available() else "cpu"

model = SwarlekhaNepaliTTS.from_pretrained(device=device)

text = "मेरो देश नेपाल हो। म पोखरामा बस्छु। पोखरा निकै सुन्दर छ।"
reference_audio = "examples/input/indra.wav"

wav = model.generate(text, audio_prompt_path=reference_audio)
torchaudio.save("swarlekha_output.wav", wav, model.sr)

The generated audio is saved as a WAV file at the model's synthesis sample rate.

Note: update the model loading code's Hugging Face repo_id to this repository before release if it still points to ResembleAI/chatterbox.

Training Details

The Nepali adaptation was built by extending the original Chatterbox tokenizer and fine-tuning the T3 component for Nepali text-to-speech.

Tokenizer strategy:

  • Preserves the original English tokenizer IDs for compatibility with the pretrained model
  • Adds a Nepali language tag, [ne]
  • Adds Devanagari characters and Nepali BPE merge tokens
  • Supports Nepali punctuation, Devanagari digits, and standard Nepali text normalization

Fine-tuning strategy:

  • Base architecture: Chatterbox-derived T3, S3Gen, S3Tokenizer, and VoiceEncoder
  • Fine-tuned component: T3 text-to-speech model
  • Efficient training: LoRA adapters on attention projections
  • Memory optimizations: gradient checkpointing, mixed precision, small batch size with gradient accumulation
  • Dataset format: Nepali TTS metadata with utterance ID, speaker ID, text, and audio
  • Audio duration filtering: approximately 0.5 to 15 seconds
  • Validation split: 5%

Example Training Configuration

The training pipeline was designed to run on modest GPU hardware such as an RTX 3050 with 6 GB VRAM:

LoRA rank: 8
LoRA alpha: 16
LoRA dropout: 0.1
Batch size: 1
Gradient accumulation steps: 16
Learning rate: 2e-5
Weight decay: 0.01
Warmup steps: 500
Max steps: 50000
Mixed precision: fp16
Scheduler: cosine

Audio Requirements

For best voice cloning quality:

  • Use a clean reference WAV file
  • Prefer 3-10 seconds of speech
  • Avoid background noise, music, clipping, or overlapping speakers
  • Use reference speech from the speaker you have permission to clone

Limitations

Swarlekha is still under development. Known limitations may include:

  • Nepali pronunciation quality may vary across dialects, rare words, punctuation patterns, and code-switched text
  • Voice cloning quality depends strongly on reference audio quality
  • Long-form generation may require sentence-level chunking for best results
  • The model may produce artifacts, skipped words, repetitions, or unstable prosody
  • English support comes from the Chatterbox-derived base components; Nepali support is the main Swarlekha adaptation

Ethical Use

Voice cloning can be sensitive. Users are responsible for obtaining consent from speakers whose voices are used as references. Generated audio should be clearly disclosed where appropriate, especially in public, commercial, political, legal, financial, or identity-sensitive contexts.

Attribution

Swarlekha is based on the Chatterbox architecture and publicly available model components by Resemble AI:

We thank Resemble AI, Hugging Face, PyTorch, and the open-source speech synthesis community for making this work possible.

License

This repository is released under the MIT License.

Contact

Indra Prasad Sapkota
Email: bishal.sap21@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for indra17/swarlekha

Finetuned
(58)
this model