Fine-tuned Chatterbox Multilingual TTS

This is a fine-tuned version of ResembleAI/chatterbox multilingual model using LoRA.

Model Description

[Add a brief description of what you improved or what the model is specialized for]

For example:

  • Improved voice quality for specific languages
  • Better pronunciation for certain accents
  • Optimized for specific use cases

Installation

pip install chatterbox-tts torch torchaudio huggingface_hub

Usage

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# Download and apply fine-tuned weights
# Option 1: Load t3_cfg (text-to-speech model)
t3_path = hf_hub_download(
    repo_id="YOUR-USERNAME/YOUR-REPO-NAME",
    filename="t3_cfg.pt"
)
t3_state = torch.load(t3_path, map_location="cpu")
model.t3.load_state_dict(t3_state)

# Option 2: If you want to load all components
conds_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="conds.pt")
s3gen_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="s3gen.pt")
ve_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="ve.pt")

model.conds.load_state_dict(torch.load(conds_path, map_location="cpu"))
model.s3gen.load_state_dict(torch.load(s3gen_path, map_location="cpu"))
model.ve.load_state_dict(torch.load(ve_path, map_location="cpu"))

# Generate speech
text = "Hello, this is a test of the fine-tuned model."
wav = model.generate(text, language_id="en")
ta.save("output.wav", wav, model.sr)

With Voice Cloning

# Generate with reference audio
reference_audio = "path/to/reference.wav"
wav = model.generate(
    text,
    language_id="en",
    audio_prompt_path=reference_audio,
    exaggeration=0.5,
    cfg_weight=0.5
)
ta.save("output_cloned.wav", wav, model.sr)

Training Details

  • Base Model: ResembleAI/chatterbox multilingual
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Training Dataset: [Add your dataset info here]
  • Training Duration: [Add training time/epochs]
  • Improvements: [Describe what you optimized for]

Model Files

  • conds.pt - Conditioning model weights
  • s3gen.pt - Speech generation model weights
  • t3_cfg.pt - Text-to-speech transformer weights (main component)
  • ve.pt - Voice encoder weights
  • tokenizer.json - Tokenizer configuration

Supported Languages

Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), Finnish (fi), French (fr), German (de), Greek (el), Hebrew (he), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Malay (ms), Norwegian (no), Polish (pl), Portuguese (pt), Russian (ru), Spanish (es), Swahili (sw), Swedish (sv), Turkish (tr)

Citation

If you use this model, please cite the original Chatterbox work:

@misc{chatterboxtts2025,
  author = {{Resemble AI}},
  title = {{Chatterbox-TTS}},
  year = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note = {GitHub repository}
}

License

This model inherits the MIT license from the base Chatterbox model.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for juliardi/chatterbox-multilingual-finetuned

Adapter
(3)
this model