Fine-tuned Chatterbox Multilingual TTS

This is a fine-tuned version of ResembleAI/chatterbox multilingual model using LoRA.

Model Description

[Add a brief description of what you improved or what the model is specialized for]

For example:

Improved voice quality for specific languages
Better pronunciation for certain accents
Optimized for specific use cases

Installation

pip install chatterbox-tts torch torchaudio huggingface_hub

Usage

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# Download and apply fine-tuned weights
# Option 1: Load t3_cfg (text-to-speech model)
t3_path = hf_hub_download(
    repo_id="YOUR-USERNAME/YOUR-REPO-NAME",
    filename="t3_cfg.pt"
)
t3_state = torch.load(t3_path, map_location="cpu")
model.t3.load_state_dict(t3_state)

# Option 2: If you want to load all components
conds_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="conds.pt")
s3gen_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="s3gen.pt")
ve_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="ve.pt")

model.conds.load_state_dict(torch.load(conds_path, map_location="cpu"))
model.s3gen.load_state_dict(torch.load(s3gen_path, map_location="cpu"))
model.ve.load_state_dict(torch.load(ve_path, map_location="cpu"))

# Generate speech
text = "Hello, this is a test of the fine-tuned model."
wav = model.generate(text, language_id="en")
ta.save("output.wav", wav, model.sr)

With Voice Cloning

# Generate with reference audio
reference_audio = "path/to/reference.wav"
wav = model.generate(
    text,
    language_id="en",
    audio_prompt_path=reference_audio,
    exaggeration=0.5,
    cfg_weight=0.5
)
ta.save("output_cloned.wav", wav, model.sr)

Training Details

Base Model: ResembleAI/chatterbox multilingual
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Training Dataset: [Add your dataset info here]
Training Duration: [Add training time/epochs]
Improvements: [Describe what you optimized for]

Model Files

conds.pt - Conditioning model weights
s3gen.pt - Speech generation model weights
t3_cfg.pt - Text-to-speech transformer weights (main component)
ve.pt - Voice encoder weights
tokenizer.json - Tokenizer configuration

Supported Languages

Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), Finnish (fi), French (fr), German (de), Greek (el), Hebrew (he), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Malay (ms), Norwegian (no), Polish (pl), Portuguese (pt), Russian (ru), Spanish (es), Swahili (sw), Swedish (sv), Turkish (tr)

Citation

If you use this model, please cite the original Chatterbox work:

@misc{chatterboxtts2025,
  author = {{Resemble AI}},
  title = {{Chatterbox-TTS}},
  year = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note = {GitHub repository}
}

License

This model inherits the MIT license from the base Chatterbox model.

Downloads last month: 2

Model tree for juliardi/chatterbox-multilingual-finetuned

Base model

ResembleAI/chatterbox

Adapter

(7)

this model