Fine-tuned Chatterbox Multilingual TTS
This is a fine-tuned version of ResembleAI/chatterbox multilingual model using LoRA.
Model Description
[Add a brief description of what you improved or what the model is specialized for]
For example:
- Improved voice quality for specific languages
- Better pronunciation for certain accents
- Optimized for specific use cases
Installation
pip install chatterbox-tts torch torchaudio huggingface_hub
Usage
import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load base multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device=device)
# Download and apply fine-tuned weights
# Option 1: Load t3_cfg (text-to-speech model)
t3_path = hf_hub_download(
repo_id="YOUR-USERNAME/YOUR-REPO-NAME",
filename="t3_cfg.pt"
)
t3_state = torch.load(t3_path, map_location="cpu")
model.t3.load_state_dict(t3_state)
# Option 2: If you want to load all components
conds_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="conds.pt")
s3gen_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="s3gen.pt")
ve_path = hf_hub_download(repo_id="YOUR-USERNAME/YOUR-REPO-NAME", filename="ve.pt")
model.conds.load_state_dict(torch.load(conds_path, map_location="cpu"))
model.s3gen.load_state_dict(torch.load(s3gen_path, map_location="cpu"))
model.ve.load_state_dict(torch.load(ve_path, map_location="cpu"))
# Generate speech
text = "Hello, this is a test of the fine-tuned model."
wav = model.generate(text, language_id="en")
ta.save("output.wav", wav, model.sr)
With Voice Cloning
# Generate with reference audio
reference_audio = "path/to/reference.wav"
wav = model.generate(
text,
language_id="en",
audio_prompt_path=reference_audio,
exaggeration=0.5,
cfg_weight=0.5
)
ta.save("output_cloned.wav", wav, model.sr)
Training Details
- Base Model: ResembleAI/chatterbox multilingual
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Dataset: [Add your dataset info here]
- Training Duration: [Add training time/epochs]
- Improvements: [Describe what you optimized for]
Model Files
conds.pt- Conditioning model weightss3gen.pt- Speech generation model weightst3_cfg.pt- Text-to-speech transformer weights (main component)ve.pt- Voice encoder weightstokenizer.json- Tokenizer configuration
Supported Languages
Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), Finnish (fi), French (fr), German (de), Greek (el), Hebrew (he), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Malay (ms), Norwegian (no), Polish (pl), Portuguese (pt), Russian (ru), Spanish (es), Swahili (sw), Swedish (sv), Turkish (tr)
Citation
If you use this model, please cite the original Chatterbox work:
@misc{chatterboxtts2025,
author = {{Resemble AI}},
title = {{Chatterbox-TTS}},
year = {2025},
howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
note = {GitHub repository}
}
License
This model inherits the MIT license from the base Chatterbox model.
- Downloads last month
- 10
Model tree for juliardi/chatterbox-multilingual-finetuned
Base model
ResembleAI/chatterbox