language:
- en
license: apache-2.0
tags:
- text-to-speech
- tts
- voice-synthesis
- voice-cloning
- zero-shot
- emotion-control
library_name: chatterbox-tts
pipeline_tag: text-to-speech
Chatterbox TTS
Model Description
Chatterbox is Resemble AI's first production-grade open-source text-to-speech (TTS) model. Built on a 0.5B Llama backbone and trained on 0.5M hours of cleaned data, it delivers state-of-the-art zero-shot TTS performance that consistently outperforms leading closed-source systems like ElevenLabs in side-by-side evaluations.
Key Features
- State-of-the-art zero-shot TTS: Generate natural-sounding speech without fine-tuning
- Emotion exaggeration control: First open-source TTS model with adjustable emotional intensity
- Ultra-stable generation: Alignment-informed inference for consistent outputs
- Voice cloning: Easy voice conversion with audio prompts
- Built-in watermarking: PerTh (Perceptual Threshold) watermarking for responsible AI
- Production-ready: Sub-200ms latency suitable for real-time applications
Intended Uses & Limitations
Intended Uses
- Content creation (videos, memes, games)
- AI agents and voice assistants
- Interactive media and applications
- Educational content
- Accessibility tools
- Creative projects requiring expressive speech
Limitations
- Currently supports English only
- Requires CUDA-capable GPU for optimal performance
- Output includes imperceptible watermarks for traceability
Ethical Considerations
- All generated audio includes Resemble AI's PerTh watermarking for responsible use tracking
- Users must comply with applicable laws and ethical guidelines
- Not intended for creating deceptive or harmful content
- Please review the disclaimer section before use
How to Use
Installation
pip install chatterbox-tts
Basic Usage
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
# Initialize model
model = ChatterboxTTS.from_pretrained(device="cuda")
# Generate speech from text
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)
# Generate with custom voice
AUDIO_PROMPT_PATH = "path/to/voice/sample.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("output_custom_voice.wav", wav, model.sr)
Advanced Usage Tips
General Use (TTS and Voice Agents)
- Default settings (
exaggeration=0.5,cfg=0.5) work well for most prompts - For fast-speaking reference voices, lower
cfgto ~0.3 for better pacing
Expressive or Dramatic Speech
- Use lower
cfgvalues (~0.3) with higherexaggeration(≥0.7) - Higher exaggeration speeds up speech; lower cfg compensates with deliberate pacing
Model Details
Architecture
- Backbone: 0.5B parameter Llama-based architecture
- Training Data: 0.5M hours of cleaned speech data
- Special Features: Alignment-informed inference for stability
Performance
- Consistently preferred over ElevenLabs in side-by-side evaluations
- Ultra-low latency (<200ms) suitable for production use
- Stable generation with minimal artifacts
Citation
If you use Chatterbox in your research or projects, please cite:
@software{chatterbox2024,
title = {Chatterbox TTS},
author = {Resemble AI},
year = {2024},
url = {https://github.com/resemble-ai/chatterbox}
}
Acknowledgments
Links
- 🎧 Listen to demo samples
- 🤗 Try it on Hugging Face Spaces
- 📊 View benchmarks on Podonos
- 🏢 Resemble AI TTS Service (for scaled production use)
Disclaimer
This model should not be used for creating deceptive, harmful, or malicious content. Users are responsible for ensuring their use complies with all applicable laws and ethical guidelines. All generated audio includes imperceptible watermarks for responsible AI tracking.
License
This model is licensed under the MIT License. See the LICENSE file for details.
Made with ♥️ by Resemble AI