File size: 4,590 Bytes
0d07681
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
language:
- en
license: apache-2.0
tags:
- text-to-speech
- tts
- voice-synthesis
- voice-cloning
- zero-shot
- emotion-control
library_name: chatterbox-tts
pipeline_tag: text-to-speech
---

# Chatterbox TTS

<img width="1200" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" />

## Model Description

Chatterbox is Resemble AI's first production-grade open-source text-to-speech (TTS) model. Built on a 0.5B Llama backbone and trained on 0.5M hours of cleaned data, it delivers state-of-the-art zero-shot TTS performance that consistently outperforms leading closed-source systems like ElevenLabs in side-by-side evaluations.

### Key Features

- **State-of-the-art zero-shot TTS**: Generate natural-sounding speech without fine-tuning
- **Emotion exaggeration control**: First open-source TTS model with adjustable emotional intensity
- **Ultra-stable generation**: Alignment-informed inference for consistent outputs
- **Voice cloning**: Easy voice conversion with audio prompts
- **Built-in watermarking**: PerTh (Perceptual Threshold) watermarking for responsible AI
- **Production-ready**: Sub-200ms latency suitable for real-time applications

## Intended Uses & Limitations

### Intended Uses

- Content creation (videos, memes, games)
- AI agents and voice assistants
- Interactive media and applications
- Educational content
- Accessibility tools
- Creative projects requiring expressive speech

### Limitations

- Currently supports English only
- Requires CUDA-capable GPU for optimal performance
- Output includes imperceptible watermarks for traceability

### Ethical Considerations

- All generated audio includes Resemble AI's PerTh watermarking for responsible use tracking
- Users must comply with applicable laws and ethical guidelines
- Not intended for creating deceptive or harmful content
- Please review the disclaimer section before use

## How to Use

### Installation

```bash
pip install chatterbox-tts
```

### Basic Usage

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

# Initialize model
model = ChatterboxTTS.from_pretrained(device="cuda")

# Generate speech from text
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)

# Generate with custom voice
AUDIO_PROMPT_PATH = "path/to/voice/sample.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("output_custom_voice.wav", wav, model.sr)
```

### Advanced Usage Tips

#### General Use (TTS and Voice Agents)
- Default settings (`exaggeration=0.5`, `cfg=0.5`) work well for most prompts
- For fast-speaking reference voices, lower `cfg` to ~0.3 for better pacing

#### Expressive or Dramatic Speech
- Use lower `cfg` values (~0.3) with higher `exaggeration` (≥0.7)
- Higher exaggeration speeds up speech; lower cfg compensates with deliberate pacing

## Model Details

### Architecture
- **Backbone**: 0.5B parameter Llama-based architecture
- **Training Data**: 0.5M hours of cleaned speech data
- **Special Features**: Alignment-informed inference for stability

### Performance
- Consistently preferred over ElevenLabs in side-by-side evaluations
- Ultra-low latency (<200ms) suitable for production use
- Stable generation with minimal artifacts

## Citation

If you use Chatterbox in your research or projects, please cite:

```bibtex
@software{chatterbox2024,
  title = {Chatterbox TTS},
  author = {Resemble AI},
  year = {2024},
  url = {https://github.com/resemble-ai/chatterbox}
}
```

## Acknowledgments

- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
- [Llama 3](https://github.com/meta-llama/llama3)

## Links

- 🎧 [Listen to demo samples](https://resemble-ai.github.io/chatterbox_demopage/)
- 🤗 [Try it on Hugging Face Spaces](https://huggingface.co/spaces/ResembleAI/Chatterbox)
- 📊 [View benchmarks on Podonos](https://podonos.com/resembleai/chatterbox)
- 🏢 [Resemble AI TTS Service](https://resemble.ai) (for scaled production use)

## Disclaimer

This model should not be used for creating deceptive, harmful, or malicious content. Users are responsible for ensuring their use complies with all applicable laws and ethical guidelines. All generated audio includes imperceptible watermarks for responsible AI tracking.

## License

This model is licensed under the MIT License. See the LICENSE file for details.

---

*Made with ♥️ by [Resemble AI](https://resemble.ai)*