chatterbox / README.md

erichartford

Update README.md

0d07681 verified 8 months ago

preview code

raw

history blame

4.59 kB

metadata

language:
  - en
license: apache-2.0
tags:
  - text-to-speech
  - tts
  - voice-synthesis
  - voice-cloning
  - zero-shot
  - emotion-control
library_name: chatterbox-tts
pipeline_tag: text-to-speech

Chatterbox TTS

Model Description

Chatterbox is Resemble AI's first production-grade open-source text-to-speech (TTS) model. Built on a 0.5B Llama backbone and trained on 0.5M hours of cleaned data, it delivers state-of-the-art zero-shot TTS performance that consistently outperforms leading closed-source systems like ElevenLabs in side-by-side evaluations.

Key Features

State-of-the-art zero-shot TTS: Generate natural-sounding speech without fine-tuning
Emotion exaggeration control: First open-source TTS model with adjustable emotional intensity
Ultra-stable generation: Alignment-informed inference for consistent outputs
Voice cloning: Easy voice conversion with audio prompts
Built-in watermarking: PerTh (Perceptual Threshold) watermarking for responsible AI
Production-ready: Sub-200ms latency suitable for real-time applications

Intended Uses & Limitations

Intended Uses

Content creation (videos, memes, games)
AI agents and voice assistants
Interactive media and applications
Educational content
Accessibility tools
Creative projects requiring expressive speech

Limitations

Currently supports English only
Requires CUDA-capable GPU for optimal performance
Output includes imperceptible watermarks for traceability

Ethical Considerations

All generated audio includes Resemble AI's PerTh watermarking for responsible use tracking
Users must comply with applicable laws and ethical guidelines
Not intended for creating deceptive or harmful content
Please review the disclaimer section before use

How to Use

Installation

pip install chatterbox-tts

Basic Usage

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

# Initialize model
model = ChatterboxTTS.from_pretrained(device="cuda")

# Generate speech from text
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)

# Generate with custom voice
AUDIO_PROMPT_PATH = "path/to/voice/sample.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("output_custom_voice.wav", wav, model.sr)

Advanced Usage Tips

General Use (TTS and Voice Agents)

Default settings (exaggeration=0.5, cfg=0.5) work well for most prompts
For fast-speaking reference voices, lower cfg to ~0.3 for better pacing

Expressive or Dramatic Speech

Use lower cfg values (~0.3) with higher exaggeration (≥0.7)
Higher exaggeration speeds up speech; lower cfg compensates with deliberate pacing

Model Details

Architecture

Backbone: 0.5B parameter Llama-based architecture
Training Data: 0.5M hours of cleaned speech data
Special Features: Alignment-informed inference for stability

Performance

Consistently preferred over ElevenLabs in side-by-side evaluations
Ultra-low latency (<200ms) suitable for production use
Stable generation with minimal artifacts

Citation

If you use Chatterbox in your research or projects, please cite:

@software{chatterbox2024,
  title = {Chatterbox TTS},
  author = {Resemble AI},
  year = {2024},
  url = {https://github.com/resemble-ai/chatterbox}
}

Acknowledgments

Disclaimer

This model should not be used for creating deceptive, harmful, or malicious content. Users are responsible for ensuring their use complies with all applicable laws and ethical guidelines. All generated audio includes imperceptible watermarks for responsible AI tracking.

License

This model is licensed under the MIT License. See the LICENSE file for details.

Made with ♥️ by Resemble AI

ResembleAI
/

chatterbox