chatterbox / README.md
erichartford's picture
Update README.md
0d07681 verified
|
raw
history blame
4.59 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - text-to-speech
  - tts
  - voice-synthesis
  - voice-cloning
  - zero-shot
  - emotion-control
library_name: chatterbox-tts
pipeline_tag: text-to-speech

Chatterbox TTS

cb-big2

Model Description

Chatterbox is Resemble AI's first production-grade open-source text-to-speech (TTS) model. Built on a 0.5B Llama backbone and trained on 0.5M hours of cleaned data, it delivers state-of-the-art zero-shot TTS performance that consistently outperforms leading closed-source systems like ElevenLabs in side-by-side evaluations.

Key Features

  • State-of-the-art zero-shot TTS: Generate natural-sounding speech without fine-tuning
  • Emotion exaggeration control: First open-source TTS model with adjustable emotional intensity
  • Ultra-stable generation: Alignment-informed inference for consistent outputs
  • Voice cloning: Easy voice conversion with audio prompts
  • Built-in watermarking: PerTh (Perceptual Threshold) watermarking for responsible AI
  • Production-ready: Sub-200ms latency suitable for real-time applications

Intended Uses & Limitations

Intended Uses

  • Content creation (videos, memes, games)
  • AI agents and voice assistants
  • Interactive media and applications
  • Educational content
  • Accessibility tools
  • Creative projects requiring expressive speech

Limitations

  • Currently supports English only
  • Requires CUDA-capable GPU for optimal performance
  • Output includes imperceptible watermarks for traceability

Ethical Considerations

  • All generated audio includes Resemble AI's PerTh watermarking for responsible use tracking
  • Users must comply with applicable laws and ethical guidelines
  • Not intended for creating deceptive or harmful content
  • Please review the disclaimer section before use

How to Use

Installation

pip install chatterbox-tts

Basic Usage

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

# Initialize model
model = ChatterboxTTS.from_pretrained(device="cuda")

# Generate speech from text
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)

# Generate with custom voice
AUDIO_PROMPT_PATH = "path/to/voice/sample.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("output_custom_voice.wav", wav, model.sr)

Advanced Usage Tips

General Use (TTS and Voice Agents)

  • Default settings (exaggeration=0.5, cfg=0.5) work well for most prompts
  • For fast-speaking reference voices, lower cfg to ~0.3 for better pacing

Expressive or Dramatic Speech

  • Use lower cfg values (~0.3) with higher exaggeration (≥0.7)
  • Higher exaggeration speeds up speech; lower cfg compensates with deliberate pacing

Model Details

Architecture

  • Backbone: 0.5B parameter Llama-based architecture
  • Training Data: 0.5M hours of cleaned speech data
  • Special Features: Alignment-informed inference for stability

Performance

  • Consistently preferred over ElevenLabs in side-by-side evaluations
  • Ultra-low latency (<200ms) suitable for production use
  • Stable generation with minimal artifacts

Citation

If you use Chatterbox in your research or projects, please cite:

@software{chatterbox2024,
  title = {Chatterbox TTS},
  author = {Resemble AI},
  year = {2024},
  url = {https://github.com/resemble-ai/chatterbox}
}

Acknowledgments

Links

Disclaimer

This model should not be used for creating deceptive, harmful, or malicious content. Users are responsible for ensuring their use complies with all applicable laws and ethical guidelines. All generated audio includes imperceptible watermarks for responsible AI tracking.

License

This model is licensed under the MIT License. See the LICENSE file for details.


Made with ♥️ by Resemble AI