| --- |
| title: VoxLibris Chatterbox TTS Engine |
| emoji: 🗣️ |
| colorFrom: purple |
| colorTo: indigo |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| --- |
| |
| # VoxLibris Chatterbox TTS Engine |
|
|
| A HuggingFace Space that serves [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) |
| as a REST API, implementing the |
| [VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md). |
|
|
| ## Endpoints |
|
|
| ### POST /GetEngineDetails |
|
|
| Returns engine capabilities, supported emotions, and voice cloning support. |
|
|
| ### POST /ConvertTextToSpeech |
|
|
| Converts text to speech with voice cloning. Requires a `voice_to_clone_sample` |
| (base64-encoded WAV). Supports emotion-driven expressiveness via the exaggeration |
| parameter, mapped automatically from VoxLibris emotions. |
|
|
| ### GET /health |
|
|
| Returns model loading status. |
|
|
| ## Authentication |
|
|
| Set the `API_KEY` secret in your HuggingFace Space settings. |
| Requests must include `Authorization: Bearer <your-key>` header. |
| Leave `API_KEY` unset to disable authentication. |
|
|
| ## Voice Cloning |
|
|
| Chatterbox is a voice-cloning TTS engine — every request requires a reference |
| voice sample. Send a base64-encoded WAV file in the `voice_to_clone_sample` |
| field. A 6-15 second clear speech sample works best. |
|
|
| ## Emotion Support |
|
|
| Chatterbox controls expressiveness through its `exaggeration` parameter (0.0-1.0). |
| The engine automatically maps VoxLibris emotions to appropriate exaggeration levels: |
|
|
| | Emotion | Exaggeration | Description | |
| |-----------|-------------|---------------------------| |
| | neutral | 0.50 | Normal, conversational | |
| | calm | 0.40 | Subdued, relaxed | |
| | happy | 0.70 | Cheerful, upbeat | |
| | sad | 0.60 | Somber, downcast | |
| | angry | 0.85 | Intense, forceful | |
| | fear | 0.75 | Tense, urgent | |
| | excited | 0.90 | High energy, enthusiastic | |
| | surprise | 0.80 | Startled, astonished | |
|
|
| The `intensity` parameter (1-100) scales the exaggeration further. |
|
|
| ## Limits |
|
|
| - Maximum 300 characters per request (longer text is truncated at word boundary) |
| - Output: 24kHz mono 16-bit WAV |
|
|
| ## Deployment |
|
|
| 1. Create a new HuggingFace Space with **Docker** SDK |
| 2. Upload the contents of this folder |
| 3. Set the `API_KEY` secret in Space settings (optional) |
| 4. The model downloads automatically on first startup (~500 MB) |
| 5. Requires GPU (T4 minimum recommended) |
| 6. Register the Space URL in VoxLibris Settings under TTS Engine Management |
|
|