| --- |
| title: VoxLibris IndexTTS2 Engine |
| emoji: 🎙️ |
| colorFrom: purple |
| colorTo: indigo |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| --- |
| |
| # VoxLibris IndexTTS2 Engine |
|
|
| A HuggingFace Space that serves [IndexTTS2](https://github.com/index-tts/index-tts) |
| as a REST API, implementing the |
| [VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md). |
|
|
| ## Endpoints |
|
|
| ### POST /GetEngineDetails |
|
|
| Returns engine capabilities, supported emotions, and voice cloning support. |
|
|
| ### POST /ConvertTextToSpeech |
|
|
| Converts text to speech with zero-shot voice cloning. Requires a |
| `voice_to_clone_sample` (base64-encoded WAV). Supports 14 emotions mapped |
| to IndexTTS2's 8-dimensional emotion vector system. |
|
|
| ### GET /health |
|
|
| Returns model loading status. |
|
|
| ## Authentication |
|
|
| Set the `API_KEY` secret in your HuggingFace Space settings. |
| Requests must include `Authorization: Bearer <your-key>` header. |
| Leave `API_KEY` unset to disable authentication. |
|
|
| ## Voice Cloning |
|
|
| IndexTTS2 is a zero-shot voice cloning engine — every request requires a |
| reference voice sample. Send a base64-encoded WAV file in the |
| `voice_to_clone_sample` field. A 6-15 second clear speech sample works best. |
|
|
| The engine disentangles speaker timbre from emotional expression, allowing |
| the cloned voice to speak with different emotions without affecting voice |
| identity. |
|
|
| ## Emotion Support |
|
|
| IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad, |
| afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3 |
| model for emotion analysis. VoxLibris emotions are automatically mapped |
| to appropriate vector blends: |
|
|
| | Emotion | Mapping Strategy | |
| |-------------|---------------------------------------| |
| | neutral | High calm (0.8) | |
| | happy | High happy (0.8) | |
| | sad | High sad (0.8) | |
| | angry | High angry (0.8) | |
| | fear | High afraid (0.8) | |
| | disgust | High disgusted (0.8) | |
| | surprise | High surprised (0.7) | |
| | calm | High calm (0.8) | |
| | excited | Happy (0.6) + surprised (0.2) | |
| | melancholy | Sad (0.2) + melancholic (0.6) | |
| | anxious | Afraid (0.5) + slight calm (0.2) | |
| | hopeful | Happy (0.5) + calm (0.3) | |
| | tender | Happy (0.2) + calm (0.5) | |
| | proud | Happy (0.5) + surprised (0.1) | |
|
|
| The `intensity` parameter (1-100) scales the emotion vectors. Additional |
| prosody reinforcement is applied via pyrubberband speed/pitch adjustments. |
|
|
| ## Key Features |
|
|
| - **Emotion-Speaker Disentanglement**: Independent control over voice timbre |
| (from reference audio) and emotional expression (from emotion vectors) |
| - **Zero-Shot Voice Cloning**: Clone any voice from a short reference audio |
| - **Duration Control**: Supports both free generation and explicit token-count |
| modes for precise audio length |
| - **Multilingual**: Chinese and English (with more languages supported) |
| - **Built-in Qwen3 Emotion Model**: Fine-tuned for text-to-emotion analysis |
|
|
| ## Limits |
|
|
| - Maximum 500 characters per request (longer text is truncated at word boundary) |
| - Output: 22050 Hz mono 16-bit WAV |
| - Reference audio: max 15 seconds (longer clips are auto-truncated) |
|
|
| ## Environment Variables |
|
|
| | Variable | Description | Default | |
| |-------------|----------------------------------------|-----------------| |
| | `API_KEY` | Bearer token for authentication | (none/disabled) | |
| | `MODEL_DIR` | Path to model checkpoints directory | `checkpoints` | |
| | `USE_FP16` | Enable half-precision inference | `true` | |
|
|
| ## Deployment |
|
|
| 1. Create a new HuggingFace Space with **Docker** SDK |
| 2. Upload the contents of this folder |
| 3. Set the `API_KEY` secret in Space settings (optional) |
| 4. The model downloads automatically during build (~5 GB) |
| 5. Requires GPU (A10G or better recommended for reasonable speed) |
| 6. Register the Space URL in VoxLibris Settings under TTS Engine Management |
|
|