--- title: VoxLibris IndexTTS2 Engine emoji: 🎙️ colorFrom: purple colorTo: indigo sdk: docker app_port: 7860 pinned: false --- # VoxLibris IndexTTS2 Engine A HuggingFace Space that serves [IndexTTS2](https://github.com/index-tts/index-tts) as a REST API, implementing the [VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md). ## Endpoints ### POST /GetEngineDetails Returns engine capabilities, supported emotions, and voice cloning support. ### POST /ConvertTextToSpeech Converts text to speech with zero-shot voice cloning. Requires a `voice_to_clone_sample` (base64-encoded WAV). Supports 14 emotions mapped to IndexTTS2's 8-dimensional emotion vector system. ### GET /health Returns model loading status. ## Authentication Set the `API_KEY` secret in your HuggingFace Space settings. Requests must include `Authorization: Bearer ` header. Leave `API_KEY` unset to disable authentication. ## Voice Cloning IndexTTS2 is a zero-shot voice cloning engine — every request requires a reference voice sample. Send a base64-encoded WAV file in the `voice_to_clone_sample` field. A 6-15 second clear speech sample works best. The engine disentangles speaker timbre from emotional expression, allowing the cloned voice to speak with different emotions without affecting voice identity. ## Emotion Support IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad, afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3 model for emotion analysis. VoxLibris emotions are automatically mapped to appropriate vector blends: | Emotion | Mapping Strategy | |-------------|---------------------------------------| | neutral | High calm (0.8) | | happy | High happy (0.8) | | sad | High sad (0.8) | | angry | High angry (0.8) | | fear | High afraid (0.8) | | disgust | High disgusted (0.8) | | surprise | High surprised (0.7) | | calm | High calm (0.8) | | excited | Happy (0.6) + surprised (0.2) | | melancholy | Sad (0.2) + melancholic (0.6) | | anxious | Afraid (0.5) + slight calm (0.2) | | hopeful | Happy (0.5) + calm (0.3) | | tender | Happy (0.2) + calm (0.5) | | proud | Happy (0.5) + surprised (0.1) | The `intensity` parameter (1-100) scales the emotion vectors. Additional prosody reinforcement is applied via pyrubberband speed/pitch adjustments. ## Key Features - **Emotion-Speaker Disentanglement**: Independent control over voice timbre (from reference audio) and emotional expression (from emotion vectors) - **Zero-Shot Voice Cloning**: Clone any voice from a short reference audio - **Duration Control**: Supports both free generation and explicit token-count modes for precise audio length - **Multilingual**: Chinese and English (with more languages supported) - **Built-in Qwen3 Emotion Model**: Fine-tuned for text-to-emotion analysis ## Limits - Maximum 500 characters per request (longer text is truncated at word boundary) - Output: 22050 Hz mono 16-bit WAV - Reference audio: max 15 seconds (longer clips are auto-truncated) ## Environment Variables | Variable | Description | Default | |-------------|----------------------------------------|-----------------| | `API_KEY` | Bearer token for authentication | (none/disabled) | | `MODEL_DIR` | Path to model checkpoints directory | `checkpoints` | | `USE_FP16` | Enable half-precision inference | `true` | ## Deployment 1. Create a new HuggingFace Space with **Docker** SDK 2. Upload the contents of this folder 3. Set the `API_KEY` secret in Space settings (optional) 4. The model downloads automatically during build (~5 GB) 5. Requires GPU (A10G or better recommended for reasonable speed) 6. Register the Space URL in VoxLibris Settings under TTS Engine Management