title: VoxLibris IndexTTS2 Engine
emoji: 🎙️
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
VoxLibris IndexTTS2 Engine
A HuggingFace Space that serves IndexTTS2 as a REST API, implementing the VoxLibris TTS Engine API Contract.
Endpoints
POST /GetEngineDetails
Returns engine capabilities, supported emotions, and voice cloning support.
POST /ConvertTextToSpeech
Converts text to speech with zero-shot voice cloning. Requires a
voice_to_clone_sample (base64-encoded WAV). Supports 14 emotions mapped
to IndexTTS2's 8-dimensional emotion vector system.
GET /health
Returns model loading status.
Authentication
Set the API_KEY secret in your HuggingFace Space settings.
Requests must include Authorization: Bearer <your-key> header.
Leave API_KEY unset to disable authentication.
Voice Cloning
IndexTTS2 is a zero-shot voice cloning engine — every request requires a
reference voice sample. Send a base64-encoded WAV file in the
voice_to_clone_sample field. A 6-15 second clear speech sample works best.
The engine disentangles speaker timbre from emotional expression, allowing the cloned voice to speak with different emotions without affecting voice identity.
Emotion Support
IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad, afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3 model for emotion analysis. VoxLibris emotions are automatically mapped to appropriate vector blends:
| Emotion | Mapping Strategy |
|---|---|
| neutral | High calm (0.8) |
| happy | High happy (0.8) |
| sad | High sad (0.8) |
| angry | High angry (0.8) |
| fear | High afraid (0.8) |
| disgust | High disgusted (0.8) |
| surprise | High surprised (0.7) |
| calm | High calm (0.8) |
| excited | Happy (0.6) + surprised (0.2) |
| melancholy | Sad (0.2) + melancholic (0.6) |
| anxious | Afraid (0.5) + slight calm (0.2) |
| hopeful | Happy (0.5) + calm (0.3) |
| tender | Happy (0.2) + calm (0.5) |
| proud | Happy (0.5) + surprised (0.1) |
The intensity parameter (1-100) scales the emotion vectors. Additional
prosody reinforcement is applied via pyrubberband speed/pitch adjustments.
Key Features
- Emotion-Speaker Disentanglement: Independent control over voice timbre (from reference audio) and emotional expression (from emotion vectors)
- Zero-Shot Voice Cloning: Clone any voice from a short reference audio
- Duration Control: Supports both free generation and explicit token-count modes for precise audio length
- Multilingual: Chinese and English (with more languages supported)
- Built-in Qwen3 Emotion Model: Fine-tuned for text-to-emotion analysis
Limits
- Maximum 500 characters per request (longer text is truncated at word boundary)
- Output: 22050 Hz mono 16-bit WAV
- Reference audio: max 15 seconds (longer clips are auto-truncated)
Environment Variables
| Variable | Description | Default |
|---|---|---|
API_KEY |
Bearer token for authentication | (none/disabled) |
MODEL_DIR |
Path to model checkpoints directory | checkpoints |
USE_FP16 |
Enable half-precision inference | true |
Deployment
- Create a new HuggingFace Space with Docker SDK
- Upload the contents of this folder
- Set the
API_KEYsecret in Space settings (optional) - The model downloads automatically during build (~5 GB)
- Requires GPU (A10G or better recommended for reasonable speed)
- Register the Space URL in VoxLibris Settings under TTS Engine Management