vlengine-indextts2 / README.md
CherithCutestory's picture
Moved files to right place
d8290d9
---
title: VoxLibris IndexTTS2 Engine
emoji: 🎙️
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---
# VoxLibris IndexTTS2 Engine
A HuggingFace Space that serves [IndexTTS2](https://github.com/index-tts/index-tts)
as a REST API, implementing the
[VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md).
## Endpoints
### POST /GetEngineDetails
Returns engine capabilities, supported emotions, and voice cloning support.
### POST /ConvertTextToSpeech
Converts text to speech with zero-shot voice cloning. Requires a
`voice_to_clone_sample` (base64-encoded WAV). Supports 14 emotions mapped
to IndexTTS2's 8-dimensional emotion vector system.
### GET /health
Returns model loading status.
## Authentication
Set the `API_KEY` secret in your HuggingFace Space settings.
Requests must include `Authorization: Bearer <your-key>` header.
Leave `API_KEY` unset to disable authentication.
## Voice Cloning
IndexTTS2 is a zero-shot voice cloning engine — every request requires a
reference voice sample. Send a base64-encoded WAV file in the
`voice_to_clone_sample` field. A 6-15 second clear speech sample works best.
The engine disentangles speaker timbre from emotional expression, allowing
the cloned voice to speak with different emotions without affecting voice
identity.
## Emotion Support
IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad,
afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3
model for emotion analysis. VoxLibris emotions are automatically mapped
to appropriate vector blends:
| Emotion | Mapping Strategy |
|-------------|---------------------------------------|
| neutral | High calm (0.8) |
| happy | High happy (0.8) |
| sad | High sad (0.8) |
| angry | High angry (0.8) |
| fear | High afraid (0.8) |
| disgust | High disgusted (0.8) |
| surprise | High surprised (0.7) |
| calm | High calm (0.8) |
| excited | Happy (0.6) + surprised (0.2) |
| melancholy | Sad (0.2) + melancholic (0.6) |
| anxious | Afraid (0.5) + slight calm (0.2) |
| hopeful | Happy (0.5) + calm (0.3) |
| tender | Happy (0.2) + calm (0.5) |
| proud | Happy (0.5) + surprised (0.1) |
The `intensity` parameter (1-100) scales the emotion vectors. Additional
prosody reinforcement is applied via pyrubberband speed/pitch adjustments.
## Key Features
- **Emotion-Speaker Disentanglement**: Independent control over voice timbre
(from reference audio) and emotional expression (from emotion vectors)
- **Zero-Shot Voice Cloning**: Clone any voice from a short reference audio
- **Duration Control**: Supports both free generation and explicit token-count
modes for precise audio length
- **Multilingual**: Chinese and English (with more languages supported)
- **Built-in Qwen3 Emotion Model**: Fine-tuned for text-to-emotion analysis
## Limits
- Maximum 500 characters per request (longer text is truncated at word boundary)
- Output: 22050 Hz mono 16-bit WAV
- Reference audio: max 15 seconds (longer clips are auto-truncated)
## Environment Variables
| Variable | Description | Default |
|-------------|----------------------------------------|-----------------|
| `API_KEY` | Bearer token for authentication | (none/disabled) |
| `MODEL_DIR` | Path to model checkpoints directory | `checkpoints` |
| `USE_FP16` | Enable half-precision inference | `true` |
## Deployment
1. Create a new HuggingFace Space with **Docker** SDK
2. Upload the contents of this folder
3. Set the `API_KEY` secret in Space settings (optional)
4. The model downloads automatically during build (~5 GB)
5. Requires GPU (A10G or better recommended for reasonable speed)
6. Register the Space URL in VoxLibris Settings under TTS Engine Management