vlengine-indextts2 / README.md
CherithCutestory's picture
Moved files to right place
d8290d9
metadata
title: VoxLibris IndexTTS2 Engine
emoji: 🎙️
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

VoxLibris IndexTTS2 Engine

A HuggingFace Space that serves IndexTTS2 as a REST API, implementing the VoxLibris TTS Engine API Contract.

Endpoints

POST /GetEngineDetails

Returns engine capabilities, supported emotions, and voice cloning support.

POST /ConvertTextToSpeech

Converts text to speech with zero-shot voice cloning. Requires a voice_to_clone_sample (base64-encoded WAV). Supports 14 emotions mapped to IndexTTS2's 8-dimensional emotion vector system.

GET /health

Returns model loading status.

Authentication

Set the API_KEY secret in your HuggingFace Space settings. Requests must include Authorization: Bearer <your-key> header. Leave API_KEY unset to disable authentication.

Voice Cloning

IndexTTS2 is a zero-shot voice cloning engine — every request requires a reference voice sample. Send a base64-encoded WAV file in the voice_to_clone_sample field. A 6-15 second clear speech sample works best.

The engine disentangles speaker timbre from emotional expression, allowing the cloned voice to speak with different emotions without affecting voice identity.

Emotion Support

IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad, afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3 model for emotion analysis. VoxLibris emotions are automatically mapped to appropriate vector blends:

Emotion Mapping Strategy
neutral High calm (0.8)
happy High happy (0.8)
sad High sad (0.8)
angry High angry (0.8)
fear High afraid (0.8)
disgust High disgusted (0.8)
surprise High surprised (0.7)
calm High calm (0.8)
excited Happy (0.6) + surprised (0.2)
melancholy Sad (0.2) + melancholic (0.6)
anxious Afraid (0.5) + slight calm (0.2)
hopeful Happy (0.5) + calm (0.3)
tender Happy (0.2) + calm (0.5)
proud Happy (0.5) + surprised (0.1)

The intensity parameter (1-100) scales the emotion vectors. Additional prosody reinforcement is applied via pyrubberband speed/pitch adjustments.

Key Features

  • Emotion-Speaker Disentanglement: Independent control over voice timbre (from reference audio) and emotional expression (from emotion vectors)
  • Zero-Shot Voice Cloning: Clone any voice from a short reference audio
  • Duration Control: Supports both free generation and explicit token-count modes for precise audio length
  • Multilingual: Chinese and English (with more languages supported)
  • Built-in Qwen3 Emotion Model: Fine-tuned for text-to-emotion analysis

Limits

  • Maximum 500 characters per request (longer text is truncated at word boundary)
  • Output: 22050 Hz mono 16-bit WAV
  • Reference audio: max 15 seconds (longer clips are auto-truncated)

Environment Variables

Variable Description Default
API_KEY Bearer token for authentication (none/disabled)
MODEL_DIR Path to model checkpoints directory checkpoints
USE_FP16 Enable half-precision inference true

Deployment

  1. Create a new HuggingFace Space with Docker SDK
  2. Upload the contents of this folder
  3. Set the API_KEY secret in Space settings (optional)
  4. The model downloads automatically during build (~5 GB)
  5. Requires GPU (A10G or better recommended for reasonable speed)
  6. Register the Space URL in VoxLibris Settings under TTS Engine Management