Spaces:

CherithCutestory
/

vlengine-indextts2

Paused

App Files Files Community

vlengine-indextts2 / README.md

CherithCutestory

Moved files to right place

d8290d9 3 months ago

preview code

raw

history blame contribute delete

4.13 kB

	---
	title: VoxLibris IndexTTS2 Engine
	emoji: 🎙️
	colorFrom: purple
	colorTo: indigo
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# VoxLibris IndexTTS2 Engine

	A HuggingFace Space that serves [IndexTTS2](https://github.com/index-tts/index-tts)
	as a REST API, implementing the
	[VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md).

	## Endpoints

	### POST /GetEngineDetails

	Returns engine capabilities, supported emotions, and voice cloning support.

	### POST /ConvertTextToSpeech

	Converts text to speech with zero-shot voice cloning. Requires a
	`voice_to_clone_sample` (base64-encoded WAV). Supports 14 emotions mapped
	to IndexTTS2's 8-dimensional emotion vector system.

	### GET /health

	Returns model loading status.

	## Authentication

	Set the `API_KEY` secret in your HuggingFace Space settings.
	Requests must include `Authorization: Bearer <your-key>` header.
	Leave `API_KEY` unset to disable authentication.

	## Voice Cloning

	IndexTTS2 is a zero-shot voice cloning engine — every request requires a
	reference voice sample. Send a base64-encoded WAV file in the
	`voice_to_clone_sample` field. A 6-15 second clear speech sample works best.

	The engine disentangles speaker timbre from emotional expression, allowing
	the cloned voice to speak with different emotions without affecting voice
	identity.

	## Emotion Support

	IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad,
	afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3
	model for emotion analysis. VoxLibris emotions are automatically mapped
	to appropriate vector blends:

	\| Emotion \| Mapping Strategy \|
	\|-------------\|---------------------------------------\|
	\| neutral \| High calm (0.8) \|
	\| happy \| High happy (0.8) \|
	\| sad \| High sad (0.8) \|
	\| angry \| High angry (0.8) \|
	\| fear \| High afraid (0.8) \|
	\| disgust \| High disgusted (0.8) \|
	\| surprise \| High surprised (0.7) \|
	\| calm \| High calm (0.8) \|
	\| excited \| Happy (0.6) + surprised (0.2) \|
	\| melancholy \| Sad (0.2) + melancholic (0.6) \|
	\| anxious \| Afraid (0.5) + slight calm (0.2) \|
	\| hopeful \| Happy (0.5) + calm (0.3) \|
	\| tender \| Happy (0.2) + calm (0.5) \|
	\| proud \| Happy (0.5) + surprised (0.1) \|

	The `intensity` parameter (1-100) scales the emotion vectors. Additional
	prosody reinforcement is applied via pyrubberband speed/pitch adjustments.

	## Key Features

	- Emotion-Speaker Disentanglement: Independent control over voice timbre
	(from reference audio) and emotional expression (from emotion vectors)
	- Zero-Shot Voice Cloning: Clone any voice from a short reference audio
	- Duration Control: Supports both free generation and explicit token-count
	modes for precise audio length
	- Multilingual: Chinese and English (with more languages supported)
	- Built-in Qwen3 Emotion Model: Fine-tuned for text-to-emotion analysis

	## Limits

	- Maximum 500 characters per request (longer text is truncated at word boundary)
	- Output: 22050 Hz mono 16-bit WAV
	- Reference audio: max 15 seconds (longer clips are auto-truncated)

	## Environment Variables

	\| Variable \| Description \| Default \|
	\|-------------\|----------------------------------------\|-----------------\|
	\| `API_KEY` \| Bearer token for authentication \| (none/disabled) \|
	\| `MODEL_DIR` \| Path to model checkpoints directory \| `checkpoints` \|
	\| `USE_FP16` \| Enable half-precision inference \| `true` \|

	## Deployment

	1. Create a new HuggingFace Space with Docker SDK
	2. Upload the contents of this folder
	3. Set the `API_KEY` secret in Space settings (optional)
	4. The model downloads automatically during build (~5 GB)
	5. Requires GPU (A10G or better recommended for reasonable speed)
	6. Register the Space URL in VoxLibris Settings under TTS Engine Management