--- title: VoxLibris StyleTTS2 Engine emoji: 🎭 colorFrom: purple colorTo: pink sdk: docker app_port: 7860 pinned: false --- # VoxLibris StyleTTS2 TTS Engine A HuggingFace Space that serves [StyleTTS2](https://github.com/yl4579/StyleTTS2) as a REST API for text-to-speech with emotion control and voice cloning, implementing the [VoxLibris TTS Engine API Contract](../../docs/tts-api-contract.md). StyleTTS2 achieves human-level TTS synthesis through style diffusion and adversarial training with large speech language models (SLMs). It uses a diffusion model to generate the most suitable speaking style for the given text. ## Endpoints ### POST /GetEngineDetails Returns engine capabilities including supported emotions and voice cloning support. ### POST /ConvertTextToSpeech Converts text to speech with rich style control: - **Emotion control**: neutral, happy, sad, angry, fear, excited, calm, surprised, whisper - **Intensity**: Scales the embedding_scale to control how strongly the emotion affects output - **Voice cloning**: Upload reference audio (base64 WAV) to clone voice timbre and prosody - **Speed/pitch adjustment**: Via pyrubberband post-processing - **Long-form support**: Automatically uses `long_inference` for longer texts with style continuity ### GET /health Returns model loading status. ### GET / Built-in test frontend with emotion selection, voice cloning upload, and parameter controls. ## How Emotion Control Works StyleTTS2 doesn't use explicit emotion tags. Instead, it controls expressiveness through diffusion parameters: | Parameter | What it controls | Range | |-----------|-----------------|-------| | `alpha` | Timbre (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 | | `beta` | Prosody (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 | | `embedding_scale` | Expressiveness (higher=more emotional/dramatic) | 0.1 - 5.0 | | `diffusion_steps` | Style diversity (more steps=more varied output) | 3 - 20 | Each emotion preset maps to tuned combinations of these parameters. The intensity slider scales `embedding_scale` to make the emotion more or less pronounced. ## Authentication Set the `API_KEY` secret in your HuggingFace Space settings. Requests must include `Authorization: Bearer ` header. Leave `API_KEY` unset to disable authentication. ## Hardware Requirements Requires GPU for reasonable inference speed. Recommended: T4 or better. The LibriTTS multi-speaker model (~1.8 GB) downloads automatically on first startup. ## Deployment 1. Create a new HuggingFace Space with **Docker** SDK 2. Upload the contents of this folder 3. Select a GPU runtime (T4 recommended) 4. Set the `API_KEY` secret in Space settings (optional) 5. The model downloads automatically on first startup (~2 GB) 6. Register the Space URL in VoxLibris Settings under TTS Engine Management