Spaces:
Paused
title: VoxLibris StyleTTS2 Engine
emoji: 🎭
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
VoxLibris StyleTTS2 TTS Engine
A HuggingFace Space that serves StyleTTS2 as a REST API for text-to-speech with emotion control and voice cloning, implementing the VoxLibris TTS Engine API Contract.
StyleTTS2 achieves human-level TTS synthesis through style diffusion and adversarial training with large speech language models (SLMs). It uses a diffusion model to generate the most suitable speaking style for the given text.
Endpoints
POST /GetEngineDetails
Returns engine capabilities including supported emotions and voice cloning support.
POST /ConvertTextToSpeech
Converts text to speech with rich style control:
- Emotion control: neutral, happy, sad, angry, fear, excited, calm, surprised, whisper
- Intensity: Scales the embedding_scale to control how strongly the emotion affects output
- Voice cloning: Upload reference audio (base64 WAV) to clone voice timbre and prosody
- Speed/pitch adjustment: Via pyrubberband post-processing
- Long-form support: Automatically uses
long_inferencefor longer texts with style continuity
GET /health
Returns model loading status.
GET /
Built-in test frontend with emotion selection, voice cloning upload, and parameter controls.
How Emotion Control Works
StyleTTS2 doesn't use explicit emotion tags. Instead, it controls expressiveness through diffusion parameters:
| Parameter | What it controls | Range |
|---|---|---|
alpha |
Timbre (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 |
beta |
Prosody (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 |
embedding_scale |
Expressiveness (higher=more emotional/dramatic) | 0.1 - 5.0 |
diffusion_steps |
Style diversity (more steps=more varied output) | 3 - 20 |
Each emotion preset maps to tuned combinations of these parameters. The intensity
slider scales embedding_scale to make the emotion more or less pronounced.
Authentication
Set the API_KEY secret in your HuggingFace Space settings.
Requests must include Authorization: Bearer <your-key> header.
Leave API_KEY unset to disable authentication.
Hardware Requirements
Requires GPU for reasonable inference speed. Recommended: T4 or better. The LibriTTS multi-speaker model (~1.8 GB) downloads automatically on first startup.
Deployment
- Create a new HuggingFace Space with Docker SDK
- Upload the contents of this folder
- Select a GPU runtime (T4 recommended)
- Set the
API_KEYsecret in Space settings (optional) - The model downloads automatically on first startup (~2 GB)
- Register the Space URL in VoxLibris Settings under TTS Engine Management