styletts2 / README.md
CherithCutestory's picture
Updated with new docker iamge
42b0869
---
title: VoxLibris StyleTTS2 Engine
emoji: 🎭
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
---
# VoxLibris StyleTTS2 TTS Engine
A HuggingFace Space that serves [StyleTTS2](https://github.com/yl4579/StyleTTS2) as a REST API for
text-to-speech with emotion control and voice cloning, implementing the
[VoxLibris TTS Engine API Contract](../../docs/tts-api-contract.md).
StyleTTS2 achieves human-level TTS synthesis through style diffusion and adversarial
training with large speech language models (SLMs). It uses a diffusion model to
generate the most suitable speaking style for the given text.
## Endpoints
### POST /GetEngineDetails
Returns engine capabilities including supported emotions and voice cloning support.
### POST /ConvertTextToSpeech
Converts text to speech with rich style control:
- **Emotion control**: neutral, happy, sad, angry, fear, excited, calm, surprised, whisper
- **Intensity**: Scales the embedding_scale to control how strongly the emotion affects output
- **Voice cloning**: Upload reference audio (base64 WAV) to clone voice timbre and prosody
- **Speed/pitch adjustment**: Via pyrubberband post-processing
- **Long-form support**: Automatically uses `long_inference` for longer texts with style continuity
### GET /health
Returns model loading status.
### GET /
Built-in test frontend with emotion selection, voice cloning upload, and parameter controls.
## How Emotion Control Works
StyleTTS2 doesn't use explicit emotion tags. Instead, it controls expressiveness through
diffusion parameters:
| Parameter | What it controls | Range |
|-----------|-----------------|-------|
| `alpha` | Timbre (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 |
| `beta` | Prosody (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 |
| `embedding_scale` | Expressiveness (higher=more emotional/dramatic) | 0.1 - 5.0 |
| `diffusion_steps` | Style diversity (more steps=more varied output) | 3 - 20 |
Each emotion preset maps to tuned combinations of these parameters. The intensity
slider scales `embedding_scale` to make the emotion more or less pronounced.
## Authentication
Set the `API_KEY` secret in your HuggingFace Space settings.
Requests must include `Authorization: Bearer <your-key>` header.
Leave `API_KEY` unset to disable authentication.
## Hardware Requirements
Requires GPU for reasonable inference speed. Recommended: T4 or better.
The LibriTTS multi-speaker model (~1.8 GB) downloads automatically on first startup.
## Deployment
1. Create a new HuggingFace Space with **Docker** SDK
2. Upload the contents of this folder
3. Select a GPU runtime (T4 recommended)
4. Set the `API_KEY` secret in Space settings (optional)
5. The model downloads automatically on first startup (~2 GB)
6. Register the Space URL in VoxLibris Settings under TTS Engine Management