File size: 2,857 Bytes
aa976ea
42b0869
 
 
 
c615db3
 
 
aa976ea
42b0869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
title: VoxLibris StyleTTS2 Engine
emoji: 🎭
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
---

# VoxLibris StyleTTS2 TTS Engine

A HuggingFace Space that serves [StyleTTS2](https://github.com/yl4579/StyleTTS2) as a REST API for
text-to-speech with emotion control and voice cloning, implementing the
[VoxLibris TTS Engine API Contract](../../docs/tts-api-contract.md).

StyleTTS2 achieves human-level TTS synthesis through style diffusion and adversarial
training with large speech language models (SLMs). It uses a diffusion model to
generate the most suitable speaking style for the given text.

## Endpoints

### POST /GetEngineDetails

Returns engine capabilities including supported emotions and voice cloning support.

### POST /ConvertTextToSpeech

Converts text to speech with rich style control:
- **Emotion control**: neutral, happy, sad, angry, fear, excited, calm, surprised, whisper
- **Intensity**: Scales the embedding_scale to control how strongly the emotion affects output
- **Voice cloning**: Upload reference audio (base64 WAV) to clone voice timbre and prosody
- **Speed/pitch adjustment**: Via pyrubberband post-processing
- **Long-form support**: Automatically uses `long_inference` for longer texts with style continuity

### GET /health

Returns model loading status.

### GET /

Built-in test frontend with emotion selection, voice cloning upload, and parameter controls.

## How Emotion Control Works

StyleTTS2 doesn't use explicit emotion tags. Instead, it controls expressiveness through
diffusion parameters:

| Parameter | What it controls | Range |
|-----------|-----------------|-------|
| `alpha` | Timbre (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 |
| `beta` | Prosody (0=reference voice, 1=text-predicted style) | 0.0 - 1.0 |
| `embedding_scale` | Expressiveness (higher=more emotional/dramatic) | 0.1 - 5.0 |
| `diffusion_steps` | Style diversity (more steps=more varied output) | 3 - 20 |

Each emotion preset maps to tuned combinations of these parameters. The intensity
slider scales `embedding_scale` to make the emotion more or less pronounced.

## Authentication

Set the `API_KEY` secret in your HuggingFace Space settings.
Requests must include `Authorization: Bearer <your-key>` header.
Leave `API_KEY` unset to disable authentication.

## Hardware Requirements

Requires GPU for reasonable inference speed. Recommended: T4 or better.
The LibriTTS multi-speaker model (~1.8 GB) downloads automatically on first startup.

## Deployment

1. Create a new HuggingFace Space with **Docker** SDK
2. Upload the contents of this folder
3. Select a GPU runtime (T4 recommended)
4. Set the `API_KEY` secret in Space settings (optional)
5. The model downloads automatically on first startup (~2 GB)
6. Register the Space URL in VoxLibris Settings under TTS Engine Management