Spaces:

CherithCutestory
/

styletts2

Paused

App Files Files Community

styletts2 / README.md

CherithCutestory

Updated with new docker iamge

42b0869 3 months ago

preview code

raw

history blame contribute delete

2.86 kB

metadata

title: VoxLibris StyleTTS2 Engine
emoji: 🎭
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 7860
pinned: false

VoxLibris StyleTTS2 TTS Engine

A HuggingFace Space that serves StyleTTS2 as a REST API for text-to-speech with emotion control and voice cloning, implementing the VoxLibris TTS Engine API Contract.

StyleTTS2 achieves human-level TTS synthesis through style diffusion and adversarial training with large speech language models (SLMs). It uses a diffusion model to generate the most suitable speaking style for the given text.

Endpoints

POST /GetEngineDetails

Returns engine capabilities including supported emotions and voice cloning support.

POST /ConvertTextToSpeech

Converts text to speech with rich style control:

Emotion control: neutral, happy, sad, angry, fear, excited, calm, surprised, whisper
Intensity: Scales the embedding_scale to control how strongly the emotion affects output
Voice cloning: Upload reference audio (base64 WAV) to clone voice timbre and prosody
Speed/pitch adjustment: Via pyrubberband post-processing
Long-form support: Automatically uses long_inference for longer texts with style continuity

GET /health

Returns model loading status.

GET /

Built-in test frontend with emotion selection, voice cloning upload, and parameter controls.

How Emotion Control Works

StyleTTS2 doesn't use explicit emotion tags. Instead, it controls expressiveness through diffusion parameters:

Parameter	What it controls	Range
`alpha`	Timbre (0=reference voice, 1=text-predicted style)	0.0 - 1.0
`beta`	Prosody (0=reference voice, 1=text-predicted style)	0.0 - 1.0
`embedding_scale`	Expressiveness (higher=more emotional/dramatic)	0.1 - 5.0
`diffusion_steps`	Style diversity (more steps=more varied output)	3 - 20

Each emotion preset maps to tuned combinations of these parameters. The intensity slider scales embedding_scale to make the emotion more or less pronounced.

Authentication

Set the API_KEY secret in your HuggingFace Space settings. Requests must include Authorization: Bearer <your-key> header. Leave API_KEY unset to disable authentication.

Hardware Requirements

Requires GPU for reasonable inference speed. Recommended: T4 or better. The LibriTTS multi-speaker model (~1.8 GB) downloads automatically on first startup.

Deployment

Create a new HuggingFace Space with Docker SDK
Upload the contents of this folder
Select a GPU runtime (T4 recommended)
Set the API_KEY secret in Space settings (optional)
The model downloads automatically on first startup (~2 GB)
Register the Space URL in VoxLibris Settings under TTS Engine Management