Spaces:

CherithCutestory
/

styletts2

Paused

App Files Files Community

styletts2 / README.md

CherithCutestory

Updated with new docker iamge

42b0869 3 months ago

preview code

raw

history blame contribute delete

2.86 kB

	---
	title: VoxLibris StyleTTS2 Engine
	emoji: 🎭
	colorFrom: purple
	colorTo: pink
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# VoxLibris StyleTTS2 TTS Engine

	A HuggingFace Space that serves [StyleTTS2](https://github.com/yl4579/StyleTTS2) as a REST API for
	text-to-speech with emotion control and voice cloning, implementing the
	[VoxLibris TTS Engine API Contract](../../docs/tts-api-contract.md).

	StyleTTS2 achieves human-level TTS synthesis through style diffusion and adversarial
	training with large speech language models (SLMs). It uses a diffusion model to
	generate the most suitable speaking style for the given text.

	## Endpoints

	### POST /GetEngineDetails

	Returns engine capabilities including supported emotions and voice cloning support.

	### POST /ConvertTextToSpeech

	Converts text to speech with rich style control:
	- Emotion control: neutral, happy, sad, angry, fear, excited, calm, surprised, whisper
	- Intensity: Scales the embedding_scale to control how strongly the emotion affects output
	- Voice cloning: Upload reference audio (base64 WAV) to clone voice timbre and prosody
	- Speed/pitch adjustment: Via pyrubberband post-processing
	- Long-form support: Automatically uses `long_inference` for longer texts with style continuity

	### GET /health

	Returns model loading status.

	### GET /

	Built-in test frontend with emotion selection, voice cloning upload, and parameter controls.

	## How Emotion Control Works

	StyleTTS2 doesn't use explicit emotion tags. Instead, it controls expressiveness through
	diffusion parameters:

	\| Parameter \| What it controls \| Range \|
	\|-----------\|-----------------\|-------\|
	\| `alpha` \| Timbre (0=reference voice, 1=text-predicted style) \| 0.0 - 1.0 \|
	\| `beta` \| Prosody (0=reference voice, 1=text-predicted style) \| 0.0 - 1.0 \|
	\| `embedding_scale` \| Expressiveness (higher=more emotional/dramatic) \| 0.1 - 5.0 \|
	\| `diffusion_steps` \| Style diversity (more steps=more varied output) \| 3 - 20 \|

	Each emotion preset maps to tuned combinations of these parameters. The intensity
	slider scales `embedding_scale` to make the emotion more or less pronounced.

	## Authentication

	Set the `API_KEY` secret in your HuggingFace Space settings.
	Requests must include `Authorization: Bearer <your-key>` header.
	Leave `API_KEY` unset to disable authentication.

	## Hardware Requirements

	Requires GPU for reasonable inference speed. Recommended: T4 or better.
	The LibriTTS multi-speaker model (~1.8 GB) downloads automatically on first startup.

	## Deployment

	1. Create a new HuggingFace Space with Docker SDK
	2. Upload the contents of this folder
	3. Select a GPU runtime (T4 recommended)
	4. Set the `API_KEY` secret in Space settings (optional)
	5. The model downloads automatically on first startup (~2 GB)
	6. Register the Space URL in VoxLibris Settings under TTS Engine Management