StyleTTS2 Test Console

checking...

Style diffusion TTS — human-level speech with emotion control & voice cloning

Text Input

Text to speak

Emotion & Style

Select emotion

Intensity

Subtle Strong

Controls how strongly the emotion affects the output (scales embedding_scale)

Voice Cloning (Optional)

Upload reference audio to clone voice style

Click to upload or drag & drop a WAV/MP3 file

StyleTTS2 uses reference audio to extract voice style (timbre and prosody). Without reference audio, it generates a style from the text using diffusion. A 3-10 second clip of clear speech works best.

Audio Parameters

Volume (1-100)

Speed (-5 to 5)

Pitch (-5 to 5)