Style diffusion TTS — human-level speech with emotion control & voice cloning
Text Input
Emotion & Style
SubtleStrong
Controls how strongly the emotion affects the output (scales embedding_scale)
Voice Cloning (Optional)
Click to upload or drag & drop a WAV/MP3 file
StyleTTS2 uses reference audio to extract voice style (timbre and prosody). Without reference audio, it generates a style from the text using diffusion. A 3-10 second clip of clear speech works best.