A newer version of the Gradio SDK is available:
6.2.0
metadata
title: XTTSv2 Optimized TTS
emoji: 🐸
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: other
tags:
- tts
- text-to-speech
- voice-cloning
- xtts
- coqui
suggested_hardware: t4-small
🐸 XTTSv2 Optimized Text-to-Speech
High-quality multilingual voice cloning powered by XTTSv2 with performance optimizations.
Features
- 17 Languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi
- Voice Cloning: Clone any voice from ~6 seconds of reference audio
- Streaming Mode: Low-latency streaming for real-time applications
- Optimizations:
- DeepSpeed acceleration
- FP16 inference
- torch.compile() optimization
- Speaker embedding caching
Usage
- Upload a reference audio file (WAV/MP3, 6-30 seconds recommended)
- Enter your text
- Select the language
- Click "Generate Speech"
Performance
| Hardware | Latency (per sentence) |
|---|---|
| T4 | ~2-3 seconds |
| A10G | ~1 second |
| A100 | ~0.5 seconds |
Configuration
Environment variables for tuning:
USE_DEEPSPEED: Enable DeepSpeed (default: true)USE_FP16: Enable FP16 inference (default: true)USE_TORCH_COMPILE: Enable torch.compile (default: true)MAX_CACHE_SIZE: Number of speakers to cache (default: 10)STREAMING_CHUNK_SIZE: Streaming chunk size (default: 20)
License
This model uses the Coqui Public Model License.