--- title: XTTSv2 Optimized TTS emoji: 🐸 colorFrom: green colorTo: blue sdk: gradio sdk_version: 5.5.0 app_file: app.py pinned: false license: other tags: - tts - text-to-speech - voice-cloning - xtts - coqui suggested_hardware: t4-small --- # 🐸 XTTSv2 Optimized Text-to-Speech High-quality multilingual voice cloning powered by XTTSv2 with performance optimizations. ## Features - **17 Languages**: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi - **Voice Cloning**: Clone any voice from ~6 seconds of reference audio - **Streaming Mode**: Low-latency streaming for real-time applications - **Optimizations**: - DeepSpeed acceleration - FP16 inference - torch.compile() optimization - Speaker embedding caching ## Usage 1. Upload a reference audio file (WAV/MP3, 6-30 seconds recommended) 2. Enter your text 3. Select the language 4. Click "Generate Speech" ## Performance | Hardware | Latency (per sentence) | |----------|------------------------| | T4 | ~2-3 seconds | | A10G | ~1 second | | A100 | ~0.5 seconds | ## Configuration Environment variables for tuning: - `USE_DEEPSPEED`: Enable DeepSpeed (default: true) - `USE_FP16`: Enable FP16 inference (default: true) - `USE_TORCH_COMPILE`: Enable torch.compile (default: true) - `MAX_CACHE_SIZE`: Number of speakers to cache (default: 10) - `STREAMING_CHUNK_SIZE`: Streaming chunk size (default: 20) ## License This model uses the [Coqui Public Model License](https://coqui.ai/cpml). ## Credits - [Coqui TTS](https://github.com/coqui-ai/TTS) - [XTTS Paper](https://arxiv.org/abs/2406.04904)