File size: 1,732 Bytes
91b4b1d c166d92 91b4b1d c166d92 68c821f 78ed7b9 91b4b1d c166d92 91b4b1d c166d92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
title: XTTSv2 Optimized TTS
emoji: 🐸
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: other
tags:
- tts
- text-to-speech
- voice-cloning
- xtts
- coqui
suggested_hardware: t4-small
---
# 🐸 XTTSv2 Optimized Text-to-Speech
High-quality multilingual voice cloning powered by XTTSv2 with performance optimizations.
## Features
- **17 Languages**: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi
- **Voice Cloning**: Clone any voice from ~6 seconds of reference audio
- **Streaming Mode**: Low-latency streaming for real-time applications
- **Optimizations**:
- DeepSpeed acceleration
- FP16 inference
- torch.compile() optimization
- Speaker embedding caching
## Usage
1. Upload a reference audio file (WAV/MP3, 6-30 seconds recommended)
2. Enter your text
3. Select the language
4. Click "Generate Speech"
## Performance
| Hardware | Latency (per sentence) |
|----------|------------------------|
| T4 | ~2-3 seconds |
| A10G | ~1 second |
| A100 | ~0.5 seconds |
## Configuration
Environment variables for tuning:
- `USE_DEEPSPEED`: Enable DeepSpeed (default: true)
- `USE_FP16`: Enable FP16 inference (default: true)
- `USE_TORCH_COMPILE`: Enable torch.compile (default: true)
- `MAX_CACHE_SIZE`: Number of speakers to cache (default: 10)
- `STREAMING_CHUNK_SIZE`: Streaming chunk size (default: 20)
## License
This model uses the [Coqui Public Model License](https://coqui.ai/cpml).
## Credits
- [Coqui TTS](https://github.com/coqui-ai/TTS)
- [XTTS Paper](https://arxiv.org/abs/2406.04904)
|