xttsv2 / README.md
fosters's picture
Upload 2 files
78ed7b9 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade
metadata
title: XTTSv2 Optimized TTS
emoji: 🐸
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: other
tags:
  - tts
  - text-to-speech
  - voice-cloning
  - xtts
  - coqui
suggested_hardware: t4-small

🐸 XTTSv2 Optimized Text-to-Speech

High-quality multilingual voice cloning powered by XTTSv2 with performance optimizations.

Features

  • 17 Languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi
  • Voice Cloning: Clone any voice from ~6 seconds of reference audio
  • Streaming Mode: Low-latency streaming for real-time applications
  • Optimizations:
    • DeepSpeed acceleration
    • FP16 inference
    • torch.compile() optimization
    • Speaker embedding caching

Usage

  1. Upload a reference audio file (WAV/MP3, 6-30 seconds recommended)
  2. Enter your text
  3. Select the language
  4. Click "Generate Speech"

Performance

Hardware Latency (per sentence)
T4 ~2-3 seconds
A10G ~1 second
A100 ~0.5 seconds

Configuration

Environment variables for tuning:

  • USE_DEEPSPEED: Enable DeepSpeed (default: true)
  • USE_FP16: Enable FP16 inference (default: true)
  • USE_TORCH_COMPILE: Enable torch.compile (default: true)
  • MAX_CACHE_SIZE: Number of speakers to cache (default: 10)
  • STREAMING_CHUNK_SIZE: Streaming chunk size (default: 20)

License

This model uses the Coqui Public Model License.

Credits