Spaces:

fosters
/

xttsv2

Sleeping

App Files Files Community

xttsv2 / README.md

fosters

Upload 2 files

78ed7b9 verified 21 days ago

preview code

raw

history blame contribute delete

1.73 kB

	---
	title: XTTSv2 Optimized TTS
	emoji: 🐸
	colorFrom: green
	colorTo: blue
	sdk: gradio
	sdk_version: 5.5.0
	app_file: app.py
	pinned: false
	license: other
	tags:
	- tts
	- text-to-speech
	- voice-cloning
	- xtts
	- coqui
	suggested_hardware: t4-small
	---

	# 🐸 XTTSv2 Optimized Text-to-Speech

	High-quality multilingual voice cloning powered by XTTSv2 with performance optimizations.

	## Features

	- 17 Languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi
	- Voice Cloning: Clone any voice from ~6 seconds of reference audio
	- Streaming Mode: Low-latency streaming for real-time applications
	- Optimizations:
	- DeepSpeed acceleration
	- FP16 inference
	- torch.compile() optimization
	- Speaker embedding caching

	## Usage

	1. Upload a reference audio file (WAV/MP3, 6-30 seconds recommended)
	2. Enter your text
	3. Select the language
	4. Click "Generate Speech"

	## Performance

	\| Hardware \| Latency (per sentence) \|
	\|----------\|------------------------\|
	\| T4 \| ~2-3 seconds \|
	\| A10G \| ~1 second \|
	\| A100 \| ~0.5 seconds \|

	## Configuration

	Environment variables for tuning:

	- `USE_DEEPSPEED`: Enable DeepSpeed (default: true)
	- `USE_FP16`: Enable FP16 inference (default: true)
	- `USE_TORCH_COMPILE`: Enable torch.compile (default: true)
	- `MAX_CACHE_SIZE`: Number of speakers to cache (default: 10)
	- `STREAMING_CHUNK_SIZE`: Streaming chunk size (default: 20)

	## License

	This model uses the [Coqui Public Model License](https://coqui.ai/cpml).

	## Credits

	- [Coqui TTS](https://github.com/coqui-ai/TTS)
	- [XTTS Paper](https://arxiv.org/abs/2406.04904)