useclaude's picture
Add XTTS-v2 model
414c675
---
library_name: transformers
license: mpl-2.0
tags:
- text-to-speech
- tts
- xtts-v2
- voice-cloning
- multilingual
- coqui
language:
- en
- th
- es
- fr
- de
- it
- pt
- pl
- tr
- ru
- nl
- cs
- ar
- zh
---
# XTTS-v2 Model Mirror for Quantum Sync
This is a mirror/backup of the **Coqui XTTS-v2** model for use with the [Quantum Sync](https://github.com/Useforclaude/quantum-sync-v5) project.
## ๐ŸŽฏ Purpose
This mirror serves as:
- **Backup** in case the original model becomes unavailable
- **Faster access** for Quantum Sync users
- **Stable reference** for production deployments
## ๐Ÿ“‹ Model Information
**Original Model:** [coqui/XTTS-v2](https://huggingface.co/coqui/XTTS-v2)
**Architecture:** XTTS-v2 (Zero-shot multi-lingual TTS)
**Model Size:** ~1.87 GB
**Supported Languages:** 13 languages
- English (en)
- Thai (th)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Polish (pl)
- Turkish (tr)
- Russian (ru)
- Dutch (nl)
- Czech (cs)
- Arabic (ar)
- Chinese (zh-cn)
## ๐Ÿš€ Usage
### With Quantum Sync (Recommended)
```bash
git clone https://github.com/Useforclaude/quantum-sync-v5.git
cd quantum-sync-v5/quantum-sync-v11-production
# Configure to use this mirror
# Edit tts_engines/xtts.py, change model_name to:
# model_name = "useclaude/quantum-sync-xtts-v2"
python main_v11.py input/file.srt \
--voice MyVoice \
--voice-sample /path/to/voice.wav \
--tts-engine xtts-v2 \
--tts-language en
```
### Direct Usage with TTS Library
```python
from TTS.api import TTS
# Use this mirror
tts = TTS(model_name="useclaude/quantum-sync-xtts-v2")
# Generate speech
tts.tts_to_file(
text="Hello, this is a test.",
speaker_wav="reference_voice.wav",
language="en",
file_path="output.wav"
)
```
### Voice Cloning Example
```python
from TTS.api import TTS
# Initialize
tts = TTS(model_name="useclaude/quantum-sync-xtts-v2")
# Clone voice from reference audio (6-30 seconds)
tts.tts_to_file(
text="The quick brown fox jumps over the lazy dog.",
speaker_wav="my_voice_sample.wav", # Your voice reference
language="en",
file_path="output_cloned.wav"
)
```
## ๐Ÿ“Š Performance
**From Quantum Sync Production Tests (2025-10-13):**
| Metric | Value |
|--------|-------|
| **Synthesis Speed** | ~3.7 segments/minute |
| **Processing Time** | 17 min for 277 segments (23 min audio) |
| **Duration Accuracy** | ~87% audio, ~13% silence gaps |
| **Timeline Drift** | -1.7% (excellent) |
| **Voice Quality** | 8/10 |
| **Cloning Accuracy** | Excellent |
| **VRAM Usage** | 6-8 GB |
**Comparison:**
- **XTTS-v2**: 15-17 min, 8/10 quality, FREE, 87% audio
- **F5-TTS**: 20-25 min, 7/10 quality, FREE, 55% audio
- **AWS Polly**: 5 min, 9/10 quality, ~$0.06, no cloning
## ๐ŸŽ›๏ธ Advanced Parameters
```python
# Speed control (0.5 - 2.0)
tts.tts_to_file(
text="Hello world",
speaker_wav="voice.wav",
language="en",
speed=0.8, # Slower speech
file_path="output.wav"
)
# Temperature control (0.1 - 1.0)
tts.tts_to_file(
text="Hello world",
speaker_wav="voice.wav",
language="en",
temperature=0.75, # More expressive
file_path="output.wav"
)
```
## ๐Ÿ“ฆ Model Files
```
quantum-sync-xtts-v2/
โ”œโ”€โ”€ model.pth (1.87 GB - Neural network weights)
โ”œโ”€โ”€ config.json (Model configuration)
โ”œโ”€โ”€ vocab.json (Vocabulary for tokenization)
โ”œโ”€โ”€ speakers_xtts.pth (Speaker embeddings)
โ”œโ”€โ”€ dvae.pth (DVAE component)
โ”œโ”€โ”€ mel_stats.pth (Mel-spectrogram statistics)
โ”œโ”€โ”€ LICENSE (MPL 2.0)
โ””โ”€โ”€ README.md (This file)
```
## ๐Ÿ“œ License
**Mozilla Public License 2.0 (MPL 2.0)**
This model is licensed under the Mozilla Public License 2.0. You can:
- โœ… Use commercially (no restrictions)
- โœ… Modify the model
- โœ… Distribute the model
- โœ… Use in proprietary software
**Requirements:**
- Include license and copyright notice
- State changes if you modify the model
- Disclose source for modifications
**Full License:** [LICENSE](./LICENSE)
## ๐Ÿ™ Attribution
**Original Work:**
- **Project:** [Coqui TTS](https://github.com/coqui-ai/TTS)
- **Model:** XTTS-v2
- **Authors:** Coqui TTS Team
- **License:** Mozilla Public License 2.0
**This Mirror:**
- **Purpose:** Backup for Quantum Sync project
- **Maintained by:** [Your Name/Organization]
- **Original Source:** https://huggingface.co/coqui/XTTS-v2
All credit goes to the original Coqui TTS team. This is simply a mirror for backup and convenience.
## ๐Ÿ“š Documentation
**Quantum Sync Documentation:**
- [XTTS-v2 Quick Start Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/XTTS-QUICK-START.md)
- [Paperspace Testing Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/PAPERSPACE-TTS-TESTING.md)
**Original Documentation:**
- [Coqui TTS GitHub](https://github.com/coqui-ai/TTS)
- [XTTS-v2 Paper](https://arxiv.org/abs/2406.04904) (if available)
## ๐Ÿ”— Links
- **This Mirror:** https://huggingface.co/useclaude/quantum-sync-xtts-v2
- **Original Model:** https://huggingface.co/coqui/XTTS-v2
- **Quantum Sync Project:** https://github.com/Useforclaude/quantum-sync-v5
- **TTS Library:** https://github.com/coqui-ai/TTS
## โš ๏ธ Disclaimer
This is an unofficial mirror maintained for backup purposes. For the latest version and official support, please refer to the [original model](https://huggingface.co/coqui/XTTS-v2) and [Coqui TTS repository](https://github.com/coqui-ai/TTS).
## ๐Ÿ“Š Model Card
### Model Description
XTTS-v2 is a state-of-the-art zero-shot multi-lingual text-to-speech model that can clone voices from short audio samples (6-30 seconds).
**Key Features:**
- Zero-shot voice cloning
- Multi-lingual support (13 languages)
- High-quality natural speech
- No fine-tuning required
- Commercial use allowed
### Intended Use
**Primary Use Cases:**
- Voice cloning for content creation
- Multi-lingual speech synthesis
- Accessibility applications
- Audiobook narration
- Video dubbing
**Out-of-Scope Use:**
- Impersonation without consent
- Generating misleading content
- Illegal activities
### Training Data
XTTS-v2 was trained on diverse multi-lingual speech data. For details, see the [original model card](https://huggingface.co/coqui/XTTS-v2).
### Performance
See **Performance** section above for detailed benchmarks from Quantum Sync project.
### Ethical Considerations
**Voice Cloning Ethics:**
- Always obtain consent before cloning someone's voice
- Clearly label AI-generated content
- Do not use for impersonation or fraud
- Follow local regulations on synthetic media
### Limitations
- May not perfectly preserve all voice characteristics
- Quality varies with reference audio quality
- Requires GPU for reasonable speed
- ~6-8 GB VRAM recommended
- Some languages may have better quality than others
## ๐Ÿ› ๏ธ Technical Specifications
**Model Type:** Autoregressive Transformer-based TTS
**Framework:** PyTorch
**Input:** Text + Reference Audio (6-30 sec WAV)
**Output:** 24kHz WAV audio
**Inference Time:** ~3-5 seconds per segment (GPU)
**Hardware Requirements:**
- GPU: NVIDIA with CUDA support
- VRAM: 6-8 GB recommended
- RAM: 16 GB
- Disk: ~2 GB for model
**Software Requirements:**
- Python 3.9+
- PyTorch 2.0+
- TTS library
- CUDA 11.8+ (for GPU)
## ๐Ÿ“ž Support
**For this mirror:**
- Issues: [Quantum Sync GitHub Issues](https://github.com/Useforclaude/quantum-sync-v5/issues)
**For original model:**
- Issues: [Coqui TTS GitHub Issues](https://github.com/coqui-ai/TTS/issues)
---
**Last Updated:** 2025-10-13
**Mirror Version:** 1.0
**Model Version:** XTTS-v2 (Latest as of upload date)