File size: 7,826 Bytes

414c675

---
library_name: transformers
license: mpl-2.0
tags:
  - text-to-speech
  - tts
  - xtts-v2
  - voice-cloning
  - multilingual
  - coqui
language:
  - en
  - th
  - es
  - fr
  - de
  - it
  - pt
  - pl
  - tr
  - ru
  - nl
  - cs
  - ar
  - zh
---

# XTTS-v2 Model Mirror for Quantum Sync

This is a mirror/backup of the **Coqui XTTS-v2** model for use with the [Quantum Sync](https://github.com/Useforclaude/quantum-sync-v5) project.

## 🎯 Purpose

This mirror serves as:
- **Backup** in case the original model becomes unavailable
- **Faster access** for Quantum Sync users
- **Stable reference** for production deployments

## 📋 Model Information

**Original Model:** [coqui/XTTS-v2](https://huggingface.co/coqui/XTTS-v2)

**Architecture:** XTTS-v2 (Zero-shot multi-lingual TTS)

**Model Size:** ~1.87 GB

**Supported Languages:** 13 languages
- English (en)
- Thai (th)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Polish (pl)
- Turkish (tr)
- Russian (ru)
- Dutch (nl)
- Czech (cs)
- Arabic (ar)
- Chinese (zh-cn)

## 🚀 Usage

### With Quantum Sync (Recommended)

```bash
git clone https://github.com/Useforclaude/quantum-sync-v5.git
cd quantum-sync-v5/quantum-sync-v11-production

# Configure to use this mirror
# Edit tts_engines/xtts.py, change model_name to:
# model_name = "useclaude/quantum-sync-xtts-v2"

python main_v11.py input/file.srt \
  --voice MyVoice \
  --voice-sample /path/to/voice.wav \
  --tts-engine xtts-v2 \
  --tts-language en
```

### Direct Usage with TTS Library

```python
from TTS.api import TTS

# Use this mirror
tts = TTS(model_name="useclaude/quantum-sync-xtts-v2")

# Generate speech
tts.tts_to_file(
    text="Hello, this is a test.",
    speaker_wav="reference_voice.wav",
    language="en",
    file_path="output.wav"
)
```

### Voice Cloning Example

```python
from TTS.api import TTS

# Initialize
tts = TTS(model_name="useclaude/quantum-sync-xtts-v2")

# Clone voice from reference audio (6-30 seconds)
tts.tts_to_file(
    text="The quick brown fox jumps over the lazy dog.",
    speaker_wav="my_voice_sample.wav",  # Your voice reference
    language="en",
    file_path="output_cloned.wav"
)
```

## 📊 Performance

**From Quantum Sync Production Tests (2025-10-13):**

| Metric | Value |
|--------|-------|
| **Synthesis Speed** | ~3.7 segments/minute |
| **Processing Time** | 17 min for 277 segments (23 min audio) |
| **Duration Accuracy** | ~87% audio, ~13% silence gaps |
| **Timeline Drift** | -1.7% (excellent) |
| **Voice Quality** | 8/10 |
| **Cloning Accuracy** | Excellent |
| **VRAM Usage** | 6-8 GB |

**Comparison:**
- **XTTS-v2**: 15-17 min, 8/10 quality, FREE, 87% audio
- **F5-TTS**: 20-25 min, 7/10 quality, FREE, 55% audio
- **AWS Polly**: 5 min, 9/10 quality, ~$0.06, no cloning

## 🎛️ Advanced Parameters

```python
# Speed control (0.5 - 2.0)
tts.tts_to_file(
    text="Hello world",
    speaker_wav="voice.wav",
    language="en",
    speed=0.8,  # Slower speech
    file_path="output.wav"
)

# Temperature control (0.1 - 1.0)
tts.tts_to_file(
    text="Hello world",
    speaker_wav="voice.wav",
    language="en",
    temperature=0.75,  # More expressive
    file_path="output.wav"
)
```

## 📦 Model Files

```
quantum-sync-xtts-v2/
├── model.pth              (1.87 GB - Neural network weights)
├── config.json            (Model configuration)
├── vocab.json             (Vocabulary for tokenization)
├── speakers_xtts.pth      (Speaker embeddings)
├── dvae.pth               (DVAE component)
├── mel_stats.pth          (Mel-spectrogram statistics)
├── LICENSE                (MPL 2.0)
└── README.md              (This file)
```

## 📜 License

**Mozilla Public License 2.0 (MPL 2.0)**

This model is licensed under the Mozilla Public License 2.0. You can:
- ✅ Use commercially (no restrictions)
- ✅ Modify the model
- ✅ Distribute the model
- ✅ Use in proprietary software

**Requirements:**
- Include license and copyright notice
- State changes if you modify the model
- Disclose source for modifications

**Full License:** [LICENSE](./LICENSE)

## 🙏 Attribution

**Original Work:**
- **Project:** [Coqui TTS](https://github.com/coqui-ai/TTS)
- **Model:** XTTS-v2
- **Authors:** Coqui TTS Team
- **License:** Mozilla Public License 2.0

**This Mirror:**
- **Purpose:** Backup for Quantum Sync project
- **Maintained by:** [Your Name/Organization]
- **Original Source:** https://huggingface.co/coqui/XTTS-v2

All credit goes to the original Coqui TTS team. This is simply a mirror for backup and convenience.

## 📚 Documentation

**Quantum Sync Documentation:**
- [XTTS-v2 Quick Start Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/XTTS-QUICK-START.md)
- [Paperspace Testing Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/PAPERSPACE-TTS-TESTING.md)

**Original Documentation:**
- [Coqui TTS GitHub](https://github.com/coqui-ai/TTS)
- [XTTS-v2 Paper](https://arxiv.org/abs/2406.04904) (if available)

## 🔗 Links

- **This Mirror:** https://huggingface.co/useclaude/quantum-sync-xtts-v2
- **Original Model:** https://huggingface.co/coqui/XTTS-v2
- **Quantum Sync Project:** https://github.com/Useforclaude/quantum-sync-v5
- **TTS Library:** https://github.com/coqui-ai/TTS

## ⚠️ Disclaimer

This is an unofficial mirror maintained for backup purposes. For the latest version and official support, please refer to the [original model](https://huggingface.co/coqui/XTTS-v2) and [Coqui TTS repository](https://github.com/coqui-ai/TTS).

## 📊 Model Card

### Model Description

XTTS-v2 is a state-of-the-art zero-shot multi-lingual text-to-speech model that can clone voices from short audio samples (6-30 seconds).

**Key Features:**
- Zero-shot voice cloning
- Multi-lingual support (13 languages)
- High-quality natural speech
- No fine-tuning required
- Commercial use allowed

### Intended Use

**Primary Use Cases:**
- Voice cloning for content creation
- Multi-lingual speech synthesis
- Accessibility applications
- Audiobook narration
- Video dubbing

**Out-of-Scope Use:**
- Impersonation without consent
- Generating misleading content
- Illegal activities

### Training Data

XTTS-v2 was trained on diverse multi-lingual speech data. For details, see the [original model card](https://huggingface.co/coqui/XTTS-v2).

### Performance

See **Performance** section above for detailed benchmarks from Quantum Sync project.

### Ethical Considerations

**Voice Cloning Ethics:**
- Always obtain consent before cloning someone's voice
- Clearly label AI-generated content
- Do not use for impersonation or fraud
- Follow local regulations on synthetic media

### Limitations

- May not perfectly preserve all voice characteristics
- Quality varies with reference audio quality
- Requires GPU for reasonable speed
- ~6-8 GB VRAM recommended
- Some languages may have better quality than others

## 🛠️ Technical Specifications

**Model Type:** Autoregressive Transformer-based TTS

**Framework:** PyTorch

**Input:** Text + Reference Audio (6-30 sec WAV)

**Output:** 24kHz WAV audio

**Inference Time:** ~3-5 seconds per segment (GPU)

**Hardware Requirements:**
- GPU: NVIDIA with CUDA support
- VRAM: 6-8 GB recommended
- RAM: 16 GB
- Disk: ~2 GB for model

**Software Requirements:**
- Python 3.9+
- PyTorch 2.0+
- TTS library
- CUDA 11.8+ (for GPU)

## 📞 Support

**For this mirror:**
- Issues: [Quantum Sync GitHub Issues](https://github.com/Useforclaude/quantum-sync-v5/issues)

**For original model:**
- Issues: [Coqui TTS GitHub Issues](https://github.com/coqui-ai/TTS/issues)

---

**Last Updated:** 2025-10-13

**Mirror Version:** 1.0

**Model Version:** XTTS-v2 (Latest as of upload date)