|
|
--- |
|
|
library_name: transformers |
|
|
license: mpl-2.0 |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- xtts-v2 |
|
|
- voice-cloning |
|
|
- multilingual |
|
|
- coqui |
|
|
language: |
|
|
- en |
|
|
- th |
|
|
- es |
|
|
- fr |
|
|
- de |
|
|
- it |
|
|
- pt |
|
|
- pl |
|
|
- tr |
|
|
- ru |
|
|
- nl |
|
|
- cs |
|
|
- ar |
|
|
- zh |
|
|
--- |
|
|
|
|
|
# XTTS-v2 Model Mirror for Quantum Sync |
|
|
|
|
|
This is a mirror/backup of the **Coqui XTTS-v2** model for use with the [Quantum Sync](https://github.com/Useforclaude/quantum-sync-v5) project. |
|
|
|
|
|
## ๐ฏ Purpose |
|
|
|
|
|
This mirror serves as: |
|
|
- **Backup** in case the original model becomes unavailable |
|
|
- **Faster access** for Quantum Sync users |
|
|
- **Stable reference** for production deployments |
|
|
|
|
|
## ๐ Model Information |
|
|
|
|
|
**Original Model:** [coqui/XTTS-v2](https://huggingface.co/coqui/XTTS-v2) |
|
|
|
|
|
**Architecture:** XTTS-v2 (Zero-shot multi-lingual TTS) |
|
|
|
|
|
**Model Size:** ~1.87 GB |
|
|
|
|
|
**Supported Languages:** 13 languages |
|
|
- English (en) |
|
|
- Thai (th) |
|
|
- Spanish (es) |
|
|
- French (fr) |
|
|
- German (de) |
|
|
- Italian (it) |
|
|
- Portuguese (pt) |
|
|
- Polish (pl) |
|
|
- Turkish (tr) |
|
|
- Russian (ru) |
|
|
- Dutch (nl) |
|
|
- Czech (cs) |
|
|
- Arabic (ar) |
|
|
- Chinese (zh-cn) |
|
|
|
|
|
## ๐ Usage |
|
|
|
|
|
### With Quantum Sync (Recommended) |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/Useforclaude/quantum-sync-v5.git |
|
|
cd quantum-sync-v5/quantum-sync-v11-production |
|
|
|
|
|
# Configure to use this mirror |
|
|
# Edit tts_engines/xtts.py, change model_name to: |
|
|
# model_name = "useclaude/quantum-sync-xtts-v2" |
|
|
|
|
|
python main_v11.py input/file.srt \ |
|
|
--voice MyVoice \ |
|
|
--voice-sample /path/to/voice.wav \ |
|
|
--tts-engine xtts-v2 \ |
|
|
--tts-language en |
|
|
``` |
|
|
|
|
|
### Direct Usage with TTS Library |
|
|
|
|
|
```python |
|
|
from TTS.api import TTS |
|
|
|
|
|
# Use this mirror |
|
|
tts = TTS(model_name="useclaude/quantum-sync-xtts-v2") |
|
|
|
|
|
# Generate speech |
|
|
tts.tts_to_file( |
|
|
text="Hello, this is a test.", |
|
|
speaker_wav="reference_voice.wav", |
|
|
language="en", |
|
|
file_path="output.wav" |
|
|
) |
|
|
``` |
|
|
|
|
|
### Voice Cloning Example |
|
|
|
|
|
```python |
|
|
from TTS.api import TTS |
|
|
|
|
|
# Initialize |
|
|
tts = TTS(model_name="useclaude/quantum-sync-xtts-v2") |
|
|
|
|
|
# Clone voice from reference audio (6-30 seconds) |
|
|
tts.tts_to_file( |
|
|
text="The quick brown fox jumps over the lazy dog.", |
|
|
speaker_wav="my_voice_sample.wav", # Your voice reference |
|
|
language="en", |
|
|
file_path="output_cloned.wav" |
|
|
) |
|
|
``` |
|
|
|
|
|
## ๐ Performance |
|
|
|
|
|
**From Quantum Sync Production Tests (2025-10-13):** |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Synthesis Speed** | ~3.7 segments/minute | |
|
|
| **Processing Time** | 17 min for 277 segments (23 min audio) | |
|
|
| **Duration Accuracy** | ~87% audio, ~13% silence gaps | |
|
|
| **Timeline Drift** | -1.7% (excellent) | |
|
|
| **Voice Quality** | 8/10 | |
|
|
| **Cloning Accuracy** | Excellent | |
|
|
| **VRAM Usage** | 6-8 GB | |
|
|
|
|
|
**Comparison:** |
|
|
- **XTTS-v2**: 15-17 min, 8/10 quality, FREE, 87% audio |
|
|
- **F5-TTS**: 20-25 min, 7/10 quality, FREE, 55% audio |
|
|
- **AWS Polly**: 5 min, 9/10 quality, ~$0.06, no cloning |
|
|
|
|
|
## ๐๏ธ Advanced Parameters |
|
|
|
|
|
```python |
|
|
# Speed control (0.5 - 2.0) |
|
|
tts.tts_to_file( |
|
|
text="Hello world", |
|
|
speaker_wav="voice.wav", |
|
|
language="en", |
|
|
speed=0.8, # Slower speech |
|
|
file_path="output.wav" |
|
|
) |
|
|
|
|
|
# Temperature control (0.1 - 1.0) |
|
|
tts.tts_to_file( |
|
|
text="Hello world", |
|
|
speaker_wav="voice.wav", |
|
|
language="en", |
|
|
temperature=0.75, # More expressive |
|
|
file_path="output.wav" |
|
|
) |
|
|
``` |
|
|
|
|
|
## ๐ฆ Model Files |
|
|
|
|
|
``` |
|
|
quantum-sync-xtts-v2/ |
|
|
โโโ model.pth (1.87 GB - Neural network weights) |
|
|
โโโ config.json (Model configuration) |
|
|
โโโ vocab.json (Vocabulary for tokenization) |
|
|
โโโ speakers_xtts.pth (Speaker embeddings) |
|
|
โโโ dvae.pth (DVAE component) |
|
|
โโโ mel_stats.pth (Mel-spectrogram statistics) |
|
|
โโโ LICENSE (MPL 2.0) |
|
|
โโโ README.md (This file) |
|
|
``` |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
**Mozilla Public License 2.0 (MPL 2.0)** |
|
|
|
|
|
This model is licensed under the Mozilla Public License 2.0. You can: |
|
|
- โ
Use commercially (no restrictions) |
|
|
- โ
Modify the model |
|
|
- โ
Distribute the model |
|
|
- โ
Use in proprietary software |
|
|
|
|
|
**Requirements:** |
|
|
- Include license and copyright notice |
|
|
- State changes if you modify the model |
|
|
- Disclose source for modifications |
|
|
|
|
|
**Full License:** [LICENSE](./LICENSE) |
|
|
|
|
|
## ๐ Attribution |
|
|
|
|
|
**Original Work:** |
|
|
- **Project:** [Coqui TTS](https://github.com/coqui-ai/TTS) |
|
|
- **Model:** XTTS-v2 |
|
|
- **Authors:** Coqui TTS Team |
|
|
- **License:** Mozilla Public License 2.0 |
|
|
|
|
|
**This Mirror:** |
|
|
- **Purpose:** Backup for Quantum Sync project |
|
|
- **Maintained by:** [Your Name/Organization] |
|
|
- **Original Source:** https://huggingface.co/coqui/XTTS-v2 |
|
|
|
|
|
All credit goes to the original Coqui TTS team. This is simply a mirror for backup and convenience. |
|
|
|
|
|
## ๐ Documentation |
|
|
|
|
|
**Quantum Sync Documentation:** |
|
|
- [XTTS-v2 Quick Start Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/XTTS-QUICK-START.md) |
|
|
- [Paperspace Testing Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/PAPERSPACE-TTS-TESTING.md) |
|
|
|
|
|
**Original Documentation:** |
|
|
- [Coqui TTS GitHub](https://github.com/coqui-ai/TTS) |
|
|
- [XTTS-v2 Paper](https://arxiv.org/abs/2406.04904) (if available) |
|
|
|
|
|
## ๐ Links |
|
|
|
|
|
- **This Mirror:** https://huggingface.co/useclaude/quantum-sync-xtts-v2 |
|
|
- **Original Model:** https://huggingface.co/coqui/XTTS-v2 |
|
|
- **Quantum Sync Project:** https://github.com/Useforclaude/quantum-sync-v5 |
|
|
- **TTS Library:** https://github.com/coqui-ai/TTS |
|
|
|
|
|
## โ ๏ธ Disclaimer |
|
|
|
|
|
This is an unofficial mirror maintained for backup purposes. For the latest version and official support, please refer to the [original model](https://huggingface.co/coqui/XTTS-v2) and [Coqui TTS repository](https://github.com/coqui-ai/TTS). |
|
|
|
|
|
## ๐ Model Card |
|
|
|
|
|
### Model Description |
|
|
|
|
|
XTTS-v2 is a state-of-the-art zero-shot multi-lingual text-to-speech model that can clone voices from short audio samples (6-30 seconds). |
|
|
|
|
|
**Key Features:** |
|
|
- Zero-shot voice cloning |
|
|
- Multi-lingual support (13 languages) |
|
|
- High-quality natural speech |
|
|
- No fine-tuning required |
|
|
- Commercial use allowed |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
**Primary Use Cases:** |
|
|
- Voice cloning for content creation |
|
|
- Multi-lingual speech synthesis |
|
|
- Accessibility applications |
|
|
- Audiobook narration |
|
|
- Video dubbing |
|
|
|
|
|
**Out-of-Scope Use:** |
|
|
- Impersonation without consent |
|
|
- Generating misleading content |
|
|
- Illegal activities |
|
|
|
|
|
### Training Data |
|
|
|
|
|
XTTS-v2 was trained on diverse multi-lingual speech data. For details, see the [original model card](https://huggingface.co/coqui/XTTS-v2). |
|
|
|
|
|
### Performance |
|
|
|
|
|
See **Performance** section above for detailed benchmarks from Quantum Sync project. |
|
|
|
|
|
### Ethical Considerations |
|
|
|
|
|
**Voice Cloning Ethics:** |
|
|
- Always obtain consent before cloning someone's voice |
|
|
- Clearly label AI-generated content |
|
|
- Do not use for impersonation or fraud |
|
|
- Follow local regulations on synthetic media |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- May not perfectly preserve all voice characteristics |
|
|
- Quality varies with reference audio quality |
|
|
- Requires GPU for reasonable speed |
|
|
- ~6-8 GB VRAM recommended |
|
|
- Some languages may have better quality than others |
|
|
|
|
|
## ๐ ๏ธ Technical Specifications |
|
|
|
|
|
**Model Type:** Autoregressive Transformer-based TTS |
|
|
|
|
|
**Framework:** PyTorch |
|
|
|
|
|
**Input:** Text + Reference Audio (6-30 sec WAV) |
|
|
|
|
|
**Output:** 24kHz WAV audio |
|
|
|
|
|
**Inference Time:** ~3-5 seconds per segment (GPU) |
|
|
|
|
|
**Hardware Requirements:** |
|
|
- GPU: NVIDIA with CUDA support |
|
|
- VRAM: 6-8 GB recommended |
|
|
- RAM: 16 GB |
|
|
- Disk: ~2 GB for model |
|
|
|
|
|
**Software Requirements:** |
|
|
- Python 3.9+ |
|
|
- PyTorch 2.0+ |
|
|
- TTS library |
|
|
- CUDA 11.8+ (for GPU) |
|
|
|
|
|
## ๐ Support |
|
|
|
|
|
**For this mirror:** |
|
|
- Issues: [Quantum Sync GitHub Issues](https://github.com/Useforclaude/quantum-sync-v5/issues) |
|
|
|
|
|
**For original model:** |
|
|
- Issues: [Coqui TTS GitHub Issues](https://github.com/coqui-ai/TTS/issues) |
|
|
|
|
|
--- |
|
|
|
|
|
**Last Updated:** 2025-10-13 |
|
|
|
|
|
**Mirror Version:** 1.0 |
|
|
|
|
|
**Model Version:** XTTS-v2 (Latest as of upload date) |
|
|
|