Add XTTS-v2 model

414c675 4 months ago

7.83 kB

	---
	library_name: transformers
	license: mpl-2.0
	tags:
	- text-to-speech
	- tts
	- xtts-v2
	- voice-cloning
	- multilingual
	- coqui
	language:
	- en
	- th
	- es
	- fr
	- de
	- it
	- pt
	- pl
	- tr
	- ru
	- nl
	- cs
	- ar
	- zh
	---

	# XTTS-v2 Model Mirror for Quantum Sync

	This is a mirror/backup of the Coqui XTTS-v2 model for use with the [Quantum Sync](https://github.com/Useforclaude/quantum-sync-v5) project.

	## 🎯 Purpose

	This mirror serves as:
	- Backup in case the original model becomes unavailable
	- Faster access for Quantum Sync users
	- Stable reference for production deployments

	## 📋 Model Information

	Original Model: [coqui/XTTS-v2](https://huggingface.co/coqui/XTTS-v2)

	Architecture: XTTS-v2 (Zero-shot multi-lingual TTS)

	Model Size: ~1.87 GB

	Supported Languages: 13 languages
	- English (en)
	- Thai (th)
	- Spanish (es)
	- French (fr)
	- German (de)
	- Italian (it)
	- Portuguese (pt)
	- Polish (pl)
	- Turkish (tr)
	- Russian (ru)
	- Dutch (nl)
	- Czech (cs)
	- Arabic (ar)
	- Chinese (zh-cn)

	## 🚀 Usage

	### With Quantum Sync (Recommended)

	```bash
	git clone https://github.com/Useforclaude/quantum-sync-v5.git
	cd quantum-sync-v5/quantum-sync-v11-production

	# Configure to use this mirror
	# Edit tts_engines/xtts.py, change model_name to:
	# model_name = "useclaude/quantum-sync-xtts-v2"

	python main_v11.py input/file.srt \
	--voice MyVoice \
	--voice-sample /path/to/voice.wav \
	--tts-engine xtts-v2 \
	--tts-language en
	```

	### Direct Usage with TTS Library

	```python
	from TTS.api import TTS

	# Use this mirror
	tts = TTS(model_name="useclaude/quantum-sync-xtts-v2")

	# Generate speech
	tts.tts_to_file(
	text="Hello, this is a test.",
	speaker_wav="reference_voice.wav",
	language="en",
	file_path="output.wav"
	)
	```

	### Voice Cloning Example

	```python
	from TTS.api import TTS

	# Initialize
	tts = TTS(model_name="useclaude/quantum-sync-xtts-v2")

	# Clone voice from reference audio (6-30 seconds)
	tts.tts_to_file(
	text="The quick brown fox jumps over the lazy dog.",
	speaker_wav="my_voice_sample.wav", # Your voice reference
	language="en",
	file_path="output_cloned.wav"
	)
	```

	## 📊 Performance

	From Quantum Sync Production Tests (2025-10-13):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Synthesis Speed \| ~3.7 segments/minute \|
	\| Processing Time \| 17 min for 277 segments (23 min audio) \|
	\| Duration Accuracy \| ~87% audio, ~13% silence gaps \|
	\| Timeline Drift \| -1.7% (excellent) \|
	\| Voice Quality \| 8/10 \|
	\| Cloning Accuracy \| Excellent \|
	\| VRAM Usage \| 6-8 GB \|

	Comparison:
	- XTTS-v2: 15-17 min, 8/10 quality, FREE, 87% audio
	- F5-TTS: 20-25 min, 7/10 quality, FREE, 55% audio
	- AWS Polly: 5 min, 9/10 quality, ~$0.06, no cloning

	## 🎛️ Advanced Parameters

	```python
	# Speed control (0.5 - 2.0)
	tts.tts_to_file(
	text="Hello world",
	speaker_wav="voice.wav",
	language="en",
	speed=0.8, # Slower speech
	file_path="output.wav"
	)

	# Temperature control (0.1 - 1.0)
	tts.tts_to_file(
	text="Hello world",
	speaker_wav="voice.wav",
	language="en",
	temperature=0.75, # More expressive
	file_path="output.wav"
	)
	```

	## 📦 Model Files

	```
	quantum-sync-xtts-v2/
	├── model.pth (1.87 GB - Neural network weights)
	├── config.json (Model configuration)
	├── vocab.json (Vocabulary for tokenization)
	├── speakers_xtts.pth (Speaker embeddings)
	├── dvae.pth (DVAE component)
	├── mel_stats.pth (Mel-spectrogram statistics)
	├── LICENSE (MPL 2.0)
	└── README.md (This file)
	```

	## 📜 License

	Mozilla Public License 2.0 (MPL 2.0)

	This model is licensed under the Mozilla Public License 2.0. You can:
	- ✅ Use commercially (no restrictions)
	- ✅ Modify the model
	- ✅ Distribute the model
	- ✅ Use in proprietary software

	Requirements:
	- Include license and copyright notice
	- State changes if you modify the model
	- Disclose source for modifications

	Full License: [LICENSE](./LICENSE)

	## 🙏 Attribution

	Original Work:
	- Project: [Coqui TTS](https://github.com/coqui-ai/TTS)
	- Model: XTTS-v2
	- Authors: Coqui TTS Team
	- License: Mozilla Public License 2.0

	This Mirror:
	- Purpose: Backup for Quantum Sync project
	- Maintained by: [Your Name/Organization]
	- Original Source: https://huggingface.co/coqui/XTTS-v2

	All credit goes to the original Coqui TTS team. This is simply a mirror for backup and convenience.

	## 📚 Documentation

	Quantum Sync Documentation:
	- [XTTS-v2 Quick Start Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/XTTS-QUICK-START.md)
	- [Paperspace Testing Guide](https://github.com/Useforclaude/quantum-sync-v5/blob/tts-experiments/quantum-sync-v11-production/PAPERSPACE-TTS-TESTING.md)

	Original Documentation:
	- [Coqui TTS GitHub](https://github.com/coqui-ai/TTS)
	- [XTTS-v2 Paper](https://arxiv.org/abs/2406.04904) (if available)

	## 🔗 Links

	- This Mirror: https://huggingface.co/useclaude/quantum-sync-xtts-v2
	- Original Model: https://huggingface.co/coqui/XTTS-v2
	- Quantum Sync Project: https://github.com/Useforclaude/quantum-sync-v5
	- TTS Library: https://github.com/coqui-ai/TTS

	## ⚠️ Disclaimer

	This is an unofficial mirror maintained for backup purposes. For the latest version and official support, please refer to the [original model](https://huggingface.co/coqui/XTTS-v2) and [Coqui TTS repository](https://github.com/coqui-ai/TTS).

	## 📊 Model Card

	### Model Description

	XTTS-v2 is a state-of-the-art zero-shot multi-lingual text-to-speech model that can clone voices from short audio samples (6-30 seconds).

	Key Features:
	- Zero-shot voice cloning
	- Multi-lingual support (13 languages)
	- High-quality natural speech
	- No fine-tuning required
	- Commercial use allowed

	### Intended Use

	Primary Use Cases:
	- Voice cloning for content creation
	- Multi-lingual speech synthesis
	- Accessibility applications
	- Audiobook narration
	- Video dubbing

	Out-of-Scope Use:
	- Impersonation without consent
	- Generating misleading content
	- Illegal activities

	### Training Data

	XTTS-v2 was trained on diverse multi-lingual speech data. For details, see the [original model card](https://huggingface.co/coqui/XTTS-v2).

	### Performance

	See Performance section above for detailed benchmarks from Quantum Sync project.

	### Ethical Considerations

	Voice Cloning Ethics:
	- Always obtain consent before cloning someone's voice
	- Clearly label AI-generated content
	- Do not use for impersonation or fraud
	- Follow local regulations on synthetic media

	### Limitations

	- May not perfectly preserve all voice characteristics
	- Quality varies with reference audio quality
	- Requires GPU for reasonable speed
	- ~6-8 GB VRAM recommended
	- Some languages may have better quality than others

	## 🛠️ Technical Specifications

	Model Type: Autoregressive Transformer-based TTS

	Framework: PyTorch

	Input: Text + Reference Audio (6-30 sec WAV)

	Output: 24kHz WAV audio

	Inference Time: ~3-5 seconds per segment (GPU)

	Hardware Requirements:
	- GPU: NVIDIA with CUDA support
	- VRAM: 6-8 GB recommended
	- RAM: 16 GB
	- Disk: ~2 GB for model

	Software Requirements:
	- Python 3.9+
	- PyTorch 2.0+
	- TTS library
	- CUDA 11.8+ (for GPU)

	## 📞 Support

	For this mirror:
	- Issues: [Quantum Sync GitHub Issues](https://github.com/Useforclaude/quantum-sync-v5/issues)

	For original model:
	- Issues: [Coqui TTS GitHub Issues](https://github.com/coqui-ai/TTS/issues)

	---

	Last Updated: 2025-10-13

	Mirror Version: 1.0

	Model Version: XTTS-v2 (Latest as of upload date)