TurkishCodeMan
/

xtts-v2-english-finetuned

+---
+language:
+- en
+license: apache-2.0
+tags:
+- text-to-speech
+- tts
+- xtts
+- voice-cloning
+- coqui
+library_name: coqui-tts
+pipeline_tag: text-to-speech
+---
+# XTTS v2 Fine-tuned Model (English)
+This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.
+## Model Description
+- **Base Model:** XTTS v2
+- **Language:** English
+- **Training Data:** Custom English speech dataset (~14 minutes)
+- **Training Epochs:** 10
+- **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07)
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Batch Size | 4 |
+| Learning Rate | 5e-06 |
+| Max Audio Length | 11 seconds |
+| Total Training Samples | 168 |
+### Loss Progression
+| Epoch | Eval Loss |
+|-------|-----------|
+| 0 | 3.36 |
+| 1 | 3.23 |
+| 2 | 3.17 |
+| 3 | 3.12 |
+| 4 | 3.10 |
+| 5 | 3.08 |
+| 6 | 3.07 |
+| 7 | **3.07** (best) |
+| 8 | 3.11 |
+| 9 | 3.10 |
+## Usage
+### Installation
+```bash
+pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
+pip install huggingface_hub
+```
+### Quick Start
+```python
+import os
+import torch
+import torchaudio
+from huggingface_hub import hf_hub_download
+from TTS.tts.configs.xtts_config import XttsConfig
+from TTS.tts.models.xtts import Xtts
+# Download model files
+repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
+model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
+config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
+vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")
+# Load model
+config = XttsConfig()
+config.load_json(config_path)
+model = Xtts.init_from_config(config)
+model.load_checkpoint(
+    config,
+    checkpoint_dir=os.path.dirname(model_path),
+    checkpoint_path=model_path,
+    vocab_path=vocab_path,
+    use_deepspeed=False
+)
+model.cuda()
+# Generate speech (download a sample reference audio first)
+ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
+gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)
+out = model.inference(
+    text="Hello, this is a test of the fine-tuned XTTS model.",
+    language="en",
+    gpt_cond_latent=gpt_cond_latent,
+    speaker_embedding=speaker_embedding,
+)
+wav = torch.tensor(out["wav"]).unsqueeze(0)
+torchaudio.save("output.wav", wav, 24000)
+```
+## Audio Samples
+| Type | File |
+|------|------|
+| Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) |
+| Generated Output | [generated_output.wav](samples/generated_output.wav) |
+## Requirements
+⚠️ **Important:** Use specific versions to avoid compatibility issues.
+- Python 3.10+
+- PyTorch 2.5.1
+- torchaudio 2.5.1 (NOT 2.9.1+)
+- transformers 4.40.0 (NOT 4.50+)
+- TTS 0.22.0
+## Known Issues & Solutions
+1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
+2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports.
+3. **torchcodec error:** Downgrade torchaudio to 2.5.1.
+## License
+Apache 2.0
+## Acknowledgments
+- [Coqui TTS](https://github.com/coqui-ai/TTS)
+- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)