File size: 3,174 Bytes
49faf94 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
language:
- en
license: apache-2.0
tags:
- text-to-speech
- tts
- xtts
- voice-cloning
- coqui
library_name: coqui-tts
pipeline_tag: text-to-speech
---
# XTTS v2 Fine-tuned Model (English)
This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.
## Model Description
- **Base Model:** XTTS v2
- **Language:** English
- **Training Data:** Custom English speech dataset (~14 minutes)
- **Training Epochs:** 10
- **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07)
## Training Details
| Parameter | Value |
|-----------|-------|
| Batch Size | 4 |
| Learning Rate | 5e-06 |
| Max Audio Length | 11 seconds |
| Total Training Samples | 168 |
### Loss Progression
| Epoch | Eval Loss |
|-------|-----------|
| 0 | 3.36 |
| 1 | 3.23 |
| 2 | 3.17 |
| 3 | 3.12 |
| 4 | 3.10 |
| 5 | 3.08 |
| 6 | 3.07 |
| 7 | **3.07** (best) |
| 8 | 3.11 |
| 9 | 3.10 |
## Usage
### Installation
```bash
pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
pip install huggingface_hub
```
### Quick Start
```python
import os
import torch
import torchaudio
from huggingface_hub import hf_hub_download
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# Download model files
repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")
# Load model
config = XttsConfig()
config.load_json(config_path)
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_dir=os.path.dirname(model_path),
checkpoint_path=model_path,
vocab_path=vocab_path,
use_deepspeed=False
)
model.cuda()
# Generate speech (download a sample reference audio first)
ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)
out = model.inference(
text="Hello, this is a test of the fine-tuned XTTS model.",
language="en",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
)
wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)
```
## Audio Samples
| Type | File |
|------|------|
| Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) |
| Generated Output | [generated_output.wav](samples/generated_output.wav) |
## Requirements
⚠️ **Important:** Use specific versions to avoid compatibility issues.
- Python 3.10+
- PyTorch 2.5.1
- torchaudio 2.5.1 (NOT 2.9.1+)
- transformers 4.40.0 (NOT 4.50+)
- TTS 0.22.0
## Known Issues & Solutions
1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports.
3. **torchcodec error:** Downgrade torchaudio to 2.5.1.
## License
Apache 2.0
## Acknowledgments
- [Coqui TTS](https://github.com/coqui-ai/TTS)
- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)
|