File size: 3,174 Bytes
49faf94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language:
- en
license: apache-2.0
tags:
- text-to-speech
- tts
- xtts
- voice-cloning
- coqui
library_name: coqui-tts
pipeline_tag: text-to-speech
---

# XTTS v2 Fine-tuned Model (English)

This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.

## Model Description

- **Base Model:** XTTS v2
- **Language:** English
- **Training Data:** Custom English speech dataset (~14 minutes)
- **Training Epochs:** 10
- **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07)

## Training Details

| Parameter | Value |
|-----------|-------|
| Batch Size | 4 |
| Learning Rate | 5e-06 |
| Max Audio Length | 11 seconds |
| Total Training Samples | 168 |

### Loss Progression

| Epoch | Eval Loss |
|-------|-----------|
| 0 | 3.36 |
| 1 | 3.23 |
| 2 | 3.17 |
| 3 | 3.12 |
| 4 | 3.10 |
| 5 | 3.08 |
| 6 | 3.07 |
| 7 | **3.07** (best) |
| 8 | 3.11 |
| 9 | 3.10 |

## Usage

### Installation

```bash
pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
pip install huggingface_hub
```

### Quick Start

```python
import os
import torch
import torchaudio
from huggingface_hub import hf_hub_download
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Download model files
repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")

# Load model
config = XttsConfig()
config.load_json(config_path)

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_dir=os.path.dirname(model_path),
    checkpoint_path=model_path,
    vocab_path=vocab_path,
    use_deepspeed=False
)
model.cuda()

# Generate speech (download a sample reference audio first)
ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)

out = model.inference(
    text="Hello, this is a test of the fine-tuned XTTS model.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
)

wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)
```

## Audio Samples

| Type | File |
|------|------|
| Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) |
| Generated Output | [generated_output.wav](samples/generated_output.wav) |

## Requirements

⚠️ **Important:** Use specific versions to avoid compatibility issues.

- Python 3.10+
- PyTorch 2.5.1
- torchaudio 2.5.1 (NOT 2.9.1+)
- transformers 4.40.0 (NOT 4.50+)
- TTS 0.22.0

## Known Issues & Solutions

1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports.
3. **torchcodec error:** Downgrade torchaudio to 2.5.1.

## License

Apache 2.0

## Acknowledgments

- [Coqui TTS](https://github.com/coqui-ai/TTS)
- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)