File size: 3,174 Bytes

49faf94

---
language:
- en
license: apache-2.0
tags:
- text-to-speech
- tts
- xtts
- voice-cloning
- coqui
library_name: coqui-tts
pipeline_tag: text-to-speech
---

# XTTS v2 Fine-tuned Model (English)

This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.

## Model Description

- **Base Model:** XTTS v2
- **Language:** English
- **Training Data:** Custom English speech dataset (~14 minutes)
- **Training Epochs:** 10
- **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07)

## Training Details

| Parameter | Value |
|-----------|-------|
| Batch Size | 4 |
| Learning Rate | 5e-06 |
| Max Audio Length | 11 seconds |
| Total Training Samples | 168 |

### Loss Progression

| Epoch | Eval Loss |
|-------|-----------|
| 0 | 3.36 |
| 1 | 3.23 |
| 2 | 3.17 |
| 3 | 3.12 |
| 4 | 3.10 |
| 5 | 3.08 |
| 6 | 3.07 |
| 7 | **3.07** (best) |
| 8 | 3.11 |
| 9 | 3.10 |

## Usage

### Installation

```bash
pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
pip install huggingface_hub
```

### Quick Start

```python
import os
import torch
import torchaudio
from huggingface_hub import hf_hub_download
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Download model files
repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")

# Load model
config = XttsConfig()
config.load_json(config_path)

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_dir=os.path.dirname(model_path),
    checkpoint_path=model_path,
    vocab_path=vocab_path,
    use_deepspeed=False
)
model.cuda()

# Generate speech (download a sample reference audio first)
ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)

out = model.inference(
    text="Hello, this is a test of the fine-tuned XTTS model.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
)

wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)
```

## Audio Samples

| Type | File |
|------|------|
| Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) |
| Generated Output | [generated_output.wav](samples/generated_output.wav) |

## Requirements

⚠️ **Important:** Use specific versions to avoid compatibility issues.

- Python 3.10+
- PyTorch 2.5.1
- torchaudio 2.5.1 (NOT 2.9.1+)
- transformers 4.40.0 (NOT 4.50+)
- TTS 0.22.0

## Known Issues & Solutions

1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports.
3. **torchcodec error:** Downgrade torchaudio to 2.5.1.

## License

Apache 2.0

## Acknowledgments

- [Coqui TTS](https://github.com/coqui-ai/TTS)
- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)