TurkishCodeMan's picture
Upload README.md with huggingface_hub
49faf94 verified
---
language:
- en
license: apache-2.0
tags:
- text-to-speech
- tts
- xtts
- voice-cloning
- coqui
library_name: coqui-tts
pipeline_tag: text-to-speech
---
# XTTS v2 Fine-tuned Model (English)
This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.
## Model Description
- **Base Model:** XTTS v2
- **Language:** English
- **Training Data:** Custom English speech dataset (~14 minutes)
- **Training Epochs:** 10
- **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07)
## Training Details
| Parameter | Value |
|-----------|-------|
| Batch Size | 4 |
| Learning Rate | 5e-06 |
| Max Audio Length | 11 seconds |
| Total Training Samples | 168 |
### Loss Progression
| Epoch | Eval Loss |
|-------|-----------|
| 0 | 3.36 |
| 1 | 3.23 |
| 2 | 3.17 |
| 3 | 3.12 |
| 4 | 3.10 |
| 5 | 3.08 |
| 6 | 3.07 |
| 7 | **3.07** (best) |
| 8 | 3.11 |
| 9 | 3.10 |
## Usage
### Installation
```bash
pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
pip install huggingface_hub
```
### Quick Start
```python
import os
import torch
import torchaudio
from huggingface_hub import hf_hub_download
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# Download model files
repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")
# Load model
config = XttsConfig()
config.load_json(config_path)
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_dir=os.path.dirname(model_path),
checkpoint_path=model_path,
vocab_path=vocab_path,
use_deepspeed=False
)
model.cuda()
# Generate speech (download a sample reference audio first)
ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)
out = model.inference(
text="Hello, this is a test of the fine-tuned XTTS model.",
language="en",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
)
wav = torch.tensor(out["wav"]).unsqueeze(0)
torchaudio.save("output.wav", wav, 24000)
```
## Audio Samples
| Type | File |
|------|------|
| Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) |
| Generated Output | [generated_output.wav](samples/generated_output.wav) |
## Requirements
⚠️ **Important:** Use specific versions to avoid compatibility issues.
- Python 3.10+
- PyTorch 2.5.1
- torchaudio 2.5.1 (NOT 2.9.1+)
- transformers 4.40.0 (NOT 4.50+)
- TTS 0.22.0
## Known Issues & Solutions
1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports.
3. **torchcodec error:** Downgrade torchaudio to 2.5.1.
## License
Apache 2.0
## Acknowledgments
- [Coqui TTS](https://github.com/coqui-ai/TTS)
- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)