Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- text-to-speech
|
| 7 |
+
- tts
|
| 8 |
+
- xtts
|
| 9 |
+
- voice-cloning
|
| 10 |
+
- coqui
|
| 11 |
+
library_name: coqui-tts
|
| 12 |
+
pipeline_tag: text-to-speech
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# XTTS v2 Fine-tuned Model (English)
|
| 16 |
+
|
| 17 |
+
This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.
|
| 18 |
+
|
| 19 |
+
## Model Description
|
| 20 |
+
|
| 21 |
+
- **Base Model:** XTTS v2
|
| 22 |
+
- **Language:** English
|
| 23 |
+
- **Training Data:** Custom English speech dataset (~14 minutes)
|
| 24 |
+
- **Training Epochs:** 10
|
| 25 |
+
- **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07)
|
| 26 |
+
|
| 27 |
+
## Training Details
|
| 28 |
+
|
| 29 |
+
| Parameter | Value |
|
| 30 |
+
|-----------|-------|
|
| 31 |
+
| Batch Size | 4 |
|
| 32 |
+
| Learning Rate | 5e-06 |
|
| 33 |
+
| Max Audio Length | 11 seconds |
|
| 34 |
+
| Total Training Samples | 168 |
|
| 35 |
+
|
| 36 |
+
### Loss Progression
|
| 37 |
+
|
| 38 |
+
| Epoch | Eval Loss |
|
| 39 |
+
|-------|-----------|
|
| 40 |
+
| 0 | 3.36 |
|
| 41 |
+
| 1 | 3.23 |
|
| 42 |
+
| 2 | 3.17 |
|
| 43 |
+
| 3 | 3.12 |
|
| 44 |
+
| 4 | 3.10 |
|
| 45 |
+
| 5 | 3.08 |
|
| 46 |
+
| 6 | 3.07 |
|
| 47 |
+
| 7 | **3.07** (best) |
|
| 48 |
+
| 8 | 3.11 |
|
| 49 |
+
| 9 | 3.10 |
|
| 50 |
+
|
| 51 |
+
## Usage
|
| 52 |
+
|
| 53 |
+
### Installation
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
|
| 57 |
+
pip install huggingface_hub
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### Quick Start
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
import os
|
| 64 |
+
import torch
|
| 65 |
+
import torchaudio
|
| 66 |
+
from huggingface_hub import hf_hub_download
|
| 67 |
+
from TTS.tts.configs.xtts_config import XttsConfig
|
| 68 |
+
from TTS.tts.models.xtts import Xtts
|
| 69 |
+
|
| 70 |
+
# Download model files
|
| 71 |
+
repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
|
| 72 |
+
model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
|
| 73 |
+
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
|
| 74 |
+
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")
|
| 75 |
+
|
| 76 |
+
# Load model
|
| 77 |
+
config = XttsConfig()
|
| 78 |
+
config.load_json(config_path)
|
| 79 |
+
|
| 80 |
+
model = Xtts.init_from_config(config)
|
| 81 |
+
model.load_checkpoint(
|
| 82 |
+
config,
|
| 83 |
+
checkpoint_dir=os.path.dirname(model_path),
|
| 84 |
+
checkpoint_path=model_path,
|
| 85 |
+
vocab_path=vocab_path,
|
| 86 |
+
use_deepspeed=False
|
| 87 |
+
)
|
| 88 |
+
model.cuda()
|
| 89 |
+
|
| 90 |
+
# Generate speech (download a sample reference audio first)
|
| 91 |
+
ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
|
| 92 |
+
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)
|
| 93 |
+
|
| 94 |
+
out = model.inference(
|
| 95 |
+
text="Hello, this is a test of the fine-tuned XTTS model.",
|
| 96 |
+
language="en",
|
| 97 |
+
gpt_cond_latent=gpt_cond_latent,
|
| 98 |
+
speaker_embedding=speaker_embedding,
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
wav = torch.tensor(out["wav"]).unsqueeze(0)
|
| 102 |
+
torchaudio.save("output.wav", wav, 24000)
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
## Audio Samples
|
| 106 |
+
|
| 107 |
+
| Type | File |
|
| 108 |
+
|------|------|
|
| 109 |
+
| Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) |
|
| 110 |
+
| Generated Output | [generated_output.wav](samples/generated_output.wav) |
|
| 111 |
+
|
| 112 |
+
## Requirements
|
| 113 |
+
|
| 114 |
+
⚠️ **Important:** Use specific versions to avoid compatibility issues.
|
| 115 |
+
|
| 116 |
+
- Python 3.10+
|
| 117 |
+
- PyTorch 2.5.1
|
| 118 |
+
- torchaudio 2.5.1 (NOT 2.9.1+)
|
| 119 |
+
- transformers 4.40.0 (NOT 4.50+)
|
| 120 |
+
- TTS 0.22.0
|
| 121 |
+
|
| 122 |
+
## Known Issues & Solutions
|
| 123 |
+
|
| 124 |
+
1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
|
| 125 |
+
2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports.
|
| 126 |
+
3. **torchcodec error:** Downgrade torchaudio to 2.5.1.
|
| 127 |
+
|
| 128 |
+
## License
|
| 129 |
+
|
| 130 |
+
Apache 2.0
|
| 131 |
+
|
| 132 |
+
## Acknowledgments
|
| 133 |
+
|
| 134 |
+
- [Coqui TTS](https://github.com/coqui-ai/TTS)
|
| 135 |
+
- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)
|