| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - text-to-speech |
| | - tts |
| | - xtts |
| | - voice-cloning |
| | - coqui |
| | library_name: coqui-tts |
| | pipeline_tag: text-to-speech |
| | --- |
| | |
| | # XTTS v2 Fine-tuned Model (English) |
| |
|
| | This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model:** XTTS v2 |
| | - **Language:** English |
| | - **Training Data:** Custom English speech dataset (~14 minutes) |
| | - **Training Epochs:** 10 |
| | - **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07) |
| |
|
| | ## Training Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Batch Size | 4 | |
| | | Learning Rate | 5e-06 | |
| | | Max Audio Length | 11 seconds | |
| | | Total Training Samples | 168 | |
| |
|
| | ### Loss Progression |
| |
|
| | | Epoch | Eval Loss | |
| | |-------|-----------| |
| | | 0 | 3.36 | |
| | | 1 | 3.23 | |
| | | 2 | 3.17 | |
| | | 3 | 3.12 | |
| | | 4 | 3.10 | |
| | | 5 | 3.08 | |
| | | 6 | 3.07 | |
| | | 7 | **3.07** (best) | |
| | | 8 | 3.11 | |
| | | 9 | 3.10 | |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0 |
| | pip install huggingface_hub |
| | ``` |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | import os |
| | import torch |
| | import torchaudio |
| | from huggingface_hub import hf_hub_download |
| | from TTS.tts.configs.xtts_config import XttsConfig |
| | from TTS.tts.models.xtts import Xtts |
| | |
| | # Download model files |
| | repo_id = "TurkishCodeMan/xtts-v2-english-finetuned" |
| | model_path = hf_hub_download(repo_id=repo_id, filename="model.pth") |
| | config_path = hf_hub_download(repo_id=repo_id, filename="config.json") |
| | vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json") |
| | |
| | # Load model |
| | config = XttsConfig() |
| | config.load_json(config_path) |
| | |
| | model = Xtts.init_from_config(config) |
| | model.load_checkpoint( |
| | config, |
| | checkpoint_dir=os.path.dirname(model_path), |
| | checkpoint_path=model_path, |
| | vocab_path=vocab_path, |
| | use_deepspeed=False |
| | ) |
| | model.cuda() |
| | |
| | # Generate speech (download a sample reference audio first) |
| | ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav") |
| | gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio) |
| | |
| | out = model.inference( |
| | text="Hello, this is a test of the fine-tuned XTTS model.", |
| | language="en", |
| | gpt_cond_latent=gpt_cond_latent, |
| | speaker_embedding=speaker_embedding, |
| | ) |
| | |
| | wav = torch.tensor(out["wav"]).unsqueeze(0) |
| | torchaudio.save("output.wav", wav, 24000) |
| | ``` |
| |
|
| | ## Audio Samples |
| |
|
| | | Type | File | |
| | |------|------| |
| | | Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) | |
| | | Generated Output | [generated_output.wav](samples/generated_output.wav) | |
| |
|
| | ## Requirements |
| |
|
| | ⚠️ **Important:** Use specific versions to avoid compatibility issues. |
| |
|
| | - Python 3.10+ |
| | - PyTorch 2.5.1 |
| | - torchaudio 2.5.1 (NOT 2.9.1+) |
| | - transformers 4.40.0 (NOT 4.50+) |
| | - TTS 0.22.0 |
| |
|
| | ## Known Issues & Solutions |
| |
|
| | 1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS. |
| | 2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports. |
| | 3. **torchcodec error:** Downgrade torchaudio to 2.5.1. |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|
| | ## Acknowledgments |
| |
|
| | - [Coqui TTS](https://github.com/coqui-ai/TTS) |
| | - [XTTS v2](https://huggingface.co/coqui/XTTS-v2) |
| |
|