--- language: - en license: apache-2.0 tags: - text-to-speech - tts - xtts - voice-cloning - coqui library_name: coqui-tts pipeline_tag: text-to-speech --- # XTTS v2 Fine-tuned Model (English) This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis. ## Model Description - **Base Model:** XTTS v2 - **Language:** English - **Training Data:** Custom English speech dataset (~14 minutes) - **Training Epochs:** 10 - **Best Checkpoint:** Epoch 7 (lowest eval loss: 3.07) ## Training Details | Parameter | Value | |-----------|-------| | Batch Size | 4 | | Learning Rate | 5e-06 | | Max Audio Length | 11 seconds | | Total Training Samples | 168 | ### Loss Progression | Epoch | Eval Loss | |-------|-----------| | 0 | 3.36 | | 1 | 3.23 | | 2 | 3.17 | | 3 | 3.12 | | 4 | 3.10 | | 5 | 3.08 | | 6 | 3.07 | | 7 | **3.07** (best) | | 8 | 3.11 | | 9 | 3.10 | ## Usage ### Installation ```bash pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0 pip install huggingface_hub ``` ### Quick Start ```python import os import torch import torchaudio from huggingface_hub import hf_hub_download from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # Download model files repo_id = "TurkishCodeMan/xtts-v2-english-finetuned" model_path = hf_hub_download(repo_id=repo_id, filename="model.pth") config_path = hf_hub_download(repo_id=repo_id, filename="config.json") vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json") # Load model config = XttsConfig() config.load_json(config_path) model = Xtts.init_from_config(config) model.load_checkpoint( config, checkpoint_dir=os.path.dirname(model_path), checkpoint_path=model_path, vocab_path=vocab_path, use_deepspeed=False ) model.cuda() # Generate speech (download a sample reference audio first) ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio) out = model.inference( text="Hello, this is a test of the fine-tuned XTTS model.", language="en", gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, ) wav = torch.tensor(out["wav"]).unsqueeze(0) torchaudio.save("output.wav", wav, 24000) ``` ## Audio Samples | Type | File | |------|------| | Speaker Reference | [speaker_reference.wav](samples/speaker_reference.wav) | | Generated Output | [generated_output.wav](samples/generated_output.wav) | ## Requirements ⚠️ **Important:** Use specific versions to avoid compatibility issues. - Python 3.10+ - PyTorch 2.5.1 - torchaudio 2.5.1 (NOT 2.9.1+) - transformers 4.40.0 (NOT 4.50+) - TTS 0.22.0 ## Known Issues & Solutions 1. **StopIteration error in trainer:** Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS. 2. **Multi-GPU error:** Set `CUDA_VISIBLE_DEVICES=0` before imports. 3. **torchcodec error:** Downgrade torchaudio to 2.5.1. ## License Apache 2.0 ## Acknowledgments - [Coqui TTS](https://github.com/coqui-ai/TTS) - [XTTS v2](https://huggingface.co/coqui/XTTS-v2)