TurkishCodeMan
/

xtts-v2-english-finetuned

Model card Files Files and versions

xtts-v2-english-finetuned / README.md

TurkishCodeMan's picture

Upload README.md with huggingface_hub

49faf94 verified about 2 months ago

|

history blame contribute delete

3.17 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-to-speech
	- tts
	- xtts
	- voice-cloning
	- coqui
	library_name: coqui-tts
	pipeline_tag: text-to-speech
	---

	# XTTS v2 Fine-tuned Model (English)

	This is a fine-tuned version of [Coqui XTTS v2](https://github.com/coqui-ai/TTS) for English text-to-speech synthesis.

	## Model Description

	- Base Model: XTTS v2
	- Language: English
	- Training Data: Custom English speech dataset (~14 minutes)
	- Training Epochs: 10
	- Best Checkpoint: Epoch 7 (lowest eval loss: 3.07)

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Batch Size \| 4 \|
	\| Learning Rate \| 5e-06 \|
	\| Max Audio Length \| 11 seconds \|
	\| Total Training Samples \| 168 \|

	### Loss Progression

	\| Epoch \| Eval Loss \|
	\|-------\|-----------\|
	\| 0 \| 3.36 \|
	\| 1 \| 3.23 \|
	\| 2 \| 3.17 \|
	\| 3 \| 3.12 \|
	\| 4 \| 3.10 \|
	\| 5 \| 3.08 \|
	\| 6 \| 3.07 \|
	\| 7 \| 3.07 (best) \|
	\| 8 \| 3.11 \|
	\| 9 \| 3.10 \|

	## Usage

	### Installation

	```bash
	pip install TTS==0.22.0 torch==2.5.1 torchaudio==2.5.1 transformers==4.40.0
	pip install huggingface_hub
	```

	### Quick Start

	```python
	import os
	import torch
	import torchaudio
	from huggingface_hub import hf_hub_download
	from TTS.tts.configs.xtts_config import XttsConfig
	from TTS.tts.models.xtts import Xtts

	# Download model files
	repo_id = "TurkishCodeMan/xtts-v2-english-finetuned"
	model_path = hf_hub_download(repo_id=repo_id, filename="model.pth")
	config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
	vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")

	# Load model
	config = XttsConfig()
	config.load_json(config_path)

	model = Xtts.init_from_config(config)
	model.load_checkpoint(
	config,
	checkpoint_dir=os.path.dirname(model_path),
	checkpoint_path=model_path,
	vocab_path=vocab_path,
	use_deepspeed=False
	)
	model.cuda()

	# Generate speech (download a sample reference audio first)
	ref_audio = hf_hub_download(repo_id=repo_id, filename="samples/speaker_reference.wav")
	gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=ref_audio)

	out = model.inference(
	text="Hello, this is a test of the fine-tuned XTTS model.",
	language="en",
	gpt_cond_latent=gpt_cond_latent,
	speaker_embedding=speaker_embedding,
	)

	wav = torch.tensor(out["wav"]).unsqueeze(0)
	torchaudio.save("output.wav", wav, 24000)
	```

	## Audio Samples

	\| Type \| File \|
	\|------\|------\|
	\| Speaker Reference \| [speaker_reference.wav](samples/speaker_reference.wav) \|
	\| Generated Output \| [generated_output.wav](samples/generated_output.wav) \|

	## Requirements

	⚠️ Important: Use specific versions to avoid compatibility issues.

	- Python 3.10+
	- PyTorch 2.5.1
	- torchaudio 2.5.1 (NOT 2.9.1+)
	- transformers 4.40.0 (NOT 4.50+)
	- TTS 0.22.0

	## Known Issues & Solutions

	1. StopIteration error in trainer: Patch `trainer/generic_utils.py` or use monkey-patch before importing TTS.
	2. Multi-GPU error: Set `CUDA_VISIBLE_DEVICES=0` before imports.
	3. torchcodec error: Downgrade torchaudio to 2.5.1.

	## License

	Apache 2.0

	## Acknowledgments

	- [Coqui TTS](https://github.com/coqui-ai/TTS)
	- [XTTS v2](https://huggingface.co/coqui/XTTS-v2)