XTTS v2 Fine-tuned Voice Model

This is a fine-tuned XTTS v2 model trained for voice cloning.

Model Details

  • Base Model: XTTS v2 (Coqui TTS)
  • Training: 400 epochs
  • Loss Improvement: 57% reduction (2.863 → 1.231)
  • Languages: English & French
  • Sample Rate: 24kHz

Files

  • best_model.pth - Fine-tuned GPT model weights
  • config.json - Model configuration
  • vocab.json - Tokenizer vocabulary
  • dvae.pth - Discrete VAE for audio encoding
  • mel_stats.pth - Mel-spectrogram normalization stats
  • reference.wav - Voice reference sample

Usage

from TTS.api import TTS

# Load the model
tts = TTS(model_path="path/to/model/folder", config_path="path/to/config.json")

# Generate speech
tts.tts_to_file(
    text="Your text here",
    file_path="output.wav",
    speaker_wav="reference.wav",
    language="en"
)

Requirements

TTS>=0.22.0
torch>=2.0.0
torchaudio

Training Details

  • Dataset: Custom voice recordings
  • Training Duration: ~50 hours total
  • Batch Size: 2
  • Gradient Accumulation: 126 steps
  • Optimizer: AdamW

Limitations

  • Optimized for the specific voice in the training data
  • Best performance on texts similar to training distribution
  • May have reduced quality on very long texts

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support